Assignments: Character LLM
Introduction
Large language models (LLMs) such as GPT are built upon the transformer architecture. In this project, you will implement and train a small transformer for character-level language modeling using the well-known text8 dataset. The text8 dataset consists of the first 100 million characters of a cleaned and preprocessed English Wikipedia dump. It includes only lowercase letters and spaces, making it compact and suitable for benchmarking text models.
The goal is to train a model that, given a sequence of L previous characters, can accurately predict the next character. This task, known as next-character prediction, is a fundamental stepping stone toward understanding and building modern language models.
Through this assignment, you will gain hands-on experience with optimization, architecture tuning, and hyperparameter search—core components of training large models efficiently. I have created a starter-kit to help you get started quickly.
Problem Statement
Given a text corpus \(\mathcal{D} = (x_1, x_2, \dots, x_T)\), where each \(x_t\) is a character in the text8 dataset, your task is to train a model \(f_\theta\) that learns the conditional distribution:
\[ P_\theta(x_{t+1} \mid x_{t-L+1}, \dots, x_t), \]
for a context window of length \(L\). The model should predict the most likely next character given the previous \(L\) characters.
Main objectives:.
- Implement and train a small transformer-based model for next-character prediction.
- Evaluate the model’s performance on a held-out test set.
- Report the accuracy (percentage of correct predictions) when predicting the next character given the previous \(L\) characters.
- You should optimize the value of \(L\) for best performance.
Organization of the Report
Your report should follow this general structure:
- Introduction: Briefly explain the problem, dataset, and goals of the assignment.
- Transformer: briefly describe the transformer architecture.
- Experiments and Tuning:
It is the main part of the report you will be evaluated on. You should describe the experiments you performed in order to optimize the chosen model and training procedure (eg. hyperparameter tuning, architecture choices, optimization strategies, loss functions, regularization, etc). - Discussion: Reflect on what worked, what didn’t, and why.
Experimentation Hints
Metrics & Loss:
Predicting multiple future characters could change training dynamics, though it may complicate optimization. Applying the loss only at the last token instead of every position can also shift the model’s learning focus. Label smoothing or temperature scaling sometimes improves generalization but may reduce sharpness of predictions.Model Tuning:
The choice of hidden size, number of layers, and attention heads can strongly influence performance and training time. Feedforward dimension and dropout may help balance capacity and regularization. Various positional encodings (sinusoidal, learned, rotary) or injecting them at different layers could impact how well the model captures long-range dependencies. Simpler recurrent alternatives such as LSTMs might perform surprisingly well on this dataset and are worth comparing. If it happens that a LSTM outperforms a transformer, it would be an interesting finding to discuss in your report.Optimization:
Hyperparameters like batch size, optimizer, learning rate, and scheduling strategy often dominate performance differences. Gradient clipping, mixed precision can improve training stability and efficiency. Smaller exploratory runs are crucial and should guide settings before committing to longer training. When training large LLMs, this is especially true since the final training run can take days and be extremely costly. For example, it is often worth it to large number of smaller runs and explore some scaling laws (e.g., how does performance scale with model size, data size, compute budget, etc). —
Submission
Submission format:.
- A single submission per group via Canvas.
- Include:
- A PDF report (max 8 pages at the most, but shorter is encouraged).
- The code (as a zip file under 10 MB, or link to a public GitHub repository).
Guidelines:.
- The report should be self-contained and not include any code.
- You are encouraged to use LaTeX; a clean Overleaf template is available here.
- Use plots and tables to summarize results.
- Cite all external sources (papers, repos, blogs).
Evaluation Criteria
- Understanding and exploration (70%)
Depth and breadth of experiments, analysis of results, and creativity in exploring model variants. - Report quality (20%)
Clarity, structure, and interpretation of results. - Code quality (10%)
Correctness, readability, and documentation.
Using LLM-based tools for writing, coding, or idea exploration is allowed and encouraged, provided you understand and can explain all included material. Collaboration and discussion with peers are encouraged, but the report and code must be your own work. Cite all external resources properly.
Final Remarks
This project is designed to be exploratory and open-ended. There is no single “correct” solution—creativity and thoughtful experimentation are key. If you get stuck, don’t hesitate to reach out for guidance. Consider maintaining a proper git repository to track progress and document results. You will learn a lot about the inner workings of transformers, optimization, and large-scale training by engaging deeply with this project.