Assignments: Character LLM

DSA4212

Introduction

Large language models (LLMs) such as GPT are built upon the transformer architecture. In this project, you will implement and train a small transformer for character-level language modeling using the well-known text8 dataset. The text8 dataset consists of the first 100 million characters of a cleaned and preprocessed English Wikipedia dump. It includes only lowercase letters and spaces, making it compact and suitable for benchmarking text models.

The goal is to train a model that, given a sequence of L previous characters, can accurately predict the next character. This task, known as next-character prediction, is a fundamental stepping stone toward understanding and building modern language models.

Through this assignment, you will gain hands-on experience with optimization, architecture tuning, and hyperparameter search—core components of training large models efficiently. I have created a starter-kit to help you get started quickly.

Problem Statement

Given a text corpus \(\mathcal{D} = (x_1, x_2, \dots, x_T)\), where each \(x_t\) is a character in the text8 dataset, your task is to train a model \(f_\theta\) that learns the conditional distribution:

\[ P_\theta(x_{t+1} \mid x_{t-L+1}, \dots, x_t), \]

for a context window of length \(L\). The model should predict the most likely next character given the previous \(L\) characters.

Main objectives:.

Implement and train a small transformer-based model for next-character prediction.
Evaluate the model’s performance on a held-out test set.
Report the accuracy (percentage of correct predictions) when predicting the next character given the previous \(L\) characters.
You should optimize the value of \(L\) for best performance.

Organization of the Report

Your report should follow this general structure:

Introduction: Briefly explain the problem, dataset, and goals of the assignment.
Transformer: briefly describe the transformer architecture.
Experiments and Tuning:
It is the main part of the report you will be evaluated on. You should describe the experiments you performed in order to optimize the chosen model and training procedure (eg. hyperparameter tuning, architecture choices, optimization strategies, loss functions, regularization, etc).
Discussion: Reflect on what worked, what didn’t, and why.

Experimentation Hints

Metrics & Loss:
Predicting multiple future characters could change training dynamics, though it may complicate optimization. Applying the loss only at the last token instead of every position can also shift the model’s learning focus. Label smoothing or temperature scaling sometimes improves generalization but may reduce sharpness of predictions.
Model Tuning:
The choice of hidden size, number of layers, and attention heads can strongly influence performance and training time. Feedforward dimension and dropout may help balance capacity and regularization. Various positional encodings (sinusoidal, learned, rotary) or injecting them at different layers could impact how well the model captures long-range dependencies. Simpler recurrent alternatives such as LSTMs might perform surprisingly well on this dataset and are worth comparing. If it happens that a LSTM outperforms a transformer, it would be an interesting finding to discuss in your report.
Optimization:
Hyperparameters like batch size, optimizer, learning rate, and scheduling strategy often dominate performance differences. Gradient clipping, mixed precision can improve training stability and efficiency. Smaller exploratory runs are crucial and should guide settings before committing to longer training. When training large LLMs, this is especially true since the final training run can take days and be extremely costly. For example, it is often worth it to large number of smaller runs and explore some scaling laws (e.g., how does performance scale with model size, data size, compute budget, etc).

Some potentially interesting references:

“Attention is all you need”, Vaswani et al. (2017) : The seminal paper introducing the transformer architecture, which has become the foundation for many modern language models.
“Character-level language modeling with deeper self-attention”, Al-Rfou et al. (2019) : Explores character-level language modeling using deep self-attention mechanisms, demonstrating the effectiveness of transformers for this task.
“Scaling laws for neural language models”, Kaplan et al. (2020) : Describes empirical scaling laws for language models, showing how performance improves with model size, dataset size, and compute budget.
“Training compute-optimal large language models”, Hoffmann et al. (2022) : Proposes strategies for training large language models efficiently, balancing model size and training duration to optimize performance.
“Roformer: Enhanced transformer with rotary position embedding”, Su et al. (2024) : Introduces the Roformer architecture, which incorporates rotary position embeddings to improve the performance of transformers on various NLP tasks.
Tricks of the Trade for Training Large Neural Networks: A collection of practical tips and techniques for training large neural networks effectively, including optimization strategies and architecture choices.

Submission

Submission format:.

A single submission per group via Canvas.
Include:
1. A PDF report (max 8 pages at the most, but shorter is encouraged).
2. The code (as a zip file under 10 MB, or link to a public GitHub repository).

Guidelines:.

The report should be self-contained and not include any code.
You are encouraged to use LaTeX; a clean Overleaf template is available here.
Use plots and tables to summarize results.
Cite all external sources (papers, repos, blogs).

Evaluation Criteria

Understanding and exploration (70%)
Depth and breadth of experiments, analysis of results, and creativity in exploring model variants.
Report quality (20%)
Clarity, structure, and interpretation of results.
Code quality (10%)
Correctness, readability, and documentation.

Using LLM-based tools for writing, coding, or idea exploration is allowed and encouraged, provided you understand and can explain all included material. Collaboration and discussion with peers are encouraged, but the report and code must be your own work. Cite all external resources properly.

Final Remarks

This project is designed to be exploratory and open-ended. There is no single “correct” solution—creativity and thoughtful experimentation are key. If you get stuck, don’t hesitate to reach out for guidance. Consider maintaining a proper git repository to track progress and document results. You will learn a lot about the inner workings of transformers, optimization, and large-scale training by engaging deeply with this project.

References

Al-Rfou, Rami, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2019. “Character-Level Language Modeling with Deeper Self-Attention.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:3159–66. 01.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint arXiv:2203.15556.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. “Roformer: Enhanced Transformer with Rotary Position Embedding.” Neurocomputing 568. Elsevier: 127063.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.