This seminar-style course is the practical study of maximizing computational intelligence within strict parameter constraints. The focus is on training small—rather than large—language models that have robust capabilities despite their size, with a specific goal of 1 billion parameters or fewer.
While traditional scaling laws suggest that performance scales primarily with model size, this seminar challenges that orthodoxy by focusing on data quality, architectural efficiency, and information density. Students will not just study these concepts but will actively and collectively build and train a sub-1B parameter model from scratch, implementing everything from the dataset curator to the final alignment loop.
Each week is dedicated to a key concept in building SLMs. Students will meet for 1 hour to discuss 1-3 papers and identify implementation goals. Students are expected to present an overview of a paper ~3 times during the semester.
The class will use a shared GitHub repository for collectively building the SLM. Students will contribute three substantive changes based on the class's agreed and prioritized development goals. This is a 1-credit course—students should expect on average 3 hours outside of class, though workload varies week to week.
Build a high-quality, synthetic-heavy dataset pipeline. For <1B models, data composition is the primary determinant of "intelligence."
Initialize the project repository and define the model configuration (e.g., 300M parameter target) based on available compute.
Implement a "Textbook Synthesizer." Use a teacher model (e.g., Llama 3.1 70B) to generate a small corpus of synthetic reasoning data.
Run a "Proxy Experiment." Train three tiny (10M param) models on different data ratios to select the optimal mix for the main build.
Finalize the main Dataloader. Implement MinHash deduplication and a "Data Scheduler."
Design the neural architecture. Focus on efficiency hacks to break the O(N²) barrier and reduce parameter counts.
Create a "Hybrid Block." Implement a custom layer in PyTorch that toggles between Standard Attention, a Mamba-style SSM, or Linear Attention. Benchmark memory usage on a 4k context window.
Integrate Grouped Query Attention (GQA) or parameter sharing into the model definition to reduce the KV cache size.
Explore mechanisms for integrating external knowledge into small models via scalable lookup or flexible data routing.
Pre-train the model. Focus on predictive scaling, "mid-training" annealing, and sparsity.
Modify your training script to implement Mid-Training Annealing. Create an intermediate "Annealing Phase" for the final 50B tokens.
Plot a "Scaling Forecast." Train your architecture at three small scales (10M, 50M, 100M) on a fixed 1B token subset. Use log-log linear fit to predict validation loss of your final 300M+ run.
Fork the architecture to test a "Sparsity Block." Implement either a Granular MoE (DeepSeekMoE style) or the Lightning Indexer attention mechanism.
Fine-tune your model on a "Needle-in-a-Haystack" synthetic dataset to verify recall at 16k+ context.
Transform the raw base model into a useful product via Reasoning, Multimodality, and Alignment.
Evaluate the model on GSM8K (math).
Train a linear projector to allow the SLM to describe simple images.
Fine-tune the student model on a preference dataset using ORPO.
Can a 1B model ever truly reason, or is it mathematically limited to surface-level pattern matching?