CSE 598-004

Building Small Language Models
Instructor: David Jurgens · jurgens@umich.edu

Overview

This seminar-style course is the practical study of maximizing computational intelligence within strict parameter constraints. The focus is on training small—rather than large—language models that have robust capabilities despite their size, with a specific goal of 1 billion parameters or fewer.

While traditional scaling laws suggest that performance scales primarily with model size, this seminar challenges that orthodoxy by focusing on data quality, architectural efficiency, and information density. Students will not just study these concepts but will actively and collectively build and train a sub-1B parameter model from scratch, implementing everything from the dataset curator to the final alignment loop.

Course Structure

Each week is dedicated to a key concept in building SLMs. Students will meet for 1 hour to discuss 1-3 papers and identify implementation goals. Students are expected to present an overview of a paper ~3 times during the semester.

The class will use a shared GitHub repository for collectively building the SLM. Students will contribute three substantive changes based on the class's agreed and prioritized development goals. This is a 1-credit course—students should expect on average 3 hours outside of class, though workload varies week to week.

Learning Goals

Schedule

Phase 1 The Data Engine

Build a high-quality, synthetic-heavy dataset pipeline. For <1B models, data composition is the primary determinant of "intelligence."

Week 1: Scaling Laws & The 2025 Landscape Jan 9
Build Goal

Initialize the project repository and define the model configuration (e.g., 300M parameter target) based on available compute.

Readings
Small Language Models (SLMs) Can Still Pack a Punch: A Survey
Gamage, Chathurika, et al.
arXiv:2501.05465
Training Compute-Optimal Large Language Models
Hoffmann, Jordan, et al.
arXiv:2203.15556
On the Slow Death of Scaling
Hooker, Sarah
SSRN:5877662
Week 2: 100% Synthetic Data — The Baguettotron Protocol Jan 16
Build Goal

Implement a "Textbook Synthesizer." Use a teacher model (e.g., Llama 3.1 70B) to generate a small corpus of synthetic reasoning data.

Readings
Baguettotron: A 321M Parameter Generative Model
PleIAs
huggingface.co
Textbooks Are All You Need
Gunasekar, Suriya, et al.
arXiv:2306.11644
Week 3: The Science of Data Mixtures Jan 23
Build Goal

Run a "Proxy Experiment." Train three tiny (10M param) models on different data ratios to select the optimal mix for the main build.

Readings
Scaling Laws for Optimal Data Mixtures
Shukor, Bethune, Busbridge, Grangier, Fini, El-Nouby, Ablin
arXiv:2507.09404
Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training
Zhu, Tong, et al.
arXiv:2502.16802
Week 4: The Pipeline & Practical Curricula (SmolLM2) Jan 30
Build Goal

Finalize the main Dataloader. Implement MinHash deduplication and a "Data Scheduler."

Readings
SmolLM2 Technical Report
Allal, et al.
arXiv:2502.02737
The Smol Training Playbook
Ben Allal, Loubna, et al.
huggingface.co
Phase 2 The Model Skeleton

Design the neural architecture. Focus on efficiency hacks to break the O(N²) barrier and reduce parameter counts.

Week 5: Breaking Quadratic Complexity (Hymba & RWKV) Feb 6
Build Goal

Create a "Hybrid Block." Implement a custom layer in PyTorch that toggles between Standard Attention, a Mamba-style SSM, or Linear Attention. Benchmark memory usage on a 4k context window.

Readings
Hymba: A Hybrid-head Architecture for Small Language Models
NVIDIA Research
arXiv:2411.13676
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Peng, Bo, et al.
arXiv:2404.05892
Week 6: Parameter Sharing & Non-Uniform Scaling Feb 13
Build Goal

Integrate Grouped Query Attention (GQA) or parameter sharing into the model definition to reduce the KV cache size.

Readings
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Thopilan, Romal, et al.
arXiv:2402.16840
OpenELM: An Efficient Language Model Family
Mehta, Sachin, et al.
arXiv:2404.14619
Week 7: Composing and External Knowledge Feb 20
Build Goal

Explore mechanisms for integrating external knowledge into small models via scalable lookup or flexible data routing.

Readings
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Cheng, et al.
arXiv:2601.07372
Flexolmo: Open Language Models for Flexible Data Use
Shi, et al.
arXiv:2507.07024
Phase 3 The Training Loop

Pre-train the model. Focus on predictive scaling, "mid-training" annealing, and sparsity.

Week 8: The Unified Data Lifecycle (Front-Loading & Mid-Training) Feb 27
Build Goal

Modify your training script to implement Mid-Training Annealing. Create an intermediate "Annealing Phase" for the final 50B tokens.

Readings
Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
NVIDIA & Hugging Face Researchers
arXiv:2510.03264
Midtraining Bridges Pretraining and Posttraining Distributions
Emmy Liu, Graham Neubig, Chenyan Xiong
arXiv:2510.14865
Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning
Mohit Raghavendra, Junmo Kang, Alan Ritter
ACL 2025
Week 9: Predictive Scaling & The "Herd" Method Mar 13
Build Goal

Plot a "Scaling Forecast." Train your architecture at three small scales (10M, 50M, 100M) on a fixed 1B token subset. Use log-log linear fit to predict validation loss of your final 300M+ run.

Readings
The Llama 3 Herd of Models
Dubey, Abhimanyu, et al. (Meta AI)
arXiv:2407.21783
DeepSeek-V3 Technical Report
DeepSeek AI
arXiv:2412.19437
JEST: Joint Example Selection and Training for Efficient Data Selection
Mindermann, Sören, et al.
arXiv:2406.17711
Week 10: Sparse Architectures (MoE & DeepSeek Sparse Attention) Mar 20
Build Goal

Fork the architecture to test a "Sparsity Block." Implement either a Granular MoE (DeepSeekMoE style) or the Lightning Indexer attention mechanism.

Readings
DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention
DeepSeek AI
PDF
OLMoE: Open Mixture-of-Experts Language Models
Muennighoff, Niklas, et al.
arXiv:2409.02060
Week 11: Infinite Context on a Budget (Context Scaling Laws) Mar 27
Build Goal

Fine-tune your model on a "Needle-in-a-Haystack" synthetic dataset to verify recall at 16k+ context.

Readings
Explaining Context Length Scaling and Bounds for Language Models
Chen, Y., et al.
arXiv:2502.01481
Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements
Haller, Patrick, et al.
arXiv:2511.05560
LIFT: Improving Long Context Understanding through Long Input Fine-Tuning
Wang, Y., et al.
arXiv:2502.14644
Phase 4 Specialization & Frontiers

Transform the raw base model into a useful product via Reasoning, Multimodality, and Alignment.

Week 12: Reasoning & Context (Phi-4) Apr 3
Build Goal

Evaluate the model on GSM8K (math).

Readings
Phi-4 Technical Report
Microsoft Research
arXiv:2412.08905
Week 13: Multimodal Projection (MiniCPM) Apr 10
Build Goal

Train a linear projector to allow the SLM to describe simple images.

Readings
MiniCPM: Unveiling the Potential of Small Language Models
Hu, Shengding, et al.
arXiv:2404.06395
Week 14: Alignment Without Tax (ORPO/DPO) Apr 17
Build Goal

Fine-tune the student model on a preference dataset using ORPO.

Readings
ORPO: Monolithic Preference Optimization without Reference Model
Hong, Jiwoo, et al.
arXiv:2403.07691
Week 15: The Theoretical Limits of Compression Apr 24
Readings
Language Modeling Is Compression
Delétang, Grégoire, et al.
arXiv:2309.10668
Discussion

Can a 1B model ever truly reason, or is it mathematically limited to surface-level pattern matching?