CSE 598-004

Phase 1 The Data Engine

Build a high-quality, synthetic-heavy dataset pipeline. For <1B models, data composition is the primary determinant of "intelligence."

Week 1: Scaling Laws & The 2025 Landscape Jan 9

Build Goal

Initialize the project repository and define the model configuration (e.g., 300M parameter target) based on available compute.

Readings

Small Language Models (SLMs) Can Still Pack a Punch: A Survey

Gamage, Chathurika, et al.

arXiv:2501.05465

Training Compute-Optimal Large Language Models

Hoffmann, Jordan, et al.

arXiv:2203.15556

On the Slow Death of Scaling

Hooker, Sarah

SSRN:5877662

Week 2: 100% Synthetic Data — The Baguettotron Protocol Jan 16

Build Goal

Implement a "Textbook Synthesizer." Use a teacher model (e.g., Llama 3.1 70B) to generate a small corpus of synthetic reasoning data.

Readings

Baguettotron: A 321M Parameter Generative Model

PleIAs

huggingface.co

Textbooks Are All You Need

Gunasekar, Suriya, et al.

arXiv:2306.11644

Week 3: The Science of Data Mixtures Jan 23

Build Goal

Run a "Proxy Experiment." Train three tiny (10M param) models on different data ratios to select the optimal mix for the main build.

Readings

Scaling Laws for Optimal Data Mixtures

Shukor, Bethune, Busbridge, Grangier, Fini, El-Nouby, Ablin

arXiv:2507.09404

Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

Zhu, Tong, et al.

arXiv:2502.16802

Week 4: The Pipeline & Practical Curricula (SmolLM2) Jan 30

Build Goal

Finalize the main Dataloader. Implement MinHash deduplication and a "Data Scheduler."

Readings

SmolLM2 Technical Report

Allal, et al.

arXiv:2502.02737

The Smol Training Playbook

Ben Allal, Loubna, et al.

huggingface.co

Phase 2 The Model Skeleton

Design the neural architecture. Focus on efficiency hacks to break the O(N²) barrier and reduce parameter counts.

Week 5: Breaking Quadratic Complexity (Hymba & RWKV) Feb 6

Build Goal

Create a "Hybrid Block." Implement a custom layer in PyTorch that toggles between Standard Attention, a Mamba-style SSM, or Linear Attention. Benchmark memory usage on a 4k context window.

Readings

Hymba: A Hybrid-head Architecture for Small Language Models

NVIDIA Research

arXiv:2411.13676

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Peng, Bo, et al.

arXiv:2404.05892

Week 6: Parameter Sharing & Non-Uniform Scaling Feb 13

Build Goal

Integrate Grouped Query Attention (GQA) or parameter sharing into the model definition to reduce the KV cache size.

Readings

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Thopilan, Romal, et al.

arXiv:2402.16840

OpenELM: An Efficient Language Model Family

Mehta, Sachin, et al.

arXiv:2404.14619

Week 7: Composing and External Knowledge Feb 20

Build Goal

Explore mechanisms for integrating external knowledge into small models via scalable lookup or flexible data routing.

Readings

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, et al.

arXiv:2601.07372

Flexolmo: Open Language Models for Flexible Data Use

Shi, et al.

arXiv:2507.07024

Phase 3 The Training Loop

Pre-train the model. Focus on predictive scaling, "mid-training" annealing, and sparsity.

Week 8: The Unified Data Lifecycle (Front-Loading & Mid-Training) Feb 27

Build Goal

Modify your training script to implement Mid-Training Annealing. Create an intermediate "Annealing Phase" for the final 50B tokens.

Readings

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

NVIDIA & Hugging Face Researchers

arXiv:2510.03264

Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu, Graham Neubig, Chenyan Xiong

arXiv:2510.14865

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Mohit Raghavendra, Junmo Kang, Alan Ritter

ACL 2025

Week 9: Predictive Scaling & The "Herd" Method Mar 13

Build Goal

Plot a "Scaling Forecast." Train your architecture at three small scales (10M, 50M, 100M) on a fixed 1B token subset. Use log-log linear fit to predict validation loss of your final 300M+ run.

Readings

The Llama 3 Herd of Models

Dubey, Abhimanyu, et al. (Meta AI)

arXiv:2407.21783

DeepSeek-V3 Technical Report

DeepSeek AI

arXiv:2412.19437

JEST: Joint Example Selection and Training for Efficient Data Selection

Mindermann, Sören, et al.

arXiv:2406.17711

Week 10: Sparse Architectures (MoE & DeepSeek Sparse Attention) Mar 20

Build Goal

Fork the architecture to test a "Sparsity Block." Implement either a Granular MoE (DeepSeekMoE style) or the Lightning Indexer attention mechanism.

Readings

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

DeepSeek AI

PDF

OLMoE: Open Mixture-of-Experts Language Models

Muennighoff, Niklas, et al.

arXiv:2409.02060

Week 11: Infinite Context on a Budget (Context Scaling Laws) Mar 27

Build Goal

Fine-tune your model on a "Needle-in-a-Haystack" synthetic dataset to verify recall at 16k+ context.

Readings

Explaining Context Length Scaling and Bounds for Language Models

Chen, Y., et al.

arXiv:2502.01481

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Haller, Patrick, et al.

arXiv:2511.05560

LIFT: Improving Long Context Understanding through Long Input Fine-Tuning

Wang, Y., et al.

arXiv:2502.14644

Phase 4 Specialization & Frontiers

Transform the raw base model into a useful product via Reasoning, Multimodality, and Alignment.

Week 12: Reasoning & Context (Phi-4) Apr 3

Build Goal

Evaluate the model on GSM8K (math).

Readings

Phi-4 Technical Report

Microsoft Research

arXiv:2412.08905

Week 13: Multimodal Projection (MiniCPM) Apr 10

Build Goal

Train a linear projector to allow the SLM to describe simple images.

Readings

MiniCPM: Unveiling the Potential of Small Language Models

Hu, Shengding, et al.

arXiv:2404.06395

Week 14: Alignment Without Tax (ORPO/DPO) Apr 17

Build Goal

Fine-tune the student model on a preference dataset using ORPO.

Readings

ORPO: Monolithic Preference Optimization without Reference Model

Hong, Jiwoo, et al.

arXiv:2403.07691

Week 15: The Theoretical Limits of Compression Apr 24

Readings

Language Modeling Is Compression

Delétang, Grégoire, et al.

arXiv:2309.10668

Discussion

Can a 1B model ever truly reason, or is it mathematically limited to surface-level pattern matching?

Overview

Course Structure

Learning Goals

Schedule