alignrl

From base model to deployed reasoning agent

Training Pipeline

Base
Qwen2.5-3B
Pre-trained
SFT
Supervised
Fine-Tuning
GRPO
Verifiable
Rewards
or
DPO
Preference
Alignment
Eval
GSM8K, ARC
MATH
Serve
Gradio
Demo

Benchmark Results

Benchmark Base SFT GRPO DPO

Training Curves

Training Loss (SFT + GRPO)

GRPO Reward Over Time

💡 Reasoning Comparison

Prompt
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Base
Natalia sold 48 clips in April and half that in May, which is 48/2 = 24 clips. So she sold 48 + 24 = 72 clips total. The answer is 72.
SFT
Let me solve this step by step. Step 1: Clips sold in April = 48 Step 2: Clips sold in May = 48 / 2 = 24 Step 3: Total clips = 48 + 24 = 72 Therefore, Natalia sold \boxed{72} clips altogether.
GRPO
I need to find the total clips sold across April and May. April sales: 48 clips May sales: "half as many" as April = 48 / 2 = 24 clips Total = April + May = 48 + 24 = 72 Let me verify: 48 is double 24 (correct), and 48 + 24 = 72 (correct). \boxed{72}

📖 Colab Notebooks

01 - Supervised Fine-Tuning

QLoRA fine-tuning with Unsloth on OpenHermes-2.5. Learn chat template formatting and 4-bit training.

02 - GRPO Training

Group Relative Policy Optimization with math-verify rewards on GSM8K. Verifiable RL for reasoning.

03 - DPO Alignment

Direct Preference Optimization on UltraFeedback. Align models to human preference without reward models.

04 - Evaluation

Benchmark all stages on GSM8K, ARC-Challenge, and MATH using lm-evaluation-harness.

05 - Inference & Demo

Run inference with vLLM/MLX and launch the Gradio comparison demo. Includes side-by-side outputs.