alignrl

▶ Training Pipeline

Base

Qwen2.5-3B
Pre-trained

→

SFT

Supervised
Fine-Tuning

→

GRPO

Verifiable
Rewards

DPO

Preference
Alignment

→

Eval

GSM8K, ARC
MATH

→

Serve

Gradio
Demo

💡 Reasoning Comparison

Prompt

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Base

Natalia sold 48 clips in April and half that in May, which is 48/2 = 24 clips. So she sold 48 + 24 = 72 clips total. The answer is 72.

SFT

Let me solve this step by step. Step 1: Clips sold in April = 48 Step 2: Clips sold in May = 48 / 2 = 24 Step 3: Total clips = 48 + 24 = 72 Therefore, Natalia sold \boxed{72} clips altogether.

GRPO

I need to find the total clips sold across April and May. April sales: 48 clips May sales: "half as many" as April = 48 / 2 = 24 clips Total = April + May = 48 + 24 = 72 Let me verify: 48 is double 24 (correct), and 48 + 24 = 72 (correct). \boxed{72}

📖 Colab Notebooks

01 - Supervised Fine-Tuning

QLoRA fine-tuning with Unsloth on OpenHermes-2.5. Learn chat template formatting and 4-bit training.

02 - GRPO Training

Group Relative Policy Optimization with math-verify rewards on GSM8K. Verifiable RL for reasoning.

03 - DPO Alignment

Direct Preference Optimization on UltraFeedback. Align models to human preference without reward models.

04 - Evaluation

Benchmark all stages on GSM8K, ARC-Challenge, and MATH using lm-evaluation-harness.

05 - Inference & Demo

Run inference with vLLM/MLX and launch the Gradio comparison demo. Includes side-by-side outputs.

▶ Training Pipeline

★ Benchmark Results

↑ Training Curves

Training Loss (SFT + GRPO)

GRPO Reward Over Time