| Benchmark | Base | SFT | GRPO | DPO |
|---|
QLoRA fine-tuning with Unsloth on OpenHermes-2.5. Learn chat template formatting and 4-bit training.
Group Relative Policy Optimization with math-verify rewards on GSM8K. Verifiable RL for reasoning.
Direct Preference Optimization on UltraFeedback. Align models to human preference without reward models.