This experiment demonstrates how to use FlexAI to fine-tune language models using reinforcement learning (RL) techniques with EasyR1, a framework for training reasoning-capable models using GRPO (Group Relative Policy Optimization), DAPO, and REINFORCE algorithms.
For illustration purposes, we’ll fine-tune the Qwen2.5-7B-Instruct model on mathematical reasoning tasks using the math12k dataset with GRPO algorithm to improve reasoning capabilities.
If you haven’t already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
Quick Start
Run GRPO training on Qwen2.5-7B with this single command:
flexai training run grpo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct
Replace <WANDB_API_KEY_SECRET_NAME> and <HF_AUTH_TOKEN_SECRET_NAME> with your actual values.
What is EasyR1?
EasyR1 is a reinforcement learning framework specifically designed for training language models with enhanced reasoning capabilities. It implements several RL algorithms optimized for LLMs:
- GRPO (Group Relative Policy Optimization): Efficient policy optimization using group-based advantage estimation
- DAPO (Data-Augmented Policy Optimization): Enhanced training with data augmentation strategies
- REINFORCE: Classic policy gradient method for LLM fine-tuning
The framework is built on top of VERL (Versatile Efficient Reinforcement Learning), providing distributed training capabilities with FSDP and vLLM integration.
Directory Structure
The code/easyR1/ directory contains:
config.yaml - Main GRPO training configuration
format_prompt/ - Jinja templates for prompt formatting
reward_function/ - Custom reward scoring functions
For baseline training scripts and additional examples, refer to the EasyR1 GitHub repository.
Understand the Configuration
EasyR1 uses a comprehensive YAML configuration file that controls all aspects of RL training. The main configuration file is located at code/easyR1/config.yaml in this repository.Key Configuration Sections
Data Configuration
data:
train_files: hiyouga/math12k@train
val_files: hiyouga/math12k@test
prompt_key: problem
answer_key: answer
format_prompt: ./code/easyR1/format_prompt/math.jinja
max_prompt_length: 2048
max_response_length: 2048
rollout_batch_size: 512
Algorithm Settings
algorithm:
adv_estimator: grpo # GRPO, DAPO, or REINFORCE
use_kl_loss: true
kl_coef: 1.0e-2
Worker Configuration
worker:
actor:
model:
model_path: Qwen/Qwen2.5-7B-Instruct
optim:
lr: 1.0e-6
rollout:
n: 5 # number of rollout samples per prompt
temperature: 1.0
reward:
reward_type: batch
reward_function: ./code/easyR1/reward_function/math.py:compute_score
Reference Baseline Examples
For pre-configured training scripts and baseline examples, refer to the EasyR1 repository. The repository provides multiple baseline configurations for different models and tasks:Available Baselines (in EasyR1 repo)
- Mathematical Reasoning:
qwen2_5_7b_math_grpo.sh, qwen3_4b_math_grpo.sh
- Geometric Reasoning (Vision-Language):
qwen2_5_vl_7b_geo3k_grpo.sh, qwen2_5_vl_7b_geo3k_dapo.sh, qwen2_5_vl_7b_geo3k_reinforce.sh
- Multi-Image Tasks:
qwen2_5_vl_7b_multi_image.sh
You can adapt these examples to work with FlexAI by following the training commands in this blueprint. Customize Your Configuration
For your specific use case, you may want to create a custom configuration. Here’s how to customize the config.yaml:Custom Dataset
Replace the dataset configuration:data:
train_files: your-username/your-dataset@train
val_files: your-username/your-dataset@test
prompt_key: question # adjust based on your dataset
answer_key: solution # adjust based on your dataset
Custom Reward Function
Create your own reward function in code/easyR1/reward_function/custom.py:def compute_score(prompts, responses, answers):
"""
Args:
prompts: List of input prompts
responses: List of model responses
answers: List of ground truth answers
Returns:
List of reward scores (float)
"""
scores = []
for response, answer in zip(responses, answers):
# Your custom reward logic here
score = your_evaluation_function(response, answer)
scores.append(score)
return scores
Then update the config to reference your custom reward function:worker:
reward:
reward_function: ./code/easyR1/reward_function/custom.py:compute_score
Create a custom Jinja template in code/easyR1/format_prompt/custom.jinja:{{ "{{" }} problem {{ "}}" }}
Please solve this step by step and provide your final answer.
Update the config:data:
format_prompt: ./code/easyR1/format_prompt/custom.jinja
Create Secrets
To access HuggingFace models and datasets, you need a HuggingFace token.
Use the flexai secret create command to store your HuggingFace Token as a secret:
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>
Then paste your HuggingFace Token API key value.
Use the same command to store your Weights & Biases (wandb) API key as a secret:
flexai secret create <WANDB_API_KEY_SECRET_NAME>
Then paste your Weights & Biases API key value.
[Optional] Pre-fetch the Model
To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage:
-
Create a HuggingFace storage provider:
flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>
-
Push the model checkpoint to your storage:
flexai checkpoint push qwen25-7b-instruct --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B-Instruct
Training
For RL training with EasyR1, we recommend using 1 node (8 × H100 GPUs) for 7B models to handle the actor, reference model, and rollout workers efficiently.
The commands below use this repository which contains all necessary configuration files in the code/easyR1/ directory.
Standard Training: Mathematical Reasoning with GRPO
flexai training run grpo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct
Training with Model Prefetch
flexai training run grpo-prefetched \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--checkpoint qwen25-7b-instruct \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=/input-checkpoint/qwen25-7b-instruct
Training with Custom Configuration
To use a modified configuration or different dataset, override config values:
flexai training run grpo-custom \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct \
data.train_files=your-username/your-dataset@train \
data.val_files=your-username/your-dataset@test \
trainer.experiment_name=custom-experiment
Monitoring Training Progress
You can check the status and lifecycle events of your Training Job:
flexai training inspect grpo
View the logs of your Training Job:
flexai training logs grpo
Training Observability with Weights & Biases
EasyR1 supports Weights & Biases (wandb) integration for detailed training metrics visualization. The configuration already includes wandb logging:
trainer:
logger: ["file", "wandb"]
project_name: easy_r1
experiment_name: qwen2_5_7b_math_grpo
Getting Training Checkpoints
Once the Training Job completes successfully, you can list all produced checkpoints:
flexai training checkpoints grpo
Look for checkpoints marked as INFERENCE READY = true - these are ready for serving.
Serving the Trained Model
Deploy your RL-trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:
flexai inference serve easyr1-reasoning-endpoint --checkpoint <CHECKPOINT_ID>
Monitor your inference endpoint status:
# List all inference endpoints
flexai inference list
# Get detailed endpoint information
flexai inference inspect easyr1-reasoning-endpoint
# Check endpoint logs
flexai inference logs easyr1-reasoning-endpoint
Testing Your RL-Trained Model
Once the endpoint is running, you can test it with reasoning tasks. For our mathematical reasoning example, the model should demonstrate improved step-by-step reasoning and accurate problem-solving.
Before and After Training Comparison
To illustrate the improvement from RL fine-tuning, here’s a comparison using a math problem:
Problem: “If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?”
Base Model Response (Qwen2.5-7B-Instruct before RL training):
The average speed is 60 mph.
Issues: Correct answer but no reasoning steps shown
RL Fine-tuned Model Response (after GRPO training on math12k):
Let me solve this step by step:
Step 1: Identify the given information
- Distance traveled = 120 miles
- Time taken = 2 hours
Step 2: Apply the speed formula
Speed = Distance / Time
Step 3: Calculate
Speed = 120 miles / 2 hours = 60 miles per hour
Therefore, the average speed of the train is 60 mph.
Improvements: Clear reasoning steps, structured approach, educational value
This demonstrates how RL training encourages the model to show its reasoning process, making it more reliable and transparent.
Example API Call
curl -X POST "https://your-endpoint-url/v1/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"prompt": "Solve the following problem step by step: A rectangle has a length of 15 cm and a width of 8 cm. What is its area?",
"max_tokens": 500,
"temperature": 0.7
}'
Expected Results
After RL fine-tuning with EasyR1, your model should achieve:
- Enhanced Reasoning: Step-by-step problem-solving with clear explanations
- Improved Accuracy: Higher success rate on reasoning tasks
- Better Generalization: Ability to apply learned reasoning patterns to new problems
- Structured Outputs: More organized and educational responses
For mathematical reasoning tasks:
- Explicit Step-by-Step Solutions: Clear breakdown of problem-solving process
- Higher Success Rate: Improved accuracy on math benchmarks
- Better Error Detection: Ability to identify and correct mistakes
Technical Details
Training Configuration Breakdown
Reinforcement Learning Components:
- Actor Model: The model being trained (policy network)
- Reference Model: Frozen copy for KL divergence computation
- Rollout Workers: Generate multiple responses for each prompt (n=5)
- Reward Function: Evaluates response quality (custom per task)
Distributed Training:
- FSDP (Fully Sharded Data Parallel): Efficient memory usage for large models
- vLLM Integration: Fast inference during rollout generation
- Tensor Parallelism: For rollout workers (size=2)
Optimization:
- GRPO Algorithm: Group-based advantage estimation for stable training
- KL Penalty: Prevents model from deviating too far from base model
- Gradient Checkpointing: Reduces memory usage during backpropagation
Resource Requirements
Recommended Configuration for Qwen2.5-7B:
- Nodes: 1 node (sufficient for RL training with actor + reference + rollout)
- Accelerators: 8 × H100 GPUs per node
- Memory: ~400GB+ GPU memory total (actor, reference, and rollout workers)
- Training Time: ~8-12 hours for 15 epochs
- Storage: ~50GB for checkpoints
Command Line Parameters Explained:
FORCE_TORCHRUN=1: Ensures proper distributed training setup
--runtime pytorch-28-vllm-0110-nvidia: PyTorch 2.8 with vLLM 0.11.0 optimized for EasyR1
--repository-url: Points to the FlexAI blueprints repository
config=code/easyR1/config.yaml: Main configuration file path relative to repository root
Key Configuration Parameters
Data Settings:
rollout_batch_size: 512: Number of prompts per training iteration
max_prompt_length: 2048: Maximum input length
max_response_length: 2048: Maximum output length
Algorithm Settings:
adv_estimator: grpo: Choice of RL algorithm
kl_coef: 1.0e-2: Strength of KL penalty
use_kl_loss: true: Enable KL divergence loss
Training Settings:
total_epochs: 15: Number of training epochs
n_gpus_per_node: 8: GPUs per node
val_freq: 5: Validation every 5 epochs
save_freq: 5: Save checkpoint every 5 epochs
Scaling Options
- For faster training: Increase to 2 nodes (16 × H100)
- For larger models: Increase
tensor_parallel_size for rollout
- For better exploration: Increase
rollout.n (more samples per prompt)
- For memory efficiency: Enable CPU offloading (
enable_cpu_offload: true)
- For different tasks: Modify reward function and prompt templates
Advanced Examples
Vision-Language Model with Geometric Reasoning
flexai training run grpo-VL-Geo \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen2.5-VL-7B-Instruct \
data.train_files=hiyouga/geometry3k@train \
data.val_files=hiyouga/geometry3k@test \
data.format_prompt=./code/easyR1/format_prompt/r1v.jinja \
worker.reward.reward_function=./code/easyR1/reward_function/r1v.py:compute_score \
trainer.experiment_name=qwen2_5_vl_7b_geo3k_grpo
Using DAPO Algorithm
flexai training run Dapo-14B \
--accels 8 --nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--env FORCE_TORCHRUN=1 \
--secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
--secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
--requirements-path code/easyR1/requirements.txt \
--runtime pytorch-28-vllm-0110-nvidia \
-- python3 -m verl.trainer.main \
config=code/easyR1/config.yaml \
worker.actor.model.model_path=Qwen/Qwen3-14B \
algorithm.adv_estimator=dapo \
algorithm.online_filtering=true \
data.train_files=hiyouga/dapo17k@train \
data.val_files=hiyouga/dapo17k@test \
data.format_prompt=./code/easyR1/format_prompt/dapo.jinja \
worker.reward.reward_function=./code/easyR1/reward_function/dapo.py:compute_score \
trainer.experiment_name=qwen3_14b_dapo17k_dapo
Troubleshooting
Training Job Fails to Start:
# Check FlexAI authentication
flexai auth status
# Verify repository access
git clone https://github.com/flexaihq/blueprints
Out of Memory Errors:
- Reduce
rollout_batch_size from 512 to 256
- Reduce
rollout.n from 5 to 3 (fewer samples per prompt)
- Enable CPU offloading:
enable_cpu_offload: true in FSDP config
- Reduce
tensor_parallel_size for rollout workers
Reward Function Errors:
- Verify reward function path is correct in config
- Test reward function locally before training
- Ensure reward function returns float scores for all inputs
- Check for NaN or infinite reward values
Checkpoint Not Inference Ready:
- Wait for training to complete fully
- Check
save_model_only: false in config to include all necessary files
- Verify training completed without errors
Endpoint Deployment Issues:
- Verify checkpoint shows
INFERENCE READY = true status
- Check FlexAI cluster availability
- Review detailed logs with
flexai inference logs <endpoint-name>
Dataset Loading Issues:
- Verify dataset path format:
username/dataset@split
- Ensure HuggingFace token has access to datasets
- Check prompt_key and answer_key match your dataset schema
vLLM Rollout Errors:
- Adjust
gpu_memory_utilization (default 0.6)
- Reduce
tensor_parallel_size if GPUs are insufficient
- Enable
enforce_eager: true for debugging
References
Code
requirements.txt
git+https://github.com/hiyouga/EasyR1.git@d146d24e990c8102fee44e61e5ca389907712960
config.yaml
---
data:
train_files: hiyouga/math12k@train
val_files: hiyouga/math12k@test
prompt_key: problem
answer_key: answer
image_key: images
video_key: videos
image_dir:
video_fps: 2.0
max_prompt_length: 2048
max_response_length: 2048
rollout_batch_size: 512
mini_rollout_batch_size:
val_batch_size: 1024
format_prompt: ./code/easyR1/format_prompt/math.jinja
override_chat_template:
shuffle: true
seed: 1
min_pixels: 262144
max_pixels: 4194304
filter_overlong_prompts: true
algorithm:
adv_estimator: grpo
disable_kl: false
use_kl_loss: true
kl_penalty: low_var_kl
kl_coef: 1.0e-2
online_filtering: false
filter_key: overall
filter_low: 0.01
filter_high: 0.99
worker:
actor:
global_batch_size: 128
micro_batch_size_per_device_for_update: 1
micro_batch_size_per_device_for_experience: 2
max_grad_norm: 1.0
padding_free: true
dynamic_batching: true
ulysses_size: 1
model:
model_path: Qwen/Qwen2.5-7B-Instruct
enable_gradient_checkpointing: true
trust_remote_code: false
freeze_vision_tower: false
optim:
lr: 1.0e-6
weight_decay: 1.0e-2
strategy: adamw
lr_warmup_ratio: 0.0
fsdp:
enable_full_shard: true
enable_cpu_offload: false
enable_rank0_init: true
offload:
offload_params: true
offload_optimizer: true
rollout:
n: 5
temperature: 1.0
top_p: 1.0
limit_images: 0
gpu_memory_utilization: 0.6
enforce_eager: false
enable_chunked_prefill: false
tensor_parallel_size: 2
disable_tqdm: true
val_override_config:
temperature: 0.6
top_p: 0.95
n: 1
ref:
fsdp:
enable_full_shard: true
enable_cpu_offload: true
enable_rank0_init: true
offload:
offload_params: false
reward:
reward_type: batch
reward_function: ./code/easyR1/reward_function/math.py:compute_score
trainer:
total_epochs: 15
max_steps:
project_name: easy_r1
experiment_name: qwen2_5_7b_math_grpo
logger: [file, wandb]
nnodes: 1
n_gpus_per_node: 8
max_try_make_batch: 20
val_freq: 5
val_before_train: true
val_only: false
val_generations_to_log: 3
save_freq: 5
save_limit: 3
save_model_only: false
save_checkpoint_path: /output-checkpoint
load_checkpoint_path:
find_last_checkpoint: true
{{ "{{" }} content | trim {{ "}}" }} You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}.
reward_function/math.py
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
import re
from typing import Any
from mathruler.grader import extract_boxed_content, grade_answer
def format_reward(response: str) -> float:
pattern = re.compile(r"<think>.*</think>.*\\boxed\{.*\}.*", re.DOTALL)
format_match = re.fullmatch(pattern, response)
return 1.0 if format_match else 0.0
def accuracy_reward(response: str, ground_truth: str) -> float:
answer = extract_boxed_content(response)
return 1.0 if grade_answer(answer, ground_truth) else 0.0
def compute_score(
reward_inputs: list[dict[str, Any]], format_weight: float = 0.1
) -> list[dict[str, float]]:
if not isinstance(reward_inputs, list):
raise ValueError("Please use `reward_type=batch` for math reward function.")
scores = []
for reward_input in reward_inputs:
response = re.sub(
r"\s*(<|>|/)\s*", r"\1", reward_input["response"]
)
format_score = format_reward(response)
accuracy_score = accuracy_reward(response, reward_input["ground_truth"])
scores.append(
{
"overall": (1 - format_weight) * accuracy_score
+ format_weight * format_score,
"format": format_score,
"accuracy": accuracy_score,
}
)
return scores