Skip to main content
This experiment demonstrates how to use FlexAI to fine-tune language models using reinforcement learning (RL) techniques with EasyR1, a framework for training reasoning-capable models using GRPO (Group Relative Policy Optimization), DAPO, and REINFORCE algorithms. For illustration purposes, we’ll fine-tune the Qwen2.5-7B-Instruct model on mathematical reasoning tasks using the math12k dataset with GRPO algorithm to improve reasoning capabilities.
If you haven’t already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.

Quick Start

Run GRPO training on Qwen2.5-7B with this single command:
flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct
Replace <WANDB_API_KEY_SECRET_NAME> and <HF_AUTH_TOKEN_SECRET_NAME> with your actual values.

What is EasyR1?

EasyR1 is a reinforcement learning framework specifically designed for training language models with enhanced reasoning capabilities. It implements several RL algorithms optimized for LLMs:
  • GRPO (Group Relative Policy Optimization): Efficient policy optimization using group-based advantage estimation
  • DAPO (Data-Augmented Policy Optimization): Enhanced training with data augmentation strategies
  • REINFORCE: Classic policy gradient method for LLM fine-tuning
The framework is built on top of VERL (Versatile Efficient Reinforcement Learning), providing distributed training capabilities with FSDP and vLLM integration.

Directory Structure

The code/easyR1/ directory contains:
  • config.yaml - Main GRPO training configuration
  • format_prompt/ - Jinja templates for prompt formatting
  • reward_function/ - Custom reward scoring functions
For baseline training scripts and additional examples, refer to the EasyR1 GitHub repository.
1

Understand the Configuration

EasyR1 uses a comprehensive YAML configuration file that controls all aspects of RL training. The main configuration file is located at code/easyR1/config.yaml in this repository.

Key Configuration Sections

Data Configuration

data:
  train_files: hiyouga/math12k@train
  val_files: hiyouga/math12k@test
  prompt_key: problem
  answer_key: answer
  format_prompt: ./code/easyR1/format_prompt/math.jinja
  max_prompt_length: 2048
  max_response_length: 2048
  rollout_batch_size: 512

Algorithm Settings

algorithm:
  adv_estimator: grpo  # GRPO, DAPO, or REINFORCE
  use_kl_loss: true
  kl_coef: 1.0e-2

Worker Configuration

worker:
  actor:
    model:
      model_path: Qwen/Qwen2.5-7B-Instruct
    optim:
      lr: 1.0e-6
  rollout:
    n: 5  # number of rollout samples per prompt
    temperature: 1.0
  reward:
    reward_type: batch
    reward_function: ./code/easyR1/reward_function/math.py:compute_score
2

Reference Baseline Examples

For pre-configured training scripts and baseline examples, refer to the EasyR1 repository. The repository provides multiple baseline configurations for different models and tasks:

Available Baselines (in EasyR1 repo)

  • Mathematical Reasoning: qwen2_5_7b_math_grpo.sh, qwen3_4b_math_grpo.sh
  • Geometric Reasoning (Vision-Language): qwen2_5_vl_7b_geo3k_grpo.sh, qwen2_5_vl_7b_geo3k_dapo.sh, qwen2_5_vl_7b_geo3k_reinforce.sh
  • Multi-Image Tasks: qwen2_5_vl_7b_multi_image.sh
You can adapt these examples to work with FlexAI by following the training commands in this blueprint.
3

Customize Your Configuration

For your specific use case, you may want to create a custom configuration. Here’s how to customize the config.yaml:

Custom Dataset

Replace the dataset configuration:
data:
  train_files: your-username/your-dataset@train
  val_files: your-username/your-dataset@test
  prompt_key: question  # adjust based on your dataset
  answer_key: solution  # adjust based on your dataset

Custom Reward Function

Create your own reward function in code/easyR1/reward_function/custom.py:
def compute_score(prompts, responses, answers):
    """
    Args:
        prompts: List of input prompts
        responses: List of model responses
        answers: List of ground truth answers

    Returns:
        List of reward scores (float)
    """
    scores = []
    for response, answer in zip(responses, answers):
        # Your custom reward logic here
        score = your_evaluation_function(response, answer)
        scores.append(score)
    return scores
Then update the config to reference your custom reward function:
worker:
  reward:
    reward_function: ./code/easyR1/reward_function/custom.py:compute_score

Custom Prompt Format

Create a custom Jinja template in code/easyR1/format_prompt/custom.jinja:
{{ "{{" }} problem {{ "}}" }}

Please solve this step by step and provide your final answer.
Update the config:
data:
  format_prompt: ./code/easyR1/format_prompt/custom.jinja

Create Secrets

To access HuggingFace models and datasets, you need a HuggingFace token. Use the flexai secret create command to store your HuggingFace Token as a secret:
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>
Then paste your HuggingFace Token API key value. Use the same command to store your Weights & Biases (wandb) API key as a secret:
flexai secret create <WANDB_API_KEY_SECRET_NAME>
Then paste your Weights & Biases API key value.

[Optional] Pre-fetch the Model

To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage:
  1. Create a HuggingFace storage provider:
    flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>
    
  2. Push the model checkpoint to your storage:
    flexai checkpoint push qwen25-7b-instruct --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B-Instruct
    

Training

For RL training with EasyR1, we recommend using 1 node (8 × H100 GPUs) for 7B models to handle the actor, reference model, and rollout workers efficiently.
The commands below use this repository which contains all necessary configuration files in the code/easyR1/ directory.

Standard Training: Mathematical Reasoning with GRPO

flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct

Training with Model Prefetch

flexai training run grpo-prefetched \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --checkpoint qwen25-7b-instruct \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=/input-checkpoint/qwen25-7b-instruct

Training with Custom Configuration

To use a modified configuration or different dataset, override config values:
flexai training run grpo-custom \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct \
      data.train_files=your-username/your-dataset@train \
      data.val_files=your-username/your-dataset@test \
      trainer.experiment_name=custom-experiment

Monitoring Training Progress

You can check the status and lifecycle events of your Training Job:
flexai training inspect grpo
View the logs of your Training Job:
flexai training logs grpo

Training Observability with Weights & Biases

EasyR1 supports Weights & Biases (wandb) integration for detailed training metrics visualization. The configuration already includes wandb logging:
trainer:
  logger: ["file", "wandb"]
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math_grpo

Getting Training Checkpoints

Once the Training Job completes successfully, you can list all produced checkpoints:
flexai training checkpoints grpo
Look for checkpoints marked as INFERENCE READY = true - these are ready for serving.

Serving the Trained Model

Deploy your RL-trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:
flexai inference serve easyr1-reasoning-endpoint --checkpoint <CHECKPOINT_ID>
Monitor your inference endpoint status:
# List all inference endpoints
flexai inference list

# Get detailed endpoint information
flexai inference inspect easyr1-reasoning-endpoint

# Check endpoint logs
flexai inference logs easyr1-reasoning-endpoint

Testing Your RL-Trained Model

Once the endpoint is running, you can test it with reasoning tasks. For our mathematical reasoning example, the model should demonstrate improved step-by-step reasoning and accurate problem-solving.

Before and After Training Comparison

To illustrate the improvement from RL fine-tuning, here’s a comparison using a math problem: Problem: “If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?” Base Model Response (Qwen2.5-7B-Instruct before RL training):
The average speed is 60 mph.
Issues: Correct answer but no reasoning steps shown RL Fine-tuned Model Response (after GRPO training on math12k):
Let me solve this step by step:

Step 1: Identify the given information
- Distance traveled = 120 miles
- Time taken = 2 hours

Step 2: Apply the speed formula
Speed = Distance / Time

Step 3: Calculate
Speed = 120 miles / 2 hours = 60 miles per hour

Therefore, the average speed of the train is 60 mph.
Improvements: Clear reasoning steps, structured approach, educational value This demonstrates how RL training encourages the model to show its reasoning process, making it more reliable and transparent.

Example API Call

curl -X POST "https://your-endpoint-url/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Solve the following problem step by step: A rectangle has a length of 15 cm and a width of 8 cm. What is its area?",
    "max_tokens": 500,
    "temperature": 0.7
  }'

Expected Results

After RL fine-tuning with EasyR1, your model should achieve:
  • Enhanced Reasoning: Step-by-step problem-solving with clear explanations
  • Improved Accuracy: Higher success rate on reasoning tasks
  • Better Generalization: Ability to apply learned reasoning patterns to new problems
  • Structured Outputs: More organized and educational responses
For mathematical reasoning tasks:
  • Explicit Step-by-Step Solutions: Clear breakdown of problem-solving process
  • Higher Success Rate: Improved accuracy on math benchmarks
  • Better Error Detection: Ability to identify and correct mistakes

Technical Details

Training Configuration Breakdown

Reinforcement Learning Components:
  • Actor Model: The model being trained (policy network)
  • Reference Model: Frozen copy for KL divergence computation
  • Rollout Workers: Generate multiple responses for each prompt (n=5)
  • Reward Function: Evaluates response quality (custom per task)
Distributed Training:
  • FSDP (Fully Sharded Data Parallel): Efficient memory usage for large models
  • vLLM Integration: Fast inference during rollout generation
  • Tensor Parallelism: For rollout workers (size=2)
Optimization:
  • GRPO Algorithm: Group-based advantage estimation for stable training
  • KL Penalty: Prevents model from deviating too far from base model
  • Gradient Checkpointing: Reduces memory usage during backpropagation

Resource Requirements

Recommended Configuration for Qwen2.5-7B:
  • Nodes: 1 node (sufficient for RL training with actor + reference + rollout)
  • Accelerators: 8 × H100 GPUs per node
  • Memory: ~400GB+ GPU memory total (actor, reference, and rollout workers)
  • Training Time: ~8-12 hours for 15 epochs
  • Storage: ~50GB for checkpoints
Command Line Parameters Explained:
  • FORCE_TORCHRUN=1: Ensures proper distributed training setup
  • --runtime pytorch-28-vllm-0110-nvidia: PyTorch 2.8 with vLLM 0.11.0 optimized for EasyR1
  • --repository-url: Points to the FlexAI blueprints repository
  • config=code/easyR1/config.yaml: Main configuration file path relative to repository root

Key Configuration Parameters

Data Settings:
  • rollout_batch_size: 512: Number of prompts per training iteration
  • max_prompt_length: 2048: Maximum input length
  • max_response_length: 2048: Maximum output length
Algorithm Settings:
  • adv_estimator: grpo: Choice of RL algorithm
  • kl_coef: 1.0e-2: Strength of KL penalty
  • use_kl_loss: true: Enable KL divergence loss
Training Settings:
  • total_epochs: 15: Number of training epochs
  • n_gpus_per_node: 8: GPUs per node
  • val_freq: 5: Validation every 5 epochs
  • save_freq: 5: Save checkpoint every 5 epochs

Scaling Options

  • For faster training: Increase to 2 nodes (16 × H100)
  • For larger models: Increase tensor_parallel_size for rollout
  • For better exploration: Increase rollout.n (more samples per prompt)
  • For memory efficiency: Enable CPU offloading (enable_cpu_offload: true)
  • For different tasks: Modify reward function and prompt templates

Advanced Examples

Vision-Language Model with Geometric Reasoning

flexai training run grpo-VL-Geo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-VL-7B-Instruct \
      data.train_files=hiyouga/geometry3k@train \
      data.val_files=hiyouga/geometry3k@test \
      data.format_prompt=./code/easyR1/format_prompt/r1v.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/r1v.py:compute_score \
      trainer.experiment_name=qwen2_5_vl_7b_geo3k_grpo

Using DAPO Algorithm

flexai training run Dapo-14B \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen3-14B \
      algorithm.adv_estimator=dapo \
      algorithm.online_filtering=true \
      data.train_files=hiyouga/dapo17k@train \
      data.val_files=hiyouga/dapo17k@test \
      data.format_prompt=./code/easyR1/format_prompt/dapo.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/dapo.py:compute_score \
      trainer.experiment_name=qwen3_14b_dapo17k_dapo

Troubleshooting

Training Job Fails to Start:
# Check FlexAI authentication
flexai auth status

# Verify repository access
git clone https://github.com/flexaihq/blueprints
Out of Memory Errors:
  • Reduce rollout_batch_size from 512 to 256
  • Reduce rollout.n from 5 to 3 (fewer samples per prompt)
  • Enable CPU offloading: enable_cpu_offload: true in FSDP config
  • Reduce tensor_parallel_size for rollout workers
Reward Function Errors:
  • Verify reward function path is correct in config
  • Test reward function locally before training
  • Ensure reward function returns float scores for all inputs
  • Check for NaN or infinite reward values
Checkpoint Not Inference Ready:
  • Wait for training to complete fully
  • Check save_model_only: false in config to include all necessary files
  • Verify training completed without errors
Endpoint Deployment Issues:
  • Verify checkpoint shows INFERENCE READY = true status
  • Check FlexAI cluster availability
  • Review detailed logs with flexai inference logs <endpoint-name>
Dataset Loading Issues:
  • Verify dataset path format: username/dataset@split
  • Ensure HuggingFace token has access to datasets
  • Check prompt_key and answer_key match your dataset schema
vLLM Rollout Errors:
  • Adjust gpu_memory_utilization (default 0.6)
  • Reduce tensor_parallel_size if GPUs are insufficient
  • Enable enforce_eager: true for debugging

References

Code

requirements.txt

git+https://github.com/hiyouga/EasyR1.git@d146d24e990c8102fee44e61e5ca389907712960

config.yaml

---
data:
  train_files: hiyouga/math12k@train
  val_files: hiyouga/math12k@test
  prompt_key: problem
  answer_key: answer
  image_key: images
  video_key: videos
  image_dir:
  video_fps: 2.0
  max_prompt_length: 2048
  max_response_length: 2048
  rollout_batch_size: 512
  mini_rollout_batch_size:
  val_batch_size: 1024
  format_prompt: ./code/easyR1/format_prompt/math.jinja
  override_chat_template:
  shuffle: true
  seed: 1
  min_pixels: 262144
  max_pixels: 4194304
  filter_overlong_prompts: true

algorithm:
  adv_estimator: grpo
  disable_kl: false
  use_kl_loss: true
  kl_penalty: low_var_kl
  kl_coef: 1.0e-2
  online_filtering: false
  filter_key: overall
  filter_low: 0.01
  filter_high: 0.99

worker:
  actor:
    global_batch_size: 128
    micro_batch_size_per_device_for_update: 1
    micro_batch_size_per_device_for_experience: 2
    max_grad_norm: 1.0
    padding_free: true
    dynamic_batching: true
    ulysses_size: 1
    model:
      model_path: Qwen/Qwen2.5-7B-Instruct
      enable_gradient_checkpointing: true
      trust_remote_code: false
      freeze_vision_tower: false
    optim:
      lr: 1.0e-6
      weight_decay: 1.0e-2
      strategy: adamw
      lr_warmup_ratio: 0.0
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: false
      enable_rank0_init: true
    offload:
      offload_params: true
      offload_optimizer: true

  rollout:
    n: 5
    temperature: 1.0
    top_p: 1.0
    limit_images: 0
    gpu_memory_utilization: 0.6
    enforce_eager: false
    enable_chunked_prefill: false
    tensor_parallel_size: 2
    disable_tqdm: true
    val_override_config:
      temperature: 0.6
      top_p: 0.95
      n: 1

  ref:
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: true
      enable_rank0_init: true
    offload:
      offload_params: false

  reward:
    reward_type: batch
    reward_function: ./code/easyR1/reward_function/math.py:compute_score

trainer:
  total_epochs: 15
  max_steps:
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math_grpo
  logger: [file, wandb]
  nnodes: 1
  n_gpus_per_node: 8
  max_try_make_batch: 20
  val_freq: 5
  val_before_train: true
  val_only: false
  val_generations_to_log: 3
  save_freq: 5
  save_limit: 3
  save_model_only: false
  save_checkpoint_path: /output-checkpoint
  load_checkpoint_path:
  find_last_checkpoint: true

format_prompt/math.jinja

{{ "{{" }} content | trim {{ "}}" }} You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}.

reward_function/math.py

# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0

import re
from typing import Any

from mathruler.grader import extract_boxed_content, grade_answer


def format_reward(response: str) -> float:
    pattern = re.compile(r"<think>.*</think>.*\\boxed\{.*\}.*", re.DOTALL)
    format_match = re.fullmatch(pattern, response)
    return 1.0 if format_match else 0.0


def accuracy_reward(response: str, ground_truth: str) -> float:
    answer = extract_boxed_content(response)
    return 1.0 if grade_answer(answer, ground_truth) else 0.0


def compute_score(
    reward_inputs: list[dict[str, Any]], format_weight: float = 0.1
) -> list[dict[str, float]]:
    if not isinstance(reward_inputs, list):
        raise ValueError("Please use `reward_type=batch` for math reward function.")

    scores = []
    for reward_input in reward_inputs:
        response = re.sub(
            r"\s*(<|>|/)\s*", r"\1", reward_input["response"]
        )
        format_score = format_reward(response)
        accuracy_score = accuracy_reward(response, reward_input["ground_truth"])
        scores.append(
            {
                "overall": (1 - format_weight) * accuracy_score
                + format_weight * format_score,
                "format": format_score,
                "accuracy": accuracy_score,
            }
        )

    return scores