> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# RL Fine-Tuning with EasyR1: GRPO & DAPO for Better Reasoning

> Fine-tune LLMs with reinforcement learning using EasyR1 on FlexAI. GRPO and DAPO algorithms, better reasoning, distributed training with FSDP and vLLM.

This experiment demonstrates how to use FlexAI to fine-tune language models using reinforcement learning (RL) techniques with [EasyR1](https://github.com/hiyouga/EasyR1), a framework for training reasoning-capable models using GRPO (Group Relative Policy Optimization), DAPO, and REINFORCE algorithms.

For illustration purposes, we'll fine-tune the `Qwen2.5-7B-Instruct` model on mathematical reasoning tasks using the `math12k` dataset with GRPO algorithm to improve reasoning capabilities.

<Note>
  If you haven't already connected FlexAI to GitHub, run `flexai code-registry connect` to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
</Note>

## Quick Start

Run GRPO training on Qwen2.5-7B with this single command:

```bash theme={null}
flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct
```

Replace `<WANDB_API_KEY_SECRET_NAME>` and `<HF_AUTH_TOKEN_SECRET_NAME>` with your actual values.

## What is EasyR1?

EasyR1 is a reinforcement learning framework specifically designed for training language models with enhanced reasoning capabilities. It implements several RL algorithms optimized for LLMs:

* **GRPO (Group Relative Policy Optimization)**: Efficient policy optimization using group-based advantage estimation
* **DAPO (Data-Augmented Policy Optimization)**: Enhanced training with data augmentation strategies
* **REINFORCE**: Classic policy gradient method for LLM fine-tuning

The framework is built on top of [VERL (Versatile Efficient Reinforcement Learning)](https://github.com/volcengine/verl), providing distributed training capabilities with FSDP and vLLM integration.

## Directory Structure

The `code/easyR1/` directory contains:

* `config.yaml` - Main GRPO training configuration
* `format_prompt/` - Jinja templates for prompt formatting
* `reward_function/` - Custom reward scoring functions

For baseline training scripts and additional examples, refer to the [EasyR1 GitHub repository](https://github.com/hiyouga/EasyR1).

<Steps>
  <Step title="Understand the Configuration">
    EasyR1 uses a comprehensive YAML configuration file that controls all aspects of RL training. The main configuration file is located at `code/easyR1/config.yaml` in this repository.

    ### Key Configuration Sections

    #### Data Configuration

    ```yaml theme={null}
    data:
      train_files: hiyouga/math12k@train
      val_files: hiyouga/math12k@test
      prompt_key: problem
      answer_key: answer
      format_prompt: ./code/easyR1/format_prompt/math.jinja
      max_prompt_length: 2048
      max_response_length: 2048
      rollout_batch_size: 512
    ```

    #### Algorithm Settings

    ```yaml theme={null}
    algorithm:
      adv_estimator: grpo  # GRPO, DAPO, or REINFORCE
      use_kl_loss: true
      kl_coef: 1.0e-2
    ```

    #### Worker Configuration

    ```yaml theme={null}
    worker:
      actor:
        model:
          model_path: Qwen/Qwen2.5-7B-Instruct
        optim:
          lr: 1.0e-6
      rollout:
        n: 5  # number of rollout samples per prompt
        temperature: 1.0
      reward:
        reward_type: batch
        reward_function: ./code/easyR1/reward_function/math.py:compute_score
    ```
  </Step>

  <Step title="Reference Baseline Examples">
    For pre-configured training scripts and baseline examples, refer to the [EasyR1 repository](https://github.com/hiyouga/EasyR1). The repository provides multiple baseline configurations for different models and tasks:

    ### Available Baselines (in EasyR1 repo)

    * **Mathematical Reasoning**: `qwen2_5_7b_math_grpo.sh`, `qwen3_4b_math_grpo.sh`
    * **Geometric Reasoning (Vision-Language)**: `qwen2_5_vl_7b_geo3k_grpo.sh`, `qwen2_5_vl_7b_geo3k_dapo.sh`, `qwen2_5_vl_7b_geo3k_reinforce.sh`
    * **Multi-Image Tasks**: `qwen2_5_vl_7b_multi_image.sh`

    You can adapt these examples to work with FlexAI by following the training commands in this blueprint.
  </Step>

  <Step title="Customize Your Configuration">
    For your specific use case, you may want to create a custom configuration. Here's how to customize the `config.yaml`:

    ### Custom Dataset

    Replace the dataset configuration:

    ```yaml theme={null}
    data:
      train_files: your-username/your-dataset@train
      val_files: your-username/your-dataset@test
      prompt_key: question  # adjust based on your dataset
      answer_key: solution  # adjust based on your dataset
    ```

    ### Custom Reward Function

    Create your own reward function in `code/easyR1/reward_function/custom.py`:

    ```python theme={null}
    def compute_score(prompts, responses, answers):
        """
        Args:
            prompts: List of input prompts
            responses: List of model responses
            answers: List of ground truth answers

        Returns:
            List of reward scores (float)
        """
        scores = []
        for response, answer in zip(responses, answers):
            # Your custom reward logic here
            score = your_evaluation_function(response, answer)
            scores.append(score)
        return scores
    ```

    Then update the config to reference your custom reward function:

    ```yaml theme={null}
    worker:
      reward:
        reward_function: ./code/easyR1/reward_function/custom.py:compute_score
    ```

    ### Custom Prompt Format

    Create a custom Jinja template in `code/easyR1/format_prompt/custom.jinja`:

    ```
    {{ "{{" }} problem {{ "}}" }}

    Please solve this step by step and provide your final answer.
    ```

    Update the config:

    ```yaml theme={null}
    data:
      format_prompt: ./code/easyR1/format_prompt/custom.jinja
    ```
  </Step>
</Steps>

## Create Secrets

To access HuggingFace models and datasets, you need a HuggingFace token.

Use the [`flexai secret create` command](https://docs.flex.ai/cli/reference/secret/create) to store your *HuggingFace Token* as a secret:

```bash theme={null}
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>
```

Then paste your *HuggingFace Token* API key value.

Use the same command to store your Weights & Biases (wandb) API key as a secret:

```bash theme={null}
flexai secret create <WANDB_API_KEY_SECRET_NAME>
```

Then paste your Weights & Biases API key value.

## \[Optional] Pre-fetch the Model

To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage:

1. **Create a HuggingFace storage provider:**

   ```bash theme={null}
   flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>
   ```

2. **Push the model checkpoint to your storage:**

   ```bash theme={null}
   flexai checkpoint push qwen25-7b-instruct --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B-Instruct
   ```

## Training

For RL training with EasyR1, we recommend using **1 node (8 × H100 GPUs)** for 7B models to handle the actor, reference model, and rollout workers efficiently.

<Note>
  The commands below use this repository which contains all necessary configuration files in the `code/easyR1/` directory.
</Note>

### Standard Training: Mathematical Reasoning with GRPO

```bash theme={null}
flexai training run grpo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct
```

### Training with Model Prefetch

```bash theme={null}
flexai training run grpo-prefetched \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --checkpoint qwen25-7b-instruct \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=/input-checkpoint/qwen25-7b-instruct
```

### Training with Custom Configuration

To use a modified configuration or different dataset, override config values:

```bash theme={null}
flexai training run grpo-custom \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-7B-Instruct \
      data.train_files=your-username/your-dataset@train \
      data.val_files=your-username/your-dataset@test \
      trainer.experiment_name=custom-experiment
```

## Monitoring Training Progress

You can check the status and lifecycle events of your Training Job:

```bash theme={null}
flexai training inspect grpo
```

View the logs of your Training Job:

```bash theme={null}
flexai training logs grpo
```

### Training Observability with Weights & Biases

EasyR1 supports Weights & Biases (wandb) integration for detailed training metrics visualization. The configuration already includes wandb logging:

```yaml theme={null}
trainer:
  logger: ["file", "wandb"]
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math_grpo
```

## Getting Training Checkpoints

Once the Training Job completes successfully, you can list all produced checkpoints:

```bash theme={null}
flexai training checkpoints grpo
```

Look for checkpoints marked as `INFERENCE READY = true` - these are ready for serving.

## Serving the Trained Model

Deploy your RL-trained model directly from the checkpoint using FlexAI inference. Replace `<CHECKPOINT_ID>` with the ID from an inference-ready checkpoint:

```bash theme={null}
flexai inference serve easyr1-reasoning-endpoint --checkpoint <CHECKPOINT_ID>
```

Monitor your inference endpoint status:

```bash theme={null}
# List all inference endpoints
flexai inference list

# Get detailed endpoint information
flexai inference inspect easyr1-reasoning-endpoint

# Check endpoint logs
flexai inference logs easyr1-reasoning-endpoint
```

## Testing Your RL-Trained Model

Once the endpoint is running, you can test it with reasoning tasks. For our mathematical reasoning example, the model should demonstrate improved step-by-step reasoning and accurate problem-solving.

### Before and After Training Comparison

To illustrate the improvement from RL fine-tuning, here's a comparison using a math problem:

**Problem**: "If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?"

**Base Model Response (Qwen2.5-7B-Instruct before RL training):**

```
The average speed is 60 mph.
```

*Issues: Correct answer but no reasoning steps shown*

**RL Fine-tuned Model Response (after GRPO training on math12k):**

```
Let me solve this step by step:

Step 1: Identify the given information
- Distance traveled = 120 miles
- Time taken = 2 hours

Step 2: Apply the speed formula
Speed = Distance / Time

Step 3: Calculate
Speed = 120 miles / 2 hours = 60 miles per hour

Therefore, the average speed of the train is 60 mph.
```

*Improvements: Clear reasoning steps, structured approach, educational value*

This demonstrates how RL training encourages the model to show its reasoning process, making it more reliable and transparent.

### Example API Call

```bash theme={null}
curl -X POST "https://your-endpoint-url/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Solve the following problem step by step: A rectangle has a length of 15 cm and a width of 8 cm. What is its area?",
    "max_tokens": 500,
    "temperature": 0.7
  }'
```

## Expected Results

After RL fine-tuning with EasyR1, your model should achieve:

* **Enhanced Reasoning**: Step-by-step problem-solving with clear explanations
* **Improved Accuracy**: Higher success rate on reasoning tasks
* **Better Generalization**: Ability to apply learned reasoning patterns to new problems
* **Structured Outputs**: More organized and educational responses

For mathematical reasoning tasks:

* **Explicit Step-by-Step Solutions**: Clear breakdown of problem-solving process
* **Higher Success Rate**: Improved accuracy on math benchmarks
* **Better Error Detection**: Ability to identify and correct mistakes

## Technical Details

### Training Configuration Breakdown

**Reinforcement Learning Components:**

* **Actor Model**: The model being trained (policy network)
* **Reference Model**: Frozen copy for KL divergence computation
* **Rollout Workers**: Generate multiple responses for each prompt (n=5)
* **Reward Function**: Evaluates response quality (custom per task)

**Distributed Training:**

* **FSDP (Fully Sharded Data Parallel)**: Efficient memory usage for large models
* **vLLM Integration**: Fast inference during rollout generation
* **Tensor Parallelism**: For rollout workers (size=2)

**Optimization:**

* **GRPO Algorithm**: Group-based advantage estimation for stable training
* **KL Penalty**: Prevents model from deviating too far from base model
* **Gradient Checkpointing**: Reduces memory usage during backpropagation

### Resource Requirements

**Recommended Configuration for Qwen2.5-7B:**

* **Nodes**: 1 node (sufficient for RL training with actor + reference + rollout)
* **Accelerators**: 8 × H100 GPUs per node
* **Memory**: \~400GB+ GPU memory total (actor, reference, and rollout workers)
* **Training Time**: \~8-12 hours for 15 epochs
* **Storage**: \~50GB for checkpoints

**Command Line Parameters Explained:**

* `FORCE_TORCHRUN=1`: Ensures proper distributed training setup
* `--runtime pytorch-28-vllm-0110-nvidia`: PyTorch 2.8 with vLLM 0.11.0 optimized for EasyR1
* `--repository-url`: Points to the FlexAI blueprints repository
* `config=code/easyR1/config.yaml`: Main configuration file path relative to repository root

### Key Configuration Parameters

**Data Settings:**

* `rollout_batch_size: 512`: Number of prompts per training iteration
* `max_prompt_length: 2048`: Maximum input length
* `max_response_length: 2048`: Maximum output length

**Algorithm Settings:**

* `adv_estimator: grpo`: Choice of RL algorithm
* `kl_coef: 1.0e-2`: Strength of KL penalty
* `use_kl_loss: true`: Enable KL divergence loss

**Training Settings:**

* `total_epochs: 15`: Number of training epochs
* `n_gpus_per_node: 8`: GPUs per node
* `val_freq: 5`: Validation every 5 epochs
* `save_freq: 5`: Save checkpoint every 5 epochs

### Scaling Options

* **For faster training**: Increase to 2 nodes (16 × H100)
* **For larger models**: Increase `tensor_parallel_size` for rollout
* **For better exploration**: Increase `rollout.n` (more samples per prompt)
* **For memory efficiency**: Enable CPU offloading (`enable_cpu_offload: true`)
* **For different tasks**: Modify reward function and prompt templates

## Advanced Examples

### Vision-Language Model with Geometric Reasoning

```bash theme={null}
flexai training run grpo-VL-Geo \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen2.5-VL-7B-Instruct \
      data.train_files=hiyouga/geometry3k@train \
      data.val_files=hiyouga/geometry3k@test \
      data.format_prompt=./code/easyR1/format_prompt/r1v.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/r1v.py:compute_score \
      trainer.experiment_name=qwen2_5_vl_7b_geo3k_grpo
```

### Using DAPO Algorithm

```bash theme={null}
flexai training run Dapo-14B \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/easyR1/requirements.txt \
  --runtime pytorch-28-vllm-0110-nvidia \
  -- python3 -m verl.trainer.main \
      config=code/easyR1/config.yaml \
      worker.actor.model.model_path=Qwen/Qwen3-14B \
      algorithm.adv_estimator=dapo \
      algorithm.online_filtering=true \
      data.train_files=hiyouga/dapo17k@train \
      data.val_files=hiyouga/dapo17k@test \
      data.format_prompt=./code/easyR1/format_prompt/dapo.jinja \
      worker.reward.reward_function=./code/easyR1/reward_function/dapo.py:compute_score \
      trainer.experiment_name=qwen3_14b_dapo17k_dapo
```

## Troubleshooting

**Training Job Fails to Start:**

```bash theme={null}
# Check FlexAI authentication
flexai auth status

# Verify repository access
git clone https://github.com/flexaihq/blueprints
```

**Out of Memory Errors:**

* Reduce `rollout_batch_size` from 512 to 256
* Reduce `rollout.n` from 5 to 3 (fewer samples per prompt)
* Enable CPU offloading: `enable_cpu_offload: true` in FSDP config
* Reduce `tensor_parallel_size` for rollout workers

**Reward Function Errors:**

* Verify reward function path is correct in config
* Test reward function locally before training
* Ensure reward function returns float scores for all inputs
* Check for NaN or infinite reward values

**Checkpoint Not Inference Ready:**

* Wait for training to complete fully
* Check `save_model_only: false` in config to include all necessary files
* Verify training completed without errors

**Endpoint Deployment Issues:**

* Verify checkpoint shows `INFERENCE READY = true` status
* Check FlexAI cluster availability
* Review detailed logs with `flexai inference logs <endpoint-name>`

**Dataset Loading Issues:**

* Verify dataset path format: `username/dataset@split`
* Ensure HuggingFace token has access to datasets
* Check prompt\_key and answer\_key match your dataset schema

**vLLM Rollout Errors:**

* Adjust `gpu_memory_utilization` (default 0.6)
* Reduce `tensor_parallel_size` if GPUs are insufficient
* Enable `enforce_eager: true` for debugging

## References

* **EasyR1 GitHub**: [https://github.com/hiyouga/EasyR1](https://github.com/hiyouga/EasyR1)
* **VERL Framework**: [https://github.com/volcengine/verl](https://github.com/volcengine/verl)
* **FlexAI Documentation**: [https://docs.flex.ai](https://docs.flex.ai)
* **HybridFlow Paper**: [https://arxiv.org/abs/2409.19256](https://arxiv.org/abs/2409.19256)
* **GRPO Algorithm**: Introduced in DeepSeekMath paper - [https://arxiv.org/abs/2402.03300](https://arxiv.org/abs/2402.03300)
* **GRPO Documentation**: [https://huggingface.co/docs/trl/grpo\_trainer](https://huggingface.co/docs/trl/grpo_trainer)

## Code

### `requirements.txt`

```text theme={null}
git+https://github.com/hiyouga/EasyR1.git@d146d24e990c8102fee44e61e5ca389907712960
```

### `config.yaml`

```yaml theme={null}
---
data:
  train_files: hiyouga/math12k@train
  val_files: hiyouga/math12k@test
  prompt_key: problem
  answer_key: answer
  image_key: images
  video_key: videos
  image_dir:
  video_fps: 2.0
  max_prompt_length: 2048
  max_response_length: 2048
  rollout_batch_size: 512
  mini_rollout_batch_size:
  val_batch_size: 1024
  format_prompt: ./code/easyR1/format_prompt/math.jinja
  override_chat_template:
  shuffle: true
  seed: 1
  min_pixels: 262144
  max_pixels: 4194304
  filter_overlong_prompts: true

algorithm:
  adv_estimator: grpo
  disable_kl: false
  use_kl_loss: true
  kl_penalty: low_var_kl
  kl_coef: 1.0e-2
  online_filtering: false
  filter_key: overall
  filter_low: 0.01
  filter_high: 0.99

worker:
  actor:
    global_batch_size: 128
    micro_batch_size_per_device_for_update: 1
    micro_batch_size_per_device_for_experience: 2
    max_grad_norm: 1.0
    padding_free: true
    dynamic_batching: true
    ulysses_size: 1
    model:
      model_path: Qwen/Qwen2.5-7B-Instruct
      enable_gradient_checkpointing: true
      trust_remote_code: false
      freeze_vision_tower: false
    optim:
      lr: 1.0e-6
      weight_decay: 1.0e-2
      strategy: adamw
      lr_warmup_ratio: 0.0
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: false
      enable_rank0_init: true
    offload:
      offload_params: true
      offload_optimizer: true

  rollout:
    n: 5
    temperature: 1.0
    top_p: 1.0
    limit_images: 0
    gpu_memory_utilization: 0.6
    enforce_eager: false
    enable_chunked_prefill: false
    tensor_parallel_size: 2
    disable_tqdm: true
    val_override_config:
      temperature: 0.6
      top_p: 0.95
      n: 1

  ref:
    fsdp:
      enable_full_shard: true
      enable_cpu_offload: true
      enable_rank0_init: true
    offload:
      offload_params: false

  reward:
    reward_type: batch
    reward_function: ./code/easyR1/reward_function/math.py:compute_score

trainer:
  total_epochs: 15
  max_steps:
  project_name: easy_r1
  experiment_name: qwen2_5_7b_math_grpo
  logger: [file, wandb]
  nnodes: 1
  n_gpus_per_node: 8
  max_try_make_batch: 20
  val_freq: 5
  val_before_train: true
  val_only: false
  val_generations_to_log: 3
  save_freq: 5
  save_limit: 3
  save_model_only: false
  save_checkpoint_path: /output-checkpoint
  load_checkpoint_path:
  find_last_checkpoint: true
```

### `format_prompt/math.jinja`

```
{{ "{{" }} content | trim {{ "}}" }} You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}.
```

### `reward_function/math.py`

```python theme={null}
# Copyright 2024 Bytedance Ltd. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0

import re
from typing import Any

from mathruler.grader import extract_boxed_content, grade_answer


def format_reward(response: str) -> float:
    pattern = re.compile(r"<think>.*</think>.*\\boxed\{.*\}.*", re.DOTALL)
    format_match = re.fullmatch(pattern, response)
    return 1.0 if format_match else 0.0


def accuracy_reward(response: str, ground_truth: str) -> float:
    answer = extract_boxed_content(response)
    return 1.0 if grade_answer(answer, ground_truth) else 0.0


def compute_score(
    reward_inputs: list[dict[str, Any]], format_weight: float = 0.1
) -> list[dict[str, float]]:
    if not isinstance(reward_inputs, list):
        raise ValueError("Please use `reward_type=batch` for math reward function.")

    scores = []
    for reward_input in reward_inputs:
        response = re.sub(
            r"\s*(<|>|/)\s*", r"\1", reward_input["response"]
        )
        format_score = format_reward(response)
        accuracy_score = accuracy_reward(response, reward_input["ground_truth"])
        scores.append(
            {
                "overall": (1 - format_weight) * accuracy_score
                + format_weight * format_score,
                "format": format_score,
                "accuracy": accuracy_score,
            }
        )

    return scores
```

<div className="blueprint-cta">
  <h3>🚀 Run this on FlexAI</h3>
  <p>Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.</p>
  <a href="https://console.flex.ai" className="cta-primary">Get started →</a>
  <a href="https://flex.ai/contact" className="cta-secondary">Talk to us</a>
</div>
