Fine-Tune Vision-Language Models with GRPO on FlexAI

This experiment demonstrates how to perform Group Relative Policy Optimization (GRPO) training on a Vision-Language Model (VLM) using the TRL library. GRPO is a reinforcement learning technique that optimizes the model based on relative rewards within groups of generated responses. We will use Qwen2.5-VL-3B-Instruct as our base VLM and apply LoRA (Low-Rank Adaptation) for efficient fine-tuning. The training leverages vLLM for fast inference during the policy optimization phase and DeepSpeed ZeRO-3 for distributed training.

Prerequisites

Connect to GitHub (if needed)

If you haven’t already connected FlexAI to GitHub, you’ll need to set up a code registry connection:

flexai code-registry connect

This will allow FlexAI to pull repositories directly from GitHub using the -u flag in training commands.

Create Secrets

To access the Qwen2.5-VL-3B-Instruct model, you may need authentication with your HuggingFace account depending on the model’s access requirements. Use the flexai secret create command to store your HuggingFace Token as a secret. Replace <HF_AUTH_TOKEN_SECRET_NAME> with your desired name for the secret:

flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>

Then paste your HuggingFace Token value.

Training

To start the GRPO Training Job, run the following command:

flexai training run vlm-grpo-training --repository-url https://github.com/flexaihq/blueprints \
  --requirements-path code/vlm-grpo/requirements.txt \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --runtime pytorch-28-vllm-0110-nvidia \
  --nodes 1 --accels 2 \
  -- accelerate launch \
    --config_file=code/vlm-grpo/deepspeed_zero3.yaml \
    code/vlm-grpo/grpo_vlm.py \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --output_dir /output-checkpoint/grpo-Qwen2.5-VL-3B-Instruct \
    --learning_rate 1e-5 \
    --gradient_checkpointing \
    --dtype bfloat16 \
    --max_prompt_length 2048 \
    --max_completion_length 1024 \
    --use_vllm \
    --vllm_mode colocate \
    --use_peft \
    --lora_target_modules "q_proj" "v_proj" \
    --log_completions \
    --logging_steps 200

Key Arguments Explained

--model_name_or_path: The base VLM model to fine-tune (Qwen2.5-VL-3B-Instruct)
--output_dir: Directory where checkpoints will be saved
--learning_rate: Learning rate for optimization (1e-5)
--gradient_checkpointing: Enables gradient checkpointing to reduce memory usage
--dtype bfloat16: Uses bfloat16 precision for training
--max_prompt_length: Maximum length of input prompts (2048 tokens)
--max_completion_length: Maximum length of generated completions (1024 tokens)
--use_vllm: Enables vLLM for fast inference during policy optimization
--vllm_mode colocate: Runs vLLM on the same GPUs as training
--use_peft: Enables LoRA for parameter-efficient fine-tuning
--lora_target_modules: Specifies which modules to apply LoRA (q_proj, v_proj)
--log_completions: Logs generated completions during training

Configuration Files

The experiment uses a DeepSpeed ZeRO-3 configuration file (code/vlm-grpo/deepspeed_zero3.yaml) that specifies:

ZeRO Stage 3 for memory-efficient distributed training
2 processes (GPUs) for multi-GPU training
Mixed precision training with bfloat16

Monitoring the Training Job

You can check the status and life cycle events of your Training Job by running:

flexai training inspect vlm-grpo-training

Additionally, you can view the logs of your Training Job by running:

flexai training logs vlm-grpo-training

Fetching the Trained Model

Once the Training Job completes successfully, you can list all the produced checkpoints:

flexai training checkpoints vlm-grpo-training

Download a checkpoint with:

flexai checkpoint fetch "<CPKT-ID>"

The checkpoint will contain the LoRA adapters that can be merged with the base model for inference.

Additional Resources

Code

grpo_vlm.py

# Adapted from: https://github.com/huggingface/trl/blob/e622196097109080b73584d598d4162e64fc6bea/examples/scripts/grpo_vlm.py
# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# /// script
# dependencies = [
#     "trl",
#     "Pillow",
#     "peft",
#     "math-verify",
#     "latex2sympy2_extended",
#     "torchvision",
#     "trackio",
#     "kernels",
# ]
# ///

"""
pip install math_verify

# For Qwen/Qwen2.5-VL-3B-Instruct
accelerate launch \
    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/grpo_vlm.py \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --output_dir grpo-Qwen2.5-VL-3B-Instruct \
    --learning_rate 1e-5 \
    --gradient_checkpointing \
    --dtype bfloat16 \
    --max_prompt_length 2048 \
    --max_completion_length 1024 \
    --use_vllm \
    --vllm_mode colocate \
    --use_peft \
    --lora_target_modules "q_proj", "v_proj" \
    --log_completions

# For HuggingFaceTB/SmolVLM2-2.2B-Instruct
pip install num2words==0.5.14

accelerate launch \
    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/grpo_vlm.py \
    --model_name_or_path HuggingFaceTB/SmolVLM2-2.2B-Instruct \
    --output_dir grpo-SmolVLM2-2.2B-Instruct \
    --learning_rate 1e-5 \
    --dtype bfloat16 \
    --max_prompt_length 2048 \
    --max_completion_length 1024 \
    --use_peft \
    --lora_target_modules "q_proj", "v_proj" \
    --log_completions \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --num_generations 2

"""

import os

import torch
from datasets import load_dataset
from trl import (
    GRPOConfig,
    GRPOTrainer,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)
from trl.rewards import accuracy_reward, think_format_reward

# Enable logging in a Hugging Face Space
os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")


if __name__ == "__main__":
    parser = TrlParser((ScriptArguments, GRPOConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_and_config()
    ################
    # Model
    ################
    dtype = (
        model_args.dtype
        if model_args.dtype in ["auto", None]
        else getattr(torch, model_args.dtype)
    )
    training_args.model_init_kwargs = dict(
        revision=model_args.model_revision,
        attn_implementation=model_args.attn_implementation,
        dtype=dtype,
    )
    quantization_config = get_quantization_config(model_args)
    if quantization_config is not None:
        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
        training_args.model_init_kwargs["device_map"] = get_kbit_device_map()
        training_args.model_init_kwargs["quantization_config"] = quantization_config

    ################
    # Dataset
    ################
    dataset = load_dataset("lmms-lab/multimodal-open-r1-8k-verified", split="train")
    dataset = dataset.train_test_split(test_size=100, seed=42)

    SYSTEM_PROMPT = (
        "A conversation between user and assistant. The user asks a question, and the assistant solves it. The "
        "assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
        "The reasoning process and answer are enclosed within <think></think> tags, i.e., <think>\nThis is my "
        "reasoning.\n</think>\nThis is my answer."
    )

    def make_conversation(example):
        prompt = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ]
        return {"prompt": prompt}

    dataset = dataset.map(make_conversation)

    # Filter have big images
    def filter_big_images(example):
        image = example["image"]
        return image.size[0] < 512 and image.size[1] < 512

    dataset = dataset.filter(filter_big_images)

    def convert_to_rgb(example):
        image = example["image"]
        if image.mode != "RGB":
            image = image.convert("RGB")
        example["image"] = image
        return example

    dataset = dataset.map(convert_to_rgb)

    train_dataset = dataset["train"]
    eval_dataset = dataset["test"] if training_args.eval_strategy != "no" else None

    ################
    # Training
    ################
    trainer = GRPOTrainer(
        model=model_args.model_name_or_path,
        args=training_args,
        reward_funcs=[think_format_reward, accuracy_reward],
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=get_peft_config(model_args),
    )

    trainer.train()

    # Save and push to hub
    trainer.save_model(training_args.output_dir)
    if training_args.push_to_hub:
        trainer.push_to_hub(dataset_name=script_args.dataset_name)

requirements.txt

deepspeed==0.18.2
math_verify==0.8.0
peft==0.18.0
trl==0.25.1

deepspeed_zero3.yaml

---
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: no
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

🚀 Run this on FlexAI

Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.

Get started →Talk to us

Getting Started

Inference

Fine-tuning

Training

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

Blueprints

Fine-Tune Vision-Language Models with GRPO on FlexAI

Prerequisites

Connect to GitHub (if needed)

Create Secrets

Training

Key Arguments Explained

Configuration Files

Monitoring the Training Job

Fetching the Trained Model

Additional Resources

Code

grpo_vlm.py

requirements.txt

deepspeed_zero3.yaml

🚀 Run this on FlexAI

Getting Started

Inference

Fine-tuning

Training

Platform Services

Interactive Development

CLI

Console

Best Practices

FAQ

Blueprints

Documentation Index

​Prerequisites

​Connect to GitHub (if needed)

​Create Secrets

​Training

​Key Arguments Explained

​Configuration Files

​Monitoring the Training Job

​Fetching the Trained Model

​Additional Resources

​Code

​grpo_vlm.py

​requirements.txt

​deepspeed_zero3.yaml

​🚀 Run this on FlexAI

Prerequisites

Connect to GitHub (if needed)

Create Secrets

Training

Key Arguments Explained

Configuration Files

Monitoring the Training Job

Fetching the Trained Model

Additional Resources

Code

grpo_vlm.py

requirements.txt

deepspeed_zero3.yaml

🚀 Run this on FlexAI