This experiment demonstrates how to perform Group Relative Policy Optimization (GRPO) training on a Vision-Language Model (VLM) using the TRL library. GRPO is a reinforcement learning technique that optimizes the model based on relative rewards within groups of generated responses. We will useDocumentation Index
Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
Use this file to discover all available pages before exploring further.
Qwen2.5-VL-3B-Instruct as our base VLM and apply LoRA (Low-Rank Adaptation) for efficient fine-tuning. The training leverages vLLM for fast inference during the policy optimization phase and DeepSpeed ZeRO-3 for distributed training.
Prerequisites
Connect to GitHub (if needed)
If you haven’t already connected FlexAI to GitHub, you’ll need to set up a code registry connection:-u flag in training commands.
Create Secrets
To access the Qwen2.5-VL-3B-Instruct model, you may need authentication with your HuggingFace account depending on the model’s access requirements. Use theflexai secret create command to store your HuggingFace Token as a secret. Replace <HF_AUTH_TOKEN_SECRET_NAME> with your desired name for the secret:
Training
To start the GRPO Training Job, run the following command:Key Arguments Explained
--model_name_or_path: The base VLM model to fine-tune (Qwen2.5-VL-3B-Instruct)--output_dir: Directory where checkpoints will be saved--learning_rate: Learning rate for optimization (1e-5)--gradient_checkpointing: Enables gradient checkpointing to reduce memory usage--dtype bfloat16: Uses bfloat16 precision for training--max_prompt_length: Maximum length of input prompts (2048 tokens)--max_completion_length: Maximum length of generated completions (1024 tokens)--use_vllm: Enables vLLM for fast inference during policy optimization--vllm_mode colocate: Runs vLLM on the same GPUs as training--use_peft: Enables LoRA for parameter-efficient fine-tuning--lora_target_modules: Specifies which modules to apply LoRA (q_proj, v_proj)--log_completions: Logs generated completions during training
Configuration Files
The experiment uses a DeepSpeed ZeRO-3 configuration file (code/vlm-grpo/deepspeed_zero3.yaml) that specifies:
- ZeRO Stage 3 for memory-efficient distributed training
- 2 processes (GPUs) for multi-GPU training
- Mixed precision training with bfloat16
Monitoring the Training Job
You can check the status and life cycle events of your Training Job by running:Fetching the Trained Model
Once the Training Job completes successfully, you can list all the produced checkpoints:Additional Resources
Code
grpo_vlm.py
requirements.txt
deepspeed_zero3.yaml
🚀 Run this on FlexAI
Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.
Get started →Talk to us