LlamaFactory is a unified framework for fine-tuning 100+ open-source LLMs from a single YAML config — SFT, PPO, DPO, and more. This blueprint shows how to run it end-to-end on FlexAI: configure a training run, launch with managed checkpoints, and deploy the resulting model as a production inference endpoint. The workflow is model- and dataset-agnostic. Throughout this guide we useDocumentation Index
Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
Use this file to discover all available pages before exploring further.
Qwen/Qwen2.5-7B + the openhermes-fr dataset as a worked example — swap them for any LlamaFactory-supported model and any HuggingFace dataset by editing the YAML config.
Note: If you haven’t already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.
Verify Dataset Configuration
Register your dataset in LlamaFactory’s dataset registry. The example below uses For your own use case, replace the entry with your dataset and update the
openhermes-fr; replace it with any HuggingFace dataset by editing the hf_hub_url and column mapping.Navigate to experiments/code/llama-factory/data/dataset_info.json and verify the dataset entry exists:columns mapping to match your dataset’s schema.Configure Training Parameters
The
qwen25-7B_sft.yaml file contains the training configuration. Swap the model_name_or_path and dataset values for your own. Key settings shown in the example:- Model:
Qwen/Qwen2.5-7B— example base model (replacemodel_name_or_pathwith any HuggingFace model ID) - Stage:
sft(Supervised Fine-Tuning) — switch todpo,ppo, orrewardfor other alignment methods LlamaFactory supports - Dataset:
openhermes-fr— example dataset (replace with your own entry indataset_info.json) - Training: Full fine-tuning with DeepSpeed ZeRO Stage 3 for memory-efficient distributed training
Create Secrets
To access gated models and datasets on HuggingFace, you need a HuggingFace token. Use theflexai secret create command to store your HuggingFace Token as a secret. Replace <HF_AUTH_TOKEN_SECRET_NAME> with your desired name for the secret:
[Optional] Pre-fetch the Model
To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage. For example, to pre-fetch theQwen/Qwen2.5-7B model:
-
Create a HuggingFace storage provider:
-
Push the model checkpoint to your storage:
Training
For a 7B model, we recommend using 1 node (8 × H100 GPUs) to ensure reasonable training time and avoid out-of-memory issues.Standard Training (without prefetch)
Training with Model Prefetch
To take advantage of model pre-fetching performed in the Optional: Pre-fetch the Model section, use:Monitoring Training Progress
You can check the status and lifecycle events of your Training Job by running:Training Observability with TensorBoard
For advanced monitoring and visualization of training metrics, you can leverage TensorBoard integration. FlexAI supports TensorBoard logging for detailed insights into training progress, loss curves, and model performance. To enable TensorBoard logging, update your YAML configuration:Getting Training Checkpoints
Once the Training Job completes successfully, you will be able to list all the produced checkpoints:INFERENCE READY = true - these are ready for serving.
Serving the Trained Model
Deploy your trained model directly from the checkpoint using FlexAI inference. Replace<CHECKPOINT_ID> with the ID from an inference-ready checkpoint:
Note: GPU specification for inference endpoints is currently managed automatically by FlexAI. Future versions will allow explicit GPU count specification for inference workloads to optimize cost and performance based on your specific requirements.You can monitor your inference endpoint status:
Testing Your Fine-Tuned Model
Once the endpoint is running, test it with prompts representative of your domain. A well-fine-tuned model should show measurable improvements over the base model on your target tasks — better in-domain vocabulary and terminology, more accurate factual recall, and more appropriate tone or structure.Example API Call
Expected Results
After fine-tuning on domain-specific data, your model should achieve:- Domain Expertise: Specialized knowledge and terminology understanding for your target domain
- Task-Specific Performance: Enhanced capabilities for domain-relevant tasks and workflows
- Maintained General Capabilities: Preserved reasoning, problem-solving, and general language skills
Technical Details
Training Configuration Breakdown
- DeepSpeed ZeRO Stage 3: Enables training of 7B model on 1 node efficiently
- Mixed Precision (bf16): Accelerates training while maintaining numerical stability
- Gradient Accumulation: Effective batch size of 4 (2 steps × 2 per device)
- Learning Rate Schedule: Cosine decay with 10% warmup for stable convergence
- Context Length: 2048 tokens, optimized for conversation tasks
Resource Requirements
Recommended Configuration for Qwen2.5-7B:- Nodes: 1 node (cost-effective for 7B models)
- Accelerators: 8 × H100 GPUs per node
- Memory: ~200GB+ GPU memory total
- Training Time: ~2-4 hours for 3 epochs
- Storage: ~30GB for checkpoints
FORCE_TORCHRUN=1: Ensures proper distributed training setup
Scaling Options
- For faster training: Increase to 2 nodes (16 × H100)
- For larger datasets: Adjust
max_samplesparameter - For longer context: Increase
cutoff_len(requires more memory) - For memory efficiency: Switch to
finetuning_type: lorafor QLoRA training
Troubleshooting
Common Issues: Training Job Fails to Start:- Reduce
per_device_train_batch_sizefrom 1 to lower value - Increase
gradient_accumulation_stepsto maintain effective batch size - Consider using
finetuning_type: lorafor memory efficiency
- Wait for training to complete fully (check with
flexai training inspect) - Ensure
save_only_model: falsein YAML configuration - Verify training completed successfully without errors
- Verify checkpoint shows
INFERENCE READY = truestatus - Check FlexAI cluster availability with
flexai inference list - Review detailed logs with
flexai inference logs <endpoint-name>
Code
code/llama-factory/qwen25-7B_sft.yaml
code/llama-factory/ds_z3_config.json
code/llama-factory/data/dataset_info.json
code/llama-factory/requirements.txt
🚀 Run this on FlexAI
Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.
Get started →Talk to us