Fine-Tune LLMs with Axolotl on Domain-Specific Data

This blueprint demonstrates how to use FlexAI to fine-tune language models on domain-specific data using Axolotl, then deploy them as production-ready inference endpoints. For illustration purposes, we’ll fine-tune for maximum command of French using the Qwen2.5-7B model and the openhermes-fr dataset. You will see that this process requires configuring Axolotl’s training parameters, leveraging FlexAI’s managed training infrastructure, and deploying the fine-tuned model as a scalable inference endpoint.

If you haven’t already connected FlexAI to GitHub, run flexai code-registry connect to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.

Verify Dataset Configuration

First, ensure your domain-specific dataset is properly configured in your Axolotl YAML file. For our French language example, we’ll use the openhermes-fr dataset.Navigate to code/axolotl/qwen2/fft-7b-french.yaml and verify the dataset configuration:

datasets:
- path: legmlai/openhermes-fr
  type:
    system_prompt: ""
    field_instruction: prompt
    field_output: accepted_completion
    format: "{instruction}"
    no_input_format: "{instruction}"

For your own use case, replace this with your domain-specific dataset. The openhermes-fr dataset is specifically designed for French language tasks and serves as an excellent example of domain specialization.

Configure Training Parameters

The qwen2/fft-7b-french.yaml file contains the training configuration for domain-specific fine-tuning. Key settings include:

Model: Qwen/Qwen2.5-7B - Excellent multilingual base model suitable for domain adaptation
Stage: Full fine-tuning - Perfect for task-specific and domain-specific adaptation
Dataset: openhermes-fr - Example domain-specific dataset (replace with your own)
Training: Full fine-tuning with FSDP (Fully Sharded Data Parallel) for optimal performance

base_model: Qwen/Qwen2.5-7B

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
- path: legmlai/openhermes-fr
  type:
    system_prompt: ""
    field_instruction: prompt
    field_output: accepted_completion
    format: "{instruction}"
    no_input_format: "{instruction}"
dataset_prepared_path:
val_set_size: 0.05
output_dir: /output-checkpoint

sequence_len: 2048
sample_packing: true
eval_sample_packing: true

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 1.0e-5

bf16: auto
tf32: true

gradient_checkpointing: true
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

Create Secrets

To access the Qwen2.5-7B model and OpenHermes-FR dataset, you need a HuggingFace token. Use the flexai secret create command to store your HuggingFace Token as a secret. Replace <HF_AUTH_TOKEN_SECRET_NAME> with your desired name for the secret:

flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>

Then paste your HuggingFace Token API key value.

[Optional] Pre-fetch the Model

To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage. For example, to pre-fetch the Qwen/Qwen2.5-7B model:

Create a HuggingFace storage provider:

flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>

Push the model checkpoint to your storage:

flexai checkpoint push qwen25-7b --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B

During your training run, you can use the pre-fetched model by adding the following argument to your training command:

--checkpoint qwen25-7b

Training

For a 7B model, we recommend using 1 node (4 × H100 GPUs) to ensure reasonable training time and avoid out-of-memory issues.

Standard Training (without prefetch)

flexai training run axolotl-french-sft \
  --accels 4 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --requirements-path code/axolotl/requirements.txt \
  -- axolotl train code/axolotl/qwen2/fft-7b-french.yaml

Training with Model Prefetch

To take advantage of model pre-fetching performed in the Optional: Pre-fetch the Model section, use:

flexai training run axolotl-french-sft-prefetched \
  --accels 4 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --checkpoint qwen25-7b \
  --env FORCE_TORCHRUN=1 \
  --requirements-path code/axolotl/requirements.txt \
  -- axolotl train code/axolotl/qwen2/fft-7b-french.yaml

Monitoring Training Progress

You can check the status and lifecycle events of your Training Job by running:

flexai training inspect axolotl-french-sft

Additionally, you can view the logs of your Training Job by running:

flexai training logs axolotl-french-sft

Training Observability with Weights & Biases

For advanced monitoring and visualization of training metrics, Axolotl supports Weights & Biases (wandb) integration. You can leverage wandb logging for detailed insights into training progress, loss curves, and model performance. To enable wandb logging, update your YAML configuration:

wandb_project: your-project-name
wandb_entity: your-wandb-entity
wandb_watch: gradients
wandb_name: qwen25-7b-sft
wandb_log_model: checkpoint

For additional monitoring, you can also access FlexAI’s hosted TensorBoard instance, which provides organization-wide real-time insights into your training workload progress. TensorBoard is enabled by default for every organization - simply log in using your FlexAI account credentials to track metrics and visualize performance. For more details on observability features, see the Axolotl documentation.

Getting Training Checkpoints

Once the Training Job completes successfully, you will be able to list all the produced checkpoints:

flexai training checkpoints axolotl-french-sft

Look for checkpoints marked as INFERENCE READY = true - these are ready for serving.

Serving the Trained Model

Deploy your trained model directly from the checkpoint using FlexAI inference. Replace <CHECKPOINT_ID> with the ID from an inference-ready checkpoint:

flexai inference serve axolotl-french-sft-endpoint --checkpoint <CHECKPOINT_ID>

GPU specification for inference endpoints is currently managed automatically by FlexAI. Future versions will allow explicit GPU count specification for inference workloads to optimize cost and performance based on your specific requirements.

You can monitor your inference endpoint status:

# List all inference endpoints
flexai inference list

# Get detailed endpoint information
flexai inference inspect axolotl-french-sft-endpoint

# Check endpoint logs
flexai inference logs axolotl-french-sft-endpoint

Testing Your Domain-Specific Model

Once the endpoint is running, you can test it with domain-specific prompts. For our French language example, the model should demonstrate strong French language understanding, proper grammar and syntax, and cultural context awareness.

Before and After Training Comparison

To illustrate the improvement from fine-tuning on French data, here’s a comparison using the question: “Qui a gagné la Coupe du monde 2018 ?” (who won the 2018 world cup?) Base Model Response (Qwen/Qwen2.5-7B before training):

La Coupe du monde de football 2018 a été remportée par la Russie.

Issues: Incorrect answer (says Russia instead of France) Fine-tuned Model Response (after full fine-tuning on openhermes-fr):

La France a remporté la Coupe du monde de football 2018, en battant le Croatie lors de la finale disputée à Moscou le 15 juillet 2018.

Improvements: Correct answer (France), excellent French grammar, accurate details, proper structure This example demonstrates the dramatic improvement in both factual accuracy and French language quality after domain-specific fine-tuning.

Example API Call

curl -X POST "https://your-endpoint-url/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Expliquez-moi les avantages de l'\''intelligence artificielle en français:",
    "max_tokens": 200,
    "temperature": 0.7
  }'

Adapt the prompt and evaluation criteria to match your specific domain and use case.

Expected Results

After fine-tuning on domain-specific data, your model should achieve:

Domain Expertise: Specialized knowledge and terminology understanding for your target domain
Task-Specific Performance: Enhanced capabilities for domain-relevant tasks and workflows
Maintained General Capabilities: Preserved reasoning, problem-solving, and general language skills

For our French language example:

Strong French Language Understanding: Natural conversation flow, proper grammar, cultural context
High Performance on French Tasks: Question answering, text summarization, creative writing

Technical Details

Training Configuration Breakdown:

Full Fine-tuning with FSDP: Enables training of 7B model on 1 node efficiently
Mixed Precision (bf16): Accelerates training while maintaining numerical stability
Gradient Accumulation: Effective batch size of 2 (2 steps × 1 per device)
Learning Rate Schedule: Cosine decay with 10% warmup for stable convergence
Context Length: 2048 tokens, optimized for conversation tasks
Sample Packing: Efficient batch utilization for variable-length sequences
Flash Attention: Optimized attention mechanism for faster training

Resource Requirements

Recommended Configuration for Qwen2.5-7B:

Nodes: 1 node (cost-effective for 7B models)
Accelerators: 4 × H100 GPUs per node
Memory: ~200GB+ GPU memory total
Training Time: ~2-4 hours for 3 epochs
Storage: ~30GB for checkpoints

Command Line Parameters Explained:

FORCE_TORCHRUN=1: Ensures proper distributed training setup

Scaling Options:

For faster training: Increase to 2 nodes (16 × H100)
For larger datasets: Adjust num_epochs parameter
For longer context: Increase sequence_len (requires more memory)
For memory efficiency: Switch to QLoRA with load_in_4bit: true and adapter: qlora
For other models: Use configs in code/axolotl/ directory (Llama, Mistral, Gemma, Phi, etc.)

Additional Training Examples

Axolotl provides extensive configuration examples for various models and training strategies:

Llama 3.1 8B with LoRA

flexai training run axolotl-lora-llama3-8B \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/axolotl/requirements.txt \
  --runtime nvidia-25.06 \
  -- axolotl train code/axolotl/llama-3/lora-8b.yml

Mistral 7B with QLoRA

flexai training run axolotl-qlora-mistral-7B \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/axolotl/requirements.txt \
  --runtime nvidia-25.06 \
  -- axolotl train code/axolotl/mistral/qlora.yml

Explore the code/axolotl/ directory for more examples including Gemma, Phi, Qwen2, multimodal models, and advanced configurations.

Troubleshooting

Training Job Fails to Start:

# Check FlexAI authentication
flexai auth status

# Verify repository access
git clone https://github.com/flexaihq/blueprints

Out of Memory Errors:

Reduce micro_batch_size from 1 to lower value (not recommended below 1)
Increase gradient_accumulation_steps to maintain effective batch size
Consider switching to QLoRA: set load_in_4bit: true and adapter: qlora for memory efficiency
Enable fsdp_offload_params: true for additional memory savings

Checkpoint Not Inference Ready:

Wait for training to complete fully (check with flexai training inspect)
Ensure Axolotl configuration saves model in compatible format
Verify training completed successfully without errors

Endpoint Deployment Issues:

Verify checkpoint shows INFERENCE READY = true status
Check FlexAI cluster availability with flexai inference list
Review detailed logs with flexai inference logs <endpoint-name>

Dataset Loading Issues:

Verify dataset path is correct in YAML configuration (e.g., legmlai/openhermes-fr)
Ensure HuggingFace token has access to private datasets
Check dataset format matches the specified type (sharegpt, alpaca, etc.)

Code

`requirements.txt`

packaging==23.2
setuptools==75.8.0
wheel
ninja

flash-attn>=2.6.3

# Install deepspeed
deepspeed>=0.15.0

# Install axolotl from GitHub
git+https://github.com/flexaihq/axolotl

`qwen2/fft-7b-french.yaml`

---
base_model: Qwen/Qwen2.5-7B
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

datasets:
- path: legmlai/openhermes-fr
  type:
    system_prompt: ''
    field_instruction: prompt
    field_output: accepted_completion
    format: '{instruction}'
    no_input_format: '{instruction}'
dataset_prepared_path:
val_set_size: 0.05
output_dir: /output-checkpoint

sequence_len: 2048
sample_packing: true
eval_sample_packing: true


wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 1.0e-5

bf16: auto
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen2DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:

# save_first_step: true  # uncomment this to validate checkpoint saving works with your config

🚀 Run this on FlexAI

Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.

Get started →Talk to us

​Create Secrets

​[Optional] Pre-fetch the Model

​Training

​Standard Training (without prefetch)

​Training with Model Prefetch

​Monitoring Training Progress

​Training Observability with Weights & Biases

​Getting Training Checkpoints

​Serving the Trained Model

​Testing Your Domain-Specific Model

​Before and After Training Comparison

​Example API Call

​Expected Results

​Technical Details

​Training Configuration Breakdown:

​Resource Requirements

​Scaling Options:

​Additional Training Examples

​Llama 3.1 8B with LoRA

​Mistral 7B with QLoRA

​Troubleshooting

​Code

​requirements.txt

​qwen2/fft-7b-french.yaml

​🚀 Run this on FlexAI

Create Secrets

[Optional] Pre-fetch the Model

Training

Standard Training (without prefetch)

Training with Model Prefetch

Monitoring Training Progress

Training Observability with Weights & Biases

Getting Training Checkpoints

Serving the Trained Model

Testing Your Domain-Specific Model

Before and After Training Comparison

Example API Call

Expected Results

Technical Details

Training Configuration Breakdown:

Resource Requirements

Scaling Options:

Additional Training Examples

Llama 3.1 8B with LoRA

Mistral 7B with QLoRA

Troubleshooting

Code

`requirements.txt`

`qwen2/fft-7b-french.yaml`

🚀 Run this on FlexAI