> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Fine-Tune and Deploy an LLM on FlexAI with LlamaFactory

> Fine-tune and deploy any LLM with LlamaFactory on FlexAI. Managed checkpoints, train and serve in one platform, no lost runs from spot preemption.

[LlamaFactory](https://github.com/hiyouga/LLaMA-Factory) is a unified framework for fine-tuning 100+ open-source LLMs from a single YAML config — SFT, PPO, DPO, and more. This blueprint shows how to run it end-to-end on FlexAI: configure a training run, launch with managed checkpoints, and deploy the resulting model as a production inference endpoint.

The workflow is model- and dataset-agnostic. Throughout this guide we use `Qwen/Qwen2.5-7B` + the [`openhermes-fr`](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset as a worked example — swap them for any LlamaFactory-supported model and any HuggingFace dataset by editing the YAML config.

> **Note**: If you haven't already connected FlexAI to GitHub, run `flexai code-registry connect` to set up a code registry connection. This allows FlexAI to pull repositories directly using the repository URL in training commands.

<Steps>
  <Step title="Verify Dataset Configuration">
    Register your dataset in LlamaFactory's dataset registry. The example below uses `openhermes-fr`; replace it with any HuggingFace dataset by editing the `hf_hub_url` and column mapping.

    Navigate to `experiments/code/llama-factory/data/dataset_info.json` and verify the dataset entry exists:

    ```json theme={null}
    {
        "openhermes-fr": {
            "hf_hub_url": "legmlai/openhermes-fr",
            "columns": {
                "prompt": "prompt",
                "response": "accepted_completion"
            }
        }
    }
    ```

    For your own use case, replace the entry with your dataset and update the `columns` mapping to match your dataset's schema.
  </Step>

  <Step title="Configure Training Parameters">
    The `qwen25-7B_sft.yaml` file contains the training configuration. Swap the `model_name_or_path` and `dataset` values for your own. Key settings shown in the example:

    * **Model**: `Qwen/Qwen2.5-7B` — example base model (replace `model_name_or_path` with any HuggingFace model ID)
    * **Stage**: `sft` (Supervised Fine-Tuning) — switch to `dpo`, `ppo`, or `reward` for other alignment methods LlamaFactory supports
    * **Dataset**: `openhermes-fr` — example dataset (replace with your own entry in `dataset_info.json`)
    * **Training**: Full fine-tuning with DeepSpeed ZeRO Stage 3 for memory-efficient distributed training

    ```yaml theme={null}
    ---
    ### model
    model_name_or_path: Qwen/Qwen2.5-7B
    trust_remote_code: true

    ### method
    stage: sft
    do_train: true
    finetuning_type: full
    deepspeed: code/llama-factory/ds_z3_config.json

    ### dataset
    dataset: openhermes-fr
    dataset_dir: code/llama-factory/data
    template: qwen
    cutoff_len: 2048
    max_samples: 100000
    overwrite_cache: true
    preprocessing_num_workers: 16
    dataloader_num_workers: 4

    ### output
    output_dir: /output-checkpoint
    logging_steps: 10
    save_steps: 1000
    plot_loss: true
    overwrite_output_dir: true
    save_only_model: false
    report_to: none

    ### train
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 2
    learning_rate: 1.0e-5
    num_train_epochs: 3.0
    lr_scheduler_type: cosine
    warmup_ratio: 0.1
    bf16: true
    ddp_timeout: 180000000
    resume_from_checkpoint:
    ```
  </Step>
</Steps>

## Create Secrets

To access gated models and datasets on HuggingFace, you need a HuggingFace token.

Use the [`flexai secret create` command](https://docs.flex.ai/cli/commands/secret/) to store your *HuggingFace Token* as a secret. Replace `<HF_AUTH_TOKEN_SECRET_NAME>` with your desired name for the secret:

```bash theme={null}
flexai secret create <HF_AUTH_TOKEN_SECRET_NAME>
```

Then paste your *HuggingFace Token* API key value.

## \[Optional] Pre-fetch the Model

To speed up training and avoid downloading large models at runtime, you can pre-fetch your HuggingFace model to FlexAI storage. For example, to pre-fetch the `Qwen/Qwen2.5-7B` model:

1. **Create a HuggingFace storage provider:**

   ```bash theme={null}
   flexai storage create HF-STORAGE --provider huggingface --hf-token-name <HF_AUTH_TOKEN_SECRET_NAME>
   ```

2. **Push the model checkpoint to your storage:**

   ```bash theme={null}
   flexai checkpoint push qwen25-7b --storage-provider HF-STORAGE --source-path Qwen/Qwen2.5-7B
   ```

This pre-fetched checkpoint can then be used in your training command to reduce startup time.

## Training

For a 7B model, we recommend using **1 node (8 × H100 GPUs)** to ensure reasonable training time and avoid out-of-memory issues.

### Standard Training (without prefetch)

```bash theme={null}
flexai training run domain-specific-qwen25-7b-sft \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --env FORCE_TORCHRUN=1 \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/llama-factory/requirements.txt \
  -- /layers/flexai_pip-install/packages/bin/llamafactory-cli train code/llama-factory/qwen25-7B_sft.yaml
```

### Training with Model Prefetch

To take advantage of model pre-fetching performed in the [Optional: Pre-fetch the Model](#optional-pre-fetch-the-model) section, use:

```bash theme={null}
flexai training run domain-specific-qwen25-7b-sft-prefetched \
  --accels 8 --nodes 1 \
  --repository-url https://github.com/flexaihq/blueprints \
  --checkpoint qwen25-7b \
  --env FORCE_TORCHRUN=1 \
  --secret HF_TOKEN=<HF_AUTH_TOKEN_SECRET_NAME> \
  --requirements-path code/llama-factory/requirements.txt \
  -- /layers/flexai_pip-install/packages/bin/llamafactory-cli train code/llama-factory/qwen25-7B_sft.yaml
```

## Monitoring Training Progress

You can check the status and lifecycle events of your Training Job by running:

```bash theme={null}
flexai training inspect domain-specific-qwen25-7b-sft
```

Additionally, you can view the logs of your Training Job by running:

```bash theme={null}
flexai training logs domain-specific-qwen25-7b-sft
```

### Training Observability with TensorBoard

For advanced monitoring and visualization of training metrics, you can leverage TensorBoard integration. FlexAI supports TensorBoard logging for detailed insights into training progress, loss curves, and model performance.

To enable TensorBoard logging, update your YAML configuration:

```yaml theme={null}
report_to: tensorboard
```

Once enabled, you can access training metrics and visualizations through the FlexAI console. For more details on observability features, see the [FlexAI TensorBoard documentation](https://docs.flex.ai/platform/tensorboard/).

## Getting Training Checkpoints

Once the Training Job completes successfully, you will be able to list all the produced checkpoints:

```bash theme={null}
flexai training checkpoints domain-specific-qwen25-7b-sft
```

Look for checkpoints marked as `INFERENCE READY = true` - these are ready for serving.

## Serving the Trained Model

Deploy your trained model directly from the checkpoint using FlexAI inference. Replace `<CHECKPOINT_ID>` with the ID from an inference-ready checkpoint:

```bash theme={null}
flexai -v inference serve domain-specific-endpoint --checkpoint <CHECKPOINT_ID>
```

> **Note**: GPU specification for inference endpoints is currently managed automatically by FlexAI. Future versions will allow explicit GPU count specification for inference workloads to optimize cost and performance based on your specific requirements.

You can monitor your inference endpoint status:

```bash theme={null}
# List all inference endpoints
flexai inference list

# Get detailed endpoint information
flexai inference inspect domain-specific-endpoint

# Check endpoint logs
flexai inference logs domain-specific-endpoint
```

## Testing Your Fine-Tuned Model

Once the endpoint is running, test it with prompts representative of your domain. A well-fine-tuned model should show measurable improvements over the base model on your target tasks — better in-domain vocabulary and terminology, more accurate factual recall, and more appropriate tone or structure.

### Example API Call

```bash theme={null}
curl -X POST "https://your-endpoint-url/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "prompt": "Your evaluation prompt here",
    "max_tokens": 200,
    "temperature": 0.7
  }'
```

Compare base-model and fine-tuned responses side-by-side on a held-out prompt set to quantify improvement.

## Expected Results

After fine-tuning on domain-specific data, your model should achieve:

* **Domain Expertise**: Specialized knowledge and terminology understanding for your target domain
* **Task-Specific Performance**: Enhanced capabilities for domain-relevant tasks and workflows
* **Maintained General Capabilities**: Preserved reasoning, problem-solving, and general language skills

## Technical Details

### Training Configuration Breakdown

* **DeepSpeed ZeRO Stage 3**: Enables training of 7B model on 1 node efficiently
* **Mixed Precision (bf16)**: Accelerates training while maintaining numerical stability
* **Gradient Accumulation**: Effective batch size of 4 (2 steps × 2 per device)
* **Learning Rate Schedule**: Cosine decay with 10% warmup for stable convergence
* **Context Length**: 2048 tokens, optimized for conversation tasks

### Resource Requirements

**Recommended Configuration for Qwen2.5-7B:**

* **Nodes**: 1 node (cost-effective for 7B models)
* **Accelerators**: 8 × H100 GPUs per node
* **Memory**: \~200GB+ GPU memory total
* **Training Time**: \~2-4 hours for 3 epochs
* **Storage**: \~30GB for checkpoints

**Command Line Parameters Explained:**

* `FORCE_TORCHRUN=1`: Ensures proper distributed training setup

### Scaling Options

* For faster training: Increase to 2 nodes (16 × H100)
* For larger datasets: Adjust `max_samples` parameter
* For longer context: Increase `cutoff_len` (requires more memory)
* For memory efficiency: Switch to `finetuning_type: lora` for QLoRA training

## Troubleshooting

**Common Issues:**

**Training Job Fails to Start:**

```bash theme={null}
# Check FlexAI authentication
flexai auth status

# Verify repository access
git clone https://github.com/flexaihq/blueprints
```

**Out of Memory Errors:**

* Reduce `per_device_train_batch_size` from 1 to lower value
* Increase `gradient_accumulation_steps` to maintain effective batch size
* Consider using `finetuning_type: lora` for memory efficiency

**Checkpoint Not Inference Ready:**

* Wait for training to complete fully (check with `flexai training inspect`)
* Ensure `save_only_model: false` in YAML configuration
* Verify training completed successfully without errors

**Endpoint Deployment Issues:**

* Verify checkpoint shows `INFERENCE READY = true` status
* Check FlexAI cluster availability with `flexai inference list`
* Review detailed logs with `flexai inference logs <endpoint-name>`

## Code

### `code/llama-factory/qwen25-7B_sft.yaml`

```yaml theme={null}
---
### model
model_name_or_path: Qwen/Qwen2.5-7B
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: freeze
deepspeed: code/llama-factory/ds_z3_config.json

### dataset
dataset: openhermes-fr
dataset_dir: code/llama-factory/data
template: qwen
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: /output-checkpoint
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint:

### eval
# eval_dataset: alpaca_en_demo
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
```

### `code/llama-factory/ds_z3_config.json`

```json theme={null}
{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}
```

### `code/llama-factory/data/dataset_info.json`

```json theme={null}
{
    "identity": {
        "file_name": "identity.json"
    },
    "alpaca_en_demo": {
        "file_name": "alpaca_en_demo.json"
    },
    "wikitext": {
        "file_name": "/input/hf-wikitext/wikitext-2-raw-v1"
    },
    "openhermes-fr": {
        "hf_hub_url": "legmlai/openhermes-fr",
        "columns": {
            "prompt": "prompt",
            "response": "accepted_completion"
        }
    },
    "openhermes-fr-prefetched": {
        "hf_hub_url": "/input/openhermes-fr",
        "columns": {
            "prompt": "prompt",
            "response": "accepted_completion"
        }
    }
}
```

### `code/llama-factory/requirements.txt`

```text theme={null}
llamafactory @ git+https://github.com/hiyouga/LLaMA-Factory.git@v0.9.3
deepspeed>=0.10.0,<=0.16.9
transformers==4.51.3
```

<div className="blueprint-cta">
  <h3>🚀 Run this on FlexAI</h3>
  <p>Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.</p>
  <a href="https://console.flex.ai" className="cta-primary">Get started →</a>
  <a href="https://flex.ai/contact" className="cta-secondary">Talk to us</a>
</div>
