Fine-Tune a Text-to-Speech Model on FlexAI

The goal of this experiment is to fine-tune the parler_tts_mini_v0.1 model to create a French version. The model generates high-quality speech from input text, which can be controlled using a description prompt (e.g., gender, speaking rate, etc.). The training uses the a text-to-speech dataset in French, enabling the model to produce natural and expressive speech in this language.

Connect to GitHub (if needed)

If you haven’t already connected FlexAI to GitHub, you’ll need to set up a code registry connection:

flexai code-registry connect

This will allow FlexAI to pull repositories directly from GitHub using the -u flag in training commands.

Getting the Dataset

You can download the pre-processed version of the dataset by running the following command:

DATASET_NAME=text-to-speech-fr && curl -L -o ${DATASET_NAME}.zip "https://bucket-docs-samples-99b3a05.s3.eu-west-1.amazonaws.com/${DATASET_NAME}.zip" && unzip ${DATASET_NAME}.zip && rm ${DATASET_NAME}.zip

If you’d like to reproduce the pre-processing steps yourself to use a different dataset or simply to learn more about the process, you can refer to the Manual Dataset Pre-processing section below.

Next, push the contents of the text-to-speech-fr/ directory as a new FlexAI dataset:

flexai dataset push text-to-speech-fr --file text-to-speech-fr

Training

To start the Training Job, run the following command:

flexai training run text-to-speech-ddp --repository-url https://github.com/flexaihq/blueprints --dataset text-to-speech-fr --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> --requirements-path code/text-to-speech/requirements.txt \
  --nodes 1 --accels 8 \
  -- code/text-to-speech/run_parler_tts_training.py ./code/text-to-speech/french_training.json

Instead of passing a .json file as input, you can also set the arguments manually. For example:

flexai training run text-to-speech-ddp --repository-url https://github.com/flexaihq/blueprints --dataset text-to-speech-fr --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> --requirements-path code/text-to-speech/requirements.txt \
  --nodes 1 --accels 8 \
  -- code/text-to-speech/run_parler_tts_training.py \
    --model_name_or_path=parler-tts/parler_tts_mini_v0.1 \
    --save_to_disk=/input/text-to-speech-fr \
    --temporary_save_to_disk=./audio_code_tmp/ \
    --wandb_project=parler-francais \
    --feature_extractor_name=ylacombe/dac_44khZ_8kbps \
    --description_tokenizer_name=google/flan-t5-large \
    --prompt_tokenizer_name=google/flan-t5-large \
    --report_to=wandb \
    --overwrite_output_dir \
    --output_dir=/output-checkpoint \
    --train_dataset_name=PHBJT/cml-tts-20percent-subset \
    --train_metadata_dataset_name=PHBJT/cml-tts-20percent-subset-description \
    --train_dataset_config_name=default \
    --train_split_name=train \
    --eval_dataset_name=PHBJT/cml-tts-20percent-subset \
    --eval_metadata_dataset_name=PHBJT/cml-tts-20percent-subset-description \
    --eval_dataset_config_name=default \
    --eval_split_name=test \
    --target_audio_column_name=audio \
    --description_column_name=text_description \
    --prompt_column_name=text \
    --max_eval_samples=10

Optional Extra Steps

You can run these extra steps in a FlexAI Interactive Session or in a local env (e.g. pipenv install --python 3.10), if you have hardware that’s capable of doing inference.

Inference

A simple inference script that you can easily adapt to your needs is available at code/text-to-speech/predict.py.

Manual Dataset Pre-processing

If you’d prefer to perform the dataset pre-processing step yourself, you can follow these instructions.

Clone this repository

If you haven’t already, clone this repository on your host machine:

git clone https://github.com/flexaihq/blueprints.git blueprints --depth 1 --branch main && cd blueprints

Install the dependencies

Depending on your environment, you might need to install - if not already - the experiments’ dependencies by running:

pip install -r code/text-to-speech/requirements.txt

Dataset preparation

Prepare the dataset by running the training command with the --preprocessing_only flag in ./code/text-to-speech/french_training.json.

For large datasets, it is recommended to run the preprocessing on a single machine to avoid timeouts when running the script in distributed mode.

The content will be saved to the destination specified in --save_to_disk=./text-to-speech-fr/. Run the dataset preparation using:

python code/text-to-speech/run_parler_tts_training.py ./code/text-to-speech/french_training.json

Make sure to remove the --preprocessing_only flag before attempting to run the script for training purposes.

Code

`code/text-to-speech/french_training.json`

{
    "model_name_or_path": "parler-tts/parler_tts_mini_v0.1",
    "save_to_disk": "/input/text-to-speech-fr",
    "temporary_save_to_disk": "./audio_code_tmp/",
    "wandb_project": "parler-francais",
    "feature_extractor_name": "ylacombe/dac_44khZ_8kbps",
    "description_tokenizer_name": "google/flan-t5-large",
    "prompt_tokenizer_name": "google/flan-t5-large",
    "report_to": [
        "wandb"
    ],
    "overwrite_output_dir": true,
    "output_dir": "/output-checkpoint",
    "train_dataset_name": "PHBJT/cml-tts-20percent-subset",
    "train_metadata_dataset_name": "PHBJT/cml-tts-20percent-subset-description",
    "train_dataset_config_name": "default",
    "train_split_name": "train",
    "eval_dataset_name": "PHBJT/cml-tts-20percent-subset",
    "eval_metadata_dataset_name": "PHBJT/cml-tts-20percent-subset-description",
    "eval_dataset_config_name": "default",
    "eval_split_name": "test",
    "target_audio_column_name": "audio",
    "description_column_name": "text_description",
    "prompt_column_name": "text",
    "max_eval_samples": 10,
    "max_duration_in_seconds": 30,
    "min_duration_in_seconds": 2.0,
    "max_text_length": 600,
    "group_by_length": true,
    "add_audio_samples_to_wandb": true,
    "preprocessing_num_workers": 8,
    "do_train": true,
    "num_train_epochs": 100,
    "gradient_accumulation_steps": 4,
    "gradient_checkpointing": false,
    "per_device_train_batch_size": 6,
    "learning_rate": 0.00095,
    "adam_beta1": 0.9,
    "adam_beta2": 0.99,
    "weight_decay": 0.01,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 500,
    "logging_steps": 100,
    "freeze_text_encoder": true,
    "do_eval": true,
    "predict_with_generate": true,
    "include_inputs_for_metrics": true,
    "evaluation_strategy": "steps",
    "eval_steps": 1000,
    "save_steps": 1000,
    "per_device_eval_batch_size": 4,
    "audio_encoder_per_device_batch_size": 24,
    "dtype": "bfloat16",
    "seed": 456,
    "dataloader_num_workers": 8,
    "attn_implementation": "sdpa"
}

`code/text-to-speech/predict.py`

# Copyright (c) 2025 FlexAI
# This file is part of the FlexAI Experiments repository.
# SPDX-License-Identifier: MIT

# This is a basic inference script you can easily modify for your needs
import soundfile as sf
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer

checkpoint_path = "YOUR CHECKPOINT PATH"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)

prompt = "Salut, comment vas-tu aujourd'hui ?"
description = "A man speaking at a moderate speed with moderate pitch, very clear audio recording that has no background noise."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
attention_mask = input_ids.clone().fill_(1)

generation = model.generate(
    input_ids=input_ids,
    prompt_input_ids=prompt_input_ids,
    attention_mask=attention_mask,
)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("audio.wav", audio_arr, model.config.sampling_rate)

`code/text-to-speech/requirements.txt`

accelerate==1.0.1
datasets>=2.21.0
evaluate>=0.4.3
jiwer>=3.0.4
parler-tts @ git+https://github.com/huggingface/parler-tts.git@5d0aca9753ab74ded179732f5bd797f7a8c6f8ee
wandb>=0.18.1

🚀 Run this on FlexAI

Managed checkpoints mean you never lose a run to preemption. Jobs launch in under 60 seconds — no infra setup, built-in observability.

Get started →Talk to us

​Training

​Optional Extra Steps

​Inference

​Manual Dataset Pre-processing

​Clone this repository

​Install the dependencies

​Dataset preparation

​Code

​code/text-to-speech/french_training.json

​code/text-to-speech/predict.py

​code/text-to-speech/requirements.txt

​🚀 Run this on FlexAI

Training

Optional Extra Steps

Inference

Manual Dataset Pre-processing

Clone this repository

Install the dependencies

Dataset preparation

Code

`code/text-to-speech/french_training.json`

`code/text-to-speech/predict.py`

`code/text-to-speech/requirements.txt`

🚀 Run this on FlexAI