Skip to main content
The goal of this experiment is to fine-tune the parler_tts_mini_v0.1 model to create a French version. The model generates high-quality speech from input text, which can be controlled using a description prompt (e.g., gender, speaking rate, etc.). The training uses the a text-to-speech dataset in French, enabling the model to produce natural and expressive speech in this language.
1

Connect to GitHub (if needed)

If you haven’t already connected FlexAI to GitHub, you’ll need to set up a code registry connection:
flexai code-registry connect
This will allow FlexAI to pull repositories directly from GitHub using the -u flag in training commands.
2

Getting the Dataset

You can download the pre-processed version of the dataset by running the following command:
DATASET_NAME=text-to-speech-fr && curl -L -o ${DATASET_NAME}.zip "https://bucket-docs-samples-99b3a05.s3.eu-west-1.amazonaws.com/${DATASET_NAME}.zip" && unzip ${DATASET_NAME}.zip && rm ${DATASET_NAME}.zip
If you’d like to reproduce the pre-processing steps yourself to use a different dataset or simply to learn more about the process, you can refer to the Manual Dataset Pre-processing section below.
Next, push the contents of the text-to-speech-fr/ directory as a new FlexAI dataset:
flexai dataset push text-to-speech-fr --file text-to-speech-fr

Training

To start the Training Job, run the following command:
flexai training run text-to-speech-ddp --repository-url https://github.com/flexaihq/blueprints --dataset text-to-speech-fr --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> --requirements-path code/text-to-speech/requirements.txt \
  --nodes 1 --accels 8 \
  -- code/text-to-speech/run_parler_tts_training.py ./code/text-to-speech/french_training.json
Instead of passing a .json file as input, you can also set the arguments manually. For example:
flexai training run text-to-speech-ddp --repository-url https://github.com/flexaihq/blueprints --dataset text-to-speech-fr --secret WANDB_API_KEY=<WANDB_API_KEY_SECRET_NAME> --requirements-path code/text-to-speech/requirements.txt \
  --nodes 1 --accels 8 \
  -- code/text-to-speech/run_parler_tts_training.py \
    --model_name_or_path=parler-tts/parler_tts_mini_v0.1 \
    --save_to_disk=/input/text-to-speech-fr \
    --temporary_save_to_disk=./audio_code_tmp/ \
    --wandb_project=parler-francais \
    --feature_extractor_name=ylacombe/dac_44khZ_8kbps \
    --description_tokenizer_name=google/flan-t5-large \
    --prompt_tokenizer_name=google/flan-t5-large \
    --report_to=wandb \
    --overwrite_output_dir \
    --output_dir=/output-checkpoint \
    --train_dataset_name=PHBJT/cml-tts-20percent-subset \
    --train_metadata_dataset_name=PHBJT/cml-tts-20percent-subset-description \
    --train_dataset_config_name=default \
    --train_split_name=train \
    --eval_dataset_name=PHBJT/cml-tts-20percent-subset \
    --eval_metadata_dataset_name=PHBJT/cml-tts-20percent-subset-description \
    --eval_dataset_config_name=default \
    --eval_split_name=test \
    --target_audio_column_name=audio \
    --description_column_name=text_description \
    --prompt_column_name=text \
    --max_eval_samples=10

Optional Extra Steps

You can run these extra steps in a FlexAI Interactive Session or in a local env (e.g. pipenv install --python 3.10), if you have hardware that’s capable of doing inference.

Inference

A simple inference script that you can easily adapt to your needs is available at code/text-to-speech/predict.py.

Manual Dataset Pre-processing

If you’d prefer to perform the dataset pre-processing step yourself, you can follow these instructions.

Clone this repository

If you haven’t already, clone this repository on your host machine:
git clone https://github.com/flexaihq/blueprints.git blueprints --depth 1 --branch main && cd blueprints

Install the dependencies

Depending on your environment, you might need to install - if not already - the experiments’ dependencies by running:
pip install -r code/text-to-speech/requirements.txt

Dataset preparation

Prepare the dataset by running the training command with the --preprocessing_only flag in ./code/text-to-speech/french_training.json.
For large datasets, it is recommended to run the preprocessing on a single machine to avoid timeouts when running the script in distributed mode.
The content will be saved to the destination specified in --save_to_disk=./text-to-speech-fr/. Run the dataset preparation using:
python code/text-to-speech/run_parler_tts_training.py ./code/text-to-speech/french_training.json
Make sure to remove the --preprocessing_only flag before attempting to run the script for training purposes.

Code

code/text-to-speech/french_training.json

{
    "model_name_or_path": "parler-tts/parler_tts_mini_v0.1",
    "save_to_disk": "/input/text-to-speech-fr",
    "temporary_save_to_disk": "./audio_code_tmp/",
    "wandb_project": "parler-francais",
    "feature_extractor_name": "ylacombe/dac_44khZ_8kbps",
    "description_tokenizer_name": "google/flan-t5-large",
    "prompt_tokenizer_name": "google/flan-t5-large",
    "report_to": [
        "wandb"
    ],
    "overwrite_output_dir": true,
    "output_dir": "/output-checkpoint",
    "train_dataset_name": "PHBJT/cml-tts-20percent-subset",
    "train_metadata_dataset_name": "PHBJT/cml-tts-20percent-subset-description",
    "train_dataset_config_name": "default",
    "train_split_name": "train",
    "eval_dataset_name": "PHBJT/cml-tts-20percent-subset",
    "eval_metadata_dataset_name": "PHBJT/cml-tts-20percent-subset-description",
    "eval_dataset_config_name": "default",
    "eval_split_name": "test",
    "target_audio_column_name": "audio",
    "description_column_name": "text_description",
    "prompt_column_name": "text",
    "max_eval_samples": 10,
    "max_duration_in_seconds": 30,
    "min_duration_in_seconds": 2.0,
    "max_text_length": 600,
    "group_by_length": true,
    "add_audio_samples_to_wandb": true,
    "preprocessing_num_workers": 8,
    "do_train": true,
    "num_train_epochs": 100,
    "gradient_accumulation_steps": 4,
    "gradient_checkpointing": false,
    "per_device_train_batch_size": 6,
    "learning_rate": 0.00095,
    "adam_beta1": 0.9,
    "adam_beta2": 0.99,
    "weight_decay": 0.01,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 500,
    "logging_steps": 100,
    "freeze_text_encoder": true,
    "do_eval": true,
    "predict_with_generate": true,
    "include_inputs_for_metrics": true,
    "evaluation_strategy": "steps",
    "eval_steps": 1000,
    "save_steps": 1000,
    "per_device_eval_batch_size": 4,
    "audio_encoder_per_device_batch_size": 24,
    "dtype": "bfloat16",
    "seed": 456,
    "dataloader_num_workers": 8,
    "attn_implementation": "sdpa"
}

code/text-to-speech/predict.py

# Copyright (c) 2025 FlexAI
# This file is part of the FlexAI Experiments repository.
# SPDX-License-Identifier: MIT

# This is a basic inference script you can easily modify for your needs
import soundfile as sf
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer

checkpoint_path = "YOUR CHECKPOINT PATH"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)

prompt = "Salut, comment vas-tu aujourd'hui ?"
description = "A man speaking at a moderate speed with moderate pitch, very clear audio recording that has no background noise."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
attention_mask = input_ids.clone().fill_(1)

generation = model.generate(
    input_ids=input_ids,
    prompt_input_ids=prompt_input_ids,
    attention_mask=attention_mask,
)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("audio.wav", audio_arr, model.config.sampling_rate)

code/text-to-speech/requirements.txt

accelerate==1.0.1
datasets>=2.21.0
evaluate>=0.4.3
jiwer>=3.0.4
parler-tts @ git+https://github.com/huggingface/parler-tts.git@5d0aca9753ab74ded179732f5bd797f7a8c6f8ee
wandb>=0.18.1