Setup
The demo code for this experiment is located atcode/qwen3-tts but it is advisable to follow the steps below before jumping to the full demo.
Prerequisites
Before starting, make sure you have:- A Flexai account with access to the platform
- The
flexaiCLI installed and configured
Start the FlexAI Inference Endpoint for voice cloning
Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. Start the FlexAI endpoint for the Qwen3-TTS-12Hz-1.7B-Base model:INFERENCE_NAME=Qwen3-base
flexai inference serve $INFERENCE_NAME --runtime vllm-omni-0.14.0 -- Qwen/Qwen3-TTS-12Hz-1.7B-Base --stage-configs-path /workspace/vllm-omni/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --trust-remote-code --enforce-eager
- Create an inference endpoint named
Qwen3-base - Use the
vllm-omniruntime - Load the Qwen3-TTS-12Hz-1.7B-Base model from Hugging Face
Get Endpoint Information
Once the endpoint is deployed, you’ll see the API key displayed in the output. Store it in an environment variable:export INFERENCE_API_KEY_BASE=<API_KEY_FROM_ENDPOINT_CREATION_OUTPUT>
export INFERENCE_URL_BASE=$(flexai inference inspect $INFERENCE_NAME -j | jq .config.endpointUrl -r)
You’ll notice these
export lines use the jq tool to extract values from the JSON output of the inspect command.If you don’t have it already, you can get jq from its official website: https://jqlang.org/Generate Audio
Now you can clone a voice by making HTTP POST requests to your endpoint. Here is an example:curl -v -X POST -H "Authorization: Bearer $INFERENCE_API_KEY_BASE" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you? I am so excited. For Sure!",
"ref_text": "Exactly. And, you know, one of the things we have, we ve been wondering about why some of these companies in the last seven, eight years are in the graveyard. And one of the challenges was they all went after CUDA or NVIDIA silicon.",
"ref_audio": "https://tmpfiles.org/26240870/sample_1.wav",
"language": "Auto",
"task_type": "Base"
}' -o excited.wav \
$INFERENCE_URL_BASE/v1/audio/speech
Parameters Explanation
The API accepts the following parameters:- inputs: The text prompt describing the text you want to generate
- ref_text: The transcription of the reference audio file (the voice you want to clone)
- ref_audio: A URL pointing to the reference audio file (must be in WAV format and less than 10 seconds long)
- language: Language of the input text (set to “Auto” for automatic detection)
- task_type: Set to “Base” for voice cloning
Start the FlexAI Inference Endpoint for using custom voice
Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. Start the FlexAI endpoint for the Qwen3-TTS-12Hz-1.7B-CustomVoice model:INFERENCE_NAME=Qwen3-custom-voice
flexai inference serve $INFERENCE_NAME --runtime vllm-omni-0.14.0 -- Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --stage-configs-path /workspace/vllm-omni/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --trust-remote-code --enforce-eager
- Create an inference endpoint named
Qwen3-custom-voice - Use the
vllm-omniruntime - Load the Qwen3-TTS-12Hz-1.7B-CustomVoice model from Hugging Face
Get Endpoint Information
Once the endpoint is deployed, you’ll see the API key displayed in the output. Store it in an environment variable:export INFERENCE_URL_TTS=<API_KEY_FROM_ENDPOINT_CREATION_OUTPUT>
export INFERENCE_URL_TTS=$(flexai inference inspect $INFERENCE_NAME -j | jq .config.endpointUrl -r)
Generate Audio
Now you can create audio with the provided custom voice by making HTTP POST requests to your endpoint. Here is an example:curl -v -X POST \
-H "Authorization: Bearer $INFERENCE_API_KEY_TTS" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English"
}' \
-o output.wav \
"$INFERENCE_URL_TTS/v1/audio/speech"
curl -X POST "$INFERENCE_URL_TTS/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{
"input": "I am so excited!",
"voice": "vivian",
"language": "English",
"instructions": "Speak with great enthusiasm"
}' --output excited.wav
Parameters Explanation
The API accepts the following parameters:- inputs: The text prompt describing the text you want to generate
- voice: The voice you want to use (you can list the available voices by making a GET request to the endpoint’s
/v1/audio/voicespath) - instructions: How the model should speak the text (e.g., “Speak with great enthusiasm”, “Speak like a news anchor”, etc.)
- language: Language of the input text (set to “Auto” for automatic detection)
Demo App
The demo app allows you to easily test the endpoints you just created. You can find it in thecode/qwen3-tts directory.
In the same directory you will find scripts to start the inference endpoints for both the base and custom voice models.
Make sure to start both endpoints before running the demo app:
Start Qwen3-TTS-12Hz-1.7B-Base model
source ./START/run-qwen3-tts-cloneVoice.sh
Start Qwen3-TTS-12Hz-1.7B-CustomVoice
source ./START/run-qwen3-tts-customVoice.sh
Start the demo app
Using uv (recommended)
uv run app.py
Using pip
python3.13 -m venv venv
source ./venv/bin/activate
uv pip install -r requirements.txt
python app.py
Code
app.py
# coding=utf-8
# Qwen3-TTS Gradio Demo
# Supports: Voice Clone (Base), TTS (CustomVoice)
# based on https://huggingface.co/spaces/Qwen/Qwen3-TTS
# Copyright 2026 The Alibaba Qwen team.
# SPDX-License-Identifier: Apache-2.0
import argparse
import base64
import io
import os
import gradio as gr
import numpy as np
import soundfile as sf
from openai import OpenAI
# Speaker and language choices for CustomVoice model
SPEAKERS = [
"Aiden",
"Dylan",
"Eric",
"Ono_anna",
"Ryan",
"Serena",
"Sohee",
"Uncle_fu",
"Vivian",
]
LANGUAGES = [
"Auto",
"Chinese",
"English",
"Japanese",
"Korean",
"French",
"German",
"Spanish",
"Portuguese",
"Russian",
]
## ============================================================================
## GLOBAL MODEL LOADING - Load all models at startup
## ============================================================================
# Check required environment variables
required_env_vars = [
"INFERENCE_API_KEY_BASE",
"INFERENCE_URL_BASE",
"INFERENCE_API_KEY_TTS",
"INFERENCE_URL_TTS",
]
missing_vars = [var for var in required_env_vars if not os.getenv(var)]
if missing_vars:
raise EnvironmentError(
f"Missing required environment variables: {', '.join(missing_vars)}"
)
class EndpointConfig:
def __init__(
self, tts_api_key=None, tts_url=None, voice_api_key=None, voice_url=None
):
self.tts_api_key = tts_api_key
self.tts_url = tts_url
self.voice_api_key = voice_api_key
self.voice_url = voice_url
def get_endpoint_config(self):
# Return as a list, not a tuple, and append 'none' as the fifth element
return [self.tts_api_key, self.tts_url, self.voice_api_key, self.voice_url]
def set_endpoint_config(self, tts_api_key, tts_url, voice_api_key, voice_url):
self.tts_api_key = tts_api_key
self.tts_url = tts_url
self.voice_api_key = voice_api_key
self.voice_url = voice_url
class Models:
def __init__(self):
self.base_model_1_7b = None
self.base_model_1_7b_model = None
self.custom_voice_model_1_7b = None
self.custom_voice_model_1_7b_model = None
endpoint_config = EndpointConfig()
models_endpoint = Models()
def update_config(tts_api_key, tts_url, voice_api_key, voice_url):
endpoint_config.set_endpoint_config(tts_api_key, tts_url, voice_api_key, voice_url)
outcome = ""
try:
models_endpoint.base_model_1_7b = OpenAI(api_key=tts_api_key, base_url=tts_url)
models = models_endpoint.base_model_1_7b.models.list()
if len(models.data) != 1:
raise ValueError(
f"Expected exactly one model in the endpoint {tts_url}, but got {len(models)}"
)
models_endpoint.base_model_1_7b_model = models.data[0].id
outcome = "Saving config successfully!"
except Exception as e:
outcome = f"Error: {type(e).__name__}: {e}"
try:
models_endpoint.custom_voice_model_1_7b = OpenAI(
api_key=voice_api_key,
base_url=voice_url,
)
models = models_endpoint.custom_voice_model_1_7b.models.list()
if len(models.data) != 1:
raise ValueError(
f"Expected exactly one model in the endpoint {voice_url} , but got {len(models)}"
)
models_endpoint.custom_voice_model_1_7b_model = models.data[0].id
outcome = "Saving config successfully!"
except Exception as e:
outcome = f"Error: {type(e).__name__}: {e}"
config = endpoint_config.get_endpoint_config()
config.append(outcome)
return config
update_config(
tts_api_key=os.getenv("INFERENCE_API_KEY_BASE"),
tts_url=f"{os.getenv('INFERENCE_URL_BASE')}/v1",
voice_api_key=os.getenv("INFERENCE_API_KEY_TTS"),
voice_url=f"{os.getenv('INFERENCE_URL_TTS')}/v1",
)
## ============================================================================
def _normalize_audio(wav, eps=1e-12, clip=True):
"""Normalize audio to float32 in [-1, 1] range."""
x = np.asarray(wav)
if np.issubdtype(x.dtype, np.integer):
info = np.iinfo(x.dtype)
if info.min < 0:
y = x.astype(np.float32) / max(abs(info.min), info.max)
else:
mid = (info.max + 1) / 2.0
y = (x.astype(np.float32) - mid) / mid
elif np.issubdtype(x.dtype, np.floating):
y = x.astype(np.float32)
m = np.max(np.abs(y)) if y.size else 0.0
if m > 1.0 + 1e-6:
y = y / (m + eps)
else:
raise TypeError(f"Unsupported dtype: {x.dtype}")
if clip:
y = np.clip(y, -1.0, 1.0)
if y.ndim > 1:
y = np.mean(y, axis=-1).astype(np.float32)
return y
def _audio_to_tuple(audio):
"""Convert Gradio audio input to (wav, sr) tuple."""
if audio is None:
return None
if isinstance(audio, tuple) and len(audio) == 2 and isinstance(audio[0], int):
sr, wav = audio
wav = _normalize_audio(wav)
return wav, int(sr)
if isinstance(audio, dict) and "sampling_rate" in audio and "data" in audio:
sr = int(audio["sampling_rate"])
wav = _normalize_audio(audio["data"])
return wav, sr
return None
def encode_audio_to_base64(audio_tuple) -> str:
"""Encode a (wav, sr) tuple to base64 data URL (WAV format)."""
wav, sr = audio_tuple
buffer = io.BytesIO()
sf.write(buffer, wav, sr, format="WAV")
buffer.seek(0)
audio_bytes = buffer.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
return f"data:audio/wav;base64,{audio_b64}"
def decode_response_audio(content) -> tuple:
"""Decode audio bytes from API response to (wav, sr) tuple."""
audio_buffer = io.BytesIO(content)
audio_np, sr = sf.read(audio_buffer)
return sr, audio_np
def generate_voice_clone(
ref_audio,
ref_text,
target_text,
language,
use_xvector_only,
progress=gr.Progress(track_tqdm=True),
):
"""Generate speech using Base (Voice Clone) model."""
if not target_text or not target_text.strip():
return None, "Error: Target text is required."
audio_tuple = _audio_to_tuple(ref_audio)
audio_b64 = encode_audio_to_base64(audio_tuple)
if audio_tuple is None:
return None, "Error: Reference audio is required."
if not use_xvector_only and (not ref_text or not ref_text.strip()):
return (
None,
"Error: Reference text is required when 'Use x-vector only' is not enabled.",
)
try:
response = models_endpoint.base_model_1_7b.audio.speech.create(
input=target_text.strip(),
voice=None,
model=models_endpoint.base_model_1_7b_model,
extra_body={
"language": language,
"ref_audio": audio_b64,
"ref_text": ref_text.strip() if ref_text else None,
"x_vector_only_mode": use_xvector_only,
"task_type": "Base",
},
)
return (
decode_response_audio(response.content),
"Generation completed successfully!",
)
except Exception as e:
print(response.content)
return None, f"Error: {type(e).__name__}: {e}"
def generate_custom_voice(
text, language, speaker, instruct, progress=gr.Progress(track_tqdm=True)
):
"""Generate speech using CustomVoice model."""
if not text or not text.strip():
return None, "Error: Text is required."
if not speaker:
return None, "Error: Speaker is required."
try:
response = models_endpoint.custom_voice_model_1_7b.audio.speech.create(
model=models_endpoint.custom_voice_model_1_7b_model,
voice=speaker.lower().replace(" ", "_"),
input=text.strip(),
instructions=instruct.strip() if instruct else None,
extra_body={
"language": language,
},
)
return (
decode_response_audio(response.content),
"Generation completed successfully!",
)
except Exception as e:
print(response.content)
return None, f"Error: {type(e).__name__}: {e}"
# Build Gradio UI
def build_ui():
theme = gr.themes.Soft(
font=[gr.themes.GoogleFont("Source Sans Pro"), "Arial", "sans-serif"],
)
css = """
.gradio-container {max-width: none !important;}
.tab-content {padding: 20px;}
"""
with gr.Blocks(theme=theme, css=css, title="Qwen3-TTS Demo") as demo:
gr.Markdown(
"""
# Qwen3-TTS
This demo is using flex.ai inference API:
- **TTS (CustomVoice)**: Generate speech with predefined speakers and optional style instructions
- **Voice Clone (Base)**: Clone any voice from a reference audio
"""
)
def toggle_api_key_visibility(visible, value):
type = "text" if visible else "password"
return gr.Textbox(label="API Key", type=type, value=value)
with gr.Tabs():
# Tab 1: TTS (CustomVoice)
with gr.Tab("TTS (CustomVoice)"):
gr.Markdown("### Text-to-Speech with Predefined Speakers")
with gr.Row():
with gr.Column(scale=2):
tts_text = gr.Textbox(
label="Text to Synthesize",
lines=4,
placeholder="Enter the text you want to convert to speech...",
value="""
Hello! Welcome to Flex dot AI.
This demo is using Flex private cloud inference API to run Qwen3-TTS models in the cloud, so no GPU is required on your end!
Just enter your text and settings, and let the flex compute do the heavy lifting to generate high-quality speech at scale for you, fast and scalable inference capabilities included!
And do not forget that you can also finetune your own custom TTS model here and then deploy on flex dot ai in just one click!
""",
)
with gr.Row():
tts_language = gr.Dropdown(
label="Language",
choices=LANGUAGES,
value="English",
interactive=True,
)
tts_speaker = gr.Dropdown(
label="Speaker",
choices=SPEAKERS,
value="Aiden",
interactive=True,
)
with gr.Row():
tts_instruct = gr.Textbox(
label="Style Instruction (Optional)",
lines=2,
placeholder="e.g., Speak in a cheerful and energetic tone",
value="Speak in a cheerful and energetic tone",
)
tts_btn = gr.Button("Generate Speech", variant="primary")
with gr.Column(scale=2):
tts_audio_out = gr.Audio(label="Generated Audio", type="numpy")
tts_status = gr.Textbox(
label="Status", lines=2, interactive=False
)
tts_btn.click(
generate_custom_voice,
inputs=[tts_text, tts_language, tts_speaker, tts_instruct],
outputs=[tts_audio_out, tts_status],
)
# Tab 2: Voice Clone (Base)
with gr.Tab("Voice Clone (Base)"):
gr.Markdown("### Clone Voice from Reference Audio")
with gr.Row():
with gr.Column(scale=2):
clone_ref_audio = gr.Audio(
label="Reference Audio (Upload a voice sample to clone)",
type="numpy",
)
clone_ref_text = gr.Textbox(
label="Reference Text (Transcript of the reference audio)",
lines=2,
placeholder="Enter the exact text spoken in the reference audio...",
)
clone_xvector = gr.Checkbox(
label="Use x-vector only (No reference text needed, but lower quality)",
value=False,
)
with gr.Column(scale=2):
clone_target_text = gr.Textbox(
label="Target Text (Text to synthesize with cloned voice)",
lines=4,
placeholder="Enter the text you want the cloned voice to speak...",
)
with gr.Row():
clone_language = gr.Dropdown(
label="Language",
choices=LANGUAGES,
value="Auto",
interactive=True,
)
clone_btn = gr.Button("Clone & Generate", variant="primary")
with gr.Row():
clone_audio_out = gr.Audio(label="Generated Audio", type="numpy")
clone_status = gr.Textbox(
label="Status", lines=2, interactive=False
)
clone_btn.click(
generate_voice_clone,
inputs=[
clone_ref_audio,
clone_ref_text,
clone_target_text,
clone_language,
clone_xvector,
],
outputs=[clone_audio_out, clone_status],
)
# Tab 3: Endpoint Config
with gr.Tab("Endpoint Config"):
with gr.Group(elem_id="tab-content"):
tts_url = gr.Textbox(label="TTS CustomVoice URL")
with gr.Row():
tts_api_key = gr.Textbox(
placeholder="TTS CustomVoice API Key",
type="password",
scale=4,
show_label=False,
)
tts_api_key_visible = gr.Checkbox(
label="Show TTS CustomVoice API Key",
value=False,
scale=1,
)
voice_url = gr.Textbox(label="Voice Clone URL")
with gr.Row():
voice_api_key = gr.Textbox(
label="Voice Clone API Key",
type="password",
scale=4,
show_label=False,
)
voice_api_key_visible = gr.Checkbox(
label="Voice Clone API Key",
value=False,
scale=1,
)
save_btn = gr.Button("Save")
config_error = gr.Markdown(value="", visible=False)
save_btn.click(
update_config,
inputs=[
tts_api_key,
tts_url,
voice_api_key,
voice_url,
],
outputs=[
tts_api_key,
tts_url,
voice_api_key,
voice_url,
config_error,
],
)
tts_api_key_visible.change(
toggle_api_key_visibility,
inputs=[tts_api_key_visible, tts_api_key],
outputs=tts_api_key,
)
voice_api_key_visible.change(
toggle_api_key_visibility,
inputs=[voice_api_key_visible, voice_api_key],
outputs=voice_api_key,
)
demo.load(
endpoint_config.get_endpoint_config,
inputs=None,
outputs=[
tts_api_key,
tts_url,
voice_api_key,
voice_url,
],
)
return demo
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Qwen3-TTS Gradio Demo")
parser.add_argument(
"--share",
action="store_true",
default=False,
help="Enable public sharing of the Gradio app (default: False)",
)
args = parser.parse_args()
demo = build_ui()
demo.launch(share=args.share)
requirements.txt
# Qwen3-TTS Dependencies for flexAI
gradio==6.9.0
numpy==2.4.3
openai==2.26.0
soundfile==0.13.1
sox==1.5.0
START/run-qwen3-tts-cloneVoice.sh
#!/bin/bash
tmp_log=$(mktemp)
export MODEL_NAME=Qwen/Qwen3-TTS-12Hz-1.7B-Base
export INFERENCE_NAME=${MODEL_NAME//\//-}-$(whoami)-$(uuidgen | cut -d '-' -f 1)
export INFERENCE_NAME=Qwen3-Base-$(whoami)-$(uuidgen | cut -d '-' -f 1)
export TENSOR_PARALLEL_SIZE=1
export RUNTIME=vllm-omni-0.14.0
flexai inference serve $INFERENCE_NAME \
--affinity "cluster=k8s-training-smc-001" \
--runtime ${RUNTIME} \
--accels ${TENSOR_PARALLEL_SIZE} \
-- $MODEL_NAME --stage-configs-path /workspace/vllm-omni/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --trust-remote-code --enforce-eager \
2>&1 | tee "${tmp_log}"
export INFERENCE_API_KEY_BASE=$(grep 'API Key' ${tmp_log} | cut -d ':' -f 2 | tr -d ' ')
sleep 10
# wait for the inference to be ready
STATUS="enqueued"
while [ ${STATUS} != "running" ]; do
STATUS=$(flexai inference inspect $INFERENCE_NAME -j | jq -r .runtime.status)
echo "${STATUS}"
sleep 10
done
export INFERENCE_URL_BASE=$(flexai inference inspect $INFERENCE_NAME -j | jq .config.endpointUrl -r)
START/run-qwen3-tts-customVoice.sh
#!/bin/bash
tmp_log=$(mktemp)
# lets use a working model
export MODEL_NAME=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
export INFERENCE_NAME=Qwen3-TTS-$(whoami)-$(uuidgen | cut -d '-' -f 1)
export TENSOR_PARALLEL_SIZE=1
export RUNTIME=vllm-omni-0.14.0
flexai inference serve $INFERENCE_NAME \
--affinity "cluster=k8s-training-smc-001" \
--runtime ${RUNTIME} \
--accels ${TENSOR_PARALLEL_SIZE} \
-- $MODEL_NAME --stage-configs-path /workspace/vllm-omni/vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --trust-remote-code --enforce-eager \
2>&1 | tee "${tmp_log}"
export INFERENCE_API_KEY_TTS=$(grep 'API Key' ${tmp_log} | cut -d ':' -f 2 | tr -d ' ')
sleep 10
# wait for the inference to be ready
STATUS="enqueued"
while [ ${STATUS} != "running" ]; do
STATUS=$(flexai inference inspect $INFERENCE_NAME -j | jq -r .runtime.status)
echo "${STATUS}"
sleep 10
done
export INFERENCE_URL_TTS=$(flexai inference inspect $INFERENCE_NAME -j | jq .config.endpointUrl -r)