TIL: qwen3-tts: next-gen open source text-to-speech

what it is

qwen3-tts is a new family of open-source text-to-speech models from the qwen team. it is designed to be natural, human-like prosody, expressiveness, and voice identity across languages.

supported features:

real-time, streaming synthesis with ultra-low latency (~97 ms)
multilingual output (10+ languages, Chinese, english, japanese, etc.)
voice design via plain language descriptors
voice cloning from a few seconds of audio

qwen3-tts models come in different sizes (e.g., ~0.6 b or ~1.7 b parameters), with corresponding prebuilt variants like base, customvoice, and voicedesign.

why it’s interesting

unlike basic tts tools, qwen3-tts looks to take expressivity and flexibility seriously. it’s optimized for:

natural intonation and rhythm rather than flat robotic output
long-form, stable synthesis without awkward pauses
streaming & real-time use cases — e.g., agents, assistants, dubbing

it’s also fully open source (apache 2.0), meaning you can:

run locally
customize voices
build products without steep api costs

anyway, personally i feel like this is sort of sota in terms of open source tts models. cant wait to try it out on prod use cases.

qwen3-tts release blog