← Back to TIL

qwen3-tts: next-gen open source text-to-speech

Jan 23, 2026

aittsmodelsspeech

what it is

qwen3-tts is a new family of open-source text-to-speech models from the qwen team. it is designed to be natural, human-like prosody, expressiveness, and voice identity across languages.

supported features:

  • real-time, streaming synthesis with ultra-low latency (~97 ms)
  • multilingual output (10+ languages, Chinese, english, japanese, etc.)
  • voice design via plain language descriptors
  • voice cloning from a few seconds of audio

qwen3-tts models come in different sizes (e.g., ~0.6 b or ~1.7 b parameters), with corresponding prebuilt variants like base, customvoice, and voicedesign.

why it’s interesting

unlike basic tts tools, qwen3-tts looks to take expressivity and flexibility seriously. it’s optimized for:

  • natural intonation and rhythm rather than flat robotic output
  • long-form, stable synthesis without awkward pauses
  • streaming & real-time use cases — e.g., agents, assistants, dubbing

it’s also fully open source (apache 2.0), meaning you can:

  • run locally
  • customize voices
  • build products without steep api costs

anyway, personally i feel like this is sort of sota in terms of open source tts models. cant wait to try it out on prod use cases.

read more: