Qwen3-Omni-modal model for text, image, audio and video with real-time speech. Thinker–Talker + MoE, multi-codebook for low latency; 119 languages; vLLM/Transformers tips.