Qwen3-Next Series Explained: 80B-A3B Hybrid Architecture with Instruct and Thinking
4 min read

Qwen introduces the next-generation hybrid architecture series — Qwen3-Next
. It features 80B total parameters with sparse activation (only ~3B active per inference), significantly improving cost-efficiency for long-context, high-concurrency, and low-latency scenarios. The series contains two primary variants:
Qwen3-Next-80B-A3B-Instruct
: Optimized for instruction following and stable formatting; it does not emit thinking traces (no<think></think>
blocks).Qwen3-Next-80B-A3B-Thinking
: Designed for deep reasoning; it generates thinking traces by default (with<think></think>
; some chat templates may only show the closing tag), and typically supports longer chains of thought than the previous generation.
References:
- Hugging Face (model cards and collection for Instruct/Thinking)
- Qwen official blog and community materials
- OpenRouter model page and third-party summary (ai-bot.cn)
Why “Next”: Three Engineering Goals
- Cost & Throughput:
80B total × ~3B active
brings small-model efficiency while maintaining large-model capability. - Long Context & Stability: Optimized for 32K+ context, balancing key information recall and stable structured outputs.
- Production Readiness: Emphasis on latency and concurrency for online APIs, enterprise knowledge QA, and agentic workflows.
Architecture Overview: Gated DeltaNet × Gated Attention × MoE
- Gated DeltaNet (speed-first)
- Optimized for long-text processing with near-linear memory growth and fast streaming generation — ideal for high-concurrency services.
- Gated Attention (precision-first)
- Performs precise recall at critical positions, preserving key facts and paragraph structure in long-form generation.
- Large-scale MoE expert system
- Community sources indicate up to 512 experts; each request routes to only a small set (e.g., Top-K + 1 shared), enabling load balancing and compute throttling.
- Native MTP (Multi-Token Prediction) training
- Introduces multi-token prediction during pretraining, reducing inference steps, boosting long-text throughput, and lowering latency.
Instruct vs Thinking: How to Choose
-
Instruct (best for production chat/agents)
- No
<think></think>
blocks; more stable formatting and alignment — great for productized use cases (customer support, structured JSON/Markdown outputs). - Stronger instruction-following and format control; easy to template and evaluate.
- No
-
Thinking (best for deep/complex reasoning)
- Defaults to injecting thinking markers via the chat template; seeing only a closing
</think>
is normal in some templates (controlled by the official template). - Longer chain-of-thought, often improving accuracy and stability on multi-step logic, math, and programming tasks.
- Defaults to injecting thinking markers via the chat template; seeing only a closing
Practical tips:
- Choose Instruct for strict formatting and fast productization.
- Choose Thinking for complex reasoning, research, and engineering derivations.
- In RAG/Agent systems, you can route dynamically by task type.
Capabilities (Community Summary)
- Instruct: Matches or approaches larger flagship models (e.g., 235B) on many instruction/multi-task benchmarks, with advantages in long-text throughput and latency.
- Thinking: Shows stronger reasoning than many lightweight/fast models (reports indicate it can surpass Gemini 2.5 Flash-Thinking in some metrics) and can output longer reasoning chains.
Note: Please refer to official model cards and third-party evaluations for exact numbers. This article focuses on capability profiles and engineering trade-offs.
Typical Use Cases
- Long-form summarization and report generation: whitepapers, legal contracts, literature reviews.
- Code generation and refactoring: cross-file understanding, refactoring suggestions, test generation, and code reviews.
- Enterprise knowledge QA (RAG): multilingual QA, factual recall, traceable citations.
- Agentic workflows: stable tool-use, memory management, and structured outputs; mix Instruct/Thinking as needed.
- High-concurrency online services: low latency and stable alignment for commercial APIs.
Quick Start (Overview)
- Qwen Chat Web: switch between Instruct/Thinking (if available in the UI).
- Alibaba Model Studio (Bailian): production API access as per official docs, recommended for enterprise.
- Hugging Face: load according to the model card; the Thinking variant injects a
<think>
snippet in the default chat template. - OpenRouter: use models like
qwen/qwen3-next-80b-a3b-instruct
via OpenAI-compatible APIs.
Tip: The Thinking variant may output lengthy thinking content; if you want to hide it, strip
<think>...</think>
via system prompts or post-processing.
Comparison with Previous/Same-tier Models
- Versus dense models like Qwen3-32B: Next gains “lower cost × higher throughput” via sparse activation and scales better for long context.
- Versus traditional MoE: the mixed Gated DeltaNet/Gated Attention design improves the balance of speed and precision.
- Versus trillion-parameter flagships: community reports suggest Instruct approaches some instruction benchmarks of 235B, while Thinking makes large gains in reasoning; overall, Next emphasizes speed and stability for production.
Links
- Hugging Face (Thinking): https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
- Hugging Face (Instruct): https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
- Qwen Official Blog (Qwen3 series): https://qwenlm.github.io/blog/qwen3/
- Third-party summary (CN): https://ai-bot.cn/qwen3-next/
- OpenRouter model page: https://openrouter.ai/qwen/qwen3-next-80b-a3b-instruct
More Articles
![OpenAI 12-Day Technical Livestream Highlights Detailed Report [December 2024]](/_astro/openai-12day.C2KzT-7l_1ndTgg.jpg)
OpenAI 12-Day Technical Livestream Highlights Detailed Report [December 2024]

Understanding Core AI Technologies: The Synergy of MCP, Agent, RAG, and Function Call

AI Model Tools Comparison How to Choose Between SGLang, Ollama, VLLM, and LLaMA.cpp?

Ant Design X - React Component Library for Building AI Chat Applications

CES 2024 Review:Revisiting the Tech Highlights of 2024

VLC Automatic Subtitles and Translation (Based on Local Offline Open-Source AI Models) | CES 2025

Chrome(Chromium) Historical Version Offline Installer Download Guide

ClearerVoice-Studio: A One-Stop Solution for Speech Enhancement, Speech Denoising, Speech Separation and Speaker Extraction

CogAgent-9B Released: A GUI Interaction Model Jointly Developed by Zhipu AI and Tsinghua