Qwen3-Next Series Explained: 80B-A3B Hybrid Architecture with Instruct and Thinking

Qwen3-Next Gated DeltaNet Gated Attention MoE MTP LLM

Sep 12, 2025 4 min read

Cover image for Qwen3-Next Series Explained: 80B-A3B Hybrid Architecture with Instruct and Thinking

Qwen introduces the next-generation hybrid architecture series — Qwen3-Next. It features 80B total parameters with sparse activation (only ~3B active per inference), significantly improving cost-efficiency for long-context, high-concurrency, and low-latency scenarios. The series contains two primary variants:

Qwen3-Next-80B-A3B-Instruct: Optimized for instruction following and stable formatting; it does not emit thinking traces (no <think></think> blocks).
Qwen3-Next-80B-A3B-Thinking: Designed for deep reasoning; it generates thinking traces by default (with <think></think>; some chat templates may only show the closing tag), and typically supports longer chains of thought than the previous generation.

References:

Hugging Face (model cards and collection for Instruct/Thinking)

Qwen official blog and community materials

OpenRouter model page and third-party summary (ai-bot.cn)

Why “Next”: Three Engineering Goals

Cost & Throughput: 80B total × ~3B active brings small-model efficiency while maintaining large-model capability.
Long Context & Stability: Optimized for 32K+ context, balancing key information recall and stable structured outputs.
Production Readiness: Emphasis on latency and concurrency for online APIs, enterprise knowledge QA, and agentic workflows.

Architecture Overview: Gated DeltaNet × Gated Attention × MoE

Gated DeltaNet (speed-first)
- Optimized for long-text processing with near-linear memory growth and fast streaming generation — ideal for high-concurrency services.
Gated Attention (precision-first)
- Performs precise recall at critical positions, preserving key facts and paragraph structure in long-form generation.
Large-scale MoE expert system
- Community sources indicate up to 512 experts; each request routes to only a small set (e.g., Top-K + 1 shared), enabling load balancing and compute throttling.
Native MTP (Multi-Token Prediction) training
- Introduces multi-token prediction during pretraining, reducing inference steps, boosting long-text throughput, and lowering latency.

Instruct vs Thinking: How to Choose

Instruct (best for production chat/agents)
- No <think></think> blocks; more stable formatting and alignment — great for productized use cases (customer support, structured JSON/Markdown outputs).
- Stronger instruction-following and format control; easy to template and evaluate.
Thinking (best for deep/complex reasoning)
- Defaults to injecting thinking markers via the chat template; seeing only a closing </think> is normal in some templates (controlled by the official template).
- Longer chain-of-thought, often improving accuracy and stability on multi-step logic, math, and programming tasks.

Practical tips:

Choose Instruct for strict formatting and fast productization.

Choose Thinking for complex reasoning, research, and engineering derivations.

In RAG/Agent systems, you can route dynamically by task type.

Capabilities (Community Summary)

Instruct: Matches or approaches larger flagship models (e.g., 235B) on many instruction/multi-task benchmarks, with advantages in long-text throughput and latency.
Thinking: Shows stronger reasoning than many lightweight/fast models (reports indicate it can surpass Gemini 2.5 Flash-Thinking in some metrics) and can output longer reasoning chains.

Note: Please refer to official model cards and third-party evaluations for exact numbers. This article focuses on capability profiles and engineering trade-offs.

Typical Use Cases

Long-form summarization and report generation: whitepapers, legal contracts, literature reviews.
Code generation and refactoring: cross-file understanding, refactoring suggestions, test generation, and code reviews.
Enterprise knowledge QA (RAG): multilingual QA, factual recall, traceable citations.
Agentic workflows: stable tool-use, memory management, and structured outputs; mix Instruct/Thinking as needed.
High-concurrency online services: low latency and stable alignment for commercial APIs.

Quick Start (Overview)

Qwen Chat Web: switch between Instruct/Thinking (if available in the UI).
Alibaba Model Studio (Bailian): production API access as per official docs, recommended for enterprise.
Hugging Face: load according to the model card; the Thinking variant injects a <think> snippet in the default chat template.
OpenRouter: use models like qwen/qwen3-next-80b-a3b-instruct via OpenAI-compatible APIs.

Tip: The Thinking variant may output lengthy thinking content; if you want to hide it, strip <think>...</think> via system prompts or post-processing.

Comparison with Previous/Same-tier Models

Versus dense models like Qwen3-32B: Next gains “lower cost × higher throughput” via sparse activation and scales better for long context.
Versus traditional MoE: the mixed Gated DeltaNet/Gated Attention design improves the balance of speed and precision.
Versus trillion-parameter flagships: community reports suggest Instruct approaches some instruction benchmarks of 235B, while Thinking makes large gains in reasoning; overall, Next emphasizes speed and stability for production.