Qwen-Image-Edit: Overview and ComfyUI Guide

Qwen Qwen-Image-Edit Stable Diffusion ComfyUI Diffusers Image Editing

Aug 20, 2025 3 min read

Cover image for Qwen-Image-Edit: Overview and ComfyUI Guide

Introduction

Qwen-Image-Edit is an image editing model built on top of Qwen-Image by Alibaba Cloud’s Qwen team. It brings strong text rendering and visual understanding into editing scenarios, enabling precise bilingual (Chinese and English) text rewriting while maintaining overall semantic consistency when modifying appearance and content.

Model and Training Overview

Further trained on the 20B-parameter Qwen-Image model to extend it to image editing tasks.

Reinforces controllable generation for text regions, making text modifications more stable and faithful to the original design.

Dual-Control Mechanism: Semantics + Appearance

To support complex edits, the input image is fed into two core components during inference:

Qwen2.5-VL: Provides visual semantic control to preserve high-level consistency for subjects and scenes.
VAE encoder: Provides visual appearance control to preserve low-level consistency for local regions and styles.

Semantic Editing (High-Level Vision)

Definition: Maintain original visual semantics while allowing large pixel-level changes; the main subject and meaning remain coherent.
Example use cases:
- IP creation and novel-view synthesis: Rotate a mascot or object by 90°/180° while keeping identity consistent.
- Style transfer: Convert portraits to different art styles (e.g., Studio Ghibli) for avatars and brand extensions.

Appearance Editing (Low-Level Vision)

Definition: Keep specified regions unchanged while adding/removing/modifying local elements in a controllable way.
Example use cases:
- Add/Remove/Modify elements: e.g., add a signboard and its reflection.
- Detail removal: Clean stray hairs or unnecessary small objects.
- Targeted element modification: Precisely change the color/style of a specific letter or shape.
- Background/Outfit adjustments: Suitable for background replacement and outfit changes in portraits.

Precise Text Editing (Chinese & English)

Add, delete, and modify text directly in images while preserving font, size, and style.
Works well for Chinese posters, fine-print corrections, and complex layouts.
Supports chain-of-edits: iteratively fix typos or annotations step by step (e.g., line-by-line corrections for calligraphy).

Performance

Across public benchmarks, Qwen-Image-Edit delivers state-of-the-art performance on image editing, particularly strong in stability and consistency for text edits.

Quick Start

Try with Diffusers

Fetch model weights from Hugging Face or ModelScope.
Use diffusers to run inference for both local and global edits under text/image conditions.

Use with ComfyUI

In ComfyUI, you can build a workflow by loading the following model files:

qwen_image_edit_fp8_e4m3fn.safetensors: diffusion model
qwen_2.5_vl_7b_fp8_scaled.safetensors: text encoder / CLIP
qwen_image_vae.safetensors: VAE

Place the weights in the appropriate directories and combine common nodes (image loader, mask, prompt, KSampler, VAE decode, etc.) to jointly control semantics and appearance.

Recommended Use Cases

Multilingual layout corrections and localization for brand assets.
Poster text replacement and fine-grained retouching for e-commerce and marketing.
Novel-view expansion and stylistic series creation for IP characters.

FAQ

Model fails to load or take effect? Verify weight paths, GPU memory, and version compatibility.
Text style drifts? Increase text-related prompt weight or refine via chain-of-edits.
Local edits affect global content? Use masks, lower global strength, and constrain scope with appearance-control nodes.