Alibaba Unveils Qwen-Image: Open-Source Model That Edits and Generates Text in Images

A research team inside Alibaba Group’s Institute for Intelligent Computing has released “Qwen-Image,” an image-generation model that focuses on something text-to-image systems still struggle with: placing long, legible passages of text inside the pictures they create.

According to the team’s technical report, the new model renders multi-line English, Chinese and mixed-language copy with markedly higher accuracy than leading commercial services while also offering state-of-the-art performance on standard image-editing tasks. The researchers have open-sourced Qwen-Image on Hugging Face and ModelScope, positioning it as a community alternative to proprietary APIs such as OpenAI’s DALL·E 3 or Google’s Imagen 4.

Current benchmark against-(1)Seedream3.0,(2) GPT Image 1, (3) Flux.1 Kontext, (4) Bagel and (5) Flux.1 [dev] – Shows Qwen-Image outperforming all of these models is every way. This can also be seen in the text rendering benchmark.

Why text inside images is a hard problem

Modern diffusion models excel at photorealistic textures and artistic styles, yet they routinely misspell shop signs, book covers or presentation slides. The reason is simple: most public datasets supply captions that describe a scene, not the exact glyphs that appear within it. Without explicit character-level supervision, even large models “learn” text primarily as abstract shapes rather than discrete symbols. That gap becomes more obvious in logographic languages such as Chinese, where thousands of low-frequency characters rarely show up in training data.

Qwen-Image targets the issue head-on. The Alibaba group built a dedicated pipeline that gathers natural images, designs synthetic layouts and balances rarely seen characters so the model sees them often during training. It then schedules the learning process from “easy” primitives (simple shapes, no text) to “hard” ones (paragraph-length, multi-font layouts) in what the authors describe as a curriculum strategy. The end result, they claim, is native text rendering comparable to human typesetting in both alphabetic and logographic scripts.

What Qwen-Image Brings to the Table?

1. A VAE that keeps the small print sharp

Qwen-Image relies on a variational auto-encoder (VAE) that compresses a 1,328-pixel image down to 16 latent channels on an 8×8 grid. The team fine-tuned only the decoder on a private corpus of documents, posters and synthetic paragraphs, then balanced pixel-wise reconstruction loss with a “perceptual” loss so that the grid artifacts common in VAEs all but disappear. In their evaluation, the VAE reaches 33.4 dB PSNR on ImageNet and 36.6 dB on a text-heavy set, outperforming larger encoders from Hunyuan or Stable Diffusion 3.5.

2. Dual encoders for semantics and pixels

During editing, the system feeds the original image into two distinct encoders. Alibaba’s Qwen 2.5-VL — a vision-language large model — extracts high-level semantics (“this is a street sign that should read ‘Café’”). Simultaneously, the frozen VAE captures low-level color and structure. Both streams condition a Multimodal Diffusion Transformer (MMDiT) backbone. The design lets the model modify only what the user asks for while leaving the rest of the picture untouched.

3. Multimodal positional encoding

To help the transformer differentiate between thousands of visual patches and dozens of text tokens, the researchers propose Multimodal Scalable RoPE. Here, text tokens are mapped diagonally across the image grid so their position never overlaps with any specific row or column of pixels. The method keeps training stable at multiple resolutions without choosing a “magic row” where text begins.

4. Producer–consumer training at scale

Running a 20-billion-parameter backbone required streaming terabytes of filtered data without stalling GPUs. The group split the workload into Ray-like producers (which clean, encode and store samples) and consumers (which only train). A custom HTTP layer passes batches in zero-copy mode while Megatron-LM handles 4-way tensor parallelism on the consumer side.

How well does it work?

Across public evaluations, Qwen-Image usually matches or beats proprietary APIs. Below is a snapshot of the numbers quoted in the research.

Benchmark	What it measures	Best closed model cited	Qwen-Image
DPG overall score	Prompt adherence (1,000 dense prompts)	Seedream 3.0 – 89.84	92.78
GenEval object-attribute score	Compositional accuracy	Seedream 3.0 – 0.84	0.91
OneIG-Bench (English) overall	Alignment, reasoning, style, diversity	GPT Image 1 – 0.539	0.650
ChineseWord accuracy	Single-character Chinese rendering	Seedream 3.0 – 33 %	58 %
AI Arena Elo rating*	Human pairwise votes, 5 k prompts	Imagen 4 Ultra Preview 0606 – top	3rd place, ~30 Elo above GPT Image 1

*AI Arena is an open leaderboard operated by Alibaba; the team reports every model has >10,000 comparisons.

On image editing, the model reaches the highest overall score on GEdit (7.56 out of 10) in both English and Chinese tasks and leads the nine-task ImgEdit benchmark with an average 4.27 out of 5. Even on novel-view synthesis — a typical 3-D problem — Qwen-Image posts 15.11 PSNR, surpassing specialized tools such as Zero-123 and ImageDream.

How to Install

The first step would involve installing the latest version of diffusers –

Copy Code


pip install git+https://github.com/huggingface/diffusers

The below code shows how you can use Qwen-Image-Edit-2509

Copy Code


import os

import torch

from PIL import Image

from diffusers import QwenImageEditPlusPipeline
pipeline = QwenImageEditPlusPipeline.from_pretrained(“Qwen/Qwen-Image-Edit-2509”, torch_dtype=torch.bfloat16)

print(“pipeline loaded”)
pipeline.to(‘cuda’)

pipeline.set_progress_bar_config(disable=None)

image1 = Image.open(“input1.png”)

image2 = Image.open(“input2.png”)

prompt = “The magician bear is on the left, the alchemist bear is on the right, facing each other in the central park square.”

inputs = {

“image”: [image1, image2],

“prompt”: prompt,

“generator”: torch.manual_seed(0),

“true_cfg_scale”: 4.0,

“negative_prompt”: ” “,

“num_inference_steps”: 40,

“guidance_scale”: 1.0,

“num_images_per_prompt”: 1,

}

with torch.inference_mode():

output = pipeline(**inputs)

output_image = output.images[0]
output_image.save(“output_image_edit_plus.png”)

print(“image saved at”, os.path.abspath(“output_image_edit_plus.png”))

Why it matters

If the numbers hold up outside controlled tests, Qwen-Image pushes open-source generation into territory once reserved for private APIs. Brands that need posters in English and Chinese, UI designers who rely on Latin and logographic fonts, or chatbots that must generate presentation slides could all benefit from a system that treats text as first-class image content rather than a decorative afterthought.

The report also hints at a broader ambition. By equipping a vision-language model with strong generative skills, Alibaba envisions “vision-language user interfaces” where the assistant not only describes a scene but illustrates concepts on demand, rendering tables, flow charts or signage that the user can immediately edit. The same backbone already performs depth estimation and segmentation as by-products of its editing objective, an early sign that one model can unify perception and creation.

When compared to other models, Qwen-Image outperforms most of them. This can be seen in the image generation results for the same prompt, as analyzed using ChatGPT below-

Availability and next steps

The research team has released checkpoints, inference code and evaluation scripts under the Apache-2.0 license. Pre-trained weights, fine-tuned editing versions, and the enhanced VAE are hosted on both Hugging Face and ModelScope. Developers can try a web demo or pull the model into existing diffusion pipelines via an open-source adapter.

Looking forward, the authors say the same data pipeline can be extended to video, and the shared encoder already supports temporal inputs. They also point to the need for better safety filters before large-scale public deployment — a familiar caveat as generative systems become more accessible.

Read the full paper on Arxiv.

Soham Pratap

An AI and hardware enthusiast passionate about pushing the boundaries of technology. I design, train, and execute cutting-edge AI models while also building powerful AI-enhanced PCs and custom rigs. I also provide consultancy to help individuals and businesses unlock the full potential of AI-driven solutions.

All Posts

Apple Team Introduces MANZANO: Unified Multimodal Model with Hybrid Vision Tokenizer Bringing Balanced Image Understanding and Generation

September 23, 2025

OpenLens AI Fully Autonomous Research Agent with Vision-Language Checks to Automate Health Informatics Studies

September 22, 2025

Soham Pratap

Leave a Comment Cancel reply

Tiny-R1V Shrinks Large-Model Reasoning to Fit Your GPU

NVIDIA Longlive AI Creates Long, Real-Time Videos That Change with Your Prompts

Alibaba Cloud Researchers Introduce Variance-Based Curriculum Learning to Boost Math Reasoning in LLMs

Qualcomm Unviels Spiffy- Which boosts Diffusion Language Models with Lossless Speculative Decoding