Frontier Models

Qwen-AgentWorld-397B-A17B Leads AgentWorldBench at 58.71

The closed mixture-of-experts model edges GPT-5.4 while an Apache 2.0 35B variant ships with weights, data and a new open benchmark built from real agent trajectories.

By Marcus Vance June 27, 2026 4 MIN READ

A realistic live action photojournalistic scene unfolds inside a modern high tech artificial intelligence research laboratory filled with rows of tall server racks humming quietly with advanced computing hardware. Anonymous engineers wearing plain white lab coats stand with their backs to the viewer examining dense clusters of graphics processing units and cooling fans inside open server cabinets that represent the large scale mixture of experts architecture of frontier models. Thick bundles of multicolored cables snake neatly between racks and workstations while one researcher points toward a central processing unit bay. Nearby desks hold multiple high resolution monitors displaying intricate multicolored performance graphs and trajectory maps without any readable text numbers or symbols. Stacks of external hard drives and sealed archival boxes sit on shelves symbolizing the released weights data and benchmark resources for the smaller open Apache licensed variant. Additional server units nearby illustrate comparative evaluations against other leading systems through identical hardware configurations placed side by side. The laboratory features clean white walls bright overhead lighting organized cable management trays and scattered technical tools such as screwdrivers multimeters and notepads with blank pages. Researchers gesture and lean over equipment in focused collaboration creating a dynamic yet professional atmosphere centered on real world agent trajectory collection and evaluation workflows. In the midground several anonymized figures adjust rack mounted components while others review printed diagrams laid flat on a large central table. The entire environment emphasizes tangible physical infrastructure supporting large scale model training evaluation and open release processes with every detail rendered in sharp photographic clarity showing metallic surfaces matte plastic panels indicator lights and ventilation grilles. Additional depth comes from background views of further server aisles receding into the distance and foreground close ups of connector ports and thermal sensors all contributing to a dense composition that captures the essence of cutting edge agent benchmark development without any visible lettering logos or interface elements. — Illustration: AI Intel Report

Qwen-AgentWorld is a family of language world models that simulate agent environments across seven domains via long chain-of-thought reasoning.

Qwen released two models under the Qwen-AgentWorld name. The larger closed variant uses a mixture-of-experts design.

The smaller variant carries an Apache 2.0 license and includes both weights and training data.

What background context surrounds the Qwen-AgentWorld release?

Agent systems require models that can predict outcomes of sequences of actions inside simulated environments. Earlier language models often lacked sufficient training on such interaction data.

Qwen drew from its existing large language model lineage to create specialized world models. The effort targets seven distinct domains that cover common agent use cases.

The domains are MCP, Search, Terminal, SWE, Web, OS and Android. Each domain supplies trajectories that reflect real agent behavior.

What new elements appear in the Qwen-AgentWorld announcement?

The announcement introduces both a 397 billion parameter closed model and a 35 billion parameter open model. The closed model is Qwen-AgentWorld-397B-A17B with 17 billion active parameters.

The open model is Qwen-AgentWorld-35B-A3B with 3 billion active parameters and a 256K context window. Weights for the open model appear on Hugging Face and ModelScope.

A new benchmark named AgentWorldBench accompanies the models. The benchmark aggregates real trajectories collected from five frontier models across nine prior agent benchmarks.

What technical details define the training and architecture?

Training proceeds through a three-stage pipeline. The stages are continual pre-training, supervised fine-tuning and reinforcement learning.

More than 10 million environment interaction trajectories supply the training signal. These trajectories span the seven listed domains.

The mixture-of-experts design activates only a fraction of total parameters during inference. This structure supports the large total parameter counts while controlling compute.

Continual Pre-Training (CPT)
Supervised Fine-Tuning (SFT)
Reinforcement Learning (RL)

How do the models compare on AgentWorldBench?

AgentWorldBench measures simulation quality for agent environments. Higher scores indicate closer alignment with observed trajectories from frontier models.

Qwen-AgentWorld-397B-A17B reaches 58.71. GPT-5.4 reaches 58.25. Claude Opus 4.8 reaches 56.59. Gemini 3.1 Pro reaches 54.57.

The 35B variant records an 8.66 point gain over the base Qwen3.5-35B-A3B on the same benchmark.

AgentWorldBench scores for Qwen-AgentWorld-397B-A17B and competing models
Model	AgentWorldBench Score
Qwen-AgentWorld-397B-A17B	58.71
GPT-5.4	58.25
Claude Opus 4.8	56.59
Gemini 3.1 Pro	54.57

What release terms apply to the open variant?

Qwen-AgentWorld-35B-A3B carries the Apache 2.0 license. The license permits commercial use and modification.

The model and the benchmark are also hosted on GitHub under the QwenLM organization. All open-weight artifacts use the same Apache 2.0 terms.

The 397B model remains closed. Only the 35B model provides public weights.

What market and stakeholder implications follow from the release?

The open 35B model supplies developers with a ready starting point for agent simulation research. Organizations can fine-tune the weights without licensing restrictions.

The closed 397B model preserves a performance margin for Qwen in internal evaluations. This dual strategy balances openness with competitive differentiation.

Agent framework builders gain access to a model trained specifically on interaction data rather than generic text. This specialization may reduce the need for custom environment simulators in some workflows.

What expert reactions have accompanied the models?

The release demonstrates that language models can serve as world simulators when trained on sufficient trajectory data. The narrow margin over GPT-5.4 shows continued rapid progress in the category.

The decision to open the 35B variant while keeping the larger model closed reflects standard practices in frontier model releases. The accompanying benchmark release supports reproducible evaluation.

Today we release Qwen-AgentWorld, a native language world model that simulates agent environments across seven domains.QwenTeam, Qwen research team

What developments are likely next?

Additional trajectory data may further close the gap between open and closed variants. Community contributions to AgentWorldBench could expand domain coverage.

Integration of the open weights into existing agent orchestration libraries is expected. Such integration would test the models in live environments beyond the benchmark.

Future iterations may increase active parameter counts or context length while retaining the mixture-of-experts structure. The current 256K window already supports extended agent sessions.

Frequently asked

What is the score of the leading model on the benchmark?

The Qwen-AgentWorld-397B-A17B model scores 58.71 on AgentWorldBench. This is the highest among the compared models including GPT-5.4 at 58.25.

Is the 397B model open sourced?

No. Only the 35B-A3B variant is open sourced under Apache 2.0. The larger model remains closed.

Where can the open model be accessed?

The Qwen-AgentWorld-35B-A3B model is available on Hugging Face and ModelScope. It comes with weights and the Apache 2.0 license.