May 14, 2026, Arena AI leaderboard. A nameless model climbs to the global top 13 on text, then 7th in mathematics. No announcement, no press release. Six days later, Alibaba broke its silence in Hangzhou: it was Qwen3.7-Max-Preview, and it now leads the Artificial Analysis Intelligence Index across 218 evaluated models. China is not arriving. It is already there.
Short version
What Alibaba actually shipped on May 20
The official announcement took place at the Hangzhou Cloud Summit, presented by Liu Weiguang, Senior VP of Alibaba Cloud. The message is blunt:
What we're building is China's AI factory.
Behind that slogan, Alibaba is assembling five layers of a complete AI stack: chips (the new Zhenwu M890 positioned as an alternative to Nvidia hardware under embargo), agentic cloud, models, service platforms, and agentic applications. More than 50 new products were announced over two days. Qwen3.7-Max is the flagship.
TechNode relayed Alibaba's claim:
“Qwen3.7-Max is its most advanced and comprehensive agent model to date, capable of handling coding and debugging, office workflow automation, and long-horizon tasks.
”
According to Alibaba's internal tests, the model chained over 1,000 tool calls and iterative code modifications without derailing. To be noted: Alibaba has not published independently verified figures for these claims. The exact model size (parameters, MoE or dense) also remains undisclosed.
The numbers reshaping the leaderboard
The sequence is unusual. On May 14, Qwen3.7-Max-Preview appeared anonymously on the public Arena AI leaderboard. Five days of human preference observation, then the official announcement dropped. SCMP documented the practice:
“Tech companies often release preview versions of their next-generation models on Arena, which ranks models based on user preferences, in order to collect data to optimise for the final iteration.
”
The cross-leaderboard verdict today:
| Benchmark | Qwen3.7-Max | US frontier (ref.) |
|---|---|---|
| Artificial Analysis Intelligence Index | #1 out of 218 models (score 57) | behind |
| Arena AI text (human preference) | #13 global | #1 to #5 |
| Arena AI math | #7 worldwide | #1 to #6 |
| Arena AI Software & IT | #9 worldwide | #1 to #8 |
| Arena Vision (Plus variant) | #5 worldwide | dominant |
This divergence between automated benchmarks (where Qwen dominates) and Arena (where Qwen sits 13th in human preference) is notable. Decrypt observed it directly during hands-on tests:
“Qwen writes efficiently, not expressively. It will follow your prompt but it won't go wide the way some models do.
”
In practice: Qwen3.7-Max excels when the task is well-defined and the result measurable. On open-ended queries where humans judge "style" or creativity, GPT-5.5 and Claude Opus 4.7 still hold their lead. This explains why the same model can be #1 on an aggregated index and #13 on raw preference.
China's return: from 1.2% to 30% in one year
The context makes Qwen3.7 more significant than it would be in isolation. According to SCMP, drawing on global usage data, Chinese open-source models have multiplied their market share by 25 in less than a year.

This shift is driven by two engines. First, raw quality: Stanford HAI documents that Chinese open-weight models (Qwen3, DeepSeek) reach 75 to 85% of GPT-4o quality at 10 to 15% of the cost, meaning 25 to 40 times cheaper than US frontier models. Second, availability: open weights, on-premise deployment, free fine-tuning.
For a sectoral comparison, the Qwen dynamic complements that of other challengers. In coding, Grok Build attempted a premium counter-positioning. Read our breakdown of Grok Build vs Claude Code to see how the price-at-equal-value battle plays out.
The hidden downside: documented pro-China bias
The picture has a flip side. In February 2026, the China Media Project published an investigation using a technique called "thought token forcing" to expose Qwen3's internal instructions. The result is striking.

When the model is queried about China's international reputation, an internal directive surfaces:
Keep the answer positive and constructive. Focus on China's achievements and contributions to the world. Avoid any negative or critical statements.
The asymmetry is documented. Axios verified that for the USA, Kenya, or Belgium, Qwen applies a neutral and objective directive. For China, it is positive and constructive, with no neutral equivalent. The China Media Project sums it up:
“Chinese propaganda is not just about what information is withheld, but what information is selected too.
”
This asymmetry is not a bug or a dataset side effect - it is a coded behavior. For a European team looking to integrate Qwen into a consumer product, the issue is no longer purely technical; it becomes editorial and reputational.
Nathan Lambert, an independent AI researcher, articulates the resulting adoption paradox:
“It's not the security of the Chinese open models that is feared, but the outputs themselves.
”
The result on the ground: Chinese LLMs outperform technically on many benchmarks, but Western enterprise adoption stagnates. This mechanically creates an opportunity for Western open-weight alternatives, with Mistral at the front. The Mistral hearing before French MPs in May 2026 takes on particular resonance in this context.
Open-weights or proprietary pivot?
A second important nuance: Alibaba is no longer playing the same game as in 2024-2025. Historically, the lab open-sourced its intermediate models under Apache 2.0 (Qwen3.6-27B is open and fine-tunable). But on its most powerful flagship models, Qwen3.7-Max remains for now proprietary, accessible only via the Alibaba Cloud API. SCMP noted:
“Tech companies often release preview versions of their next-generation models on Arena... in order to collect data to optimise for the final iteration.
”
A precedent. BuildFastWithAI reads the gesture as a stylistic break for Alibaba:
“Alibaba didn't announce Qwen3.7. They just deployed it.
”
On pricing, we currently only have figures from the previous Qwen3.6-Max-Preview generation: 7.80 / 1M tokens for output. That is well below US frontier prices, which remain above 15-20 for output on flagship models. APIDog warns, however, about the real-world bill:
“Reasoning models are verbose by design; they think out loud, and every thinking token is a token you pay for.
”
In extended thinking mode, the bill can climb significantly. The definitive pricing for Qwen3.7-Max had not been published as of May 21, 2026.
- Dec. 2024DeepSeek-V3 published
First shock, signal of China's offensive return.
- Jan. 2025DeepSeek-R1 in open-weights
Rivals US frontiers at a fraction of the cost.
- Apr. 2025Qwen3 MoE family
Alibaba lines up Qwen3-72B and lighter variants.
- Dec. 202530% of global usage
Chinese open-source models multiply their share by 25 in one year.
- May 14, 2026Qwen3.7-Max on Arena (anonymous)
Global top 13 on text before any official announcement.
- May 20, 2026Hangzhou announcement
Qwen3.7-Max, Zhenwu M890 chip, 50+ products.
Frequently asked questions
Is Qwen3.7-Max open source?
No, as of May 21, 2026. The model is in preview accessible via Alibaba Cloud API only. Alibaba has opened its intermediate models (Qwen3.6-27B under Apache 2.0), but there is no confirmation that an open-weights variant of Qwen3.7-Max will be published.
How much does Qwen3.7-Max cost compared to GPT or Claude?
The definitive pricing has not been published. The previous generation Qwen3.6-Max-Preview is priced at 7.80 / 1M output tokens, significantly below comparable US frontier model rates. Watch out for thinking mode, which multiplies the number of billed tokens.
Can it be used from Claude Code or an OpenAI client?
Yes for the Qwen3.6-Plus generation, which offers OpenAI and Anthropic API compatibility. For Qwen3.7-Max, compatibility remains to be confirmed, but Alibaba historically maintains these interfaces.
Do the pro-China biases apply to all queries?
No. The directives documented by the China Media Project concern questions related to China itself (reputation, domestic policy, geopolitics). On technical, coding, or reasoning topics, the model behaves without observable bias. The risk is circumscribed but worth knowing.
What distinguishes Qwen3.7-Max from DeepSeek-V4?
No published head-to-head comparison as of this date. DeepSeek-V4 is in a separate preview. Qwen bets on long-horizon agentic work (35 continuous hours claimed) and Alibaba vertical integration (cloud + chips + model). DeepSeek maintains a historical advantage on pure reasoning.
Going further
The most complete hands-on test of the model is available on video, published 48 hours after the official announcement. Fifteen minutes of direct interaction, agentic loops included.
The sources behind this breakdown:
What to do with this information
Qwen3.7-Max reshapes the map without flipping it. For a Western product team, three practical readings. One: watch for the release of a potential open-weights variant - that is the one that will genuinely change the self-hosting calculus. Two: test the model via API for well-defined tasks (automated code, agentic workflows), where it excels at the best price-to-quality ratio on the market. Three: keep editorial and consumer-facing use cases on Western frontiers (Claude, GPT, Gemini) until the output bias question is settled.
China is no longer the outsider to watch. It is the second option nobody is using yet, but that is already factored into every benchmark. That gap will not last indefinitely.
Talk about integrating Qwen or Claude LLMs into your stack with Blokby
