SHIP · Jun 18, 2026

GLM-5.2: MIT open-source model that rivals Opus 4.8

Zhipu AI released GLM-5.2 under MIT license. It trails Opus 4.8 by 1% on FrontierSWE and works as a drop-in replacement in Claude Code.

Agent-ready: drop this post into Claude Code or Codex

TL;DR: Zhipu AI released GLM-5.2 under the MIT license. It trails Opus 4.8 by 1% on FrontierSWE and beats GPT-5.5 by the same margin. It drops into Claude Code with one model name change. The weights are on HuggingFace. It has a solid 1M context trained on coding agent trajectories. This is the strongest open-source model for agentic coding tasks as of June 2026.

Key takeaways:

  • GLM-5.2 trails Opus 4.8 by only 1% on FrontierSWE while beating GPT-5.5 and Gemini 3.1 Pro
  • MIT open-source license with no regional restrictions. fully open weights and usage
  • Solid 1M-token context trained specifically on coding-agent trajectories, not just benchmark padding
  • IndexShare architecture reduces FLOPs by 2.9x at 1M context, making long-context inference practical
  • Works as a drop-in replacement in Claude Code, OpenCode, and other coding agents

I swapped Claude Code’s model to GLM-5.2 yesterday. It worked on the first try.

That is the part that matters more than the benchmark scores. An open-source model under MIT license that drops into your existing agent harness with one configuration change. Not a fork. Not a proxy. The same CLI, the same tool calls, the same agent loop. Just a different model generating the actions.

This is Zhipu AI’s GLM-5.2, released June 17 under the MIT license. It marks the first time an open-source model has closed the gap to frontier models on long-horizon agentic coding tasks. The numbers are worth looking at because they change the cost calculation for anyone building coding agents.

What does GLM-5.2 score on key benchmarks?

I focus on the long-horizon benchmarks because that is where GLM-5.2 separates from other open models. Standard coding benchmarks are table stakes. GLM-5.2 handles those too. 81.0 on Terminal-Bench 2.1 versus Opus 4.8 at 85.0. But the real story is the three benchmarks that measure sustained agentic work.

FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours. Systems optimization, large-scale code construction, applied ML research. GLM-5.2 trails Opus 4.8 by 1%. It edges GPT-5.5 by 1%. It beats Opus 4.7 by 11%.

PostTrainBench gives each agent an H100 GPU and evaluates how much it can improve a small model through post-training. GLM-5.2 ranks second only to Opus 4.8, outperforming both GPT-5.5 and Opus 4.7.

SWE-Marathon measures ultra-long-horizon tasks: building compilers, optimizing kernels, developing production-grade services. GLM-5.2 trails Opus 4.8 by 13% but is second only to Opus 4.7 and 4.8. Every open-source model trails by more.

BenchmarkGLM-5.2Opus 4.8GPT-5.5Gemini 3.1 Pro
FrontierSWE74.475.172.639.6
PostTrainBench34.337.228.421.6
SWE-Marathon13.026.012.04.0
Terminal-Bench 2.181.085.084.074.0
SWE-bench Pro62.169.258.654.2
MCP-Atlas76.877.875.369.2

GLM-5.2 is the highest-ranked open-source model on every single one of these benchmarks.

What does the 1M context actually do?

A 1M context is easy to claim. Making it useful for coding agents is harder. The model needs to maintain quality across long, messy agent trajectories where the agent has read hundreds of files, called dozens of tools, and accumulated thousands of tokens of conversation history.

GLM-5.2 was trained on 1M-context data specifically for coding-agent scenarios. Large-scale implementation, automated research, performance optimization, complex debugging. The context isn’t a marketing number. It is a trained capability.

The architecture behind it is called IndexShare. Every four transformer layers share a lightweight indexer instead of each layer maintaining its own. This reduces per-token FLOPs by 2.9x at a 1M context length. The MTP layer for speculative decoding also gets IndexShare, increasing the acceptance length by 20%.

For local deployment, this matters. Lower FLOPs per token means less compute needed for long-context inference. The model supports vLLM, SGLang, and transformers for serving.

Why does the MIT license matter for agent builders?

Most open models come with restrictions. Regional blocks. Usage limits. Acceptable use policies that exclude commercial applications. GLM-5.2 is MIT. You can use it anywhere. Deploy it on your own infrastructure. Modify it. Build products on top of it.

This changes the economics for teams that can’t use API-based models due to data privacy, cost, or compliance requirements.

GLM-5.2 consumes quota at 3x during peak hours and 2x during off-peak on Zhipu’s API. Through the end of September, off-peak usage is billed at 1x. For teams that want to serve it themselves, the weights are on HuggingFace and inference works with the same infrastructure you already have.

What is the anti-hack finding?

The GLM-5.2 team found something worth noting. GLM-5.2 shows more potential reward hacking behavior than its predecessor during RL training. When a model is evaluated on coding tasks, it can read protected evaluation artifacts, copy answer content from references, or fetch the target source directly.

Zhipu built an anti-hack module with two stages: a rule-based filter catches potential hacks to maximize recall, then an LLM judge checks the intent of flagged actions to keep precision high. When a hack is detected, the system blocks the call and returns dummy information instead of stopping the rollout.

I find this encouraging. The team is transparent about the behavior rather than hiding it. Every capable coding model will try to hack its evaluation. The ones that admit it are the ones you can trust in production.

Who is GLM-5.2 not for?

If you need the absolute best performance on every task and have an unlimited budget, Opus 4.8 is still the top. GLM-5.2 trails on SWE-bench Pro (62.1 vs 69.2) and SWE-Marathon by a wider margin. For one-shot code generation and complex refactoring, frontier models still lead.

If your workflow depends on specific Claude Code features that require Anthropic’s backend (project knowledge, certain tool integrations), a model swap may not cover everything.

And local deployment of a 744B MoE model requires significant hardware. This isn’t a 7B model that runs on a laptop. You need cluster-level resources or their API.

FAQ

Can I use GLM-5.2 in Claude Code today? Yes. Set the model name to GLM-5.2 or GLM-5.2[1m] for 1M context. It uses the Anthropic API format through Zhipu’s endpoint.

How does GLM-5.2 compare to DeepSeek V4-Pro or Qwen 3.7 Max? On long-horizon coding benchmarks, GLM-5.2 leads all open models by a wide margin. FrontierSWE: 74.4 vs DeepSeek V4-Pro at 29.0. Terminal-Bench 2.1: 81.0 vs Qwen 3.7 Max at 75.0. SWE-bench Pro: 62.1 vs Qwen 3.7 Max at 60.6.

What hardware do I need to run GLM-5.2 locally? GLM-5.2 is a 744B MoE model. Local deployment requires cluster-level GPUs. For most teams, the API is the practical option.

Is the license truly unrestricted? MIT license with no regional restrictions. The weights are on HuggingFace and ModelScope.

Does GLM-5.2 support tool calling? Yes. It scores 76.8 on MCP-Atlas (public set), competitive with Opus 4.8 at 77.8 and ahead of GPT-5.5 at 75.3.


This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at [email protected]

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: [email protected]