Hugging Face x IBM Launch Open Agent Leaderboard: Evaluates 'Complete Agent Systems' Rather Than Individual Models; Adds DeepSeek V3.2 and Kimi K2.5 L1
Confidence: High
Key Points: Hugging Face and IBM Research have launched the Open Agent Leaderboard — the industry's first open-source evaluation benchmark that treats 'complete agent systems' as the unit of measurement. It emphasizes that 'the same model can produce dramatically different results depending on the agent architecture, tool set, and memory strategy used,' making LLM-only evaluation insufficient. The current setup covers 5 models x 5 agents x 6 benchmarks. Two newly added open-weight models, DeepSeek V3.2 and Kimi K2.5, are competitive in certain combinations but still trail leading closed-source models by 18-29 percentage points on average. The leaderboard accepts community contributions across three axes: new agents (wrapped using the Exgentic protocol), new benchmarks (with programmatic evaluators), and new models.
Impact: For AI evaluation methodology: shifting the focus from 'LLM benchmarks' to 'holistic agent system benchmarks' is a key paradigm shift for the second half of 2026. For the open-source community: DeepSeek V3.2 and Kimi K2.5 gain visibility, potentially accelerating the maturity of open-source agent stacks. For enterprise AI procurement: future RFPs should require 'agent system benchmarks' rather than just 'model scores.'
Detailed Analysis
Trade-offs
Pros:
- First-ever evaluation at the agent 'system' level, reflecting real production performance
- Open-weight models (DeepSeek, Kimi) gain evaluation visibility
- Three-axis contribution mechanism is clear: separate submission flows for agents, benchmarks, and models
- The Exgentic protocol standardizes agent wrapping and can scale to more agent frameworks in the future
Cons:
- The 5x5x6 matrix is still relatively small and cannot cover all practical agent designs
- Open-weight models trail leading closed-source ones by 18-29 pp — the gap remains significant
- Exgentic is a new standard; mainstream agent frameworks (LangChain, CrewAI, AutoGen) do not yet support it natively
- Results may be biased by benchmark task selection; further diversification is needed
Quick Start (5-15 minutes)
- Visit huggingface.co/blog/ibm-research/open-agent-leaderboard to read the methodology
- Open the leaderboard on Hugging Face Spaces and compare the 5 models on the benchmarks that matter to you
- If you develop an agent framework, read the Exgentic protocol spec and try wrapping your agent
- Add 'agent system benchmarks' to your next round of model selection evaluations
Recommendation
AI platform and ML leaders should immediately incorporate the Open Agent Leaderboard into procurement and selection processes. Open-source community developers can contribute new agents or benchmarks to expand leaderboard breadth. Researchers can use the 18-29 pp open-source gap as a research target and investigate how better agent harnesses can close it.
Sources: Hugging Face - The Open Agent Leaderboard (Official) | Hugging Face Spaces - Open Agent Leaderboard (Official)