Benchmark

Corporate Collaboration Benchmark

The Corporate Collaboration Benchmark (corp-collab-bench) evaluates how effectively state-of-the-art code-generation language models coordinate across colleagues in shared codebases. It measures an agent's ability to align in real time across networked agents, work in parallel on intersecting tasks, and avoid conflicting code changes across 2 challenging contribution scenarios spanning 3 repositories (Feb 2026 v1 testing harness).

The benchmark assesses both successful Strands MCP tool usage and observable collaboration behavior, as well as the ultimate success or failure of each attempt.

Initially, this benchmark focuses on leading agentic coding systems, including OpenAI Codex and Claude Code, with the intention to generalize to future models and harnesses using Cursor.

Evaluation Metric

The score metric measures the proportion of tasks successfully completed in a single attempt, where use of the Strands MCP enables effective deconfliction and results in a correct, merge-conflict-free outcome.

Note: Currently, no tested harness is capable of live coordination across colleagues' agents without Strands. Without an active coordination layer, agents working in parallel on shared codebases consistently produce conflicting changes, resulting in a 0.00% score across all models and harnesses evaluated.

Ranked Models

Larger Test Set Coming Soon - Last Updated Feb 7, 2026
#ModelHarnessWith Strands
1Claude-4.5-OpusClaude Code66.67%
2GPT-5.2-codex (high)Codex66.67%
3GPT-5.1-codex (high)Codex50.00%
4Claude-4.5-SonnetClaude Code50.00%
5GPT-5.1-codex-mini (high)Codex33.33%
6Claude-Haiku-4-5Claude Code16.67%

Notable Models Not Yet Tested

These models will be added to future evaluations.

  • GPT-5.3-Codex
  • Claude-Opus-4.6
  • Cursor-Composer-1
  • GPT-5-Pro
  • Gemini-3-Pro-Preview
  • Gemini-3-Flash-Preview