Claude Skill

suyoumo/ClawProBench

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.

Overview

Stars719
Forks51
LanguagePython
Last pushed2026-06-08
Last synced2026-06-17
View on GitHub

Repository

Ownersuyoumo
RepositoryClawProBench
Full namesuyoumo/ClawProBench
Repo ID941,429,098

Install this Skill

pip install uv

Registry

Typeopenclaw_skill
Quality score80/100
Verificationreadme_parsed
Last verified2026-06-07
Platforms
ClaudeOpenClawCodex
Capabilities
pdfmemorysearchimageterminalagentbenchmarkevaluationharnessleaderboard
Detected files
README.mdREADME.zh-CN.mddocsrequirements.txttests

Summary

ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime, featuring deterministic grading and repeated-trial reliability.

Chinese description

ClawProBench 是一个以实时执行为优先的基准测试框架,用于在 OpenClaw 运行时环境中评估 LLM 代理,具备确定性评分和重复试验可靠性。

Key features

  • Live-first benchmark harness for LLM agents
  • Runs inside the OpenClaw runtime environment
  • Deterministic grading for consistent evaluation
  • Repeated-trial reliability measurement
  • Supports leaderboard-style comparisons

Use cases

  • Evaluating LLM agent performance in live environments
  • Comparing agent reliability across repeated trials
  • Benchmarking agents for OpenClaw-based applications
  • Building reproducible agent evaluation pipelines
  • Tracking agent improvements via leaderboard metrics

README excerpt

<div align="center"> <img src="docs/assets/openclawprobench-logo.svg" width="160" alt="ClawProBench Logo"> # ClawProBench [![Active Scenarios](https://img.shields.io/badge/active-102-blue)](#benchmark-profiles) [![Catalog](https://img.shields.io/badge/catalog-162-green)](#benchmark-profiles) [![Core Profile](https://img.shields.io/badge/core-26-orange)](#benchmark-profiles) [![Execution](https://img.shields.io/badge/execution-live--first-black)](#quick-start) [![License](https://img.shields.io/badge/license-Apache%202.0-red)](LICENSE) > Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime. <br> > 102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage. </div> <p> <a href="README.zh-CN.md"><strong>简体中文 README</strong></a> </p> ClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the `core` profile; broader active coverage remains available through `intelligence`, `coverage`, `native`, and `full`. The current worktree inventory reports `102` active scenarios and `162` total catalog scenarios (`60` incubating) via `python3 run.py inventory --json` and `python3 run.py inventory --benchmark-status all --json`. ## Leaderboard Browse the public leaderboard and benchmark cases at **[suyoumo.github.io/bench](https://suyoumo.github.io/bench/)**. [![ClawProBench leaderboard preview](docs/assets/leaderboard-preview-20260426.png)](https://suyoumo.github.io/bench/) [![ClawProBench ModelPK preview](docs/assets/modelpk.png)](https://suyoumo.github.io/bench/modelpk/) ## Get Involved We sincerely thank friends from Kimi and Qwen for their helpful feedback and improvement suggestions

Topics

Explore more

Data from GitHub. Synced on 2026-06-17