Claude Skill
suyoumo/ClawProBench
ClawProBench 是一个以实时执行为优先的基准测试框架,用于在 OpenClaw 运行时环境中评估 LLM 代理,具备确定性评分和重复试验可靠性。
概览
仓库信息
安装这个 Skill
pip install uvRegistry 信息
项目简介
ClawProBench 是一个以实时执行为优先的基准测试框架,用于在 OpenClaw 运行时环境中评估 LLM 代理,具备确定性评分和重复试验可靠性。
ClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.
要点
- 面向 LLM 代理的实时优先基准测试框架
- 在 OpenClaw 运行时环境中运行
- 确定性评分确保评估一致性
- 重复试验可靠性测量
- 支持排行榜式比较
使用场景
- 在实时环境中评估 LLM 代理性能
- 比较代理在重复试验中的可靠性
- 为基于 OpenClaw 的应用进行代理基准测试
- 构建可复现的代理评估流水线
- 通过排行榜指标追踪代理改进
README 摘要
<div align="center"> <img src="docs/assets/openclawprobench-logo.svg" width="160" alt="ClawProBench Logo"> # ClawProBench [](#benchmark-profiles) [](#benchmark-profiles) [](#benchmark-profiles) [](#quick-start) [](LICENSE) > Transparent live-first benchmark harness for evaluating model capability inside the OpenClaw runtime. <br> > 102 active scenarios, 162 catalog scenarios, deterministic grading, and OpenClaw-native coverage. </div> <p> <a href="README.zh-CN.md"><strong>简体中文 README</strong></a> </p> ClawProBench focuses on real OpenClaw execution with deterministic grading, structured reports, and benchmark-profile selection. The default ranking path is the `core` profile; broader active coverage remains available through `intelligence`, `coverage`, `native`, and `full`. The current worktree inventory reports `102` active scenarios and `162` total catalog scenarios (`60` incubating) via `python3 run.py inventory --json` and `python3 run.py inventory --benchmark-status all --json`. ## Leaderboard Browse the public leaderboard and benchmark cases at **[suyoumo.github.io/bench](https://suyoumo.github.io/bench/)**. [](https://suyoumo.github.io/bench/) [](https://suyoumo.github.io/bench/modelpk/) ## Get Involved We sincerely thank friends from Kimi and Qwen for their helpful feedback and improvement suggestions