Claude Skill
InternLM/WildClawBench
WildClawBench 是一个用于在 OpenClaw 环境中评估 AI 代理的野外基准测试,支持代理型 AI 的研究与评估。
概览
仓库信息
安装这个 Skill
pip install -U "huggingface_hub[cli]"Registry 信息
项目简介
WildClawBench 是一个野外基准测试,用于评估在 OpenClaw 环境中运行的 AI 代理,为代理型 AI 系统提供真实且具有挑战性的测试平台。
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
要点
- 面向 AI 代理的野外基准测试
- 基于 OpenClaw 环境构建
- 专注于代理型 AI 评估
- 真实且具有挑战性的测试场景
使用场景
- 评估 AI 代理在开放环境中的性能
- 对代理型 AI 模型进行基准测试
- 代理型评估方法的研究
README 摘要
<h1 align="center">WildClawBench</h1> <p align="center"> <img src="assets/lobster_battle.png" alt="WildClawBench Lobster" width="480"> </p> <div align="center"> []() []() []() [](https://internlm.github.io/WildClawBench/) <br> [](https://arxiv.org/abs/2605.10912) [](https://huggingface.co/papers/2605.10912) [](https://huggingface.co/datasets/internlm/WildClawBench) [](https://github.com/InternLM/WildClawBench/blob/main/WildClawBench_report.pdf) </div> > **Hard, practical, end-to-end evaluation for AI agents — in the wild.** --- **WildClawBench** is an agent benchmark that tests what actually matters: can an AI agent do real work, end-to-end, without hand-holding? We drop agents into a live [OpenClaw](https://github.com/openclaw/openclaw) environment — the same open-source personal AI assistant that real users rely on daily — and throw **60 original tasks** at them: clipping goal highlights from a football match, negotiating meeting times over multi-round emails, hunting down contradictions in search results, writing inference scripts for undocumented codebases, catching privacy leaks before they happen. Useful things. Hard things. Hard enough that **the strongest frontier model we tested still tops out around 62% overall** (technical report Main