InternLM/WildClawBench 有哪些主要特性？

面向 AI 代理的野外基准测试; 基于 OpenClaw 环境构建; 专注于代理型 AI 评估; 真实且具有挑战性的测试场景

InternLM/WildClawBench 有哪些使用场景？

评估 AI 代理在开放环境中的性能; 对代理型 AI 模型进行基准测试; 代理型评估方法的研究

InternLM/WildClawBench 使用什么编程语言？

InternLM/WildClawBench 主要使用 Python 编写。

如何安装 InternLM/WildClawBench？

运行命令：openclaw install InternLM/WildClawBench

Claude Skill

InternLM/WildClawBench

WildClawBench 是一个用于在 OpenClaw 环境中评估 AI 代理的野外基准测试，支持代理型 AI 的研究与评估。

语言

概览

Stars462

Forks47

语言Python

最后更新2026-06-25

最近同步2026-07-03

前往 GitHub

仓库信息

拥有者InternLM

仓库WildClawBench

完整名称InternLM/WildClawBench

Repo ID1,189,335,371

GitHub 地址https://github.com/InternLM/WildClawBench

安装这个 Skill

pip install -U "huggingface_hub[cli]"

GitHub

Registry 信息

类型openclaw_skill

质量分75/100

验证状态readme_parsed

最近验证2026-06-11

平台

ClaudeOpenClawCodex

能力

browserpdfmemorysearchimagevideoterminalworkflowagentic-aiagentic-evaluation

识别文件

README.mdrequirements.txt

配置键

OPENROUTER_API_KEYBRAVE_API_KEYMY_PROXY_API_KEYGEMINI_API_KEYFIRECRAWL_API_KEY

项目简介

WildClawBench 是一个野外基准测试，用于评估在 OpenClaw 环境中运行的 AI 代理，为代理型 AI 系统提供真实且具有挑战性的测试平台。

英文描述

An in-the-wild benchmark for AI agents in the OpenClaw Environment.

要点

面向 AI 代理的野外基准测试
基于 OpenClaw 环境构建
专注于代理型 AI 评估
真实且具有挑战性的测试场景

使用场景

评估 AI 代理在开放环境中的性能
对代理型 AI 模型进行基准测试
代理型评估方法的研究

README 摘要

<h1 align="center">WildClawBench</h1> <p align="center"> <img src="assets/lobster_battle.png" alt="WildClawBench Lobster" width="480"> </p> <div align="center"> [![Tasks](https://img.shields.io/badge/Tasks-60-blue)]() [![Harnesses](https://img.shields.io/badge/Harnesses-4-purple)]() [![Models](https://img.shields.io/badge/Models-19-green)]() [![Leaderboard](https://img.shields.io/badge/🏆_Leaderboard-WildClawBench-8c2416)](https://internlm.github.io/WildClawBench/) <br> [![arXiv](https://img.shields.io/badge/arXiv-2605.10912-b31b1b.svg)](https://arxiv.org/abs/2605.10912) [![HF Daily Paper](https://img.shields.io/badge/🤗_Daily_Paper-Featured-ffcc00)](https://huggingface.co/papers/2605.10912) [![HuggingFace](https://img.shields.io/badge/🤗_HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/internlm/WildClawBench) [![PDF Report](https://img.shields.io/badge/📄_Paper-PDF-red)](https://github.com/InternLM/WildClawBench/blob/main/WildClawBench_report.pdf) </div> > **Hard, practical, end-to-end evaluation for AI agents — in the wild.** --- **WildClawBench** is an agent benchmark that tests what actually matters: can an AI agent do real work, end-to-end, without hand-holding? We drop agents into a live [OpenClaw](https://github.com/openclaw/openclaw) environment — the same open-source personal AI assistant that real users rely on daily — and throw **60 original tasks** at them: clipping goal highlights from a football match, negotiating meeting times over multi-round emails, hunting down contradictions in search results, writing inference scripts for undocumented codebases, catching privacy leaks before they happen. Useful things. Hard things. Hard enough that **the strongest frontier model we tested still tops out around 62% overall** (technical report Main

话题

agentic-ai agentic-evaluation agents benchmarks openclaw

InternLM/WildClawBench

概览

仓库信息

安装这个 Skill

Registry 信息

项目简介

要点

使用场景

README 摘要

话题

探索更多

相关技能

ValueCell-ai/ClawX

infiniflow/ragflow

alirezarezvani/claude-skills

MervinPraison/PraisonAI

icip-cas/PPTAgent

kepano/obsidian-skills