Claude Skill

InternLM/WildClawBench

WildClawBench is an in-the-wild benchmark for evaluating AI agents in the OpenClaw environment, supporting agentic AI research and evaluation.

Overview

Stars462
Forks47
LanguagePython
Last pushed2026-06-25
Last synced2026-07-03
View on GitHub

Repository

OwnerInternLM
RepositoryWildClawBench
Full nameInternLM/WildClawBench
Repo ID1,189,335,371

Install this Skill

pip install -U "huggingface_hub[cli]"

Registry

Typeopenclaw_skill
Quality score75/100
Verificationreadme_parsed
Last verified2026-06-11
Platforms
ClaudeOpenClawCodex
Capabilities
browserpdfmemorysearchimagevideoterminalworkflowagentic-aiagentic-evaluation
Detected files
README.mdrequirements.txt
Config keys
OPENROUTER_API_KEYBRAVE_API_KEYMY_PROXY_API_KEYGEMINI_API_KEYFIRECRAWL_API_KEY

Summary

WildClawBench is an in-the-wild benchmark designed to evaluate AI agents operating within the OpenClaw environment, providing a realistic and challenging testbed for agentic AI systems.

Chinese description

OpenClaw环境中AI代理的野外基准测试。

Key features

  • In-the-wild benchmark for AI agents
  • Built on the OpenClaw environment
  • Focuses on agentic AI evaluation
  • Realistic and challenging test scenarios

Use cases

  • Evaluating AI agent performance in open environments
  • Benchmarking agentic AI models
  • Research on agentic evaluation methodologies

README excerpt

<h1 align="center">WildClawBench</h1> <p align="center"> <img src="assets/lobster_battle.png" alt="WildClawBench Lobster" width="480"> </p> <div align="center"> [![Tasks](https://img.shields.io/badge/Tasks-60-blue)]() [![Harnesses](https://img.shields.io/badge/Harnesses-4-purple)]() [![Models](https://img.shields.io/badge/Models-19-green)]() [![Leaderboard](https://img.shields.io/badge/🏆_Leaderboard-WildClawBench-8c2416)](https://internlm.github.io/WildClawBench/) <br> [![arXiv](https://img.shields.io/badge/arXiv-2605.10912-b31b1b.svg)](https://arxiv.org/abs/2605.10912) [![HF Daily Paper](https://img.shields.io/badge/🤗_Daily_Paper-Featured-ffcc00)](https://huggingface.co/papers/2605.10912) [![HuggingFace](https://img.shields.io/badge/🤗_HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/internlm/WildClawBench) [![PDF Report](https://img.shields.io/badge/📄_Paper-PDF-red)](https://github.com/InternLM/WildClawBench/blob/main/WildClawBench_report.pdf) </div> > **Hard, practical, end-to-end evaluation for AI agents — in the wild.** --- **WildClawBench** is an agent benchmark that tests what actually matters: can an AI agent do real work, end-to-end, without hand-holding? We drop agents into a live [OpenClaw](https://github.com/openclaw/openclaw) environment — the same open-source personal AI assistant that real users rely on daily — and throw **60 original tasks** at them: clipping goal highlights from a football match, negotiating meeting times over multi-round emails, hunting down contradictions in search results, writing inference scripts for undocumented codebases, catching privacy leaks before they happen. Useful things. Hard things. Hard enough that **the strongest frontier model we tested still tops out around 62% overall** (technical report Main

Topics

Explore more

Data from GitHub. Synced on 2026-07-03