Claude Skill
pinchbench/skill
PinchBench 是一个用于评估 LLM 模型作为 OpenClaw 编码代理的基准测试系统,由 kilo.ai 团队构建。
概览
仓库信息
安装这个 Skill
git clone https://github.com/pinchbench/skill.gitRegistry 信息
项目简介
PinchBench 是一个用于评估 LLM 模型作为 OpenClaw 编码代理的基准测试系统,由 kilo.ai 团队开发。
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
要点
- 评估 LLM 模型作为编码代理
- 基于 OpenClaw 的基准测试框架
- 由 kilo.ai 团队构建
- 基于 Python 的实现
使用场景
- 基准测试 LLM 编码代理性能
- 比较不同 LLM 模型在编码任务中的表现
- 基于代理的代码生成研究
README 摘要
# 🦀 PinchBench **Real-world benchmarks for AI coding agents** [](https://pinchbench.com) [](LICENSE) <!-- task-count-badge --><!-- /task-count-badge --> > **Note:** This repository contains the benchmark skill/tasks. It is NOT the source of official leaderboard results. To add models to the official results, modify [pinchbench/scripts/default-models.yml](https://github.com/pinchbench/scripts/blob/main/default-models.yml). PinchBench measures how well LLM models perform as the brain of an [OpenClaw](https://github.com/openclaw/openclaw) agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files. Results are collected on a public leaderboard at **[pinchbench.com](https://pinchbench.com)**.  ## Why PinchBench? Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents: - **Tool usage** — Can the model call the right tools with the right parameters? - **Multi-step reasoning** — Can it chain together actions to complete complex tasks? - **Real-world messiness** — Can it handle ambiguous instructions and incomplete information? - **Practical outcomes** — Did it actually create the file, send the email, or schedule the meeting? ## Quick Start ```bash # Clone the skill git clone https://github.com/pinchbench/skill.git cd skill # Run benchmarks with your model of choice ./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 # Or run specific tasks ./scripts/run.sh --model openrouter/openai/gpt-4o --suite t
话题
暂无话题