Claude Skill
darkrishabh/agent-skills-eval
A TypeScript CLI tool for evaluating AI agent skills in agentskills.io format. Supports JSONL/YAML tests and OpenAI-compatible LLM evals.
Overview
Repository
Install this Skill
npx agent-skills-eval ./skills \Registry
npx agent-skills-eval ./skills \npm install agent-skills-evalnpx agent-skills-eval --helpnpx agent-skills-eval [root] \
Summary
A TypeScript-based test runner for evaluating AI agent skills in the agentskills.io format. It supports CLI usage, JSONL and YAML test definitions, and OpenAI-compatible LLM evaluations.
agentskills.io 风格 AI 代理技能的测试运行器
Key features
- Runs agent skill evaluations from agentskills.io-style definitions
- Supports JSONL and YAML test file formats
- OpenAI-compatible LLM evaluation integration
- Command-line interface (CLI) for easy automation
- Built with TypeScript for type safety and reliability
Use cases
- Evaluating AI agent performance on standardized skill tests
- Automating LLM evaluation pipelines in CI/CD workflows
- Benchmarking different AI agents against agentskills.io tasks
- Developing and testing new agent skills with reproducible evals
README excerpt
<div align="center"> <img src="https://github.com/user-attachments/assets/094b8e11-e19e-4c96-ae82-ba701cfcf7e3" alt="agent-skills-eval — a test runner for Agent Skills" width="100%" /> <br /> # agent-skills-eval **A test runner for [Agent Skills](https://agentskills.io).** Write a `SKILL.md`, drop in some evals, and find out — empirically — whether your skill actually makes the model better at the task. [](https://www.npmjs.com/package/agent-skills-eval) [](https://github.com/darkrishabh/agent-skills-eval/actions/workflows/ci.yml) [](LICENSE) [](package.json) [](https://darkrishabh.github.io/agent-skills-eval/) [](https://www.typescriptlang.org/) [Documentation](https://darkrishabh.github.io/agent-skills-eval/) · [Quickstart](#quickstart) · [SDK](#sdk) · [agentskills.io](https://agentskills.io) </div> --- ## Why this exists [Agent Skills](https://agentskills.io) — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a `SKILL.md` and assume your agent is now better at the task. The hard part is *proving* it. `agent-skills-eval` is the missing piece. It runs your skill against the same prompts twice — once `with_