Claude Skill
shenli/distributed-system-testing
Explore AI-agent skills for distributed-systems testing, combining chaos engineering and automation to validate resilience and correctness in complex systems.
Overview
Repository
Install this Skill
git clone https://github.com/shenli/distributed-system-testing.git \Registry
Summary
A collection of AI-agent skills designed for distributed-systems testing, leveraging chaos engineering and agent-based automation to validate system resilience and correctness.
分布式系统测试的AI代理技能
Key features
- AI-agent skills tailored for distributed-systems testing
- Integration with chaos engineering principles
- Automated resilience and correctness validation
- Modular skill design for flexible test scenarios
- Open-source and community-driven development
Use cases
- Resilience testing of distributed databases
- Fault injection and recovery validation in microservices
- Automated chaos experiments for cloud-native systems
- Correctness verification under network partitions
- Training AI agents to detect system anomalies
README excerpt
# Distributed Systems Testing Skills **Two skills for AI coding agents that design and run claim-driven tests for distributed and stateful systems.** Together they produce a structured Markdown test plan and a findings report with 10-state verdicts and an explicit SUT / harness / checker / environment blame classification. A reviewer reads the two artifacts and decides whether to ship; nothing else has to be re-run. Works with Claude Code, Codex, Copilot CLI, Cursor, Gemini, or any agent that reads Markdown and runs shell. The skills are plain SKILL.md files. The agent executes them; the plan and findings report are the output. One skill designs the plan. The other runs it. A plan starts from the product's claims, generates hypotheses tied to those claims, and writes scenarios named after the claim each tries to falsify. For consistency-critical scenarios, each scenario also binds an abstract model (`register | queue | log | lock | lease | ledger | …`) to an operation-history schema, a named checker, and a nemesis with observable landing evidence. The plan ends with a coverage adequacy argument and a conservative confidence statement. ## Why The default for testing distributed and stateful systems — write a few integration tests and call it done — finds a small fraction of the bugs that actually break these systems in production: partial network partitions, non-deterministic concurrency, crash-recovery, upgrade/rollback, idempotency under replay, timing-sensitive ordering. These skills enforce an opinionated workflow that pulls from the field's hard-won knowledge: - **Claim-driven, not test-driven.** Start from what the product promises. Every scenario falsifies one claim under one fault. A test named after its claim is harder to weaken than one named after