mcp-eval tests Model Context Protocol servers and agents. It runs scripted scenarios, captures telemetry, and enforces assertions so you can confirm behavior is stable before releasing changes.
The full documentation lives at mcp-eval.ai. Keep it handy for configuration specifics, advanced examples, and release updates.
mcp-eval connects to your targets over MCP, executes scenarios, and records detailed metrics for every tool call.Why teams run it
- Catch regressions when prompts, workflows, or model settings change
- Confirm that the right MCP tools fire in the expected order with the expected payloads
- Exercise recovery paths such as human-input pauses or fallback workflows
- Produce repeatable evidence—reports, traces, and badges—that a release is safe
What you can cover
Test MCP Servers
Validate tool definitions, edge cases, and error responses before exposing servers to users
Evaluate Agents
Measure tool usage, reasoning quality, and recovery behavior
Track Performance
Capture latency, token usage, and cost with built-in telemetry
Assert Quality
Combine structural checks, path validators, and LLM judges in one run
Install mcp-eval
init wizard can generate decorator tests, pytest scaffolding, and dataset examples—you can rerun it as your suite grows.
Register what you test
After an mcp-agent workflow or aggregator is running locally:-
Register servers with the same command or endpoint your agent uses:
-
Register agents by pointing to an
AgentSpec, an instantiatedAgent, or yourMCPApp: -
When you introduce a new workflow or capability, run
mcp-eval generateto draft scenario ideas with LLM assistance.
Structure evaluations
mcp-eval follows a code-first layout similar to Pydantic AI’s evals package: datasets hold cases, cases reference evaluators, and evaluators score the outputs.
Decorator tasks
decorator_style.py
Pytest suites
pytest_style.py
Dataset runs
dataset_style.py
Datasets, cases, and evaluators match the structure in Pydantic AI evals: cases define inputs and expectations, evaluators score results, and datasets group related cases for reuse.
Run and inspect
Pick a focus area
- Work through end-to-end agent scenarios in
Agent Evaluation. - Validate server behavior and tool contracts in
MCP Server Evaluation. - Refer back to mcp-eval.ai for extended guides, configuration options, and community examples.
Observability, reports, and CI/CD
- OpenTelemetry traces flow to Grafana, Honeycomb, Pydantic Logfire, or any OTEL target
- JSON/Markdown/HTML reports are ready for CI artifacts or release notes
- Reusable GitHub Actions (
mcp-eval/.github/actions/mcp-eval/run) publish test results, summaries, and badges
