Treat each tool as an API contract. Evaluations catch schema drift, unexpected latencies, and regressions in derived content before they reach users.
The full server guide lives at mcp-eval.ai/server-evaluation, with deeper dives on datasets, assertions, and debugging techniques.
Connect the server under test
Register the server exactly the way your agent launches it—stdio, SSE, Docker, or remote URL:MCPAggregator, evaluate both the aggregated view and the underlying servers. That ensures namespacing and tool discovery keep working when you refactor.
Baseline assertions
- Correctness: validate the content returned to the agent (
Expect.content.contains,Expect.tools.output_matches) - Tool usage: make sure the tool you expect is the one that fired (
Expect.tools.was_called,Expect.tools.sequence) - Performance: guard against regressions in latency or excessive retries (
Expect.performance.response_time_under,Expect.performance.max_iterations) - Quality: enlist LLM judges when outputs are qualitative (
Expect.judge.llm,Expect.judge.multi_criteria)
fetch_server_test.py
Encode golden paths and limits
Expect.path.efficiency and Expect.tools.sequence let you capture the ideal execution path. This is especially useful for servers that proxy other systems (databases, file systems, SaaS APIs) where unnecessary retries are costly.
golden_path.py
Inspect artifacts
- Per-test JSON plus OpenTelemetry
.jsonltraces show detailed timings and tool payloads (./test-reportsby default) - HTML and Markdown summaries are ready for CI upload or PR comments
- Combine with mcp-agent tracing to correlate server-side telemetry with agent orchestration
