MCP Server Evaluation

Server evaluations connect your MCP implementation to a reference agent and verify that tools behave correctly, error handling works, and performance stays within limits. Run them whether your server powers an mcp-agent workflow or an external client.

Treat each tool as an API contract. Evaluations catch schema drift, unexpected latencies, and regressions in derived content before they reach users.

The full server guide lives at mcp-eval.ai/server-evaluation, with deeper dives on datasets, assertions, and debugging techniques.

Connect the server under test

mcp-eval server add \
  --name fetch \
  --transport stdio \
  --command "uv" "run" "python" "-m" "mcp_servers.fetch"

When you expose a suite of servers through MCPAggregator, evaluate both the aggregated view and the underlying servers. That ensures namespacing and tool discovery keep working when you refactor.

Baseline assertions

Correctness: validate the content returned to the agent (Expect.content.contains, Expect.tools.output_matches)
Tool usage: make sure the tool you expect is the one that fired (Expect.tools.was_called, Expect.tools.sequence)
Performance: guard against regressions in latency or excessive retries (Expect.performance.response_time_under, Expect.performance.max_iterations)
Quality: enlist LLM judges when outputs are qualitative (Expect.judge.llm, Expect.judge.multi_criteria)

fetch_server_test.py

from mcp_eval import Expect, task

@task("Fetch server returns HTML summary")
async def test_fetch_tool(agent, session):
    response = await agent.generate_str(
        "Use the fetch tool to read https://httpbin.org/html and summarize the page."
    )

    await session.assert_that(Expect.tools.was_called("fetch"))
    await session.assert_that(
        Expect.tools.output_matches("fetch", {"isError": False}, match_type="partial")
    )
    await session.assert_that(
        Expect.content.contains("httpbin", case_sensitive=False), response=response
    )
    await session.assert_that(
        Expect.judge.llm(
            "Summary should mention the simple HTML demonstration page", min_score=0.8
        ),
        response=response,
    )

Encode golden paths and limits

Expect.path.efficiency and Expect.tools.sequence let you capture the ideal execution path. This is especially useful for servers that proxy other systems (databases, file systems, SaaS APIs) where unnecessary retries are costly.

golden_path.py

await session.assert_that(
    Expect.path.efficiency(
        expected_tool_sequence=["fetch"],
        allow_extra_steps=1,
        tool_usage_limits={"fetch": 1},
    )
)

Inspect artifacts

Per-test JSON plus OpenTelemetry .jsonl traces show detailed timings and tool payloads (./test-reports by default)
HTML and Markdown summaries are ready for CI upload or PR comments
Combine with mcp-agent tracing to correlate server-side telemetry with agent orchestration

Get Started

MCP Agent SDK

Deployment

Test and Evaluate

Reference

MCP Server Evaluation

Connect the server under test

Baseline assertions

Encode golden paths and limits

Inspect artifacts

Get Started

MCP Agent SDK

Deployment

Test and Evaluate

Reference

​Connect the server under test

​Baseline assertions

​Encode golden paths and limits

​Inspect artifacts

Connect the server under test

Baseline assertions

Encode golden paths and limits

Inspect artifacts