Skip to main content
mcp-eval tests Model Context Protocol servers and agents. It runs scripted scenarios, captures telemetry, and enforces assertions so you can confirm behavior is stable before releasing changes.
The full documentation lives at mcp-eval.ai. Keep it handy for configuration specifics, advanced examples, and release updates.
mcp-eval connects to your targets over MCP, executes scenarios, and records detailed metrics for every tool call.

Why teams run it

  • Catch regressions when prompts, workflows, or model settings change
  • Confirm that the right MCP tools fire in the expected order with the expected payloads
  • Exercise recovery paths such as human-input pauses or fallback workflows
  • Produce repeatable evidence—reports, traces, and badges—that a release is safe

What you can cover

Install mcp-eval

uv tool install mcpevals      # CLI
uv add mcpevals               # project dependency
mcp-eval init                 # scaffold config, tests/, and datasets/
The init wizard can generate decorator tests, pytest scaffolding, and dataset examples—you can rerun it as your suite grows.

Register what you test

After an mcp-agent workflow or aggregator is running locally:
  1. Register servers with the same command or endpoint your agent uses:
    mcp-eval server add \
      --name fetch \
      --transport stdio \
      --command "uv" "run" "python" "-m" "mcp_servers.fetch"
    
  2. Register agents by pointing to an AgentSpec, an instantiated Agent, or your MCPApp:
    # tests/config/targets.yaml
    agents:
      - name: finder
        type: agent_spec
        path: ../../examples/basic/mcp_basic_agent/mcp_agent/agents/finder.py
    servers:
      - name: fetch
        transport: stdio
        command: ["uv", "run", "python", "-m", "mcp_servers.fetch"]
    
  3. When you introduce a new workflow or capability, run mcp-eval generate to draft scenario ideas with LLM assistance.

Structure evaluations

mcp-eval follows a code-first layout similar to Pydantic AI’s evals package: datasets hold cases, cases reference evaluators, and evaluators score the outputs.

Decorator tasks

decorator_style.py
from mcp_eval import Expect, task

@task("Finder summarizes Example Domain")
async def test_finder_fetch(agent, session):
    response = await agent.generate_str("Fetch https://example.com and summarize it.")

    await session.assert_that(Expect.tools.was_called("fetch"))
    await session.assert_that(Expect.content.contains("Example Domain"), response=response)
    await session.assert_that(Expect.performance.max_iterations(3))

Pytest suites

pytest_style.py
import pytest
from mcp_eval import create_agent, Expect

@pytest.mark.asyncio
async def test_finder_fetch_pytest():
    agent = await create_agent("finder")
    response = await agent.generate_str("Fetch https://example.com")
    assert "Example Domain" in response
    await Expect.tools.was_called("fetch").evaluate(agent.session)

Dataset runs

dataset_style.py
from mcp_eval import Case, Dataset, Expect
from mcp_eval import create_agent

dataset = Dataset(
    cases=[
        Case(
            name="fetch_example_domain",
            inputs="Fetch https://example.com and summarize it.",
            evaluators=[Expect.tools.was_called("fetch")],
        )
    ]
)

async def run_case(prompt: str) -> str:
    agent = await create_agent("finder")
    return await agent.generate_str(prompt)

report = await dataset.evaluate(run_case)  # call from an async test or helper
Datasets, cases, and evaluators match the structure in Pydantic AI evals: cases define inputs and expectations, evaluators score results, and datasets group related cases for reuse.

Run and inspect

mcp-eval run tests/   # decorator, dataset, and CLI suites
uv run pytest -q tests
During a run you can pull structured telemetry:
metrics = session.get_metrics()
span_tree = session.get_span_tree()

Pick a focus area

Observability, reports, and CI/CD

  • OpenTelemetry traces flow to Grafana, Honeycomb, Pydantic Logfire, or any OTEL target
  • JSON/Markdown/HTML reports are ready for CI artifacts or release notes
  • Reusable GitHub Actions (mcp-eval/.github/actions/mcp-eval/run) publish test results, summaries, and badges
I