mcp-eval
, run realistic scenarios, and check that it uses tools correctly, follows instructions, and produces the expected outputs.
Use evaluations to confirm the agent behaves as expected before you release changes.
The full agent playbook is at mcp-eval.ai/agent-evaluation; reference it for extended patterns, datasets, and troubleshooting.
Define the agent under test
You can pointmcp-eval
at an AgentSpec
, an instantiated Agent
, or a factory that builds agents on demand. For example, to test the Finder agent from examples/basic/mcp_basic_agent
:
agent_under_test.py
agent_factory.py
Choose a test style
- Decorator tasks (
@task
) are great for narrative scenarios and map cleanly to the patterns described in Anthropic’s Building Effective Agents paper. - Pytest works when you want to stay inside your existing test harness.
- Datasets let you replay curated or generated cases against multiple agents.
lastmile-ai/mcp-eval/examples/mcp_server_fetch/tests/
.
Write assertions that match agent expectations
Expect
covers tool usage, quality, efficiency, and performance. Combine multiple checks in a single run:
agent_eval_test.py
mcp_agent.workflows.*
:
path_validation.py
Inspect metrics and spans
Every evaluation captures telemetry you can read during or after the test:telemetry.py
context.tracer
) to debug tool failures, retries, or human-input pauses.
Scenario ideas
- Regression suites for workflows in
examples/workflows/
—verify orchestrators still call the same tool chain after prompt or model changes. - Human-in-the-loop flows—assert that
Expect.tools.was_called("__human_input__")
fires when you send a pause signal from theSignalRegistry
. - Multi-agent swarms—ensure router decisions and downstream agents cooperate by validating tool sequences and final content.