Skip to main content
Evaluator-optimizer workflow

When to use it

  • Quality matters and you need an automated reviewer to approve or demand revisions.
  • You have an explicit rubric (score threshold, policy checklist, guardrail) that can be evaluated programmatically.
  • You want traceability: each attempt, its score, and the feedback that drove the next revision.
  • You need to wrap another workflow (router, orchestrator, parallel, even a deterministic function) with an evaluation loop.

How the loop works

create_evaluator_optimizer_llm returns an EvaluatorOptimizerLLM that:
  1. Calls the optimizer to generate an initial response.
  2. Sends the response to the evaluator, which returns an EvaluationResult with:
    • rating: a QualityRating (POOR, FAIR, GOOD, EXCELLENT mapped to 0–3).
    • feedback: free-form comments.
    • needs_improvement: boolean.
    • focus_areas: list of bullet points for the next iteration.
  3. If the rating meets min_rating, the loop stops. Otherwise it regenerates with the optimizer, incorporating the evaluator’s feedback, until max_refinements is reached.
  4. Every attempt is recorded in refinement_history for audit or UI display.

Quick start

from mcp_agent.app import MCPApp
from mcp_agent.workflows.factory import AgentSpec, RequestParams, create_evaluator_optimizer_llm
from mcp_agent.workflows.evaluator_optimizer.evaluator_optimizer import QualityRating

app = MCPApp(name="eval_opt_example")

async def main():
    async with app.run() as running_app:
        evaluator_optimizer = create_evaluator_optimizer_llm(
            name="policy_checked_writer",
            optimizer=AgentSpec(
                name="draft",
                instruction="Write detailed answers with citations when available.",
            ),
            evaluator=AgentSpec(
                name="compliance",
                instruction=(
                    "Score the response from 0-3.\n"
                    "Reject anything that violates policy or lacks citations."
                ),
            ),
            min_rating=QualityRating.EXCELLENT,  # require top score
            max_refinements=4,
            provider="anthropic",  # evaluator/optimizer can use different providers internally
            request_params=RequestParams(temperature=0.4),
            context=running_app.context,
        )

        result = await evaluator_optimizer.generate_str(
            "Summarise MCP Agent's router pattern for a product manager."
        )

        # Inspect the iteration history
        for attempt in evaluator_optimizer.refinement_history:
            print(
                attempt["attempt"],
                attempt["evaluation_result"].rating,
                attempt["evaluation_result"].feedback,
            )

        return result
You can pass an existing AugmentedLLM (router, orchestrator, parallel workflow) as the optimizer instead of an AgentSpec. For evaluators, strings are allowed: if you pass a literal string, the factory spins up an evaluator agent using that instruction.

Configuration knobs

  • min_rating: numeric threshold (0–3). Set to None to keep all iterations and let a human pick the best attempt.
  • max_refinements: hard cap on iteration count; default is 3.
  • evaluator: accept AgentSpec, Agent, AugmentedLLM, or string instruction. Use this to plug in policy engines or MCP tools that act as judges.
  • request_params: forwarded to both optimizer and evaluator LLMs (temperature, max tokens, strict schema enforcement).
  • llm_factory: automatically injected based on the provider you specify; override if you need custom model selection or instrumentation.
  • evaluator_optimizer.refinement_history: list of dicts containing response and evaluation_result per attempt—useful for UI timelines or telemetry.

Pairing with other patterns

  • Router + evaluator: Route to a specialised agent, then run the evaluator loop before returning to the user.
  • Parallel + evaluator: Run multiple evaluators in parallel (e.g. clarity, policy, bias). Feed the aggregated verdict back into the optimizer.
  • Deep research failsafe: Wrap sections of a deep orchestrator plan with an evaluator-optimizer step to enforce domain-specific QA.

Operational tips

  • Evaluator instructions should reference the previous feedback: the default prompt asks the optimizer to address each focus area. Ensure your instruction echoes that requirement.
  • Call await evaluator_optimizer.get_token_node() to see how many tokens each iteration consumed (optimiser vs evaluator).
  • Log or persist refinement_history when you need postmortem evidence of what the evaluator flagged and how the optimizer reacted.
  • Combine with OpenTelemetry (otel.enabled: true) to capture spans for each iteration, including evaluation scores and decision rationale.

Example projects

I