How do I Evaluate AI Agents and Agentic Workflows?

Offline Evaluation: Testing Before Deployment

Offline evaluation enables you to test agents against curated datasets or simulated scenarios before they face real users. Maxim supports multiple methods for this:

Agent Simulation

Agent simulations are the most effective way to test conversational agents. Instead of static input-output pairs, simulations create dynamic interactions between your agent and a user persona to test multi-turn capabilities.

Define Personas: Configure user characteristics (e.g., “impatient customer,” “technical user”) to test how the agent adapts.
Set Scenarios: Define the user’s goal (e.g., “Process a refund”) and the expected steps the agent should take.
Validate Context: Ensure the agent retains information across multiple exchanges.

HTTP Endpoint Evaluation

If your agent is hosted externally, you can use HTTP Endpoint Evaluation. This enables you to send payloads to your agent’s API endpoint and evaluate the response without code changes.

Configuration: Connect Maxim to your agent’s URL.
Payloads: Map your test dataset columns to your API’s input schema (e.g., mapping a question column to your API’s messages body).
Multi-turn Support: Configure the endpoint to handle conversation history for realistic stateful testing.

No-Code & Dataset Evaluation

Test your agentic workflows with Datasets and Evaluators in minutes using the No-Code Agent Builder. Alternatively, if you have pre-generated logs, use Dataset Evaluation to score existing outputs without re-running the agent.

Online Evaluation: Monitoring in Production

Once deployed, use Online Evaluation to monitor agent performance in real-time. Maxim uses Distributed Tracing to capture the hierarchy of agent operations:

Sessions: Evaluate the full multi-turn conversation (e.g., “Did the user achieve their goal?”).
Traces: Evaluate individual interactions or turns within a session.
Spans: Evaluate specific steps, such as a Retrieval (RAG) step or a Tool Call.

You can configure Alerts to notify you via Slack or PagerDuty if quality scores drop below your defined threshold.

Automating Evaluations with the SDK

For Continuous Integration (CI/CD), use the Maxim SDK to trigger evaluations programmatically. The following example demonstrates running a test for an agent:

from maxim import Maxim

# Initialize the client
maxim = Maxim({"api_key": "YOUR_API_KEY"})

# Trigger a test run
result = (
    maxim.create_test_run(
        name="Weekly Agent Regression",
        in_workspace_id="YOUR_WORKSPACE_ID"
    )
    # Map dataset columns to your agent's inputs
    .with_data_structure({
        "input": "user_query",
        "context": "retrieved_docs",
        "expected_output": "ground_truth"
    })
    # Link to your existing dataset and prompt version
    .with_data("dataset-id-uuid")
    .with_prompt_version_id("prompt-version-uuid")
    # Attach evaluators
    .with_evaluators(
        "Answer Relevance",
        "Tool Call Accuracy",
        "Hallucination"
    )
    .run()
)

print(f"Test Run Completed. Results: {result.test_run_result.link}")

For more details on SDK integration, refer to the SDK Quickstart guide.

Evaluators

Maxim supports various types of evaluators. You can use pre-built evaluators from the evaluator store or build your own custom evaluators according to your specific use case.

Evaluator Type	Description	Examples	Documentation
AI Evaluators	Uses LLMs to judge subjective quality	Faithfulness, Hallucination, Tone	AI Evaluators
Statistical Evaluators	Standard ML metrics for text comparison	BLEU, ROUGE, Exact Match	Statistical Evaluators
Programmatic Evaluators	Code-based assertions (JS/Python)	JSON schema validation, Regex checks	Programmatic Evaluators
Human Evaluators	Manual review by domain experts	Expert review, RLHF annotation	Human Annotation

Building Custom Evaluators

While pre-built evaluators cover common use cases, Custom Evaluators are essential for validating domain-specific business rules and proprietary logic. Maxim allows you to build these tailored evaluators in the following ways:

Custom AI Evaluators (LLM-as-a-Judge)

Use this when you need to evaluate subjective criteria specific to your domain (e.g., “Is this response compliant with our brand voice?”).

Configuration: You select the judge model (e.g., GPT-4, Claude 3.5 Sonnet) and define the Evaluation Instructions in the Definition Tab.
Scoring Options: Configure the output as Binary (Pass/Fail), Scale (1-5), or Categorical (e.g., “Compliant”, “Minor Violation”, “Critical Violation”).
Best For: Nuances that require reasoning, such as medical advice safety or legal disclaimer checks.

Custom Programmatic Evaluators

Use this for deterministic, logic-based checks that code can handle better than an LLM.

How it works: Write a validate function in Python or JavaScript directly in the Maxim UI.
Logic: You can use regex to validate formats (e.g., “Must start with ‘SKU-’”), check for specific keywords, or validate complex JSON structures.

def validate(output, expected_output, **kwargs):
    # Example: Fail if specific forbidden words are present
    if "competitor_name" in output.lower():
        return {"score": 0, "result": "Fail", "reason": "Mentioned competitor"}
    return {"score": 1, "result": "Pass", "reason": "Clean output"}

API-Based Evaluators

Connect your evaluation pipeline to external systems. If you have an existing scoring service or a specialized compliance API, you can wrap it as a Custom Evaluator to maintain a unified quality view within Maxim.

Human-in-the-Loop Evaluations

For high-stakes validation or subjective quality assessment, Maxim provides a robust Human-in-the-Loop (HITL) pipeline. This allows subject matter experts (SMEs), domain specialists, or internal team members to review AI outputs alongside automated metrics.

Annotation Methods

Maxim supports two primary workflows for collecting human feedback:

Method	Best For	How it Works
Annotate on Report	Internal Teams	Team members with Maxim access can directly rate entries within the test run report. Multiple team members can rate the same entry, and the platform calculates the average score.
Send via Email	External SMEs & Clients	Send evaluation requests to external reviewers (e.g., doctors, lawyers) via email. They receive a secure link to a Rater Dashboard where they can review outputs without needing a paid Maxim seat.

Best Practices

Curate from Production: Use Data Curation to turn failed production traces into test cases for your offline datasets.
Test Tool Use: Specifically for agents, use the Tool Call Accuracy evaluator to verify that your agent selects the correct tools and parameters.
Use Synthetic Data: If you lack test data, use Synthetic Data Generation to create datasets with edge cases and diverse user inputs.

​Offline Evaluation: Testing Before Deployment

​Agent Simulation

​HTTP Endpoint Evaluation

​No-Code & Dataset Evaluation

​Online Evaluation: Monitoring in Production

​Automating Evaluations with the SDK

​Evaluators

​Building Custom Evaluators

​Custom AI Evaluators (LLM-as-a-Judge)

​Custom Programmatic Evaluators

​API-Based Evaluators

​Human-in-the-Loop Evaluations

​Annotation Methods

​Best Practices

Offline Evaluation: Testing Before Deployment

Agent Simulation

HTTP Endpoint Evaluation

No-Code & Dataset Evaluation

Online Evaluation: Monitoring in Production

Automating Evaluations with the SDK

Evaluators

Building Custom Evaluators

Custom AI Evaluators (LLM-as-a-Judge)

Custom Programmatic Evaluators

API-Based Evaluators

Human-in-the-Loop Evaluations

Annotation Methods

Best Practices