Offline Evaluation: Testing Before Deployment
Offline evaluation enables you to test agents against curated datasets or simulated scenarios before they face real users. Maxim supports multiple methods for this:Agent Simulation
Agent simulations are the most effective way to test conversational agents. Instead of static input-output pairs, simulations create dynamic interactions between your agent and a user persona to test multi-turn capabilities.- Define Personas: Configure user characteristics (e.g., “impatient customer,” “technical user”) to test how the agent adapts.
- Set Scenarios: Define the user’s goal (e.g., “Process a refund”) and the expected steps the agent should take.
- Validate Context: Ensure the agent retains information across multiple exchanges.
HTTP Endpoint Evaluation
If your agent is hosted externally, you can use HTTP Endpoint Evaluation. This enables you to send payloads to your agent’s API endpoint and evaluate the response without code changes.- Configuration: Connect Maxim to your agent’s URL.
- Payloads: Map your test dataset columns to your API’s input schema (e.g., mapping a
questioncolumn to your API’smessagesbody). - Multi-turn Support: Configure the endpoint to handle conversation history for realistic stateful testing.
No-Code & Dataset Evaluation
Test your agentic workflows with Datasets and Evaluators in minutes using the No-Code Agent Builder. Alternatively, if you have pre-generated logs, use Dataset Evaluation to score existing outputs without re-running the agent.Online Evaluation: Monitoring in Production
Once deployed, use Online Evaluation to monitor agent performance in real-time. Maxim uses Distributed Tracing to capture the hierarchy of agent operations:- Sessions: Evaluate the full multi-turn conversation (e.g., “Did the user achieve their goal?”).
- Traces: Evaluate individual interactions or turns within a session.
- Spans: Evaluate specific steps, such as a Retrieval (RAG) step or a Tool Call.
Automating Evaluations with the SDK
For Continuous Integration (CI/CD), use the Maxim SDK to trigger evaluations programmatically. The following example demonstrates running a test for an agent:Evaluators
Maxim supports various types of evaluators. You can use pre-built evaluators from the evaluator store or build your own custom evaluators according to your specific use case.| Evaluator Type | Description | Examples | Documentation |
|---|---|---|---|
| AI Evaluators | Uses LLMs to judge subjective quality | Faithfulness, Hallucination, Tone | AI Evaluators |
| Statistical Evaluators | Standard ML metrics for text comparison | BLEU, ROUGE, Exact Match | Statistical Evaluators |
| Programmatic Evaluators | Code-based assertions (JS/Python) | JSON schema validation, Regex checks | Programmatic Evaluators |
| Human Evaluators | Manual review by domain experts | Expert review, RLHF annotation | Human Annotation |
Building Custom Evaluators
While pre-built evaluators cover common use cases, Custom Evaluators are essential for validating domain-specific business rules and proprietary logic. Maxim allows you to build these tailored evaluators in the following ways:Custom AI Evaluators (LLM-as-a-Judge)
Use this when you need to evaluate subjective criteria specific to your domain (e.g., “Is this response compliant with our brand voice?”).- Configuration: You select the judge model (e.g., GPT-4, Claude 3.5 Sonnet) and define the Evaluation Instructions in the Definition Tab.
- Scoring Options: Configure the output as Binary (Pass/Fail), Scale (1-5), or Categorical (e.g., “Compliant”, “Minor Violation”, “Critical Violation”).
- Best For: Nuances that require reasoning, such as medical advice safety or legal disclaimer checks.
Custom Programmatic Evaluators
Use this for deterministic, logic-based checks that code can handle better than an LLM.- How it works: Write a
validatefunction in Python or JavaScript directly in the Maxim UI. - Logic: You can use regex to validate formats (e.g., “Must start with ‘SKU-’”), check for specific keywords, or validate complex JSON structures.
API-Based Evaluators
Connect your evaluation pipeline to external systems. If you have an existing scoring service or a specialized compliance API, you can wrap it as a Custom Evaluator to maintain a unified quality view within Maxim.Human-in-the-Loop Evaluations
For high-stakes validation or subjective quality assessment, Maxim provides a robust Human-in-the-Loop (HITL) pipeline. This allows subject matter experts (SMEs), domain specialists, or internal team members to review AI outputs alongside automated metrics.Annotation Methods
Maxim supports two primary workflows for collecting human feedback:| Method | Best For | How it Works |
|---|---|---|
| Annotate on Report | Internal Teams | Team members with Maxim access can directly rate entries within the test run report. Multiple team members can rate the same entry, and the platform calculates the average score. |
| Send via Email | External SMEs & Clients | Send evaluation requests to external reviewers (e.g., doctors, lawyers) via email. They receive a secure link to a Rater Dashboard where they can review outputs without needing a paid Maxim seat. |
Best Practices
- Curate from Production: Use Data Curation to turn failed production traces into test cases for your offline datasets.
- Test Tool Use: Specifically for agents, use the Tool Call Accuracy evaluator to verify that your agent selects the correct tools and parameters.
- Use Synthetic Data: If you lack test data, use Synthetic Data Generation to create datasets with edge cases and diverse user inputs.