Why Evaluate Locally?
Evaluating locally is ideal when:- You are in the active development phase and want to iterate quickly without deployment cycles.
- Your agent has complex orchestration logic (e.g., multi-agent systems in CrewAI) that is hard to expose via a simple API.
- You need to capture granular metadata like token usage, cost, and latency directly from your local execution environment.
How It Works
To test a local agent, you define a wrapper function that acceptsLocalData (the input row from your dataset) and returns a YieldedOutput object containing the agent’s response and metadata. Maxim’s SDK then iterates through your test dataset, runs this function for each entry, and scores the results.
Define the Agent Wrapper
Your function must take
LocalData as an input and return YieldedOutput. This object allows you to report not just the output, but also retrieved context (for RAG evaluation), token usage, and costs.Code Example: Testing a CrewAI Agent
The following example demonstrates how to wrap a CrewAI agent and evaluate it using Maxim:Best Practices
- Modular Design: Break your agent logic into smaller, testable functions so you can wrap them easily.
- Metadata Tracking: Always populate the
YieldedOutputMetafields (latency, tokens, cost). This ensures your dashboard reports are accurate and help you track resource consumption. - Context Handling: If your agent performs retrieval or multi-step reasoning, pass those intermediate outputs in the
retrieved_context_to_evaluatefield. This allows AI evaluators to judge whether the agent’s reasoning was sound, not just if the final answer was correct.