Offline Evaluations via Logging

What are Offline Evaluations via Logging?

Offline evaluations allow you to test and validate your AI Agent before it goes live with end users. Unlike online evaluations that run in production, offline evals give you the opportunity to:

Test against a curated set of inputs with expected outputs
Validate tool calls, retrieved context, and generation quality
Run evaluations in a controlled environment
Iterate quickly without impacting real users

By combining Maxim’s logging capabilities with the withEvaluators function, you can capture every interaction of your AI system and automatically run evaluations against expected outcomes.

Prerequisites

Before you start, ensure you have:

Maxim SDK installed in your project
API key from the Maxim platform
Log repository created in your Maxim workspace

Getting Started

Step 1: Install the SDK

pip install maxim-py

Step 2: Initialize the Logger

from maxim import Maxim

# Initialize Maxim SDK
maxim = Maxim({"api_key": "your-api-key"})

# Get the logger for your repository
logger = maxim.logger({"id": "your-log-repository-id"})

Step 3: Logging a Trace or Span

Before you log data, it is helpful to understand the hierarchy of Maxim’s logging objects:

Trace: A trace represents a single interaction or request in your application (e.g. a user query). This is the core unit of logging.
Span: A span represents a unit of work within a trace (e.g. a retrieval step, a generation step, or a custom function execution).
Session (Optional): A session is a logical grouping of multiple traces (e.g. a multi-turn conversation).

Basic Workflow: Trace -> Span

The most common workflow is to create a trace for a single test case and add spans to it.

# Create a trace for a test case
trace = logger.trace({
    "id": "test-case-001",
    "name": "customer-support-query"
})

# Set the input (user query)
trace.set_input("What is your refund policy?")

# ... run your AI logic ...

# Create a generation (a type of span)
generation = trace.generation({
    "id": "gen-id",
    "name": "llm-response",
    "provider": "openai",
    "model": "gpt-4o",
    "messages": [
        {"role": "user", "content": "What is your refund policy?"}
    ]
})

# ... log generation result ...
generation.end()

# Set the output (AI response)
trace.set_output("Our refund policy allows returns within 30 days of purchase...")

# End the trace
trace.end()

Session -> Trace -> Span

If you need to group multiple traces together (e.g. for a chat session), you can wrap them in a session.

# Create a session
session = logger.session({
    "id": "session-user-123",
    "name": "support-chat-session"
})

# Create a trace linked to this session
trace = session.trace({
    "id": "turn-1",
    "name": "user-query-1"
})

# ... use trace as normal ...

trace.end()
session.end()

Step 4: Log Generations, Retrievals, and Errors

Detailed logging allows you to debug issues and run granular evaluations. You can log LLM calls (Generations), context fetching (Retrievals), and any errors that occur.

Generations (LLM Calls)

Track each LLM call within your trace to capture detailed information about model interactions, including prompt, completion, and usage stats.

from uuid import uuid4
import time

# Create a generation within the trace
generation = trace.generation({
    "id": str(uuid4()),
    "name": "policy-lookup",
    "provider": "openai",
    "model": "gpt-4o",
    "messages": [
        {"role": "system", "content": "You are a helpful customer support assistant."},
        {"role": "user", "content": "What is your refund policy?"}
    ],
    "model_parameters": {"temperature": 0.7}
})

# ... make API call to LLM provider ...

# Log the result
generation.result({
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": int(time.time()),
    "model": "gpt-4o",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "Our refund policy allows returns within 30 days..."
        },
        "finish_reason": "stop"
    }],
    "usage": {
        "prompt_tokens": 50,
        "completion_tokens": 100,
        "total_tokens": 150
    }
})

generation.end()

Retrievals (RAG)

For RAG systems, logging retrieval steps helps you evaluate the quality of your context separately from the generation.

# Log a retrieval step
retrieval = trace.retrieval({
    "id": str(uuid4()),
    "name": "knowledge-base-search"
})

retrieval.set_input("refund policy")

# ... perform search ...

# Log retrieved documents
retrieval.set_output([
    {"content": "Refunds are processed within 3-5 business days.", "score": 0.95, "source": "doc-1"},
    {"content": "Returns must be in original packaging.", "score": 0.88, "source": "doc-2"}
])

retrieval.end()

Tool Calls

If your agent uses tools (e.g., function calling), logging these interactions allows you to evaluate tool usage accuracy.

# Log a tool call
tool_call_span = trace.tool_call({
    "id": "call_123",
    "name": "get_weather",
    "description": "Get current temperature for a given location",
    "args": {"location": "San Francisco, CA"}
})

# ... execute tool ...
result = "72°F and sunny"

# Log the result
tool_call_span.result(result)

tool_call_span.end()

Custom Metrics

In addition to running evaluators, you may want to log custom numeric metrics such as cost, latency, or pre-computed scores. You can use the addMetric method (or add_metric in Python) on any entity (trace, generation, retrieval, or session).

# Attach metrics to a trace
trace.add_metric("user_feedback_score", 4.5)

# Attach metrics to a generation
generation.add_metric("cost", 0.002)
generation.add_metric("latency_ms", 450)

Errors

Capturing errors is crucial for debugging. You can log errors on any entity (trace, span, generation, or tool call).

generation.error({
    "message": "Rate limit exceeded. Please try again later.",
    "type": "RateLimitError",
    "code": "429"
})

Running Evaluators

You can configure evaluations to run on logs pushed via the SDK. To configure this, in your log repository dashboard, click on “Configure evaluation”. Here, you can choose the evaluators to run on your traces or sessions. Set the sampling to 100% and remove all applied filters so that evaluations are run on all the logs.

Attaching Evaluators via SDK

The withEvaluators function allows you to attach evaluators to any component of your trace (trace itself, spans, generations, or retrievals). Evaluators run automatically once all required variables are provided.

# Attach evaluators to the entire trace
trace.evaluate().with_evaluators("faithfulness", "completeness")

# Attach evaluators to the generation
generation.evaluate().with_evaluators("clarity", "toxicity", "output-relevance")

Providing Variables for Evaluation

Evaluators require specific variables to perform their assessment. Use the withVariables method to provide these values:

# Provide variables for evaluation
generation.evaluate().with_variables(
    {
        "input": "What is your refund policy?",
        "output": "Our refund policy allows returns within 30 days...",
        "expected_output": "Returns are accepted within 30 days of purchase for a full refund."
    },
    ["clarity", "output-relevance", "semantic-similarity"]
)

Chaining Evaluators and Variables

You can chain withEvaluators and withVariables together for cleaner code:

generation.evaluate() \
    .with_evaluators("clarity", "toxicity", "semantic-similarity") \
    .with_variables({
        "input": user_query,
        "output": ai_response,
        "expected_output": expected_answer
    })

Putting it all together

Here’s a comprehensive example that demonstrates running offline evaluations with expected outputs:

from maxim import Maxim

from uuid import uuid4
import openai
import time

# Initialize clients
maxim = Maxim({"api_key": "your-maxim-api-key"})
logger = maxim.logger({"id": "your-log-repository-id"})
client = openai.OpenAI(api_key="your-openai-api-key")

# Define test cases with expected outputs
test_cases = [
    {
        "id": "tc-001",
        "input": "What is your refund policy?",
        "expected_output": "Returns are accepted within 30 days for a full refund.",
        "expected_tool_calls": None
    },
    {
        "id": "tc-002",
        "input": "Check the status of order #12345",
        "expected_output": "Order #12345 is currently in transit.",
        "expected_tool_calls": ["get_order_status"]
    },
    {
        "id": "tc-003",
        "input": "What products do you recommend for dry skin?",
        "expected_output": "For dry skin, we recommend our Hydrating Moisturizer and Gentle Cleanser.",
        "expected_tool_calls": ["search_products"]
    }
]

def run_offline_evaluation(test_case):
    """Run a single test case with logging and evaluation."""
    
    # Create a trace for this test case
    trace = logger.trace({
        "id": test_case["id"],
        "name": "customer-support-eval",
        "tags": {
            "test_type": "offline_eval",
            "has_expected_tool_calls": str(test_case["expected_tool_calls"] is not None)
        }
    })
    
    trace.set_input(test_case["input"])
    
    # Create a generation for the LLM call
    generation_id = str(uuid4())
    generation = trace.generation({
        "id": generation_id,
        "name": "support-response",
        "provider": "openai",
        "model": "gpt-4o",
        "messages": [
            {"role": "system", "content": "You are a helpful customer support assistant."},
            {"role": "user", "content": test_case["input"]}
        ],
        "model_parameters": {"temperature": 0.7}
    })
    
    # Simulate and log tool calls if expected
    if test_case.get("expected_tool_calls"):
        for tool_name in test_case["expected_tool_calls"]:
            tool_span = trace.tool_call({
                "id": str(uuid4()),
                "name": tool_name,
                "args": {"query": test_case["input"]} # Simulated args
            })
            # Simulate tool execution result
            tool_span.result({"status": "success", "data": "simulated_data"})
            tool_span.end()
    
    # Attach evaluators to the generation
    evaluators_to_attach = ["clarity", "toxicity", "output-relevance"]
    if test_case["expected_output"]:
        evaluators_to_attach.append("semantic-similarity")
    if test_case["expected_tool_calls"]:
        evaluators_to_attach.append("tool-call-accuracy")
    
    generation.evaluate().with_evaluators(*evaluators_to_attach)
    
    # Make the actual LLM call
    start_time = time.time()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful customer support assistant."},
            {"role": "user", "content": test_case["input"]}
        ],
        temperature=0.7
    )
    latency = (time.time() - start_time) * 1000
    
    ai_output = response.choices[0].message.content
    
    # Log the generation result
    generation.result({
        "id": response.id,
        "object": "chat.completion",
        "created": int(time.time()),
        "model": "gpt-4o",
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": ai_output
            },
            "finish_reason": response.choices[0].finish_reason
        }],
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        }
    })
    
    # Provide variables for evaluation (including expected output for comparison)
    generation.evaluate().with_variables(
        {
            "input": test_case["input"],
            "output": ai_output,
            "expected_output": test_case["expected_output"]
        },
        evaluators_to_attach
    )
    
    generation.end()
    
    # Set trace output and end
    trace.set_output(ai_output)
    trace.end()
    
    return {
        "test_id": test_case["id"],
        "input": test_case["input"],
        "output": ai_output,
        "expected_output": test_case["expected_output"]
    }

# Run all test cases
print("Running offline evaluation suite...")
results = []
for test_case in test_cases:
    result = run_offline_evaluation(test_case)
    results.append(result)
    print(f"✓ Completed: {test_case['id']}")

print(f"\nCompleted {len(results)} test cases. View results in your Maxim dashboard!")

Example: RAG System

For RAG (Retrieval-Augmented Generation) systems, you can evaluate both retrieval quality and generation accuracy:

def evaluate_rag_query(query, expected_answer, expected_context):
    trace = logger.trace({
        "id": str(uuid4()),
        "name": "rag-eval"
    })
    
    trace.set_input(query)
    
    # Log the retrieval step
    retrieval = trace.retrieval({
        "id": str(uuid4()),
        "name": "document-retrieval"
    })
    
    retrieval.set_input(query)
    
    # Simulate retrieval (replace with your actual retrieval logic)
    retrieved_docs = your_retrieval_function(query)
    
    retrieval.set_output([
        {"content": doc["content"], "score": doc["score"]}
        for doc in retrieved_docs
    ])
    
    # Attach retrieval evaluators
    retrieval.evaluate() \
        .with_evaluators("context-relevance", "context-precision") \
        .with_variables({
            "input": query,
            "context": "\n".join([doc["content"] for doc in retrieved_docs]),
            "expected_output": expected_context
        })
    
    retrieval.end()
    
    # Log the generation step
    generation = trace.generation({
        "id": str(uuid4()),
        "name": "answer-generation",
        "provider": "openai",
        "model": "gpt-4o"
    })
    
    # Generate answer using retrieved context
    answer = generate_answer(query, retrieved_docs)
    
    generation.result({
        "choices": [{"message": {"role": "assistant", "content": answer}}]
    })
    
    # Attach generation evaluators
    generation.evaluate() \
        .with_evaluators("faithfulness", "output-relevance", "semantic-similarity") \
        .with_variables({
            "input": query,
            "output": answer,
            "context": "\n".join([doc["content"] for doc in retrieved_docs]),
            "expected_output": expected_answer
        })
    
    generation.end()
    trace.set_output(answer)
    trace.end()

Example: Tool Calls

For agent workflows that include tool calls, you can validate that the correct tools are being called:

def evaluate_agent_with_tools(query, expected_tool_calls, expected_output):
    trace = logger.trace({
        "id": str(uuid4()),
        "name": "agent-tool-eval"
    })
    
    trace.set_input(query)
    
    # Run your agent logic
    agent_result = your_agent_function(query)
    
    # Log each tool call
    for tool_call in agent_result.tool_calls:
        tc = trace.tool_call({
            "id": tool_call["id"],
            "name": tool_call["function"]["name"],
            "description": f"Tool call: {tool_call['function']['name']}",
            "args": tool_call["function"]["arguments"]
        })
        
        tc.result(tool_call["result"])
        
        # Attach tool call evaluator
        tc.evaluate() \
            .with_evaluators("tool-selection") \
            .with_variables({
                "input": query,
                "tool_calls": str(agent_result.tool_calls),
                "expected_tool_calls": str(expected_tool_calls)
            })
        
        tc.end()
    
    # Log the final generation
    generation = trace.generation({
        "id": str(uuid4()),
        "name": "final-response",
        "provider": "openai",
        "model": "gpt-4o"
    })
    
    generation.result({
        "choices": [{"message": {"role": "assistant", "content": agent_result.final_answer}}]
    })
    
    generation.evaluate() \
        .with_evaluators("output-relevance", "semantic-similarity", "tool-call-accuracy") \
        .with_variables({
            "input": query,
            "output": agent_result.final_answer,
            "expected_output": expected_output,
            "tool_calls": str([tc["function"]["name"] for tc in agent_result.tool_calls]),
            "expected_tool_calls": str(expected_tool_calls)
        })
    
    generation.end()
    trace.set_output(agent_result.final_answer)
    trace.end()

Viewing Evaluation Results

After running your offline evaluations, view the results in the Maxim dashboard:

Navigate to your Log Repository
View the Logs tab to see all logged traces
Click on any trace to see detailed evaluation results
Use the Evaluation tab to see scores, reasoning, and pass/fail status

The “overview” tab in your logs repository provides insights on your logs and evaluation runs, including metrics like latency, cost, score, error rate, and more. You can filter your logs by different criteria, like tags, cost, latency, etc.

Best Practices

Use deterministic test IDs

Use consistent, meaningful IDs for your test cases to make it easy to track and compare runs over time.

Include expected outputs

Always include expected outputs in your test cases for comparison evaluators like semantic-similarity to provide meaningful scores.

Tag your traces

Use tags to categorize your offline evaluation runs (e.g., test_type: offline_eval, version: v1.2.0) for easy filtering.

Choose appropriate evaluators

Select evaluators that match your use case:

Semantic Similarity: Compare output against expected output
Faithfulness: Ensure answers are grounded in provided context
Tool Call Accuracy: Validate correct tool selection
Context Relevance: Assess retrieval quality in RAG systems

Next Steps

Node-Level Evaluation - Learn more about programmatic evaluation
Pre-built Evaluators - Explore available evaluators
Custom Evaluators - Create your own evaluation logic
CI/CD Integration - Automate your evaluation pipeline

Schedule a demo to see how Maxim AI helps teams ship reliable agents.

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

CI/CD

Offline Evaluations via Logging

What are Offline Evaluations via Logging?

Prerequisites

Getting Started

Step 1: Install the SDK

Step 2: Initialize the Logger

Step 3: Logging a Trace or Span

Basic Workflow: Trace -> Span

Session -> Trace -> Span

Step 4: Log Generations, Retrievals, and Errors

Generations (LLM Calls)

Retrievals (RAG)

Tool Calls

Custom Metrics

Errors

Running Evaluators

Attaching Evaluators via SDK

Providing Variables for Evaluation

Chaining Evaluators and Variables

Putting it all together

Example: RAG System

Example: Tool Calls

Viewing Evaluation Results

Best Practices

Next Steps

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

CI/CD

​What are Offline Evaluations via Logging?

​Prerequisites

​Getting Started

​Step 1: Install the SDK

​Step 2: Initialize the Logger

​Step 3: Logging a Trace or Span

​Basic Workflow: Trace -> Span

​Session -> Trace -> Span

​Step 4: Log Generations, Retrievals, and Errors

​Generations (LLM Calls)

​Retrievals (RAG)

​Tool Calls

​Custom Metrics

​Errors

​Running Evaluators

​Attaching Evaluators via SDK

​Providing Variables for Evaluation

​Chaining Evaluators and Variables

​Putting it all together

​Example: RAG System

​Example: Tool Calls

​Viewing Evaluation Results

​Best Practices

​Next Steps

What are Offline Evaluations via Logging?

Prerequisites

Getting Started

Step 1: Install the SDK

Step 2: Initialize the Logger

Step 3: Logging a Trace or Span

Basic Workflow: Trace -> Span

Session -> Trace -> Span

Step 4: Log Generations, Retrievals, and Errors

Generations (LLM Calls)

Retrievals (RAG)

Tool Calls

Custom Metrics

Errors

Running Evaluators

Attaching Evaluators via SDK

Providing Variables for Evaluation

Chaining Evaluators and Variables

Putting it all together

Example: RAG System

Example: Tool Calls

Viewing Evaluation Results

Best Practices

Next Steps