Skip to main content

What is Node-Level Evaluation?

As your AI application grows in complexity, it becomes increasingly difficult to understand how it is performing on different flows and components. This granular insight becomes necessary to identify bottlenecks or low quality areas in your application’s or agent’s flow. By targeting the underperforming areas, you can optimize overall performance more effectively than using brute force approaches. This is where Node-Level evaluation can help out. It enables you to evaluate a trace or its component (a span, generation or retrieval) in isolation. This can be done via the Maxim SDK’s logger using a very simple API. Let us see how we can start evaluating our nodes.
Before you startYou need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.

Understanding How the Maxim SDK Logger Evaluates

Two actions are mainly required to evaluate a node:
  1. Attach Evaluators: This action defines what evaluators to run on the particular node, this needs to be called to start an evaluation on any component.
  2. Attach Variables: Once evaluators are attached on a component, each evaluator waits for all the variables it needs to evaluate to be attached to it. Only after all the variables an evaluator needs are attached, does it start processing.
Once you have attached evaluators and variables to them, we will process the evaluator and display the results in the Evaluation tab under the respective node.
  • The evaluator will not run until all of the variables it needs are attached to it.
  • If we don’t receive all the variables needed for an evaluator for over 5 minutes, we will start displaying a Missing variables message (although we will still process the evaluator even if variables are received after 5 minutes).
  • The variables that an evaluator needs can be found in the evaluator’s page. The evaluator test panel on the right has all the variables that the evalutor needs listed (all of them are required).
Screenshot of evaluator test panel
As per the image above, we can see that the evaluator needs input, context and expectedOutput variables.

Evaluation Framework Architecture

The Maxim SDK provides a robust evaluation framework built on several key components:

Core Evaluation Classes

BaseEvaluator: The foundation class for all evaluators in maxim-py
from maxim.evaluators import BaseEvaluator
from maxim.models import LocalEvaluatorResultParameter, LocalData, LocalEvaluatorReturn

class MyCustomEvaluator(BaseEvaluator):
    def evaluate(self, result: LocalEvaluatorResultParameter, data: LocalData) -> Dict[str, LocalEvaluatorReturn]:
        # Your evaluation logic here
        return {"evaluator_name": LocalEvaluatorReturn(score=0.85, reasoning="Evaluation reasoning")}
EvaluateContainer: Manages evaluators for specific entities (traces, spans, generations, retrievals)
# The evaluate() method returns an EvaluateContainer
evaluation_container = generation.evaluate()
PassFailCriteria: Defines success/failure thresholds for evaluations
from maxim.models import PassFailCriteria, PassFailCriteriaOnEachEntry, PassFailCriteriaForTestrunOverall

criteria = PassFailCriteria(
    on_each_entry_pass_if=PassFailCriteriaOnEachEntry(">=", 0.7),
    for_testrun_overall_pass_if=PassFailCriteriaForTestrunOverall(">=", 0.8, "average")
)

Evaluation Types

The SDK supports multiple evaluator types:
  • AI Evaluators: Use AI models to evaluate content (e.g., “clarity”, “toxicity”)
  • Programmatic Evaluators: Custom logic-based evaluations
  • Statistical Evaluators: Mathematical/statistical analysis
  • API Evaluators: External API-based evaluations
  • Human Evaluators: Manual human assessment
  • Local Evaluators: Client-side custom evaluators

Attaching Evaluators via Maxim SDK

We use the with_evaluators method to attach evaluators to any component within a trace or the trace itself. It is as easy as just listing the names of the evaluators you want to attach, which are available on the platform.
component.evaluate.withEvaluators("evaluator");

// example
generation.evaluate.withEvaluators("clarity", "toxicity");
If you list an evaluator that doesn’t exist in your workspace but is available in the store, we will auto install it for you in the workspace.If the evaluator is not available in the store as well, we will ignore it.

Advanced Evaluator Configuration

You can also attach custom local evaluators with specific pass/fail criteria:
from maxim.evaluators import BaseEvaluator
from maxim.models import PassFailCriteria, PassFailCriteriaOnEachEntry, PassFailCriteriaForTestrunOverall

class CustomQualityEvaluator(BaseEvaluator):
    def evaluate(self, result: LocalEvaluatorResultParameter, data: LocalData) -> Dict[str, LocalEvaluatorReturn]:
        # Custom evaluation logic
        score = self.analyze_quality(result.output)
        return {
            "quality_score": LocalEvaluatorReturn(
                score=score, 
                reasoning=f"Quality analysis based on {len(result.output)} characters"
            )
        }
    
    def analyze_quality(self, text: str) -> float:
        # Your custom quality analysis logic
        return 0.85

# Usage with custom criteria
custom_evaluator = CustomQualityEvaluator(
    pass_fail_criteria={
        "quality_score": PassFailCriteria(
            on_each_entry_pass_if=PassFailCriteriaOnEachEntry(">=", 0.7),
            for_testrun_overall_pass_if=PassFailCriteriaForTestrunOverall(">=", 0.8, "average")
        )
    }
)

generation.evaluate().with_evaluators("clarity", custom_evaluator)

Providing Variables to Evaluators

Once evaluators are attached to a component, variables can be passed to them via the with_variables method. This method accepts a key-value pair of variable names to their values. You also need to specify which evaluators you want these variables to be attached to, which can be done by passing the list of evaluator names as the second argument.
component.evaluate.withVariables(
  { variableName: "value" }, // Key-value pair of variables
  ["evaluator"], // List of evaluators
);

// example
retrieval.evaluate.withVariables(
  { output: assistantResponse.choices[0].message.content },
  ["clarity", "toxicity"],
);
You can directly chain the with_variables method after attaching evaluators to any component. Allowing you to skip mentioning the evaluator names again.
trace.evaluate
    .withEvaluators("clarity", "toxicity")
    .withVariables({
        input: userInput,
    });

Viewing Evaluation Results on Evaluations Tab

This is very similar to Making sense of evaluations on logs, except that the evaluations for each component appear on their own card as it did for the trace. Screenshot of node level evaluation result

Comprehensive Code Examples

Basic Generation Evaluation

This example shows how to evaluate a simple text generation with multiple evaluators:
import maxim
from maxim.decorators import trace, generation

# Initialize logger
logger = maxim.Logger(api_key="your-api-key")

@logger.trace(name="customer_support_chat")
def handle_customer_query(user_message: str):
    # Create a generation for the AI response
    generation_config = {
        "id": "support-response-001",
        "provider": "openai",
        "model": "gpt-4",
        "messages": [
            {"role": "system", "content": "You are a helpful customer support agent."},
            {"role": "user", "content": user_message}
        ],
        "model_parameters": {"temperature": 0.7, "max_tokens": 500},
        "name": "customer_support_response"
    }
    
    generation = logger.current_trace().generation(generation_config)
    
    # Attach evaluators to the generation
    generation.evaluate().with_evaluators("clarity", "toxicity", "helpfulness")
    
    # Provide input variable for all evaluators
    generation.evaluate().with_variables(
        {"input": user_message},
        ["clarity", "toxicity", "helpfulness"]
    )
    
    # Simulate AI response
    ai_response = "Thank you for contacting us. I understand your concern about..."
    generation.result({
        "choices": [{"message": {"content": ai_response, "role": "assistant"}}],
        "usage": {"total_tokens": 150}
    })
    
    # Provide output variable for evaluation
    generation.evaluate().with_variables(
        {"output": ai_response},
        ["clarity", "toxicity", "helpfulness"]
    )
    
    return ai_response

Advanced RAG System Evaluation

This example demonstrates evaluation in a Retrieval-Augmented Generation (RAG) system:
import maxim
from maxim.decorators import trace, generation, retrieval

logger = maxim.Logger(api_key="your-api-key")

@logger.trace(name="rag_question_answering")
def answer_question_with_rag(question: str, knowledge_base: list):
    # Step 1: Retrieve relevant documents
    @retrieval(name="document_retrieval", evaluators=["Ragas Context Relevancy"])
    def retrieve_documents(query: str):
        retrieval = maxim.current_retrieval()
        retrieval.input(query)
        
        # Attach evaluation variables
        retrieval.evaluate().with_variables(
            {"input": query},
            ["Ragas Context Relevancy"]
        )
        
        # Simulate document retrieval
        relevant_docs = [
            {"content": "Document 1 content...", "relevance_score": 0.9},
            {"content": "Document 2 content...", "relevance_score": 0.7}
        ]
        
        retrieval.output(relevant_docs)
        
        # Provide context for evaluation
        retrieval.evaluate().with_variables(
            {"context": str(relevant_docs)},
            ["Ragas Context Relevancy"]
        )
        
        return relevant_docs
    
    # Step 2: Generate answer using retrieved context
    retrieved_docs = retrieve_documents(question)
    
    generation_config = {
        "id": "rag-answer-generation",
        "provider": "openai", 
        "model": "gpt-4",
        "messages": [
            {
                "role": "system", 
                "content": "Answer the question using the provided context. Be accurate and helpful."
            },
            {"role": "user", "content": f"Question: {question}\nContext: {retrieved_docs}"}
        ],
        "name": "rag_answer_generation"
    }
    
    generation = logger.current_trace().generation(generation_config)
    
    # Attach multiple evaluators
    generation.evaluate().with_evaluators(
        "clarity", "accuracy", "completeness", "relevance"
    )
    
    # Provide input variables
    generation.evaluate().with_variables(
        {
            "input": question,
            "context": str(retrieved_docs)
        },
        ["clarity", "accuracy", "completeness", "relevance"]
    )
    
    # Generate answer
    answer = "Based on the provided context, the answer is..."
    generation.result({
        "choices": [{"message": {"content": answer, "role": "assistant"}}]
    })
    
    # Provide output for evaluation
    generation.evaluate().with_variables(
        {"output": answer},
        ["clarity", "accuracy", "completeness", "relevance"]
    )
    
    return answer

Multi-Agent System Evaluation

This example shows evaluation in a multi-agent system with different evaluation strategies:
import maxim
from maxim.decorators import trace, generation, span

logger = maxim.Logger(api_key="your-api-key")

@logger.trace(name="multi_agent_workflow")
def process_complex_request(user_request: str):
    
    # Agent 1: Research Agent
    @span(name="research_phase")
    def research_agent(query: str):
        research_generation = maxim.current_span().generation({
            "id": "research-agent",
            "provider": "openai",
            "model": "gpt-4",
            "messages": [
                {"role": "system", "content": "You are a research specialist. Gather comprehensive information."},
                {"role": "user", "content": query}
            ],
            "name": "research_agent"
        })
        
        # Evaluate research quality
        research_generation.evaluate().with_evaluators("completeness", "accuracy")
        research_generation.evaluate().with_variables(
            {"input": query},
            ["completeness", "accuracy"]
        )
        
        research_output = "Comprehensive research findings..."
        research_generation.result({"choices": [{"message": {"content": research_output}}]})
        
        research_generation.evaluate().with_variables(
            {"output": research_output},
            ["completeness", "accuracy"]
        )
        
        return research_output
    
    # Agent 2: Analysis Agent  
    @span(name="analysis_phase")
    def analysis_agent(research_data: str):
        analysis_generation = maxim.current_span().generation({
            "id": "analysis-agent",
            "provider": "openai",
            "model": "gpt-4",
            "messages": [
                {"role": "system", "content": "You are an analysis specialist. Provide deep insights."},
                {"role": "user", "content": f"Analyze this research: {research_data}"}
            ],
            "name": "analysis_agent"
        })
        
        # Evaluate analysis quality
        analysis_generation.evaluate().with_evaluators("clarity", "depth", "insightfulness")
        analysis_generation.evaluate().with_variables(
            {"input": research_data},
            ["clarity", "depth", "insightfulness"]
        )
        
        analysis_output = "Detailed analysis with insights..."
        analysis_generation.result({"choices": [{"message": {"content": analysis_output}}]})
        
        analysis_generation.evaluate().with_variables(
            {"output": analysis_output},
            ["clarity", "depth", "insightfulness"]
        )
        
        return analysis_output
    
    # Agent 3: Synthesis Agent
    @span(name="synthesis_phase") 
    def synthesis_agent(analysis_data: str):
        synthesis_generation = maxim.current_span().generation({
            "id": "synthesis-agent",
            "provider": "openai",
            "model": "gpt-4",
            "messages": [
                {"role": "system", "content": "You are a synthesis specialist. Create final recommendations."},
                {"role": "user", "content": f"Synthesize this analysis: {analysis_data}"}
            ],
            "name": "synthesis_agent"
        })
        
        # Evaluate synthesis quality
        synthesis_generation.evaluate().with_evaluators("coherence", "actionability", "completeness")
        synthesis_generation.evaluate().with_variables(
            {"input": analysis_data},
            ["coherence", "actionability", "completeness"]
        )
        
        final_output = "Final recommendations and action plan..."
        synthesis_generation.result({"choices": [{"message": {"content": final_output}}]})
        
        synthesis_generation.evaluate().with_variables(
            {"output": final_output},
            ["coherence", "actionability", "completeness"]
        )
        
        return final_output
    
    # Execute the multi-agent workflow
    research_result = research_agent(user_request)
    analysis_result = analysis_agent(research_result)
    final_result = synthesis_agent(analysis_result)
    
    return final_result

Custom Local Evaluator Implementation

This example shows how to create and use custom local evaluators:
from maxim.evaluators import BaseEvaluator
from maxim.models import (
    LocalEvaluatorResultParameter, 
    LocalData, 
    LocalEvaluatorReturn,
    PassFailCriteria,
    PassFailCriteriaOnEachEntry,
    PassFailCriteriaForTestrunOverall
)
import re
from typing import Dict

class CustomQualityEvaluator(BaseEvaluator):
    """Custom evaluator for assessing response quality"""
    
    def evaluate(self, result: LocalEvaluatorResultParameter, data: LocalData) -> Dict[str, LocalEvaluatorReturn]:
        output = result.output
        context = result.context_to_evaluate
        
        # Calculate quality score based on multiple factors
        clarity_score = self._assess_clarity(output)
        completeness_score = self._assess_completeness(output, context)
        relevance_score = self._assess_relevance(output, context)
        
        # Weighted average
        overall_score = (clarity_score * 0.4 + completeness_score * 0.3 + relevance_score * 0.3)
        
        reasoning = f"Quality assessment: Clarity={clarity_score:.2f}, Completeness={completeness_score:.2f}, Relevance={relevance_score:.2f}"
        
        return {
            "quality_score": LocalEvaluatorReturn(
                score=overall_score,
                reasoning=reasoning
            )
        }
    
    def _assess_clarity(self, text: str) -> float:
        """Assess text clarity based on readability metrics"""
        # Simple readability assessment
        sentences = len(re.split(r'[.!?]+', text))
        words = len(text.split())
        avg_sentence_length = words / max(sentences, 1)
        
        # Penalize very long sentences, reward moderate length
        if avg_sentence_length > 30:
            return 0.3
        elif avg_sentence_length > 20:
            return 0.6
        else:
            return 0.9
    
    def _assess_completeness(self, output: str, context: str) -> float:
        """Assess how completely the output addresses the context"""
        if not context:
            return 0.5  # Neutral if no context provided
        
        # Simple keyword overlap assessment
        output_words = set(output.lower().split())
        context_words = set(context.lower().split())
        
        if len(context_words) == 0:
            return 0.5
        
        overlap = len(output_words.intersection(context_words))
        return min(overlap / len(context_words), 1.0)
    
    def _assess_relevance(self, output: str, context: str) -> float:
        """Assess relevance of output to context"""
        if not context:
            return 0.5
        
        # Simple semantic relevance (in practice, you'd use more sophisticated methods)
        output_lower = output.lower()
        context_lower = context.lower()
        
        # Check for key terms from context appearing in output
        context_terms = context_lower.split()
        relevant_terms = [term for term in context_terms if term in output_lower]
        
        if len(context_terms) == 0:
            return 0.5
        
        return len(relevant_terms) / len(context_terms)

# Usage with custom evaluator
custom_evaluator = CustomQualityEvaluator(
    pass_fail_criteria={
        "quality_score": PassFailCriteria(
            on_each_entry_pass_if=PassFailCriteriaOnEachEntry(">=", 0.7),
            for_testrun_overall_pass_if=PassFailCriteriaForTestrunOverall(">=", 0.8, "average")
        )
    }
)

# Use in your application
@logger.trace(name="custom_quality_evaluation")
def evaluate_response_quality(user_input: str, ai_response: str, context: str):
    generation = logger.current_trace().generation({
        "id": "quality-eval-generation",
        "provider": "openai",
        "model": "gpt-4",
        "messages": [{"role": "user", "content": user_input}],
        "name": "quality_evaluation"
    })
    
    # Attach custom evaluator
    generation.evaluate().with_evaluators(custom_evaluator)
    
    # Provide variables
    generation.evaluate().with_variables(
        {
            "input": user_input,
            "context": context
        },
        ["quality_score"]
    )
    
    generation.result({"choices": [{"message": {"content": ai_response}}]})
    
    generation.evaluate().with_variables(
        {"output": ai_response},
        ["quality_score"]
    )
    
    return ai_response

Best Practices

1. Strategic Evaluator Selection

  • Use evaluators selectively to monitor key performance metrics. Don’t overdo with attaching too many evaluators.
  • Choose evaluators based on your use case:
    • For customer service: helpfulness, clarity, toxicity
    • For content generation: coherence, relevance, creativity
    • For technical documentation: accuracy, completeness, clarity

2. Variable Management

  • Attach variables reliably to ensure no evaluation is left pending due to lack of variables.
  • Use meaningful variable names that clearly indicate their purpose.
  • Provide context when available - many evaluators perform better with additional context.

3. Performance Optimization

  • Setup sampling and filtering according to your needs to ensure accurate evaluation processing without eating up too much cost.
  • Use local evaluators for simple, fast evaluations that don’t require external API calls.
  • Batch evaluations when possible to reduce overhead.

4. Custom Evaluator Development

  • Inherit from BaseEvaluator for consistent behavior.
  • Define clear pass/fail criteria that align with your business objectives.
  • Include detailed reasoning in your evaluation results for better debugging.
  • Test your custom evaluators thoroughly before deploying to production.

5. Error Handling and Monitoring

  • Monitor evaluation results regularly to identify patterns and issues.
  • Set up alerts for evaluations that consistently fail or score poorly.
  • Review evaluation reasoning to understand why certain outputs are scored low.

6. Integration Patterns

  • Use decorators for simple, repetitive evaluation patterns.
  • Use programmatic evaluation for complex, conditional evaluation logic.
  • Combine multiple evaluation strategies for comprehensive assessment.

Troubleshooting Common Issues

Missing Variables

If you see “Missing variables” messages:
  1. Check that all required variables are provided
  2. Verify variable names match exactly what the evaluator expects
  3. Ensure variables are attached to the correct evaluators

Evaluation Not Running

If evaluations aren’t processing:
  1. Verify evaluators are properly attached with with_evaluators()
  2. Check that all required variables are provided
  3. Ensure the component (generation, retrieval, etc.) is properly ended

Custom Evaluator Issues

If custom evaluators aren’t working:
  1. Verify the evaluator inherits from BaseEvaluator
  2. Check that pass_fail_criteria are properly defined
  3. Ensure the evaluate() method returns the correct format
  4. Test the evaluator in isolation before integration

Advanced Use Cases

A/B Testing with Evaluations

# Compare different models or prompts using evaluations
@logger.trace(name="ab_test_evaluation")
def compare_models(user_input: str):
    # Model A
    model_a_response = generate_with_model_a(user_input)
    model_a_generation = logger.current_trace().generation({
        "id": "model-a-response",
        "provider": "openai",
        "model": "gpt-3.5-turbo",
        "name": "model_a"
    })
    model_a_generation.evaluate().with_evaluators("quality", "speed")
    model_a_generation.result({"choices": [{"message": {"content": model_a_response}}]})
    
    # Model B  
    model_b_response = generate_with_model_b(user_input)
    model_b_generation = logger.current_trace().generation({
        "id": "model-b-response", 
        "provider": "openai",
        "model": "gpt-4",
        "name": "model_b"
    })
    model_b_generation.evaluate().with_evaluators("quality", "speed")
    model_b_generation.result({"choices": [{"message": {"content": model_b_response}}]})
    
    return {"model_a": model_a_response, "model_b": model_b_response}