Skip to main content
To start, navigate to Evaluators > Library in the Maxim dashboard and click the + Create Evaluator button.

Custom AI Evaluators (LLM-as-a-Judge)

Custom AI Evaluators use an LLM to “reason” about your agent’s outputs based on natural language instructions. This is ideal for subjective checks like tone, brand compliance, or complex safety guidelines.
1

Model Selection

Choose the specific model to act as the judge (e.g., GPT-4, Claude 3.5 Sonnet).
2

Evaluation Instructions

Write a system prompt defining how the judge should evaluate the data. You can inject dynamic values using the following variables:
  • {{input}}: The prompt sent to your agent.
  • {{output}}: The response generated by your agent.
  • {{context}}: Any retrieved context or RAG documents.
  • {{expected_output}}: The ground truth (if available in your dataset).
3

Scoring Scales

Configure the format of the result:
  • Binary: Returns True/False (Pass/Fail).
  • Scale (1-5): Returns a numeric score (e.g., Likert scale).
  • Categorical: Returns specific string labels (e.g., “Safe”, “Risky”, “Toxic”).
4

Pass Criteria

In the Pass Criteria tab, define the threshold for success (e.g., “Score must be > 3” or “Must return ‘Safe’”).

Custom Programmatic Evaluators

Programmatic evaluators allow you to write deterministic logic using Python or JavaScript. This is best for strict validation rules, such as checking JSON schemas, verifying regex patterns, or detecting forbidden keywords. You must define a validate function that accepts standard arguments and returns a result matching your configured Response Type (Boolean, Number, or String). Example: Python Validator for Sentence Count
def validate(input, output, expected_output, context, **kwargs):
    # Check if the output has fewer than 3 sentences
    sentences = output.split('.')
    # Filter out empty strings from split
    valid_sentences = [s for s in sentences if s.strip()]

    if len(valid_sentences) < 3:
        # Return False for "Fail" if Binary type is selected
        return False

    # Return True for "Pass"
    return True

API-Based (Remote) Evaluators

If you have an existing evaluation pipeline or a proprietary scoring model hosted externally, you can connect it to Maxim using an API-Based Evaluator.
  • Endpoint Configuration: Provide your API URL, method (POST/GET), headers (e.g., Authorization tokens), and payload structure.
  • Integration: Maxim sends the test run data (inputs, outputs) to your endpoint and records the response as the evaluation score.

Human Evaluators

For high-stakes workflows requiring manual oversight, you can configure Human Evaluators. These creates a task queue for subject matter experts (SMEs) to review outputs.
  • Configuration: Define the rating interface (e.g., a 1-5 star rating or a text comment box) and provide guidelines for reviewers.
  • Workflow: Assign these evaluators to a test run to trigger a human-in-the-loop review process.

Testing Your Evaluator

Before saving, use the Playground on the right side of the evaluator editor:
1

Enter sample values

Enter sample values for your variables ({{input}}, {{output}}).
2

Click Run

Click Run to execute the evaluator immediately.
3

Review the Result

Review the Result and Reasoning trace to verify the logic works as expected.