Evaluators
Evaluators are tools or metrics used to assess the quality, accuracy, and effectiveness of AI model outputs. We have various types of evaluators that can be customized and integrated into endpoints and test runs. See below for more details. You can find more about evaluators here.Evaluator type | Description |
---|---|
AI | Uses AI models to assess outputs |
Programmatic | Applies predefined rules or algorithms |
Statistical | Utilizes statistical methods for evaluation |
Human | Involves human judgment and feedback |
API-based | Leverages external APIs for assessment |
Evaluator Store
A large set of pre-built evaluators are available for you to use directly. These can be found in the evaluator store and added to your workspace with a single click. At Maxim, our pre-built evaluators fall into two categories:-
Maxim-created Evaluators: These are evaluators created, benchmarked, and managed by Maxim. There are three kinds of Maxim-created evaluators:
- AI Evaluators: These evaluators use other large language models to evaluate your application (LLM-as-a-Judge).
- Statistical Evaluators: Traditional ML metrics such as BLEU, ROUGE, WER, TER, etc.
- Programmatic Evaluators: JavaScript functions for common use cases like
validJson
,validURL
, etc., that help validate your responses.
- Third-party Evaluators: We have also enabled popular third-party libraries for evaluation, such as RAGAS, so you can use them in your evaluation endpoints with just a few clicks. If you have any custom integration requests, please feel free to drop us a note.
If you want us to build a specific evaluator for your needs, please drop a line at [email protected].
Custom Evaluators
While we provide many evaluators for common use cases out of the box, we understand that some applications have specific requirements. Keeping that in mind, the platform allows for easy creation of custom evaluators of the following types:AI Evaluators
These evaluators use other LLMs to evaluate your application. You can configure different prompts, models, and scoring strategies depending on your use case. Once tested in the playground, you can start using the evaluators in your endpoints.Programmatic Evaluators
These are JavaScript functions where you can write your own custom logic. You can use the{{input}}
, {{output}}
, and {{expectedOutput}}
variables, which pull relevant data from the dataset column or the response of the run to execute the evaluator.
API-based Evaluators
If you have built your own evaluation model for specific use cases, you can expose the model using an HTTP endpoint and integrate it within Maxim for evaluation.Human Evaluators
This allows for the last mile of evaluation with human annotators in the loop. You can create a Human Evaluator for specific criteria that you want annotators to assess. During a test run, simply attach the evaluators, add details of the raters, and choose the sample set for human annotation. Learn more about the human evaluation lifecycle here.Every evaluator should return a score and reasoning, which are then analyzed and used to summarize results according to your criteria.
Evaluator Grading
Every evaluator’s grading configuration has two parts:-
Type of scale – Yes/No, Scale of 1-5, etc.
- For AI evaluators, this can be chosen, and an explanation is needed for grading logic.
- For programmatic evaluators, the relevant response type can be configured.
- For API-based evaluators, you can map the field to be used for scoring.
-
Pass criteria – This includes configuration for two levels:
- The score at which an evaluator should pass for a given query.
- The percentage of queries that need to pass for the evaluator to pass at the run level across all dataset entries.
Maxim uses reserved variables with specific meanings:
{{input}}
: Input from the dataset{{expectedOutput}}
: Expected output from the dataset{{expectedToolCalls}}
: Expected tool calls from the dataset{{scenario}}
: Scenario from the dataset{{expectedSteps}}
: Expected steps from the dataset{{output}}
: Generated output of the endpoint/prompt/no-code agent{{context}}
: Context to evaluate
Evaluator Reasoning
To help you analyze why certain cases perform well or underperform, we provide clear reasoning for each evaluator score. This can be viewed for each entry within the evaluation tab on its details sheet.Multimodal Datasets
Datasets in Maxim are multimodal and can be created directly on the platform or uploaded as existing CSV files. In Maxim, datasets can have columns of the following types (entities):- Input (Compulsory entity): A column associated with this type is treated as an input query to test your application.
- Expected Output: Data representing the desired response that the application should generate for the corresponding input.
- Output: This is for cases where you have run your queries elsewhere and have the outputs within your CSV that you want to evaluate directly.
- Image: You can upload images or provide an image URL.
- Variables: Any data that you want to dynamically change in prompts/endpoints during runtime.
- Expected Tool Calls: Prompt tools expected to be triggered for the corresponding input.