Set up auto evaluation on logs
Evaluate captured logs automatically from the UI based on filters and sampling
Why evaluate logs?
We know that evaluation is a necessary step while building an LLM, but since an LLM can be non-deterministic, all possible scenarios can never be covered; thus evaluating the LLM on live system also becomes crucial.
Evaluation on logs helps cover cases or scenarios that might not be covered by Test runs, ensuring that the LLM is performing optimally under various conditions. Additionally, it allows for potential issues to be identified early on which allows for making necessary adjustments to improve the overall performance of the LLM in time.
Before you start
You need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.
Setting up auto evaluation
Navigate to repository
Navigate to the repository where you want to evaluate your logs.
Access evaluation configuration
Click on Configure evaluation
in the top right corner of the page and choose the Setup evaluation configuration
option. This will open up the evaluation configuration sheet.
Configure auto evaluation settings
The sheet’s Auto Evaluation
section has 3 parts:
Select evaluators
: Choose the evaluators you want to use for your evaluation.Filters
: Setup filters to only evaluate logs that meet a certain criteria.Sampling
: Choose a sampling rate, this will help you control the amount of logs that are evaluated and prevent evaluating every log; which could potentially lead to very high costs.
The Human Evaluation
section below is explained in the Set up human evaluation on logs section
Save configuration
Finally click on the Save configuration button.
The configuration is now done and your logs should start evaluating automatically based on the filters and sampling rate you have set up! 🎉
Making sense of evaluations on logs
In the logs’ table view, you can find the evaluations on a trace in its row towards the left end, displaying the evaluation scores. You can sort the logs by evaluation scores as well by clicking on either of the evaluators’ column header.
Click the trace to view detailed evaluation results. In the sheet, you will find the Evaluation
tab, wherein you can see the evaluation in detail.
The evaluation tab displays many details regarding the evaluation of the trace, let us see how you can navigate through them and get more insights into how your LLM is performing.
Evaluation summary
Evaluation summary displays the following information (top to bottom, left to right):
- How many evaluators passed out of the total evaluators across the trace
- How much did all the evaluators’ evaluation cost
- How many tokens were used across the all evaluators’ evaluations
- What was the total time taken for the evaluation to process
Trace evaluation card
In each card, you will find a tab switcher on the top right corner, this is used to navigate through the evaluation’s details. Here is what you can find in in different tabs:
Overview tab
All the evaluators run on the trace level and their scores are displayed in a table here along with whether the evaluator passed or failed.
Individual evaluator’s tab
This tab contains the following sections:
- Result: Shows whether the evaluator passed or failed.
- Score: Shows the score of the evaluator.
- Reason (shown where applicable): Displays the reasoning behind the score of the evaluator, if given.
- Cost (shown where applicable): Shows the cost of the individual evaluator’s evaluation.
- Tokens used (shown where applicable): Shows the number of tokens used by the individual evaluator’s evaluation.
- Model latency (shown where applicable): Shows the time taken by the model to respond back with a result for an evaluator.
- Time taken: Shows the time taken by the evaluator to evaluate.
- Variables used to evaluate: Shows the values that were used to replace the variables with while processing the evaluator.
- Logs: These are logs that were generated during the evaluation process. They might be useful for debugging errors or issues that occurred during the evaluation.
Tree view on the left panel
This view is essential for when you are evaluating the each log on the node level, essentially on each component of the trace (like a generation or retrieval, etc). This view helps with your perception as to what component’s evaluation you are looking at on the right panel (and the component’s place in the trace as well). We discuss more about the Node Level Evaluation further down.