The need for human evaluation
While machine learning models can provide a baseline evaluation, they may not always capture the nuances of human perception, simply because they lack the ability to understand context and emotions behind some scenarios. Humans, in these scenarios, can also provide better comments and insights. This makes it essential to also have humans be a part of the evaluation process. Human evaluation on logs are very similar to how human annotation is done on test runs, in fact, the Human Evaluators used in test runs are also used here. Let’s see how we can set up a human evaluation pipeline for our logs.Before you startYou need to have your logging set up to capture interactions between your LLM and users before you can evaluate them. To do so, you would need to integrate Maxim SDK into your application.Also if you do not have a Human Evaluator created in your workspace, please create one by navigating to the Evaluators () tab from the sidebar, as we will need it to setup the human evaluation pipeline.
Setting up human evaluation
1
Navigate to repository
Navigate to the repository where you want to setup human evaluation on logs.
2
Access evaluation configuration
Click on
Configure evaluation
in the top right corner of the page and choose the Setup evaluation configuration
option. This will open up the evaluation configuration sheet.3
Select human evaluators
We need to focus on the 
Human Evaluation
section below. Here we will see a dropdown under Select evaluators
, we need to choose Human Evaluators to use for our evaluation from here.This will setup what evaluation we want to do upon our logs. Now we need to setup filtering criteria to determine which logs should be evaluated as evaluating all logs by hand can get out of hand very fast.
We talked about the Auto evaluation section above. You can learn more about using other types of evaluators to evaluate your logs there.
4
Save configuration
Before we setup the filtering criteria though, we need to save this configuration, do this by clicking on the
Save configuration
button.5
Access annotation queue
Now to get to filtering criteria, we will click on
Configure evaluation
in the top right corner of the page again but choose the View annotation queue
option this time. You will be taken to the annotation queue page.6
Set up queue logic

Set up queue logic
button, click on it to setup the logic for the queue and click on the Save queue logic
button finally to save.
Manually add logs to the queue by:
- Selecting the logs you want to add to the queue by clicking on the checkboxes at the left of each log
-
Clicking on the
<Icon icon="notebook-pen" /> Add to annotation queue
button and you’re done!
Viewing annotations
There are 3 places where annotations can be viewed:The annotation queue page
Here each added log will have their human evaluators’ scores displayed. The scores would be the average score of all the annotations done for an evaluator by different users. On editing the score, the individual score along with comment and rewritten output (if any) of the user editing the score will be shown with the ability to edit all of them.
Annotating the logs
On opening the annotation queue page, you will see a list of logs that have been added to the queue beside which there will be a Select rating dropdown. Clicking on the Select rating dropdown will open a modal where you can select a rating for the log and optionally add a comment or provide a rewritten output if necessary.
Save and next
button to move to the next log/entry and score it.

The logs table
Similar to how the evaluator scores are shown for auto evaluation, human evaluators are also shown (again, the average score is shown here)
The trace details sheet (under the Evaluation tab)
On opening any trace, you will find aDetails
and Evaluation
tab. The Evaluation
tab here would display all the evaluations on that happened on the trace. We will focus on the Human Evaluators here but in order to make sense of other evaluators in this sheet you can refer to Auto Evaluation -> Making sense of evaluations on logs
The trace evaluation overview tab shows the average score of each Human Evaluator and Rewritten Outputs, if present, by each individual user.

Score
(avg.) and Result
(whether the particular evaluator’s evaluation passed or failed). We also see a breakdown of the scores and their corresponding comments, if any, given by each user in this tab, thus giving you a granular view of the evaluation as well.
