Learn how to evaluate the quality of AI HR assistants using Maxim’s evaluation suite, ensuring accurate and efficient HR processes.
HR_RAG_Assistant
{{context}}
Include at least one direct quote from the context, enclosed in quotation marks, and specify the section and page number where the quote can be found.Ensure the response is friendly and polite, adding “please” at the end to maintain a courteous tone.{{context}}
variable.HR_policy.txt
for our example)text-embedding-ada-002
model. These embeddings enable the retrieval of context that is relevant to the user’s query{{context}}
variable), and the LLM generates a response for our input query using the information in the retrieved context.
To evaluate the performance of our assistant, we’ll now create a test dataset. It is a collection of employee queries and corresponding expected responses. We’ll use the expected response to evaluate the performance and quality of the response generated by our assistant.
employee_query
as “Input” type, since these queries will be the input to our HR assistantexpected_response
as “Expected Output” type, since this is the reference for comparison of generated assistant responsesEvaluator | Type | Purpose |
---|---|---|
Context Relevance | LLM-as-a-judge | Evaluates how well your RAG pipeline’s retriever finds information relevant to the input |
Faithfulness | LLM-as-a-judge | Measures whether the output factually aligns with the contents of your context |
Context Precision | LLM-as-a-judge | Measures retriever accuracy by assessing the relevance of each node in the retrieved context |
Bias | LLM-as-a-judge | Determines whether output contains gender, racial, political, or geographical bias |
Semantic Similarity | Statistical | Checks whether the generated output is semantically similar to the expected output |
Tone check | Custom eval | Determines whether the output has friendly and polite tone |
{{output}}
, determine if the response is friendly and polite?HR_queries
for our example)HR_policy
in our example)