Evaluate Retrieval Quality

Evaluate retrieval at scale

Evaluate retrieval at scale

While the playground experience allows you to experiment and debug when retrieval is not working well, it is important to do this at scale across multiple inputs and with a set of defined metrics. Follow the steps given below to run a test and evaluate context retrieval.

Initiate prompt testing

Click on test for a prompt that has an attached context (as explained in the previous section).

Select your test dataset

Select your dataset which has the required inputs.

Choose context evaluation source

For the context to evaluate, select the dynamic Context Source

Add retrieval quality evaluators

Select context specific evaluators - e.g. Context recall, context precision or context relevance and trigger the test

Review retrieved context results

Once the run is complete, the retrieved context column will be filled for all inputs.

Examine detailed chunk information

View complete details of retrieved chunks by clicking on any entry.

Analyze evaluator feedback

Evaluator scores and reasoning for every entry can be checked under the evaluation tab. Use this to debug retrieval issues.

By running experiments iteratively as you are making changes to your AI application, you can check for any regressions in the retrieval pipeline and continue to test for new test cases.

Measure Tool Call Accuracy

Human Annotation

⌘I

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

Evaluate Retrieval Quality

Evaluate retrieval at scale

Introduction

Prompt Engineering

Offline Evals

Online Evals

Tracing

Simulations

Library

Dashboards

Integrations

Settings

​Evaluate retrieval at scale

Evaluate retrieval at scale