- Session-level: Evaluate entire multi-turn conversations to assess overall conversation quality, coherence, and user satisfaction
- Trace-level: Evaluate individual single-turn interactions to measure response quality, accuracy, and appropriateness
- Span-level (Node-level): Evaluate specific components within a trace (e.g., generations, retrievals, tool calls) to optimize individual parts of your workflow
- Automatically evaluate logs based on custom filters and sampling rules
- Configure evaluators at different levels through the UI (Session/Trace) or SDK (Span/Node)
- Map evaluator variables to your trace data for flexible evaluation
- Combine automated evaluators with human review
- Curate datasets from evaluated logs for offline testing
- Set up alerts to stay on top of both quality and performance issues
