Evaluates whether an agent has successfully accomplished the intended goal of a task based on the complete interaction.
session
: Complete interaction log between user and agent showing all steps takenResult
: Binary score (0 or 1)Reasoning
: Detailed explanation