Evaluates whether an agent has completed all required steps, considering flexible execution order.
session
: Complete interaction log between user and agent showing all steps takenexpected_steps
: List of required steps (order flexible)Result
: Binary score (0 or 1)Reasoning
: Detailed explanation of step completion