Input
output
(list): The model-generated list of tool calls.expectedOutput
(list): The reference list of expected tool calls (JSON-formatted objects with tool name and arguments).
Output
Result
(float): A score between 0 and 1.Reasoning
(str): Optional detailed feedback on the matching process.
Interpretation
- Higher scores (closer to 1): Most expected tool calls were made correctly with proper parameters and order
- Lower scores (closer to 0): Few expected tool calls were matched correctly
Formula
Use Cases
- Evaluating agent compliance with required tool sequences
- Assessing function-calling tasks that require specific arguments
- Measuring multi-step tool-use workflows end-to-end