Open
Description
Describe the bug
In ToolCallAccuracy, if the number of ToolCalls in the user_input is greater than the number of reference_tool_calls, it does not affect the evaluation score. In other words, the evaluation score remains unaffected even when more ToolCalls than expected occur.
Ragas version: latest
Python version: 3.12
Code to Reproduce
sample = [
HumanMessage(content="What's the weather like in New York right now?"),
AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"})
]),
HumanMessage(content="Can you translate that to Celsius?"),
AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
]),
ToolMessage(content="75°F is approximately 23.9°C."),
AIMessage(content="75°F is approximately 23.9°C.")
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"})
]
)
Output:
1
Error trace
Expected behavior
"evaluation is 0"
I think there are many opinions.
Additional context
Add any other context about the problem here.