Skip to content

Incorrect calculation at ToolCallAccuracy #1893

Open
@licux

Description

@licux

Describe the bug
In ToolCallAccuracy, if the number of ToolCalls in the user_input is greater than the number of reference_tool_calls, it does not affect the evaluation score. In other words, the evaluation score remains unaffected even when more ToolCalls than expected occur.

Ragas version: latest
Python version: 3.12

Code to Reproduce

sample = [
    HumanMessage(content="What's the weather like in New York right now?"),
    AIMessage(content="The current temperature in New York is 75°F and it's partly cloudy.", tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]),
    HumanMessage(content="Can you translate that to Celsius?"),
    AIMessage(content="Let me convert that to Celsius for you.", tool_calls=[
        ToolCall(name="temperature_conversion", args={"temperature_fahrenheit": 75})
    ]),
    ToolMessage(content="75°F is approximately 23.9°C."),
    AIMessage(content="75°F is approximately 23.9°C.")
]

sample = MultiTurnSample(
    user_input=sample,
    reference_tool_calls=[
        ToolCall(name="weather_check", args={"location": "New York"})
    ]
)

Output:

1

Error trace

Expected behavior
"evaluation is 0"
I think there are many opinions.

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmodule-metricsthis is part of metrics module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions