-
Notifications
You must be signed in to change notification settings - Fork 9
Added a GraphEval - custom faithfulness evaluator that uses knowledge graphs #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…to logging from print
WalkthroughA new evaluator module, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GraphEvalEvaluator
participant LLM
User->>GraphEvalEvaluator: evaluate(entry)
GraphEvalEvaluator->>LLM: Extract knowledge graph from output (prompt)
LLM-->>GraphEvalEvaluator: Return knowledge graph triples
GraphEvalEvaluator->>LLM: Compare triples to context (prompt)
LLM-->>GraphEvalEvaluator: Return verification results
GraphEvalEvaluator-->>User: Return evaluation result (score, passed, details)
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
evaluators/langevals/langevals_langevals/grapheval.py (2)
268-272
: Incorrect type annotation forknowledge_graph
knowledge_graph
is a collection of triples, not plain strings.
Annotate it accordingly to improve static analysis and IDE support:- knowledge_graph: list[str], + knowledge_graph: list[list[str]] | list[tuple[str, str, str]],Also consider
Sequence
instead oflist
if mutability is not required.
317-325
: Return type too narrow – may also return list/None
_get_arguments
can return a list (for"triples"
), a boolean (for"result"
), or the error string.
Declaring-> str | bool
hides the list case.- def _get_arguments(self, response: ModelResponse, value: str) -> str | bool: + from typing import Any + def _get_arguments(self, response: ModelResponse, value: str) -> Any:This avoids type-checker noise and documents the real behaviour.
evaluators/langevals/tests/test_grapheval.py (1)
19-60
: Replace equality checks with direct truthiness & simplifyisinstance
chainsSeveral tests compare booleans with
==
and repeatisinstance
on the same object.
Adopting idiomatic assertions makes the tests clearer and silences Ruff warnings.- assert result.passed == True + assert result.passed- assert result.passed == False + assert not result.passed-if not ( - isinstance(result, EvaluationResultError) - or isinstance(result, EvaluationResultSkipped) -): +if not isinstance(result, (EvaluationResultError, EvaluationResultSkipped)):Apply the pattern throughout the file (lines 19-21, 31-33, 47-49, 59-60, 74-75, 120-124, 135-139, 157-162, 174-177).
No functional change, but the code becomes more idiomatic and easier to read.
Also applies to: 74-77, 100-124, 135-139, 157-162, 174-178
🧰 Tools
🪛 Ruff (0.8.2)
20-20: Avoid equality comparisons to
True
; useif result.passed:
for truth checksReplace with
result.passed
(E712)
32-32: Avoid equality comparisons to
False
; useif not result.passed:
for false checksReplace with
not result.passed
(E712)
48-48: Avoid equality comparisons to
True
; useif result.passed:
for truth checksReplace with
result.passed
(E712)
60-60: Avoid equality comparisons to
False
; useif not result.passed:
for false checksReplace with
not result.passed
(E712)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
evaluators/langevals/langevals_langevals/grapheval.py
(1 hunks)evaluators/langevals/tests/test_grapheval.py
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
evaluators/langevals/langevals_langevals/grapheval.py (1)
langevals_core/langevals_core/base_evaluator.py (3)
BaseEvaluator
(196-330)EvaluatorEntry
(62-114)LLMEvaluatorSettings
(46-54)
🪛 Ruff (0.8.2)
evaluators/langevals/tests/test_grapheval.py
20-20: Avoid equality comparisons to True
; use if result.passed:
for truth checks
Replace with result.passed
(E712)
32-32: Avoid equality comparisons to False
; use if not result.passed:
for false checks
Replace with not result.passed
(E712)
48-48: Avoid equality comparisons to True
; use if result.passed:
for truth checks
Replace with result.passed
(E712)
60-60: Avoid equality comparisons to False
; use if not result.passed:
for false checks
Replace with not result.passed
(E712)
75-75: Avoid equality comparisons to False
; use if not result.passed:
for false checks
Replace with not result.passed
(E712)
120-121: Multiple isinstance
calls for result
, merge into a single call
Merge isinstance
calls for result
(SIM101)
136-137: Multiple isinstance
calls for result
, merge into a single call
Merge isinstance
calls for result
(SIM101)
139-139: Avoid equality comparisons to False
; use if not result.passed:
for false checks
Replace with not result.passed
(E712)
157-158: Multiple isinstance
calls for result
, merge into a single call
Merge isinstance
calls for result
(SIM101)
175-175: Avoid equality comparisons to True
; use if result.passed:
for truth checks
Replace with result.passed
(E712)
evaluators/langevals/langevals_langevals/grapheval.py
82-82: Local variable details
is assigned to but never used
Remove assignment to unused variable details
(F841)
"name": "create_knowledge_graph", | ||
"description": "Create a knowledge graph from input text", | ||
"parameters": { | ||
"type": "object", | ||
"properties": { | ||
"triples": { | ||
"type": "array", | ||
"items": { | ||
"type": "object", | ||
"properties": { | ||
"entity_1": { | ||
"type": "string", | ||
"description": "First entity in the relationship", | ||
}, | ||
"relationship": { | ||
"type": "string", | ||
"description": "Relationship between entities", | ||
}, | ||
"entity_2": { | ||
"type": "string", | ||
"description": "Second entity in the relationship", | ||
}, | ||
}, | ||
"required": [ | ||
"entity_1", | ||
"relationship", | ||
"entity_2", | ||
], | ||
}, | ||
"description": "List of entity-relationship triples that construct a knowledge graph", | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Schema/Example mismatch – triples defined as objects but examples return lists
The JSON schema for triples
specifies an array of objects with entity_1
, relationship
, entity_2
keys.
However, the prompt examples (and downstream code/tests) expect each triple to be a simple 3-item list, e.g.:
["Italy", "had 3.6x times more cases of coronavirus than", "China"]
This inconsistency can confuse the LLM and break self._get_arguments
(which assumes a list of lists).
Consider aligning the schema with the expected structure:
- "triples": {
- "type": "array",
- "items": {
- "type": "object",
- "properties": {
- "entity_1": { "type": "string" },
- "relationship": { "type": "string" },
- "entity_2": { "type": "string" }
- },
- "required": ["entity_1","relationship","entity_2"],
- },
+ "triples": {
+ "type": "array",
+ "items": {
+ "type": "array",
+ "items": { "type": "string" },
+ "minItems": 3,
+ "maxItems": 3
+ },
"description": "List of [entity_1, relationship, entity_2] triples",
}
or, alternatively, update the rest of the pipeline/tests to consume objects instead of lists.
def evaluate(self, entry: GraphEvalEntry) -> SingleEvaluationResult: | ||
details = None | ||
passed = None | ||
try: | ||
knowledge_graph_response = self._construct_knowledge_graph(entry.output) | ||
cost = completion_cost(knowledge_graph_response) or 0.0 | ||
knowledge_graph = self._get_arguments( | ||
knowledge_graph_response, value="triples" | ||
) | ||
except Exception as e: | ||
logging.error("Caught an exception while creating a knowledge graph: ", e) | ||
|
||
try: | ||
if isinstance(knowledge_graph, list): | ||
passed_response = self._compare_knowledge_graph_with_contexts( | ||
knowledge_graph=knowledge_graph, contexts=entry.contexts | ||
) | ||
cost += completion_cost(passed_response) or 0.0 | ||
passed = self._get_arguments(passed_response, value="result") | ||
except Exception as e: | ||
logging.error( | ||
"Caught an exception while comparing knowledge graph with contexts: ", e | ||
) | ||
|
||
if isinstance(passed, bool): | ||
return GraphEvalResult( | ||
passed=passed, | ||
details=f"The following entity_1-relationship->entity_2 triples were found in the output: {knowledge_graph}", | ||
cost=Money(amount=cost, currency="USD") if cost else None, | ||
) | ||
return GraphEvalResult( | ||
passed=False, | ||
details="We could not evaluate faithfulness of the output", | ||
cost=Money(amount=cost, currency="USD") if cost else None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against un-initialised variables & improve logging
cost
and knowledge_graph
are first defined inside the try
-block.
If _construct_knowledge_graph
throws, the next block (if isinstance(knowledge_graph, list):
) will raise an UnboundLocalError
, and the final Money(amount=cost …)
access will do the same for cost
.
While fixing this, you can also drop the unused details
variable and use logging.exception
to capture the traceback.
- details = None
- passed = None
- try:
- knowledge_graph_response = self._construct_knowledge_graph(entry.output)
- cost = completion_cost(knowledge_graph_response) or 0.0
+ passed: bool | None = None
+ cost: float = 0.0
+ knowledge_graph: list[list[str]] | None = None
+ try:
+ knowledge_graph_response = self._construct_knowledge_graph(entry.output)
+ cost += completion_cost(knowledge_graph_response) or 0.0
knowledge_graph = self._get_arguments(
knowledge_graph_response, value="triples"
)
- except Exception as e:
- logging.error("Caught an exception while creating a knowledge graph: ", e)
+ except Exception:
+ logging.exception("Error while creating a knowledge graph")
Do the same in the second try
-block:
- except Exception as e:
- logging.error(
- "Caught an exception while comparing knowledge graph with contexts: ", e
- )
+ except Exception:
+ logging.exception(
+ "Error while comparing knowledge graph with contexts"
+ )
This prevents hidden crashes and surfaces the real stack-trace during debugging.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def evaluate(self, entry: GraphEvalEntry) -> SingleEvaluationResult: | |
details = None | |
passed = None | |
try: | |
knowledge_graph_response = self._construct_knowledge_graph(entry.output) | |
cost = completion_cost(knowledge_graph_response) or 0.0 | |
knowledge_graph = self._get_arguments( | |
knowledge_graph_response, value="triples" | |
) | |
except Exception as e: | |
logging.error("Caught an exception while creating a knowledge graph: ", e) | |
try: | |
if isinstance(knowledge_graph, list): | |
passed_response = self._compare_knowledge_graph_with_contexts( | |
knowledge_graph=knowledge_graph, contexts=entry.contexts | |
) | |
cost += completion_cost(passed_response) or 0.0 | |
passed = self._get_arguments(passed_response, value="result") | |
except Exception as e: | |
logging.error( | |
"Caught an exception while comparing knowledge graph with contexts: ", e | |
) | |
if isinstance(passed, bool): | |
return GraphEvalResult( | |
passed=passed, | |
details=f"The following entity_1-relationship->entity_2 triples were found in the output: {knowledge_graph}", | |
cost=Money(amount=cost, currency="USD") if cost else None, | |
) | |
return GraphEvalResult( | |
passed=False, | |
details="We could not evaluate faithfulness of the output", | |
cost=Money(amount=cost, currency="USD") if cost else None, | |
) | |
def evaluate(self, entry: GraphEvalEntry) -> SingleEvaluationResult: | |
passed: bool | None = None | |
cost: float = 0.0 | |
knowledge_graph: list[list[str]] | None = None | |
try: | |
knowledge_graph_response = self._construct_knowledge_graph(entry.output) | |
cost += completion_cost(knowledge_graph_response) or 0.0 | |
knowledge_graph = self._get_arguments( | |
knowledge_graph_response, value="triples" | |
) | |
except Exception: | |
logging.exception("Error while creating a knowledge graph") | |
try: | |
if isinstance(knowledge_graph, list): | |
passed_response = self._compare_knowledge_graph_with_contexts( | |
knowledge_graph=knowledge_graph, contexts=entry.contexts | |
) | |
cost += completion_cost(passed_response) or 0.0 | |
passed = self._get_arguments(passed_response, value="result") | |
except Exception: | |
logging.exception( | |
"Error while comparing knowledge graph with contexts" | |
) | |
if isinstance(passed, bool): | |
return GraphEvalResult( | |
passed=passed, | |
details=f"The following entity_1-relationship->entity_2 triples were found in the output: {knowledge_graph}", | |
cost=Money(amount=cost, currency="USD") if cost else None, | |
) | |
return GraphEvalResult( | |
passed=False, | |
details="We could not evaluate faithfulness of the output", | |
cost=Money(amount=cost, currency="USD") if cost else None, | |
) |
🧰 Tools
🪛 Ruff (0.8.2)
82-82: Local variable details
is assigned to but never used
Remove assignment to unused variable details
(F841)
Created a new custom evaluator called GraphEval - inspired by a research paper https://arxiv.org/pdf/2407.10793. It constructs a knowledge graph (by an LLM) consisting of entities and relationships from the output and checks if they are truthful to the contexts. This evaluator can be used as an alternative to RAGAS faithfulness and factual correctness evaluators.
Summary by CodeRabbit