LLM-Eval is a two-phase research project that explores how Large Language Models (LLMs) can be used to automatically evaluate the quality of dialogues between humans and conversational agents.
The project investigates the use of the LLM-EVAL framework, testing its ability to reproduce human-like evaluations across different models and datasets.
- Phase 1: Evaluate four LLMs on a benchmark dataset (
ConvAI2):- Claude 3
- Claude 3.5
- GPT-4o
- GPT-4o-mini
- Phase 2: Evaluate how dataset structure affects performance (using Claude 3):
- FED
- PC
- TC
- DSTC9
- Metrics: Accuracy, Cohenβs Kappa, Pearson, Spearman, Kendall-Tau correlations
- Evaluation schema follows the paper: LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations
- Programming Language: Python 3
- API Access: OpenAI + Anthropic APIs
- Environment Management:
venv+.envfor keys - Libraries:
json,os,tqdm,anthropic,openai,sklearn,pandas,matplotlib
LLM-Eval/
βββ docs/ β Project report, presentation, paper
βββ prog/
β βββ dataset1/ β Phase 1: Model-based evaluation (Claude, GPT)
β β βββ Claude3/
β β βββ Claude3-5/
β β βββ GPT-4o/
β β βββ GPT-4o-mini/
β β βββ convai2_data.json
β βββ dataset2/ β Phase 2: Dataset-based evaluation (FED, TC, etc.)
β β βββ DSTC9/
β β βββ FED/
β β βββ PC/
β β βββ TC/
βββ README.md β Project documentation (this file)
- π LLM-Eval_Report.pdf β Full project report
- π° LLM-Eval_Paper.pdf β Original paper on LLM-Eval
- π LLM-Eval_Presentation.pptx β Slide deck
- π LLM-Eval_Guidelines.pdf β Project guidelines
All located inside docs/.
- Arcangeli Giovanni
- Ciancio Vittorio
- Marco Di Maio
Project presented for the Artificial Intelligence course β University of Salerno (2025)
This project is licensed under the CC BY-NC-SA 4.0 License
You may share and adapt this work for non-commercial purposes only, as long as you give appropriate credit and distribute your contributions under the same license.
For commercial use, explicit permission from the authors is required.
