Skip to content

Marco210210/llm-eval-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM-Eval – Automatic Evaluation of Dialogues with LLMs

LLM-Eval is a two-phase research project that explores how Large Language Models (LLMs) can be used to automatically evaluate the quality of dialogues between humans and conversational agents.

The project investigates the use of the LLM-EVAL framework, testing its ability to reproduce human-like evaluations across different models and datasets.


🌐 Project Overview

  • Phase 1: Evaluate four LLMs on a benchmark dataset (ConvAI2):
    • Claude 3
    • Claude 3.5
    • GPT-4o
    • GPT-4o-mini
  • Phase 2: Evaluate how dataset structure affects performance (using Claude 3):
    • FED
    • PC
    • TC
    • DSTC9
  • Metrics: Accuracy, Cohen’s Kappa, Pearson, Spearman, Kendall-Tau correlations
  • Evaluation schema follows the paper: LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations

πŸ› οΈ Technologies & Tools

  • Programming Language: Python 3
  • API Access: OpenAI + Anthropic APIs
  • Environment Management: venv + .env for keys
  • Libraries: json, os, tqdm, anthropic, openai, sklearn, pandas, matplotlib

πŸ“ Repository Structure

LLM-Eval/
β”œβ”€β”€ docs/                    β†’ Project report, presentation, paper
β”œβ”€β”€ prog/
β”‚   β”œβ”€β”€ dataset1/            β†’ Phase 1: Model-based evaluation (Claude, GPT)
β”‚   β”‚   β”œβ”€β”€ Claude3/
β”‚   β”‚   β”œβ”€β”€ Claude3-5/
β”‚   β”‚   β”œβ”€β”€ GPT-4o/
β”‚   β”‚   β”œβ”€β”€ GPT-4o-mini/
β”‚   β”‚   └── convai2_data.json
β”‚   β”œβ”€β”€ dataset2/            β†’ Phase 2: Dataset-based evaluation (FED, TC, etc.)
β”‚   β”‚   β”œβ”€β”€ DSTC9/
β”‚   β”‚   β”œβ”€β”€ FED/
β”‚   β”‚   β”œβ”€β”€ PC/
β”‚   β”‚   └── TC/
β”œβ”€β”€ README.md               β†’ Project documentation (this file)

πŸ“„ Documentation

All located inside docs/.


πŸ‘₯ Contributors

Project presented for the Artificial Intelligence course – University of Salerno (2025)


πŸ“„ License

This project is licensed under the CC BY-NC-SA 4.0 License
License: CC BY-NC-SA 4.0

You may share and adapt this work for non-commercial purposes only, as long as you give appropriate credit and distribute your contributions under the same license.
For commercial use, explicit permission from the authors is required.

About

Automatic multi-metric evaluation of human-bot dialogues using LLMs (Claude, GPT-4o) across different datasets and settings. Built for the Artificial Intelligence course at the University of Salerno.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages