This project demonstrates how to fine-tune a large language model using Low-Rank Adaptation (LoRA) for the task of academic paper summarization, and then leverage this fine-tuned model in a multi-agent autonomous research assistant built using LangGraph + LangChain.
It was developed as part of the Generative AI - Spring 2025 coursework under the supervision of Dr. Hajra Waheed.

Reading and understanding scientific literature is cognitively intensive. Our objective was to fine-tune an LLM using LoRA, a parameter-efficient training strategy, to generate structured, coherent, and factually correct summaries of research papers from the arXiv Summarization Dataset.
- Source: arXiv Summarization Dataset via HuggingFace Datasets
- Samples Used: 5,000
- Structure: Input =
article
, Target =abstract
- Split:
- Train: 4,000
- Validation: 500
- Test: 500
- Base Model:
Qwen/Qwen2.5-3B-Instruct
(chat-oriented transformer) - Tokenizer: Used base tokenizer for consistent input-output formatting
- LoRA Integration: Applied to attention layers:
q_proj
,v_proj
Parameter | Value |
---|---|
LoRA Rank (r ) |
8 |
LoRA Alpha | 16 |
LoRA Dropout | 0.1 |
Epochs | 4 |
Batch Size (per device) | 3 |
Gradient Accumulation Steps | 3 |
Precision | FP16 |
Max Input Length | 512 tokens |
Optimizer | AdamW |
Trainable Parameters: ~1.84M (only 0.06% of total 3.08B)
📁 Output: Saved in /qwen3b-lora-finetuned/
- Summaries generated for 10 unseen test articles
- Comparison across:
- Ground Truth Abstract
- Base Model Output
- Fine-Tuned Model Output
- File:
summary_comparisons.json
Using:
evaluate.load("rouge")
nltk.translate.bleu_score
bert_score.score
Metric | Base Model | Fine-Tuned |
---|---|---|
ROUGE-1 | ~0.37 | ~0.43 |
ROUGE-L | ~0.32 | ~0.36 |
BLEU | ~0.25 | ~0.31 |
BERTScore-F1 | ~0.83 | ~0.89 |
📓 See LLM-Finetuning.ipynb
for implementation.
Using LLaMA 3.3 70B Instruct Turbo via Together Ai API, each summary was rated based on:
- Fluency
- Factuality
- Coverage
Metric | Avg Score (out of 5) |
---|---|
Fluency | 5.00 |
Factuality | 5.00 |
Coverage | 4.70 |
Prompts and logic are also in LLM-Finetuning.ipynb
.
A fully autonomous system for conducting academic literature reviews.
Agent | Purpose |
---|---|
KeywordAgent |
Expands initial user query using LLM |
SearchAgent |
Retrieves relevant open-access papers (via API) |
RankAgent |
Scores papers by relevance, recency, citations |
SummaryAgent |
Uses LoRA-finetuned model to summarize selected papers |
CompareAgent |
Extracts insights, contradictions, and research gaps |
User Input
↓
expand_keywords
↓
search_papers
↓
rank_papers
↓
summarize_papers
↓
compare_papers
↓
Output: Structured PDF Research Summary
🧾 Output Example: ai_healthcare_report.pdf
transformers
— Model loading and generationpeft
— LoRA integrationdatasets
— Data loadingtorch
— GPU traininglangchain
+langgraph
— Multi-agent orchestrationtogether
— External API for LLM-based evaluationsjson
,re
,reportlab
,PyMuPDF
— Utilities for data formatting, PDF parsing, and output
📁 qwen3b-lora-finetuned/
├── adapter_config.json
├── adapter_model.safetensors
├── tokenizer_config.json
├── tokenizer.json
└── ... (other model assets)
📄 ai_healthcare_report.pdf # Agent-generated multi-paper summary
📄 Experimental-Report.pdf # Final technical write-up
📄 Task-Manual.pdf # Assignment instructions
📄 LLM-Finetuning.ipynb # Model fine-tuning logic
📄 LLM-Evaluation.ipynb # LLM-based evaluation
You must have access to a CUDA-enabled GPU with >16GB VRAM.
- Clone the repository.
- Run
LLM-Finetuning.ipynb
to reproduce the training. - Evaluate using
LLM-Evaluation.ipynb
. - Results will be saved as
.json
and.pdf
reports.
This project is submitted as part of academic coursework and is intended for educational and research purposes only.
For queries or collaboration: