This project provides an interactive web interface to semantically search scientific research papers using vector embeddings. It leverages Sentence Transformers, Qdrant, and Gradio to create an intuitive and powerful search experience tailored for academic texts.
You can try the app instantly without installing anything:
π Launch Live Demo on Hugging Face Spaces
Note: The demo uses a Qdrant-hosted vector store preloaded with 1,000 scientific papers from PubMed Central (PMC).
Topics covered in the papers of the live demo:
- Introduction
- Features
- Installation
- Usage
- Project Structure
- Configuration
- Examples
- Dependencies
- Limitations
This project enables semantic querying of scientific papers using natural language input. It retrieves the most relevant documents and highlights the most pertinent paragraphs. Behind the scenes, documents are embedded and indexed in Qdrant, a vector database, and retrieved using cosine similarity.
- Upload scientific articles from PMCIDs or datasets.
- Sentence-level chunking and indexing for fine-grained retrieval.
- Gradio-powered web UI with real-time document rendering.
- Highlighting of most relevant paragraphs.
Ensure you have Python 3.9+ installed, then install dependencies:
pip install -r requirements.txt
Set the following environment variables:
Variable | Description | Default |
---|---|---|
QDRANT_URL |
URL to the Qdrant instance | :memory: |
QDRANT_API_KEY |
API key for Qdrant (if required) | "" |
If you don't set QDRANT_URL
an in memory store will be used.
Start the Gradio app:
python app.py
Enter your query, set the number of documents to retrieve, and submit to see relevant articles and highlighted paragraphs.
Once running, click "Load Example" to search for:
venuous thrombosis
This will return the most relevant paragraphs and links to scientific articles.
See requirements.txt
for full list. Key libraries include:
gradio
torch
sentence-transformers
qdrant-client
datasets
tqdm
- PDF upload is not yet implemented (
fetch_pdf.py
is a stub). - Only supports EuropePMC via PMCID for external article ingestion.
- The live demo is limited to 1,000 papers from PMC (hosting is expensive
βΉοΈ ).