🧠 Text Search Engine

A full-stack text-based search engine that includes web crawling, indexing, PageRank scoring, and a frontend search interface. Built with Python and Docker, this project simulates a mini search engine pipeline with modular support for crawling, text normalization, document ranking, and querying.

📦 Project Structure

Text-Search-Engine/
├── map-reduce-page-rank/         # PageRank using MapReduce
├── spark-page-rank/              # PageRank implemented using Apache Spark
├── search-frontend/              # React-based UI for querying the engine
├── canonicalizer.py              # URL normalization
├── crawler.py                    # Web crawler for link extraction and content retrieval
├── doc_writer.py                 # Writes documents to file system
├── docker-compose.yml            # Docker configuration
├── elasticsearch.yml             # Elasticsearch config
├── indexer.py                    # Builds inverted index and pushes data to Elasticsearch
├── page_rank.py                  # PageRank implementation
├── frontier.py                   # Maintains crawling frontier (queue of URLs)
├── stoplist.txt                  # Stopwords to exclude during indexing
├── document_related_terms.txt    # Precomputed document-topic relevance scores
├── crawled_pagerank_res.txt      # PageRank scores for crawled pages
├── wt2g_inlinks.txt              # Inlink data (used in PageRank)
├── wt2g_res.txt                  # Raw crawl result
├── topical_terms.txt             # Terms used for topical scoring
├── kibana.yml                    # Kibana configuration
├── README.md                     # This file

🧰 Features

Crawler: Recursively crawls web pages and stores page content, links, and metadata.
Canonicalizer: Normalizes URLs to avoid redundant crawling.
Indexer: Creates inverted indexes and uploads them to Elasticsearch.
PageRank: Supports classic PageRank, MapReduce-based, and Spark-based versions.
Search UI: React-based frontend for querying indexed content.
Topical Term Scoring: Weighs search results based on topic relevance.
Dockerized Setup: Uses Docker Compose to spin up Elasticsearch and Kibana.

🚀 Getting Started

1. Prerequisites

Python 3.8+
Docker & Docker Compose
Node.js (for frontend)

2. Clone the Repository

git clone https://github.com/harivilasp/Text-Search-Engine.git
cd Text-Search-Engine

3. Build and Run with Docker

docker-compose up --build

This will start Elasticsearch and Kibana instances as services.

4. Crawl the Web

python crawler.py

This will crawl URLs from a seed list and save documents locally.

5. Index the Data

python indexer.py

It will index crawled documents and push them to Elasticsearch.

6. Run Frontend

cd search-frontend
npm install
npm start

The React frontend will be available at http://localhost:3000.

🧪 Sample Files

stoplist.txt: Stopwords for filtering uninformative terms.
topical_terms.txt: Terms related to specific topics used in topical scoring.
document_related_terms.txt: Topic-document relevance scores.
wt2g_inlinks.txt: Link structure for PageRank.
crawled_pagerank_res.txt: Output PageRank scores.

📊 PageRank Implementations

Choose from 3 different implementations:

page_rank.py: Pure Python iterative version.
map-reduce-page-rank/: MapReduce-style for large-scale computation.
spark-page-rank/: Apache Spark-based scalable version.

🔍 Search and Ranking

Search results are ranked using:

BM25 (via Elasticsearch)
PageRank scores
Topical Relevance Scoring

These rankings are combined to provide more meaningful results.

🔧 Configuration

elasticsearch.yml: Elasticsearch tuning.
kibana.yml: Kibana dashboard settings.
ignore_urls.txt: Patterns to ignore during crawling.

📚 References & Resources

👨‍💻 Author

Hari Vilas Panjwani
Feel free to reach out via GitHub for collaborations or suggestions!

📄 License

This project is open-source and available under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Text Search Engine

📦 Project Structure

🧰 Features

🚀 Getting Started

1. Prerequisites

2. Clone the Repository

3. Build and Run with Docker

4. Crawl the Web

5. Index the Data

6. Run Frontend

🧪 Sample Files

📊 PageRank Implementations

🔍 Search and Ranking

🔧 Configuration

📚 References & Resources

👨‍💻 Author

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
map-reduce-page-rank		map-reduce-page-rank
search-frontend		search-frontend
spark-page-rank		spark-page-rank
.DS_Store		.DS_Store
README.md		README.md
canonicalizer.py		canonicalizer.py
crawled_pagerank_res.txt		crawled_pagerank_res.txt
crawler.py		crawler.py
doc_writer.py		doc_writer.py
docker-compose.yml		docker-compose.yml
document_related_terms.txt		document_related_terms.txt
elasticsearch.yml		elasticsearch.yml
frontier.py		frontier.py
ignore_urls.txt		ignore_urls.txt
indexer.py		indexer.py
kibana.yml		kibana.yml
page_rank.py		page_rank.py
stoplist.txt		stoplist.txt
topical_terms.txt		topical_terms.txt
wt2g_inlinks.txt		wt2g_inlinks.txt
wt2g_res.txt		wt2g_res.txt

harivilasp/Text-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

🧠 Text Search Engine

📦 Project Structure

🧰 Features

🚀 Getting Started

1. Prerequisites

2. Clone the Repository

3. Build and Run with Docker

4. Crawl the Web

5. Index the Data

6. Run Frontend

🧪 Sample Files

📊 PageRank Implementations

🔍 Search and Ranking

🔧 Configuration

📚 References & Resources

👨‍💻 Author

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages