ContextCrafter - Universal Content Ingestion Pipeline

A robust, scalable content ingestion system that transforms any content source into structured knowledge base format. Built for production use with enterprise-grade reliability and Apple/Google-level user experience.

🚀 Features

Universal Content Extraction

Multi-source Support: Blogs, websites, PDFs, Substack, technical documentation
Intelligent Fallbacks: Multiple extraction strategies with automatic failover
Smart Content Detection: Automatically identifies content type and structure
Chapter-aware PDF Processing: Intelligently splits books and documents

Production-Ready Architecture

Scalable Design: Built on Next.js 15 with modern React patterns
Real-time Processing: Server-sent events for live progress updates
Error Resilience: Comprehensive error handling and recovery
Database Integration: PostgreSQL schema for knowledge base storage

Enterprise UX

Clean Interface: Apple/Google-inspired design system
Progress Tracking: Real-time extraction progress with detailed status
Batch Processing: Handle multiple sources efficiently
Export Options: JSON download and clipboard integration

🏗️ Architecture

Core Components

Ingestion Pipeline
- URL Parser with site-type detection
- Content Extractor with pluggable strategies
- PDF Processor with chapter detection
- Metadata Enrichment with AI fallbacks
Smart Extractor Module
- Generic extractors: newspaper3k, readability, trafilatura
- Specialized extractors: Substack, interviewing.io, technical blogs
- Fallback hierarchy for maximum reliability
- AI-powered content cleaning
Normalization Layer
- Standardized JSON output format
- Markdown content formatting
- Metadata extraction and validation
- Word count and content analysis

Tech Stack

Component	Technology
Frontend	Next.js 15, React, TypeScript
UI Framework	Tailwind CSS, shadcn/ui
Backend	Next.js API Routes, Server Actions
Database	PostgreSQL with optimized indexes
Content Processing	Python scripts with PyMuPDF, BeautifulSoup
Deployment	Vercel, Railway, or Docker

📋 API Reference

Extract URL Content

```typescript POST /api/extract-url Content-Type: application/json

{ "url": "https://interviewing.io/blog", "team_id": "aline123", "user_id": "user_001" } ```

Extract PDF Content

```typescript POST /api/extract-pdf Content-Type: multipart/form-data

file: [PDF file] team_id: "aline123" user_id: "user_001" ```

Response Format

```json { "team_id": "aline123", "items": [ { "title": "Advanced System Design Patterns", "content": "# Advanced System Design Patterns\n\nSystem design is...", "content_type": "blog", "source_url": "https://example.com/article", "author": "Sarah Chen", "user_id": "user_001", "word_count": 245, "extracted_at": "2024-01-15T10:30:00Z" } ], "total_items": 15, "processing_time": 8, "sources_processed": ["https://interviewing.io/blog"] } ```

🛠️ Installation & Setup

Local Development

Clone and Install ```bash git clone https://github.com/prajwalun/contextcrafter cd contextcrafter npm install ```
Run Database Scripts ```bash

Create tables

psql $DATABASE_URL -f scripts/create-knowledge-base.sql

Seed sample data

psql $DATABASE_URL -f scripts/seed-sample-data.sql ```
Start Development Server ```bash npm run dev ```

Docker Deployment

```dockerfile FROM node:18-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build EXPOSE 3000 CMD ["npm", "start"] ```

🎯 Use Cases

Technical Content Curation

Engineering Blogs: Automatically ingest company engineering blogs
Documentation: Extract and organize technical documentation
Research Papers: Process academic papers and technical reports

Knowledge Base Building

Customer Support: Build comprehensive FAQ and help systems
Training Materials: Organize educational content and courses
Company Wiki: Centralize institutional knowledge

Content Analysis

Competitive Intelligence: Monitor competitor content and insights
Trend Analysis: Track industry trends and emerging topics
Content Audit: Analyze existing content for gaps and opportunities

🔧 Customization

Adding New Extractors

```python class CustomExtractor(ContentExtractor): def can_extract(self, url: str) -> bool: return 'custom-site.com' in url.lower()

def extract(self, url: str, html: str) -> List[ExtractedContent]:
    # Custom extraction logic
    return [ExtractedContent(...)]

```

Custom Content Types

```sql ALTER TABLE knowledge_base_items DROP CONSTRAINT knowledge_base_items_content_type_check;

ALTER TABLE knowledge_base_items ADD CONSTRAINT knowledge_base_items_content_type_check CHECK (content_type IN ('blog', 'podcast_transcript', 'call_transcript', 'linkedin_post', 'reddit_comment', 'book', 'interview_guide', 'documentation', 'research_paper', 'other')); ```

📊 Performance & Scaling

Optimization Features

Concurrent Processing: Parallel extraction for multiple sources
Caching Layer: Redis integration for frequently accessed content
Rate Limiting: Respectful crawling with configurable delays
Content Deduplication: Automatic detection of duplicate content

Monitoring & Analytics

Processing Metrics: Track extraction success rates and performance
Content Analytics: Word counts, content types, source analysis
Error Tracking: Comprehensive logging and error reporting
Usage Statistics: Team and user activity monitoring

Code Standards

TypeScript for type safety
ESLint + Prettier for code formatting
Jest for testing
Conventional commits for git history

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
components		components
hooks		hooks
lib		lib
public		public
scripts		scripts
styles		styles
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
README.md		README.md
components.json		components.json
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ContextCrafter - Universal Content Ingestion Pipeline

🚀 Features

Universal Content Extraction

Production-Ready Architecture

Enterprise UX

🏗️ Architecture

Core Components

Tech Stack

📋 API Reference

Extract URL Content

Extract PDF Content

Response Format

🛠️ Installation & Setup

Local Development

Create tables

Seed sample data

Docker Deployment

🎯 Use Cases

Technical Content Curation

Knowledge Base Building

Content Analysis

🔧 Customization

Adding New Extractors

Custom Content Types

📊 Performance & Scaling

Optimization Features

Monitoring & Analytics

Code Standards

About

Uh oh!

Releases

Packages

Languages

prajwalun/contextcrafter

Folders and files

Latest commit

History

Repository files navigation

ContextCrafter - Universal Content Ingestion Pipeline

🚀 Features

Universal Content Extraction

Production-Ready Architecture

Enterprise UX

🏗️ Architecture

Core Components

Tech Stack

📋 API Reference

Extract URL Content

Extract PDF Content

Response Format

🛠️ Installation & Setup

Local Development

Create tables

Seed sample data

Docker Deployment

🎯 Use Cases

Technical Content Curation

Knowledge Base Building

Content Analysis

🔧 Customization

Adding New Extractors

Custom Content Types

📊 Performance & Scaling

Optimization Features

Monitoring & Analytics

Code Standards

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages