A robust, scalable content ingestion system that transforms any content source into structured knowledge base format. Built for production use with enterprise-grade reliability and Apple/Google-level user experience.
- Multi-source Support: Blogs, websites, PDFs, Substack, technical documentation
- Intelligent Fallbacks: Multiple extraction strategies with automatic failover
- Smart Content Detection: Automatically identifies content type and structure
- Chapter-aware PDF Processing: Intelligently splits books and documents
- Scalable Design: Built on Next.js 15 with modern React patterns
- Real-time Processing: Server-sent events for live progress updates
- Error Resilience: Comprehensive error handling and recovery
- Database Integration: PostgreSQL schema for knowledge base storage
- Clean Interface: Apple/Google-inspired design system
- Progress Tracking: Real-time extraction progress with detailed status
- Batch Processing: Handle multiple sources efficiently
- Export Options: JSON download and clipboard integration
-
Ingestion Pipeline
- URL Parser with site-type detection
- Content Extractor with pluggable strategies
- PDF Processor with chapter detection
- Metadata Enrichment with AI fallbacks
-
Smart Extractor Module
- Generic extractors: newspaper3k, readability, trafilatura
- Specialized extractors: Substack, interviewing.io, technical blogs
- Fallback hierarchy for maximum reliability
- AI-powered content cleaning
-
Normalization Layer
- Standardized JSON output format
- Markdown content formatting
- Metadata extraction and validation
- Word count and content analysis
Component | Technology |
---|---|
Frontend | Next.js 15, React, TypeScript |
UI Framework | Tailwind CSS, shadcn/ui |
Backend | Next.js API Routes, Server Actions |
Database | PostgreSQL with optimized indexes |
Content Processing | Python scripts with PyMuPDF, BeautifulSoup |
Deployment | Vercel, Railway, or Docker |
```typescript POST /api/extract-url Content-Type: application/json
{ "url": "https://interviewing.io/blog", "team_id": "aline123", "user_id": "user_001" } ```
```typescript POST /api/extract-pdf Content-Type: multipart/form-data
file: [PDF file] team_id: "aline123" user_id: "user_001" ```
```json { "team_id": "aline123", "items": [ { "title": "Advanced System Design Patterns", "content": "# Advanced System Design Patterns\n\nSystem design is...", "content_type": "blog", "source_url": "https://example.com/article", "author": "Sarah Chen", "user_id": "user_001", "word_count": 245, "extracted_at": "2024-01-15T10:30:00Z" } ], "total_items": 15, "processing_time": 8, "sources_processed": ["https://interviewing.io/blog"] } ```
-
Clone and Install ```bash git clone https://github.com/prajwalun/contextcrafter cd contextcrafter npm install ```
-
Run Database Scripts ```bash
psql $DATABASE_URL -f scripts/create-knowledge-base.sql
psql $DATABASE_URL -f scripts/seed-sample-data.sql ```
-
Start Development Server ```bash npm run dev ```
```dockerfile FROM node:18-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build EXPOSE 3000 CMD ["npm", "start"] ```
- Engineering Blogs: Automatically ingest company engineering blogs
- Documentation: Extract and organize technical documentation
- Research Papers: Process academic papers and technical reports
- Customer Support: Build comprehensive FAQ and help systems
- Training Materials: Organize educational content and courses
- Company Wiki: Centralize institutional knowledge
- Competitive Intelligence: Monitor competitor content and insights
- Trend Analysis: Track industry trends and emerging topics
- Content Audit: Analyze existing content for gaps and opportunities
```python class CustomExtractor(ContentExtractor): def can_extract(self, url: str) -> bool: return 'custom-site.com' in url.lower()
def extract(self, url: str, html: str) -> List[ExtractedContent]:
# Custom extraction logic
return [ExtractedContent(...)]
```
```sql ALTER TABLE knowledge_base_items DROP CONSTRAINT knowledge_base_items_content_type_check;
ALTER TABLE knowledge_base_items ADD CONSTRAINT knowledge_base_items_content_type_check CHECK (content_type IN ('blog', 'podcast_transcript', 'call_transcript', 'linkedin_post', 'reddit_comment', 'book', 'interview_guide', 'documentation', 'research_paper', 'other')); ```
- Concurrent Processing: Parallel extraction for multiple sources
- Caching Layer: Redis integration for frequently accessed content
- Rate Limiting: Respectful crawling with configurable delays
- Content Deduplication: Automatic detection of duplicate content
- Processing Metrics: Track extraction success rates and performance
- Content Analytics: Word counts, content types, source analysis
- Error Tracking: Comprehensive logging and error reporting
- Usage Statistics: Team and user activity monitoring
- TypeScript for type safety
- ESLint + Prettier for code formatting
- Jest for testing
- Conventional commits for git history