Skip to content

Conversation

AayushSaini101
Copy link

@AayushSaini101 AayushSaini101 commented Aug 4, 2025

Description

This PR introduces the GitHub Pages connector as a new connector type in the Onyx platform. The GitHub Pages connector allows users to index and search content from GitHub Pages websites by connecting to GitHub repositories and processing their content. New Feature: GitHub Pages Connector

The GitHub Pages connector provides the following capabilities:

Core Functionality:

  • Repository Integration: Connects to GitHub repositories via the GitHub API
  • Multi-format Support: Indexes HTML, Markdown, reStructuredText, and text files
  • Smart Filtering: Filters by file type, directory depth, and file size
  • Incremental Updates: Supports polling based on file modification dates
  • Rate Limiting: Handles GitHub API rate limits with exponential backoff

Configuration Options:

  • Repository Owner: GitHub username or organization
  • Repository Name: Name of the repository containing GitHub Pages
  • Branch: Branch to scan (default: gh-pages)
  • Root Directory: Optional subdirectory to index
  • Max Files: Maximum number of files to index (default: 1000)
  • Max Depth: Maximum directory depth for crawling
  • Timeout: Request timeout in seconds

Supported File Types:

  • .html, .htm - HTML files (processed with BeautifulSoup)
  • .md, .markdown - Markdown files (converted to HTML then processed)
  • .txt - Plain text files
  • .rst - reStructuredText files
  • .asciidoc, .adoc - AsciiDoc files

fixes #2282
/claim #2282


Summary by cubic

Added a new GitHub Pages connector that lets users index and search content from GitHub Pages sites by connecting to GitHub repositories and processing their files. This addresses the requirements in issue #2282.

  • New Features
    • Supports indexing HTML, Markdown, reStructuredText, and text files from a specified repository and branch.
    • Allows filtering by file type, directory depth, and file size.
    • Handles incremental updates using file modification dates and manages GitHub API rate limits.
    • Includes configuration options for repository owner, name, branch, root directory, max files, max depth, and timeout.
    • Added UI and type support for the new connector in the web app.

@AayushSaini101 AayushSaini101 requested a review from a team as a code owner August 4, 2025 05:03
Copy link

vercel bot commented Aug 4, 2025

@AayushSaini101 is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

@AayushSaini101
Copy link
Author

I will add the demo of the connector

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces a comprehensive GitHub Pages connector that enables indexing content from GitHub Pages websites via the GitHub API. The implementation follows established connector patterns by extending LoadConnector and PollConnector base classes, supporting both full indexing and incremental updates based on file modification dates.

The connector supports multiple file formats including HTML, Markdown, reStructuredText, and plain text files, with intelligent content processing using BeautifulSoup for HTML and markdown conversion. It includes robust GitHub API integration with rate limiting, exponential backoff, and proper authentication handling via GitHub access tokens.

Key architectural components include:

  • Backend connector implementation (GitHubPagesConnector) with comprehensive file filtering, URL building, and document creation
  • Frontend integration through connector configuration forms and credential management
  • Proper enum additions (ValidSources.GitHubPages, DocumentSource.GITHUB_PAGES) to register the new connector type
  • Test coverage with unit tests validating core functionality
  • Factory pattern integration to enable connector instantiation

The connector provides extensive configuration options including repository targeting (owner/name/branch), file filtering (max files, directory depth, file size limits), and polling intervals. The implementation reuses existing GitHub connector utilities for rate limiting while adding GitHub Pages-specific functionality for URL construction and content processing.

Confidence score: 1/5

  • This PR has critical configuration issues that will prevent the connector from working properly in production
  • Score reflects duplicate connector configurations, mismatched form parameters, and inconsistent document source mapping that could cause runtime failures
  • Pay close attention to web/src/lib/connectors/connectors.tsx which has duplicate github_pages entries with conflicting configurations, and the connector implementation which doesn't match the form configuration parameters

19 files reviewed, 9 comments

Edit Code Review Bot Settings | Greptile

shapely==2.0.6
stripe==10.12.0
urllib3==2.2.3
urllib3==1.26.18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Major version downgrade from urllib3 2.2.3 to 1.26.18 introduces compatibility risks. Verify this doesn't break existing HTTP functionality and document the specific dependency conflict that requires this downgrade.

Comment on lines +156 to +161
# Test with valid settings
try:
github_pages_connector.validate_connector_settings()
except Exception:
# This might fail in test environment, which is expected
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Overly broad exception handling masks potential issues. Consider testing specific validation scenarios or checking for expected exception types.

Comment on lines +340 to +358
def _get_file_last_modified(
self, file_info: GitHubPagesFileInfo
) -> Optional[datetime]:
"""Get the last modification date of a file via GitHub API."""
try:
repo = self.github_client.get_repo(f"{self.repo_owner}/{self.repo_name}")

# Get the commits that modified this file
commits = repo.get_commits(path=file_info.original_path, sha=self.branch)

# Take the most recent commit
if commits.totalCount > 0:
latest_commit = commits[0]
return latest_commit.commit.committer.date.replace(tzinfo=timezone.utc)

except Exception as e:
logger.debug(f"Couldn't get modification date for {file_info.path}: {e}")

return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Making individual API calls to get commit history for each file could quickly exhaust GitHub rate limits and cause significant performance issues. Consider batching these requests or using the repository's commit history more efficiently.

Comment on lines +201 to +204
if not file_path.startswith(self.root_directory + "/"):
continue
# Adjust relative path for processing
file_path = file_path[len(self.root_directory) + 1 :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The root directory filtering logic assumes the path starts with root_directory + "/" but doesn't handle the case where a file might be exactly at the root directory level without a trailing slash.

github_pages: {
icon: GithubIcon,
displayName: "GitHub Pages",
category: SourceCategory.Other,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider changing category to SourceCategory.CodeRepository to match the existing GitHub connector for better organization

AayushSaini101 and others added 4 commits August 4, 2025 11:23
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…connector.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@AayushSaini101 AayushSaini101 marked this pull request as draft August 4, 2025 06:01
@AayushSaini101
Copy link
Author

Working to improve on the suggestions thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Github Pages Connector
1 participant