-
Notifications
You must be signed in to change notification settings - Fork 2k
feat: add github_pages connector #5149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@AayushSaini101 is attempting to deploy a commit to the Danswer Team on Vercel. A member of the Team first needs to authorize it. |
I will add the demo of the connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces a comprehensive GitHub Pages connector that enables indexing content from GitHub Pages websites via the GitHub API. The implementation follows established connector patterns by extending LoadConnector and PollConnector base classes, supporting both full indexing and incremental updates based on file modification dates.
The connector supports multiple file formats including HTML, Markdown, reStructuredText, and plain text files, with intelligent content processing using BeautifulSoup for HTML and markdown conversion. It includes robust GitHub API integration with rate limiting, exponential backoff, and proper authentication handling via GitHub access tokens.
Key architectural components include:
- Backend connector implementation (
GitHubPagesConnector
) with comprehensive file filtering, URL building, and document creation - Frontend integration through connector configuration forms and credential management
- Proper enum additions (
ValidSources.GitHubPages
,DocumentSource.GITHUB_PAGES
) to register the new connector type - Test coverage with unit tests validating core functionality
- Factory pattern integration to enable connector instantiation
The connector provides extensive configuration options including repository targeting (owner/name/branch), file filtering (max files, directory depth, file size limits), and polling intervals. The implementation reuses existing GitHub connector utilities for rate limiting while adding GitHub Pages-specific functionality for URL construction and content processing.
Confidence score: 1/5
- This PR has critical configuration issues that will prevent the connector from working properly in production
- Score reflects duplicate connector configurations, mismatched form parameters, and inconsistent document source mapping that could cause runtime failures
- Pay close attention to
web/src/lib/connectors/connectors.tsx
which has duplicategithub_pages
entries with conflicting configurations, and the connector implementation which doesn't match the form configuration parameters
19 files reviewed, 9 comments
shapely==2.0.6 | ||
stripe==10.12.0 | ||
urllib3==2.2.3 | ||
urllib3==1.26.18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Major version downgrade from urllib3 2.2.3 to 1.26.18 introduces compatibility risks. Verify this doesn't break existing HTTP functionality and document the specific dependency conflict that requires this downgrade.
# Test with valid settings | ||
try: | ||
github_pages_connector.validate_connector_settings() | ||
except Exception: | ||
# This might fail in test environment, which is expected | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Overly broad exception handling masks potential issues. Consider testing specific validation scenarios or checking for expected exception types.
backend/tests/daily/connectors/github_pages/test_github_pages_connector.py
Show resolved
Hide resolved
def _get_file_last_modified( | ||
self, file_info: GitHubPagesFileInfo | ||
) -> Optional[datetime]: | ||
"""Get the last modification date of a file via GitHub API.""" | ||
try: | ||
repo = self.github_client.get_repo(f"{self.repo_owner}/{self.repo_name}") | ||
|
||
# Get the commits that modified this file | ||
commits = repo.get_commits(path=file_info.original_path, sha=self.branch) | ||
|
||
# Take the most recent commit | ||
if commits.totalCount > 0: | ||
latest_commit = commits[0] | ||
return latest_commit.commit.committer.date.replace(tzinfo=timezone.utc) | ||
|
||
except Exception as e: | ||
logger.debug(f"Couldn't get modification date for {file_info.path}: {e}") | ||
|
||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Making individual API calls to get commit history for each file could quickly exhaust GitHub rate limits and cause significant performance issues. Consider batching these requests or using the repository's commit history more efficiently.
if not file_path.startswith(self.root_directory + "/"): | ||
continue | ||
# Adjust relative path for processing | ||
file_path = file_path[len(self.root_directory) + 1 :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The root directory filtering logic assumes the path starts with root_directory + "/"
but doesn't handle the case where a file might be exactly at the root directory level without a trailing slash.
github_pages: { | ||
icon: GithubIcon, | ||
displayName: "GitHub Pages", | ||
category: SourceCategory.Other, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider changing category to SourceCategory.CodeRepository
to match the existing GitHub connector for better organization
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…connector.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Working to improve on the suggestions thanks |
Description
This PR introduces the GitHub Pages connector as a new connector type in the Onyx platform. The GitHub Pages connector allows users to index and search content from GitHub Pages websites by connecting to GitHub repositories and processing their content. New Feature: GitHub Pages Connector
The GitHub Pages connector provides the following capabilities:
Core Functionality:
Configuration Options:
Supported File Types:
fixes #2282
/claim #2282
Summary by cubic
Added a new GitHub Pages connector that lets users index and search content from GitHub Pages sites by connecting to GitHub repositories and processing their files. This addresses the requirements in issue #2282.