Skip to content

Conversation

melmathari
Copy link
Contributor

@melmathari melmathari commented Sep 9, 2025

GitHub Pages connector

Description

This PR introduces a new GitHub Pages connector and integrates it into both the backend and frontend of Onyx.

Test

  • ✅ Prettier applied on web files
  • ✅ Pre-commit hooks (black, reorder-python-imports, autoflake, ruff, prettier) all passed
  • ✅ mypy type checks passed on modified backend files

Demo

Watch the video

Related Issue / Claim

Closes #2282

Creating a GitHub PAT for the GitHub Pages connector

  1. Generate a fine-grained personal access token.
  2. Configure:
    • Token name: Onyx GitHub Pages
    • Expiration: No expiration (recommended for connectors)
    • Resource owner: user/org that owns the repo
    • Repository access: All repositories (or select specific repos)
  3. Permissions:
    • Contents → Read-only
    • Metadata → Read-only
  4. Copy and store the token securely.

Using the token in Onyx

  • In the GitHub Pages connector config, paste the PAT into the GitHub access token field.
  • Provide:
    • repo_owner (e.g. melmathari)
    • repo_name (e.g. GitHub-pages)
  • Save and validate the connector.

/claim #2282

  • This PR should be backported
  • [Optional] Override Linear Check

Summary by cubic

Adds a GitHub Pages connector that indexes HTML/Markdown from a repo’s Pages site via the GitHub API and exposes it as a load-state connector in the app. Implements the flow requested in Linear #2282.

  • New Features

    • Backend GitHub Pages connector with checkpointing, rate-limit handling, and credential validation
    • Supports gh-pages, configured Pages branch, or default branch; converts repo paths to Pages URLs
    • Parses HTML/Markdown using existing file processing utilities; includes title extraction and metadata
    • New enum, factory mapping, and Slack icon for DocumentSource.GITHUB_PAGES
  • Frontend

    • New connector config with fields: repo_owner, repo_name; advanced option: include_readme
    • Uses existing GitHub access token credential template
    • Added icon, source metadata, types, and inclusion in load-state and auto-sync sources

@melmathari melmathari requested a review from a team as a code owner September 9, 2025 18:29
@algora-pbc algora-pbc bot mentioned this pull request Sep 9, 2025
Copy link

vercel bot commented Sep 9, 2025

Someone is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds a comprehensive GitHub Pages connector to Onyx that enables indexing of GitHub Pages websites through GitHub's API rather than web scraping. The implementation follows established patterns by creating a new GithubPagesConnector class that extends both LoadConnector and CheckpointedConnector interfaces.

Backend Changes:

  • New connector implementation (backend/onyx/connectors/github_pages/connector.py): The main connector class fetches source files directly from GitHub repositories using the GitHub API, processes various file types (HTML, Markdown, etc.), and creates documents with GitHub Pages-style URLs. It intelligently handles scenarios where GitHub Pages is enabled (discovering published URLs) and falls back to processing source files directly when it's not.
  • Constants and factory integration (backend/onyx/configs/constants.py, backend/onyx/connectors/factory.py): Added GITHUB_PAGES enum to DocumentSource and integrated the connector into the factory mapping system.
  • Slack integration (backend/onyx/onyxbot/slack/icons.py): Added icon mapping for the new source type to maintain consistency in Slack bot displays.

Frontend Changes:

  • UI components: Added GithubPagesIcon component reusing the existing GitHub icon for visual consistency.
  • Type definitions (web/src/lib/types.ts): Added GitHubPages to ValidSources enum and included it in validAutoSyncSources for automatic synchronization support.
  • Source metadata (web/src/lib/sources.ts): Added GitHub Pages to the source metadata mapping under the CodeRepository category.
  • Connector configuration (web/src/lib/connectors/connectors.tsx): Implemented comprehensive form configuration with required fields for repository owner/name and optional README inclusion setting, along with TypeScript interface definition.
  • Credentials setup (web/src/lib/connectors/credentials.ts): Configured credential template reusing the existing GithubCredentialJson interface.

The connector addresses use cases where GitHub Pages sites are behind authentication or firewalls by accessing source files through the authenticated GitHub API. It includes proper error handling, rate limiting, checkpointing, and follows the established connector patterns throughout the Onyx codebase.

Confidence score: 4/5

  • This PR is safe to merge with moderate confidence, requiring standard review attention
  • Score reflects comprehensive implementation following established patterns, though some error handling could be more specific
  • Pay closer attention to the main connector implementation file for error handling and the factory integration

10 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

melmathari and others added 2 commits September 9, 2025 20:33
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

@melmathari
Copy link
Contributor Author

@Weves Open to feedback, appreciate you looking into this. I am not sure whether this PR covers all the requirements so I might need some assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Github Pages Connector
1 participant