-
Notifications
You must be signed in to change notification settings - Fork 2.1k
feat: adding support for markdown files for github #5057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds support for indexing Markdown files (.md) to the GitHub connector, extending its capabilities beyond just pull requests and issues. The implementation follows the existing architectural pattern by introducing a new FILES_MD
stage in the connector's checkpoint-based processing pipeline.
The changes include:
- Frontend: Added a new checkbox option "Include Markdown files?" (
include_files_md
) to the GitHub connector configuration UI, following the same pattern as the existing PR and issue inclusion options - Backend: Implemented a new processing stage that recursively traverses repository contents using GitHub's content API to find and index .md files, with proper document conversion and metadata handling
- Testing: Updated both unit and integration tests to validate the new functionality, including mock file creation and proper assertions for the new document type
The feature integrates seamlessly into the existing workflow where repositories are processed in stages: Pull Requests → Issues → Markdown Files → Next Repository. This allows organizations to search through their documentation files (README.md, CONTRIBUTING.md, etc.) alongside code-related discussions, significantly expanding the searchable knowledge base from GitHub repositories.
Confidence score: 2/5
- This PR has significant implementation issues that could cause production problems, including potential infinite loops and performance bottlenecks
- The stage transition logic contains a critical bug that could prevent the connector from progressing properly through repositories
- The recursive directory traversal lacks pagination and could cause timeouts or memory issues with large repositories containing many markdown files
- Files that need more attention:
backend/onyx/connectors/github/connector.py
(lines 747-780 for stage transition logic, lines 525-547 for recursive traversal)
4 files reviewed, 4 comments
def _convert_file_to_document(file: ContentFile) -> Document: | ||
return Document( | ||
id=file.html_url, | ||
sections=[TextSection(link=file.html_url, text=file.content or "")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: ContentFile.content may be base64 encoded for binary files or None for large files. Should decode content and handle these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cubic analysis
1 issue found across 4 files • Review in cubic
React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai
to give feedback, ask questions, or re-run the review.
New field include_files_md is declared as required, but older connector records won’t contain it, so the type should be optional to avoid undefined values at runtime Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
logic: After processing FILES_MD, the stage is reset to PRS instead of moving to the next repository. This will cause an infinite loop Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
) | ||
assert len(docs) > 1 # We expect at least one PR and one Issue to exist | ||
|
||
def get_issue_doc(docs: list[Document]) -> Document | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to add a docstring here as well
Description
The PR adds support to index markdown files for the Github connector.
Addresses https://linear.app/danswer/issue/DAN-2231/adding-md-support-for-github-connector
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Added support for indexing Markdown (.md) files from GitHub repositories in the connector.