Skip to content

Conversation

trial-danswer
Copy link
Collaborator

@trial-danswer trial-danswer commented Jul 22, 2025

Description

The PR adds support to index markdown files for the Github connector.

Addresses https://linear.app/danswer/issue/DAN-2231/adding-md-support-for-github-connector

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

  • manual testing
  • working on adding the unit tests

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Added support for indexing Markdown (.md) files from GitHub repositories in the connector.

  • New Features
    • Option to include Markdown files when configuring the GitHub connector.
    • Markdown files are now fetched and indexed alongside issues and pull requests.

@trial-danswer trial-danswer requested a review from evan-onyx July 22, 2025 20:42
Copy link

vercel bot commented Jul 22, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 23, 2025 0:47am

@trial-danswer trial-danswer changed the title adding support for markdown files for github feat: adding support for markdown files for github Jul 22, 2025
@trial-danswer trial-danswer marked this pull request as ready for review July 22, 2025 22:58
@trial-danswer trial-danswer requested a review from a team as a code owner July 22, 2025 22:58
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds support for indexing Markdown files (.md) to the GitHub connector, extending its capabilities beyond just pull requests and issues. The implementation follows the existing architectural pattern by introducing a new FILES_MD stage in the connector's checkpoint-based processing pipeline.

The changes include:

  • Frontend: Added a new checkbox option "Include Markdown files?" (include_files_md) to the GitHub connector configuration UI, following the same pattern as the existing PR and issue inclusion options
  • Backend: Implemented a new processing stage that recursively traverses repository contents using GitHub's content API to find and index .md files, with proper document conversion and metadata handling
  • Testing: Updated both unit and integration tests to validate the new functionality, including mock file creation and proper assertions for the new document type

The feature integrates seamlessly into the existing workflow where repositories are processed in stages: Pull Requests → Issues → Markdown Files → Next Repository. This allows organizations to search through their documentation files (README.md, CONTRIBUTING.md, etc.) alongside code-related discussions, significantly expanding the searchable knowledge base from GitHub repositories.

Confidence score: 2/5

  • This PR has significant implementation issues that could cause production problems, including potential infinite loops and performance bottlenecks
  • The stage transition logic contains a critical bug that could prevent the connector from progressing properly through repositories
  • The recursive directory traversal lacks pagination and could cause timeouts or memory issues with large repositories containing many markdown files
  • Files that need more attention: backend/onyx/connectors/github/connector.py (lines 747-780 for stage transition logic, lines 525-547 for recursive traversal)

4 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

def _convert_file_to_document(file: ContentFile) -> Document:
return Document(
id=file.html_url,
sections=[TextSection(link=file.html_url, text=file.content or "")],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: ContentFile.content may be base64 encoded for binary files or None for large files. Should decode content and handle these cases

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

1 issue found across 4 files • Review in cubic

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.

trial-danswer and others added 2 commits July 22, 2025 16:05
New field include_files_md is declared as required, but older connector records won’t contain it, so the type should be optional to avoid undefined values at runtime

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
logic: After processing FILES_MD, the stage is reset to PRS instead of moving to the next repository. This will cause an infinite loop

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
)
assert len(docs) > 1 # We expect at least one PR and one Issue to exist

def get_issue_doc(docs: list[Document]) -> Document | None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to add a docstring here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants