Skip to content

Conversation

Orbital-Web
Copy link
Contributor

Description

  • Clustering no longer loads every staging entity, relationship, etc. into memory and clusters them
  • Clustering does work in batches for memory efficiency and crash robustness
  • Fixed issue where if a staging entity's parent is in the normalized table but not the staging table, a parent-child relationship will not be created
  • Made parent child relationship generation parallel

How Has This Been Tested?

Locally, on GitHub, Jira, Linear, and Salesforce. Both incrementally and in one go
Checked vespa to ensure the right information is there

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@Orbital-Web Orbital-Web requested a review from a team as a code owner June 8, 2025 17:59
Copy link

vercel bot commented Jun 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 8, 2025 7:54pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Major refactor of Knowledge Graph clustering to improve memory efficiency and crash resilience through batch processing and parallelization.

  • Introduces batch processing in backend/onyx/kg/clustering/clustering.py to prevent loading all entities into memory at once
  • Added parent_key column in backend/onyx/db/models.py for explicit parent-child relationships tracking
  • Parallelized parent-child relationship generation in backend/onyx/kg/clustering/clustering.py for better performance
  • Changed relationship handling in backend/onyx/db/relationships.py to work directly with normalized tables instead of staging
  • Fixed edge case in parent-child relationship creation when parent exists in normalized but not staging tables

5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

Copy link
Contributor

@joachim-danswer joachim-danswer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally

@Orbital-Web Orbital-Web added this pull request to the merge queue Jun 8, 2025
Merged via the queue into main with commit 2b812b7 Jun 8, 2025
11 checks passed
@Orbital-Web Orbital-Web deleted the kg-batch-cluster branch June 8, 2025 22:13
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* super genius kg_entity parent migration

* feat: batched clustering

* fix: nit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants