Skip to content

Conversation

evan-onyx
Copy link
Contributor

@evan-onyx evan-onyx commented Apr 5, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1684/handle-date-pills-and-other-similar-things-in-google-docs

Index smart chips in Google Docs. We know of three ways to get content from Google Docs. Here are some pros/cons
a) Docs advanced API v1
Pros: Allows structured retrieval, i.e. can extract headings
Cons: missing smart chips (dates, timers, calendar events, etc). DOES extract people and docs links.
b) Drive file retrieval
Pros: gets ALL smart chips, all text content.
Cons: no structure
c) Apps Scripting
Pros: structured retrieval, DOES get dates
Cons: misses all other smart chips, requires users to enable a bunch of extra scopes

This PR addresses some prior issues with the Advanced retrieval (missing the first section, not getting tables). It also detects when a doc contains smart chips, and if so uses (b) to get the full file content, then best-effort combines with section information from (a) to get full-content docs with reasonable section information.

We switched away from (c) upon realizing that (b) had more information than previously thought, but the code from that approach is available in the drive-pill-indexing branch.

How Has This Been Tested?

Tested in UI, should add an integration test at some point

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@evan-onyx evan-onyx requested a review from a team as a code owner April 5, 2025 00:04
Copy link

vercel bot commented Apr 5, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 7, 2025 9:02pm

@evan-onyx evan-onyx changed the title Drive pill indexing2 Drive smart chip indexing Apr 5, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

The PR enhances the Google Drive connector's extraction and indexing of smart chips by integrating multiple retrieval methods and refining section alignment.

  • /backend/onyx/connectors/google_drive/appsscript.json: New configuration enabling the advanced Docs service in a V8 environment.
  • /backend/onyx/chat/process_message.py: Streamlined deduplication logic using a clearer conditional expression.
  • /backend/onyx/connectors/google_drive/section_extraction.py: Improved tab handling and structured processing of headings and tables.
  • /backend/onyx/connectors/google_drive/doc_conversion.py: Introduced best-effort alignment between basic and advanced extraction with warnings for misalignments.
  • /backend/onyx/connectors/google_drive/smart_chip_retrieval.gs: New Apps Script for smart chip extraction with potential Map usage concerns.

6 file(s) reviewed, 1 comment(s)
Edit PR Review Bot Settings | Greptile

@Weves Weves added this pull request to the merge queue Apr 7, 2025
Merged via the queue into main with commit 9c73099 Apr 7, 2025
11 checks passed
@evan-onyx evan-onyx deleted the drive-pill-indexing2 branch April 7, 2025 23:35
aronszanto pushed a commit to aronszanto/onyx that referenced this pull request Apr 26, 2025
* WIP

* WIP almost done, but realized we can just do basic retrieval

* rebased and added scripts

* improved approach to extracting smart chips

* remove files from previous branch

* fix connector tests

* fix test
AnkitTukatek pushed a commit to TukaTek/onyx that referenced this pull request Sep 23, 2025
* WIP

* WIP almost done, but realized we can just do basic retrieval

* rebased and added scripts

* improved approach to extracting smart chips

* remove files from previous branch

* fix connector tests

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants