-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: add remove_by_selector feature to filter unnecessary elements when scraping #5131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@razvanMiu is attempting to deploy a commit to the Danswer Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds a remove_by_selector
feature to the web connector that enables filtering out unwanted HTML elements during web scraping using CSS selectors. The implementation spans three files:
Frontend Changes: The web connector configuration (connectors.tsx
) gains a new optional list field in the advanced settings that allows users to specify CSS selectors for elements to exclude during scraping.
Backend Integration: The WebConnector (connector.py
) now accepts a remove_by_selector
parameter as a list of CSS selectors and integrates this with the HTML processing pipeline. The filtering occurs after BeautifulSoup parsing but before existing HTML cleanup routines.
Core Implementation: A new remove_by_selector
function in html_utils.py
handles the actual element removal. This function supports dual configuration approaches: explicit selectors passed from the connector configuration and selectors specified in <meta name="remove_by_selector">
tags within the scraped HTML itself. The function processes comma-separated selectors and uses BeautifulSoup's decompose()
method to permanently remove matching elements.
This feature addresses a common web scraping need where certain page elements (headers, footers, navigation menus, advertisements) should be excluded from indexed content to improve document quality. The implementation fits naturally into the existing HTML processing pipeline and follows established patterns in the codebase for connector configuration.
PR Description Notes:
- Minor typo: "containe" should be "contain"
Confidence score: 3/5
- This PR introduces useful functionality but has several implementation issues that could cause problems in production
- The core logic is sound, but lacks proper error handling for invalid CSS selectors, has variable naming inconsistencies, and missing input validation
- Files needing attention:
backend/onyx/file_processing/html_utils.py
for error handling and validation,backend/onyx/connectors/web/connector.py
for parameter consistency
3 files reviewed, 4 comments
Description
This PR enables a new field on the web connector, remove_by_selector, a list of selectors that can be used to filter unnecessary elements from the web page when scraping. It also allows adding the remove_by_selector as a meta tag in the page itself using something like this:
<meta name="remove_by_selector" content="#header,#footer" />
.Note: the remove_by_selector added as meta tag can containe multiple selectors separated by comma.
How Has This Been Tested?
By adding a css selector in the remove_by_selector field, inside of a web connector, and checking the documents indexed.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.
Summary by cubic
Added a remove_by_selector option to let users exclude specific HTML elements from web scraping using CSS selectors, either via connector settings or a meta tag in the page.