Skip to content

Conversation

razvanMiu
Copy link

@razvanMiu razvanMiu commented Jul 30, 2025

Description

This PR enables a new field on the web connector, remove_by_selector, a list of selectors that can be used to filter unnecessary elements from the web page when scraping. It also allows adding the remove_by_selector as a meta tag in the page itself using something like this: <meta name="remove_by_selector" content="#header,#footer" />.
Note: the remove_by_selector added as meta tag can containe multiple selectors separated by comma.

How Has This Been Tested?

By adding a css selector in the remove_by_selector field, inside of a web connector, and checking the documents indexed.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

Summary by cubic

Added a remove_by_selector option to let users exclude specific HTML elements from web scraping using CSS selectors, either via connector settings or a meta tag in the page.

  • New Features
    • Supports a list of CSS selectors to remove unwanted elements before scraping.
    • Allows selectors to be set in the connector or as a meta tag in the HTML.

@razvanMiu razvanMiu requested a review from a team as a code owner July 30, 2025 13:37
Copy link

vercel bot commented Jul 30, 2025

@razvanMiu is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds a remove_by_selector feature to the web connector that enables filtering out unwanted HTML elements during web scraping using CSS selectors. The implementation spans three files:

Frontend Changes: The web connector configuration (connectors.tsx) gains a new optional list field in the advanced settings that allows users to specify CSS selectors for elements to exclude during scraping.

Backend Integration: The WebConnector (connector.py) now accepts a remove_by_selector parameter as a list of CSS selectors and integrates this with the HTML processing pipeline. The filtering occurs after BeautifulSoup parsing but before existing HTML cleanup routines.

Core Implementation: A new remove_by_selector function in html_utils.py handles the actual element removal. This function supports dual configuration approaches: explicit selectors passed from the connector configuration and selectors specified in <meta name="remove_by_selector"> tags within the scraped HTML itself. The function processes comma-separated selectors and uses BeautifulSoup's decompose() method to permanently remove matching elements.

This feature addresses a common web scraping need where certain page elements (headers, footers, navigation menus, advertisements) should be excluded from indexed content to improve document quality. The implementation fits naturally into the existing HTML processing pipeline and follows established patterns in the codebase for connector configuration.

PR Description Notes:

  • Minor typo: "containe" should be "contain"

Confidence score: 3/5

  • This PR introduces useful functionality but has several implementation issues that could cause problems in production
  • The core logic is sound, but lacks proper error handling for invalid CSS selectors, has variable naming inconsistencies, and missing input validation
  • Files needing attention: backend/onyx/file_processing/html_utils.py for error handling and validation, backend/onyx/connectors/web/connector.py for parameter consistency

3 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

@razvanMiu razvanMiu changed the title Add remove_by_selector feature to filter unnecessary elements when scraping feat: add remove_by_selector feature to filter unnecessary elements when scraping Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant