Skip to content

Better handle duplicates, trailing slashes, and http/https #392

@maxachis

Description

@maxachis

Currently, we have four variants of https://www.alacourt.gov/:

  • One which came from a google source collector and has "https" and a trailing slash (ID 3470)
  • The root URL for that, which simply removes the trailing slash (ID 14645)
  • One which was manually added from a separate collector and has "http" and a trailing slash (ID 4217)
  • The root URL for that, which simply removes the trailing slash (ID 13719)

These are not the only examples of this -- using the ChatGPT-provided query below, I identify 893 URLs that meet this criteria.

Such URLs add clutter and slow down meaningful data labeling because we have to effectively label the same entity four times.

The challenge is this:

  • Very occasionally, a URL with a trailing slash yields different content from a URL without a trailing slash
  • Some sites allow http but not https, or vice versa.

Fortunately, we do have some ways of id-ing duplicates already -- namely, when we extract HTML content from a web page, we hash the compressed HTML, which when combined with testing http/https as well as with/out trailing slash, can tell us whether a variant of a URL is a duplicate or another, or an invalid variant.

(We can't just use hashes to always ID duplicates, unfortunately, as in some cases identical hashes are because our current extractor doesn't load javascript or other content which meaningfully changes the content of the web page).

So we have a few things we probably want to do:

  • For any URLs which have http, test if the extracted HTML has the same hash as https, and switch the url to using https if so
  • In the case of two URLs which differ only by a trailing slash at the end, run the same test on both variants, and keep the version without the trailing slash if they're identical.

Note that this likely won't pick up all such duplicates, as there are likely cases where the content of a web page changes even on requests made milliseconds apart. But it would help us remove a lot of this glut.

This probably requires augmenting our Duplicate Detection task to run these sorts of tests.

ChatGPT-generated SQL Query

WITH norm AS (
    SELECT
        id,
        url,
        -- 1) remove http/https
        -- 2) remove one trailing slash
        regexp_replace(
                        regexp_replace(url, '^https?://', '', 'i'),
                '/$', ''
        ) AS url_norm
    FROM urls
)
SELECT url_norm, count(*) AS n, array_agg(id ORDER BY id) AS ids, array_agg(url ORDER BY url) AS samples
FROM norm
GROUP BY url_norm
HAVING count(*) > 1
ORDER BY n DESC;

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Awaiting Dev

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions