You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:
add a parser class to the topology which
exclusively parses URLs marked as verified HTML seeds (eg. by a metadata key isHtmlSeed)
extracts links from the HTML and sends them to the status index as DISCOVERED
(optionally) outlinks are filtered: same host or domain, configurable URL patterns stored in status index for the HTML seed
the (adaptive) scheduler must be configured to schedule the refetch of HTML seeds