fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393

janbuchar · 2025-05-14T13:42:56Z

affects all generic scrapers
closes Increased RQ write usage in generic scrapers #392
the change means that we will skip the RQ+RL tandem behavior from BasicCrawler
an alternative solution would be to initialize the tandem lazily, but this would be tricky due to 1) the amount of non-trivial code that depends on requestQueue being present and 2) input containing the requestQueueName option that we'd also need to take into account

Caveats

with large startUrls, there will be a lot of RQ writes billed at the start of the Actor
- this should not be a huge deal
- however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

…ith large startUrls list

B4nan · 2025-05-14T13:52:57Z

packages/actor-scraper/camoufox-scraper/src/internals/crawler_setup.ts

        this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

+        const requests: Request[] = [];
+        for await (const request of await RequestList.open(null, startUrls)) {


do we event want to go through the RL?

We only use it here to parse the input conveniently. We could pass the input directly to addRequests, but we couldn't pick only N first requests because the startUrls can contain links to files that contain multiple links.

B4nan · 2025-05-15T08:49:12Z

packages/actor-scraper/camoufox-scraper/src/internals/crawler_setup.ts

+        for await (const request of await RequestList.open(null, startUrls)) {
+            if (
+                this.input.maxResultsPerCrawl > 0 &&
+                requests.length >= 1.5 * this.input.maxResultsPerCrawl


when i mentioned the overenqueuing for 150% of the limit, i was mainly concerned about the requests added via enqueueLinks, i don't think its common that people provide more start URLs than the limit itself, the issue is what we enqueue on those pages. i guess we'd have to handle this on crawlee level.

i don't think its common that people provide more start URLs than the limit itself

Bold claim. I can imagine for example

someone setting a limit to get a quick sample of the results when developing their pageFunction

parsing URLs from an external source that's out of control of the user - it would make sense to add a limit in that case to prevent mayhem

The 150% is there because the crawl may fail for some URLs.

i am not saying we shouldnt handle this here, i am saying the important part is elsewhere

+1 for processing this in Crawlee, see apify/crawlee#2728.

Also, this.input.maxResultsPerCrawl is slightly confusing naming here - even if pageFunction returns null (or failedRequestHandler is called), it generates an "empty" dataset record (see #353).

IIRC maxResultsPerCrawl here = maxRequestsPerCrawl in Crawlee.

barjin

Let's see how it goes 🚀

barjin · 2025-05-19T12:46:01Z

packages/actor-scraper/camoufox-scraper/src/internals/crawler_setup.ts

+            ) {
+                break;
+            }
+            requests.push(request);


I'm thinking this (initializing all of the Request instances preemptively) could potentially cause some OOM issues (especially with Cheerio crawler, which users tend to run with smaller RAM allocation). We are basically reverting this PR.

Either way, there are no breaking changes in the API. Let's release it - reverting is always an option 🤷

You raise a good point here.

If we do this - apify/crawlee#2980 - we can make this thingy more efficient.

barjin

however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

Oh yes, we might want to take care of this. I recall similar logic in WCC. Not having this raised the RQ write count significantly for some longer runs with migrations and resurrects.

janbuchar · 2025-05-19T15:00:50Z

however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

Oh yes, we might want to take care of this. I recall similar logic in WCC. Not having this raised the RQ write count significantly for some longer runs with migrations and resurrects.

Yeah, it surely is an eyesore 😁 Do you think it makes sense to handle this flag in Crawlee?

barjin · 2025-05-20T11:44:30Z

handle this flag in Crawlee

Yes, preferably we could handle this inside of crawler.run([...list of requests...]) here, because looking at the code, we have the exact same issue there. But still, I'd support only this native interface - if any user (including us) decides to transfer requests from one storage to another before running the crawler like we do here, that's imho on them (us).

barjin · 2025-05-20T14:53:19Z

packages/actor-scraper/cheerio-scraper/src/internals/crawler_setup.ts

The state maintenance looks valid 👍

I can think of a situation with requestQueueName input that will behave differently after these changes (two runs, shared RQ, not shared KVS), but it's IMO such an edge-case I wouldn't even consider this breaking.

…irectly

fix: Use RequestQueue directly to avoid excessive RQ writes on runs w…

64e49ef

…ith large startUrls list

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label May 14, 2025

janbuchar requested review from B4nan and barjin May 14, 2025 13:42

github-actions bot assigned janbuchar May 14, 2025

github-actions bot added this to the 114th sprint - Tooling team milestone May 14, 2025

janbuchar added 4 commits May 14, 2025 15:47

Remove thing

829fc5e

Update crawlee

2f8b8de

Remove unused import

dcea0a3

Format

40a809d

B4nan reviewed May 15, 2025

View reviewed changes

janbuchar marked this pull request as ready for review May 15, 2025 09:17

B4nan approved these changes May 19, 2025

View reviewed changes

barjin approved these changes May 19, 2025

View reviewed changes

barjin self-requested a review May 19, 2025 12:46

barjin reviewed May 19, 2025

View reviewed changes

Add flag to prevent repeated insertion of start requests

471fa88

barjin reviewed May 20, 2025

View reviewed changes

Use the RQ-initialized flag in all scrapers

9b792d0

janbuchar changed the title ~~fix: Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list~~ fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list May 21, 2025

janbuchar added 2 commits May 21, 2025 11:47

Merge remote-tracking branch 'origin/master' into use-request-queue-d…

c4f2ec5

…irectly

Resolve lint errors

938fd17

janbuchar requested a review from barjin May 21, 2025 09:54

Revert package-lock changes

a759149

janbuchar merged commit 3388d13 into master May 21, 2025
9 checks passed

janbuchar deleted the use-request-queue-directly branch May 21, 2025 09:57

fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393

fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393

Uh oh!

Conversation

janbuchar commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Caveats

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar commented May 19, 2025

Uh oh!

barjin commented May 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janbuchar commented May 14, 2025 •

edited

Loading

barjin May 15, 2025 •

edited

Loading

barjin left a comment •

edited

Loading