Skip to content

fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 21, 2025

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented May 14, 2025

  • affects all generic scrapers
  • closes Increased RQ write usage in generic scrapers #392
  • the change means that we will skip the RQ+RL tandem behavior from BasicCrawler
  • an alternative solution would be to initialize the tandem lazily, but this would be tricky due to 1) the amount of non-trivial code that depends on requestQueue being present and 2) input containing the requestQueueName option that we'd also need to take into account

Caveats

  • with large startUrls, there will be a lot of RQ writes billed at the start of the Actor
    • this should not be a huge deal
    • however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label May 14, 2025
@janbuchar janbuchar requested review from B4nan and barjin May 14, 2025 13:42
@github-actions github-actions bot added this to the 114th sprint - Tooling team milestone May 14, 2025
// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we event want to go through the RL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use it here to parse the input conveniently. We could pass the input directly to addRequests, but we couldn't pick only N first requests because the startUrls can contain links to files that contain multiple links.

for await (const request of await RequestList.open(null, startUrls)) {
if (
this.input.maxResultsPerCrawl > 0 &&
requests.length >= 1.5 * this.input.maxResultsPerCrawl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i mentioned the overenqueuing for 150% of the limit, i was mainly concerned about the requests added via enqueueLinks, i don't think its common that people provide more start URLs than the limit itself, the issue is what we enqueue on those pages. i guess we'd have to handle this on crawlee level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think its common that people provide more start URLs than the limit itself

Bold claim. I can imagine for example

  1. someone setting a limit to get a quick sample of the results when developing their pageFunction
  2. parsing URLs from an external source that's out of control of the user - it would make sense to add a limit in that case to prevent mayhem

The 150% is there because the crawl may fail for some URLs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not saying we shouldnt handle this here, i am saying the important part is elsewhere

Copy link
Contributor

@barjin barjin May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for processing this in Crawlee, see apify/crawlee#2728.

Also, this.input.maxResultsPerCrawl is slightly confusing naming here - even if pageFunction returns null (or failedRequestHandler is called), it generates an "empty" dataset record (see #353).

IIRC maxResultsPerCrawl here = maxRequestsPerCrawl in Crawlee.

@janbuchar janbuchar marked this pull request as ready for review May 15, 2025 09:17
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how it goes 🚀

) {
break;
}
requests.push(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this (initializing all of the Request instances preemptively) could potentially cause some OOM issues (especially with Cheerio crawler, which users tend to run with smaller RAM allocation). We are basically reverting this PR.

Either way, there are no breaking changes in the API. Let's release it - reverting is always an option 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this - apify/crawlee#2980 - we can make this thingy more efficient.

@barjin barjin self-requested a review May 19, 2025 12:46
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

Oh yes, we might want to take care of this. I recall similar logic in WCC. Not having this raised the RQ write count significantly for some longer runs with migrations and resurrects.

@janbuchar
Copy link
Contributor Author

however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once

Oh yes, we might want to take care of this. I recall similar logic in WCC. Not having this raised the RQ write count significantly for some longer runs with migrations and resurrects.

Yeah, it surely is an eyesore 😁 Do you think it makes sense to handle this flag in Crawlee?

@barjin
Copy link
Contributor

barjin commented May 20, 2025

handle this flag in Crawlee

Yes, preferably we could handle this inside of crawler.run([...list of requests...]) here, because looking at the code, we have the exact same issue there. But still, I'd support only this native interface - if any user (including us) decides to transfer requests from one storage to another before running the crawler like we do here, that's imho on them (us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state maintenance looks valid 👍

I can think of a situation with requestQueueName input that will behave differently after these changes (two runs, shared RQ, not shared KVS), but it's IMO such an edge-case I wouldn't even consider this breaking.

@janbuchar janbuchar changed the title fix: Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list May 21, 2025
@janbuchar janbuchar requested a review from barjin May 21, 2025 09:54
@janbuchar janbuchar merged commit 3388d13 into master May 21, 2025
9 checks passed
@janbuchar janbuchar deleted the use-request-queue-directly branch May 21, 2025 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increased RQ write usage in generic scrapers
3 participants