-
Notifications
You must be signed in to change notification settings - Fork 45
fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ith large startUrls list
// RequestQueue | ||
this.requestQueue = await RequestQueueV2.open(this.requestQueueName); | ||
|
||
const requests: Request[] = []; | ||
for await (const request of await RequestList.open(null, startUrls)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we event want to go through the RL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use it here to parse the input conveniently. We could pass the input directly to addRequests
, but we couldn't pick only N first requests because the startUrls
can contain links to files that contain multiple links.
for await (const request of await RequestList.open(null, startUrls)) { | ||
if ( | ||
this.input.maxResultsPerCrawl > 0 && | ||
requests.length >= 1.5 * this.input.maxResultsPerCrawl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i mentioned the overenqueuing for 150% of the limit, i was mainly concerned about the requests added via enqueueLinks
, i don't think its common that people provide more start URLs than the limit itself, the issue is what we enqueue on those pages. i guess we'd have to handle this on crawlee level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think its common that people provide more start URLs than the limit itself
Bold claim. I can imagine for example
- someone setting a limit to get a quick sample of the results when developing their
pageFunction
- parsing URLs from an external source that's out of control of the user - it would make sense to add a limit in that case to prevent mayhem
The 150% is there because the crawl may fail for some URLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not saying we shouldnt handle this here, i am saying the important part is elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for processing this in Crawlee, see apify/crawlee#2728.
Also, this.input.maxResultsPerCrawl
is slightly confusing naming here - even if pageFunction
returns null
(or failedRequestHandler
is called), it generates an "empty" dataset record (see #353).
IIRC maxResultsPerCrawl
here = maxRequestsPerCrawl
in Crawlee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see how it goes 🚀
) { | ||
break; | ||
} | ||
requests.push(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this (initializing all of the Request
instances preemptively) could potentially cause some OOM issues (especially with Cheerio crawler, which users tend to run with smaller RAM allocation). We are basically reverting this PR.
Either way, there are no breaking changes in the API. Let's release it - reverting is always an option 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a good point here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this - apify/crawlee#2980 - we can make this thingy more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, the same bunch of writes will happen on a resume/migration/whatever - maybe we should add a useState flag so that we can only insert the startUrls once
Oh yes, we might want to take care of this. I recall similar logic in WCC. Not having this raised the RQ write count significantly for some longer runs with migrations and resurrects.
Yeah, it surely is an eyesore 😁 Do you think it makes sense to handle this flag in Crawlee? |
Yes, preferably we could handle this inside of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state maintenance looks valid 👍
I can think of a situation with requestQueueName
input that will behave differently after these changes (two runs, shared RQ, not shared KVS), but it's IMO such an edge-case I wouldn't even consider this breaking.
Caveats
startUrls
, there will be a lot of RQ writes billed at the start of the ActoruseState
flag so that we can only insert thestartUrls
once