-
Notifications
You must be signed in to change notification settings - Fork 45
fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
64e49ef
829fc5e
2f8b8de
dcea0a3
40a809d
471fa88
9b792d0
c4f2ec5
938fd17
a759149
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -76,7 +76,6 @@ export class CrawlerSetup implements CrawlerSetupOptions { | |
requestQueueName?: string; | ||
|
||
crawler!: PlaywrightCrawler; | ||
requestList!: RequestList; | ||
dataset!: Dataset; | ||
pagesOutputted!: number; | ||
private initPromise: Promise<void>; | ||
|
@@ -167,7 +166,6 @@ export class CrawlerSetup implements CrawlerSetupOptions { | |
|
||
// Initialize async operations. | ||
this.crawler = null!; | ||
this.requestList = null!; | ||
this.requestQueue = null!; | ||
this.dataset = null!; | ||
this.keyValueStore = null!; | ||
|
@@ -182,14 +180,19 @@ export class CrawlerSetup implements CrawlerSetupOptions { | |
return req; | ||
}); | ||
|
||
this.requestList = await RequestList.open( | ||
'PLAYWRIGHT_SCRAPER', | ||
startUrls, | ||
); | ||
|
||
// RequestQueue | ||
this.requestQueue = await RequestQueueV2.open(this.requestQueueName); | ||
|
||
const requests: Request[] = []; | ||
for await (const request of await RequestList.open(null, startUrls)) { | ||
if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) { | ||
break | ||
} | ||
requests.push(request); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm thinking this (initializing all of the Either way, there are no breaking changes in the API. Let's release it - reverting is always an option 🤷 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You raise a good point here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do this - apify/crawlee#2980 - we can make this thingy more efficient. |
||
} | ||
|
||
await this.requestQueue.addRequestsBatched(requests); | ||
|
||
// Dataset | ||
this.dataset = await Dataset.open(this.datasetName); | ||
const info = await this.dataset.getInfo(); | ||
|
@@ -207,7 +210,6 @@ export class CrawlerSetup implements CrawlerSetupOptions { | |
|
||
const options: PlaywrightCrawlerOptions = { | ||
requestHandler: this._requestHandler.bind(this), | ||
requestList: this.requestList, | ||
requestQueue: this.requestQueue, | ||
requestHandlerTimeoutSecs: this.devtools | ||
? DEVTOOLS_TIMEOUT_SECS | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The state maintenance looks valid 👍 I can think of a situation with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we event want to go through the RL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use it here to parse the input conveniently. We could pass the input directly to
addRequests
, but we couldn't pick only N first requests because thestartUrls
can contain links to files that contain multiple links.