Skip to content

fix(scrapers): Use RequestQueue directly to avoid excessive RQ writes on runs with large startUrls list #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 21, 2025
192 changes: 96 additions & 96 deletions package-lock.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
"@types/tough-cookie": "^4.0.5",
"@types/ws": "^8.5.12",
"commitlint": "^19.3.0",
"crawlee": "^3.13.0",
"crawlee": "^3.13.4",
"eslint": "^9.23.0",
"eslint-config-prettier": "^10.1.1",
"fs-extra": "^11.2.0",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: PlaywrightCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
private initPromise: Promise<void>;
Expand Down Expand Up @@ -167,7 +166,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -182,14 +180,19 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open(
'PLAYWRIGHT_SCRAPER',
startUrls,
);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we event want to go through the RL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use it here to parse the input conveniently. We could pass the input directly to addRequests, but we couldn't pick only N first requests because the startUrls can contain links to files that contain multiple links.

if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) {
break
}
requests.push(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this (initializing all of the Request instances preemptively) could potentially cause some OOM issues (especially with Cheerio crawler, which users tend to run with smaller RAM allocation). We are basically reverting this PR.

Either way, there are no breaking changes in the API. Let's release it - reverting is always an option 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this - apify/crawlee#2980 - we can make this thingy more efficient.

}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand All @@ -207,7 +210,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

const options: PlaywrightCrawlerOptions = {
requestHandler: this._requestHandler.bind(this),
requestList: this.requestList,
requestQueue: this.requestQueue,
requestHandlerTimeoutSecs: this.devtools
? DEVTOOLS_TIMEOUT_SECS
Expand Down
2 changes: 1 addition & 1 deletion packages/actor-scraper/cheerio-scraper/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"type": "module",
"dependencies": {
"@apify/scraper-tools": "^1.1.4",
"@crawlee/cheerio": "^3.13.2",
"@crawlee/cheerio": "^3.13.4",
"apify": "^3.2.6"
},
"devDependencies": {
Expand Down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state maintenance looks valid 👍

I can think of a situation with requestQueueName input that will behave differently after these changes (two runs, shared RQ, not shared KVS), but it's IMO such an edge-case I wouldn't even consider this breaking.

Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ import {
RequestList,
RequestQueueV2,
} from '@crawlee/cheerio';
import type { ApifyEnv } from 'apify';
import type { ApifyClient, ApifyEnv } from 'apify';
import { Actor } from 'apify';
import { load } from 'cheerio';

Expand Down Expand Up @@ -70,7 +70,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: CheerioCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
proxyConfiguration?: ProxyConfiguration;
Expand Down Expand Up @@ -151,7 +150,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -167,11 +165,22 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open('CHEERIO_SCRAPER', startUrls);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
if (
this.input.maxResultsPerCrawl > 0 &&
requests.length >= 1.5 * this.input.maxResultsPerCrawl
) {
break;
}
requests.push(request);
}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand All @@ -197,7 +206,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestHandler: this._requestHandler.bind(this),
preNavigationHooks: [],
postNavigationHooks: [],
requestList: this.requestList,
requestQueue: this.requestQueue,
navigationTimeoutSecs: this.input.pageLoadTimeoutSecs,
requestHandlerTimeoutSecs: this.input.pageFunctionTimeoutSecs,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: JSDOMCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
proxyConfiguration?: ProxyConfiguration;
Expand Down Expand Up @@ -150,7 +149,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -166,11 +164,19 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open('JSDOM_SCRAPER', startUrls);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) {
break
}
requests.push(request);
}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand Down Expand Up @@ -198,7 +204,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
runScripts: this.input.runScripts ?? true,
hideInternalConsole: !(this.input.showInternalConsole ?? false),
postNavigationHooks: [],
requestList: this.requestList,
requestQueue: this.requestQueue,
navigationTimeoutSecs: this.input.pageLoadTimeoutSecs,
requestHandlerTimeoutSecs: this.input.pageFunctionTimeoutSecs,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: PlaywrightCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
private initPromise: Promise<void>;
Expand Down Expand Up @@ -198,7 +197,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -213,14 +211,19 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open(
'PLAYWRIGHT_SCRAPER',
startUrls,
);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) {
break
}
requests.push(request);
}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand All @@ -241,7 +244,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

const options: PlaywrightCrawlerOptions = {
requestHandler: this._requestHandler.bind(this),
requestList: this.requestList,
requestQueue: this.requestQueue,
requestHandlerTimeoutSecs: this.devtools
? DEVTOOLS_TIMEOUT_SECS
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: PuppeteerCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
private initPromise: Promise<void>;
Expand Down Expand Up @@ -195,7 +194,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -210,14 +208,19 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open(
'PUPPETEER_SCRAPER',
startUrls,
);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) {
break
}
requests.push(request);
}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand All @@ -238,7 +241,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

const options: PuppeteerCrawlerOptions = {
requestHandler: this._requestHandler.bind(this),
requestList: this.requestList,
requestQueue: this.requestQueue,
requestHandlerTimeoutSecs: this.devtools
? DEVTOOLS_TIMEOUT_SECS
Expand Down
15 changes: 10 additions & 5 deletions packages/actor-scraper/web-scraper/src/internals/crawler_setup.ts
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {
requestQueueName?: string;

crawler!: PuppeteerCrawler;
requestList!: RequestList;
dataset!: Dataset;
pagesOutputted!: number;
private initPromise: Promise<void>;
Expand Down Expand Up @@ -219,7 +218,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

// Initialize async operations.
this.crawler = null!;
this.requestList = null!;
this.requestQueue = null!;
this.dataset = null!;
this.keyValueStore = null!;
Expand All @@ -234,11 +232,19 @@ export class CrawlerSetup implements CrawlerSetupOptions {
return req;
});

this.requestList = await RequestList.open('WEB_SCRAPER', startUrls);

// RequestQueue
this.requestQueue = await RequestQueueV2.open(this.requestQueueName);

const requests: Request[] = [];
for await (const request of await RequestList.open(null, startUrls)) {
if (this.input.maxResultsPerCrawl > 0 && requests.length >= 1.5 * this.input.maxResultsPerCrawl) {
break
}
requests.push(request);
}

await this.requestQueue.addRequestsBatched(requests);

// Dataset
this.dataset = await Dataset.open(this.datasetName);
const info = await this.dataset.getInfo();
Expand All @@ -262,7 +268,6 @@ export class CrawlerSetup implements CrawlerSetupOptions {

const options: PuppeteerCrawlerOptions = {
requestHandler: this._requestHandler.bind(this),
requestList: this.requestList,
requestQueue: this.requestQueue,
requestHandlerTimeoutSecs: this.isDevRun
? DEVTOOLS_TIMEOUT_SECS
Expand Down
Loading