Skip to content

refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388

Open
barjin wants to merge 12 commits intov4from
chore/more-context-pipeline
Open

refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
barjin wants to merge 12 commits intov4from
chore/more-context-pipeline

Conversation

@barjin
Copy link
Member

@barjin barjin commented Feb 5, 2026

Extracts all CrawlingContext initialization to ContextPipeline steps to tighten the control over the CrawlingContext contents.

Blocks #3380

@barjin barjin self-assigned this Feb 5, 2026
@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 5, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the context initialization logic in Crawlee's crawler architecture by moving all CrawlingContext setup into the ContextPipeline. This change provides tighter control over context construction and prepares the codebase for the upcoming session pool exclusivity changes in PR #3380.

Changes:

  • Introduces a new buildContextPipeline() method in BasicCrawler that handles all core context initialization (helpers, request fetching, session management, etc.)
  • Moves context pipeline invocation from runRequestHandler() to the runTaskFunction level in AutoscaledPool
  • Updates subclasses (HttpCrawler, BrowserCrawler, FileDownload) to call super.buildContextPipeline() and extend the pipeline idiomatically

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/basic-crawler/src/internals/basic-crawler.ts Adds buildContextPipeline() method for idiomatic context initialization; refactors runTaskFunction to invoke the pipeline at a higher level with improved error handling
packages/browser-crawler/src/internals/browser-crawler.ts Updates to call super.buildContextPipeline() and adds override keyword for type safety
packages/http-crawler/src/internals/http-crawler.ts Updates to call super.buildContextPipeline() instead of creating a new pipeline; moves ContextPipeline import to type-only import
packages/http-crawler/src/internals/file-download.ts Updates to call this.buildContextPipeline() for consistency with the new architecture
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Updates to apply result-bound helpers after pipeline execution to avoid being overwritten by base crawler helpers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@barjin barjin marked this pull request as ready for review February 6, 2026 15:50
@barjin barjin requested review from janbuchar February 6, 2026 15:50
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more of a refactor, I'd say...

…extHelpers

The enqueueLinks helper was accidentally removed from the
resultBoundContextHelpers, causing links to not be enqueued correctly
through the RequestHandlerResult in the adaptive crawler.
…line building

Start context pipelines from {} instead of lying about an empty object
being a CrawlingContext. The pipeline gradually extends the type through
compose() calls until it reaches the final CrawlingContext shape.
@barjin barjin changed the title chore: use ContextPipeline to initialize BasicCrawler's context idiomatically refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically Feb 9, 2026
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bunch of nits, good stuff overall!

@barjin barjin requested a review from janbuchar February 13, 2026 12:18
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only three comments, two of them are fairly important.

})
.compose({
action: async (context) => {
// AdaptivePlaywrightCrawler passes edited request directly into the pipeline, we don't want to override that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. It feels somewhat wrong to mention a subclass here. I think we should be more upfront about the contextPipeline understanding some properties. Couldn't it start with {request?: Request}?

Speaking of AdaptivePlaywrightCrawler, it also wants to override context helpers that touch storages, and that might be worth adding to the initial context type of the ContextPipeline too...

async (finalContext) => await this.requestHandler(finalContext),
);
await this.staticContextPipeline.call(subCrawlerContext, async (finalContext) => {
Object.assign(finalContext, resultBoundContextHelpers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that pre-navigation hooks, for example, will have direct access to the storage, which is a regression.

Ultimately, I'd like to implement the storage interception on a different level, without relying on shady monkey patching, so maybe we can get away with leaving this. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking deeper into the AdaptivePlaywrightCrawler impl, it could use some clean up (using ContextPipeline or not).

I can almost see the last contextPipelineBuilder step "idiomatically" overriding request and the helpers... but not quite (at least not without some refactoring).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brother, I already cleaned it up. But of course, there is always more room for improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants