refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
refactor: use ContextPipeline to initialize BasicCrawler's context idiomatically#3388
ContextPipeline to initialize BasicCrawler's context idiomatically#3388Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the context initialization logic in Crawlee's crawler architecture by moving all CrawlingContext setup into the ContextPipeline. This change provides tighter control over context construction and prepares the codebase for the upcoming session pool exclusivity changes in PR #3380.
Changes:
- Introduces a new
buildContextPipeline()method inBasicCrawlerthat handles all core context initialization (helpers, request fetching, session management, etc.) - Moves context pipeline invocation from
runRequestHandler()to therunTaskFunctionlevel inAutoscaledPool - Updates subclasses (
HttpCrawler,BrowserCrawler,FileDownload) to callsuper.buildContextPipeline()and extend the pipeline idiomatically
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/basic-crawler/src/internals/basic-crawler.ts | Adds buildContextPipeline() method for idiomatic context initialization; refactors runTaskFunction to invoke the pipeline at a higher level with improved error handling |
| packages/browser-crawler/src/internals/browser-crawler.ts | Updates to call super.buildContextPipeline() and adds override keyword for type safety |
| packages/http-crawler/src/internals/http-crawler.ts | Updates to call super.buildContextPipeline() instead of creating a new pipeline; moves ContextPipeline import to type-only import |
| packages/http-crawler/src/internals/file-download.ts | Updates to call this.buildContextPipeline() for consistency with the new architecture |
| packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts | Updates to apply result-bound helpers after pipeline execution to avoid being overwritten by base crawler helpers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts
Show resolved
Hide resolved
janbuchar
left a comment
There was a problem hiding this comment.
this is more of a refactor, I'd say...
…extHelpers The enqueueLinks helper was accidentally removed from the resultBoundContextHelpers, causing links to not be enqueued correctly through the RequestHandlerResult in the adaptive crawler.
…line building
Start context pipelines from {} instead of lying about an empty object
being a CrawlingContext. The pipeline gradually extends the type through
compose() calls until it reaches the final CrawlingContext shape.
ContextPipeline to initialize BasicCrawler's context idiomaticallyContextPipeline to initialize BasicCrawler's context idiomatically
janbuchar
left a comment
There was a problem hiding this comment.
Just a bunch of nits, good stuff overall!
janbuchar
left a comment
There was a problem hiding this comment.
Only three comments, two of them are fairly important.
| }) | ||
| .compose({ | ||
| action: async (context) => { | ||
| // AdaptivePlaywrightCrawler passes edited request directly into the pipeline, we don't want to override that |
There was a problem hiding this comment.
Huh. It feels somewhat wrong to mention a subclass here. I think we should be more upfront about the contextPipeline understanding some properties. Couldn't it start with {request?: Request}?
Speaking of AdaptivePlaywrightCrawler, it also wants to override context helpers that touch storages, and that might be worth adding to the initial context type of the ContextPipeline too...
| async (finalContext) => await this.requestHandler(finalContext), | ||
| ); | ||
| await this.staticContextPipeline.call(subCrawlerContext, async (finalContext) => { | ||
| Object.assign(finalContext, resultBoundContextHelpers); |
There was a problem hiding this comment.
This means that pre-navigation hooks, for example, will have direct access to the storage, which is a regression.
Ultimately, I'd like to implement the storage interception on a different level, without relying on shady monkey patching, so maybe we can get away with leaving this. Thoughts?
There was a problem hiding this comment.
Yeah, looking deeper into the AdaptivePlaywrightCrawler impl, it could use some clean up (using ContextPipeline or not).
I can almost see the last contextPipelineBuilder step "idiomatically" overriding request and the helpers... but not quite (at least not without some refactoring).
There was a problem hiding this comment.
Brother, I already cleaned it up. But of course, there is always more room for improvement.
Extracts all
CrawlingContextinitialization toContextPipelinesteps to tighten the control over theCrawlingContextcontents.Blocks #3380