Skip to content

feat: share SessionPool instance between crawlers#3499

Merged
barjin merged 26 commits into
v4from
feat/pass-session-pool
Apr 17, 2026
Merged

feat: share SessionPool instance between crawlers#3499
barjin merged 26 commits into
v4from
feat/pass-session-pool

Conversation

@barjin
Copy link
Copy Markdown
Member

@barjin barjin commented Mar 18, 2026

BasicCrawler and subclasses now accept sessionPool option for passing a pre-initialized SessionPool instance. This is mutually exclusive with sessionPoolOptions.

SessionPool instances now do not share the persistence keys (similar to e.g. Statistics).

Closes #3445

barjin added 3 commits March 18, 2026 11:25
Each SessionPool now gets an auto-incrementing `id` used to derive a unique `persistStateKey` (`SDK_SESSION_POOL_STATE_{id}`), preventing multiple instances from overwriting each other's persisted state. An explicit `id` or `persistStateKey` can still be provided to control persistence across restarts. BasicCrawler forwards its crawler ID to the SessionPool when available.
Adds a `sessionPool` option to BasicCrawlerOptions that accepts an already-initialized SessionPool, allowing multiple crawlers to share the same pool. When a shared pool is injected, the crawler skips creating a new one and does not tear it down or reset its store — the caller owns its lifecycle. Mutually exclusive with `sessionPoolOptions`.
Covers: using a shared pool instance, verifying it's not torn down by
the crawler, sharing across sequential crawlers, and the mutual
exclusivity guard with sessionPoolOptions.
@barjin barjin self-assigned this Mar 18, 2026
@barjin barjin requested a review from Copilot March 18, 2026 10:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves session pool isolation by default (unique persistence keys per pool) and adds support for injecting an existing SessionPool into BasicCrawler to enable session sharing across crawlers.

Changes:

  • Add SessionPoolOptions.id and change the default persistStateKey format to include the pool id for isolation.
  • Add BasicCrawlerOptions.sessionPool (mutually exclusive with sessionPoolOptions) and ensure injected pools are not reset/teardown by the crawler.
  • Update/add tests to cover persistence isolation and injected/shared session pool behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
packages/core/src/session_pool/session_pool.ts Introduces id and derives a unique default persistStateKey per pool instance.
packages/basic-crawler/src/internals/basic-crawler.ts Allows passing an existing SessionPool and tracks ownership to avoid tearing down shared pools.
test/core/session_pool/session_pool.test.ts Adjusts persistence tests for the new persist key behavior and adds isolation coverage.
test/core/crawlers/basic_crawler.test.ts Adds tests for accepting an injected SessionPool, lifecycle behavior, and sharing across crawlers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread packages/core/src/session_pool/session_pool.ts Outdated
Comment thread packages/basic-crawler/src/internals/basic-crawler.ts
Comment thread test/core/crawlers/basic_crawler.test.ts
Comment thread test/core/crawlers/basic_crawler.test.ts
barjin added 2 commits March 18, 2026 12:20
Fix "it's" → "its" typo in JSDoc, validate `sessionPool` option as
`instanceof SessionPool`, add missing teardown in test, and assert
actual session reuse in the shared pool test.
@barjin barjin requested review from janbuchar and l2ysho March 18, 2026 11:44
Comment on lines +849 to +851
throw new Error(
'Cannot use both `sessionPool` and `sessionPoolOptions` — pass either a pre-built SessionPool instance or options to create one, not both.',
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just abolish sessionPoolOptions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but then we cannot determine whether the crawler owns the SessionPool instance or not... which means, no automatic teardown() (maybe not as problematic, as mentioned here), but also no automatic .resetStore() on crawler restart (which, maybe that's not as problematic either, who knows?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I tried dropping sessionPoolOptions and keeping only sessionPool in this commit.

Tbf, I didn't mind the sessionPoolOptions that much. Constructing the SessionPool explicitly each time shifts the responsibility to the user (see problems with resetStore and teardown above), which might feel burdensome.

I'm soft-leaning towards keeping both sessionPoolOptions (for BC and low-effort use-cases) and sessionPool (for SessionPool sharing and poweruser scenarios). Wdyt?

Copy link
Copy Markdown
Contributor

@janbuchar janbuchar Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you made me check how the Python version does it. Key findings:

  1. No session pool options, if you want to fine tune it, pass in a configured instance
  2. init can be called multiple times, but the first teardown just... tears it down, which is not ideal 😁
  3. EDIT init can only be called once, subsequent initialization attempts throw an error. There is also an active property on the session pool (and other components that require async init/teardown, e.g., Statistics). BasicCrawler.run checks this active property.

I think the Python way is the right direction - it takes care of the init/teardown logic for the user, without providing two ways to customize the session pool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can't we require Node.js 24+ so that we can use using and Symbol.dispose?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it takes care of the init/teardown logic for the user

I'm all for this, but we cannot safely call .teardown() / .resetStore() on a shared pool, potentially used by multiple crawler instances at once (which is what the original issue is asking for).

I made the SessionPool lazy-initialized, i.e., any call to its async methods checks whether it has been initialized - and runs init() if not.

The teardown calls are trickier, though. I'm now tracking whether this.sessionPool is owned by the crawler, or if it has been passed from the outside. .teardown() is then called only if it's owned. Imo this is the best compromise we can have here, but feel free to change my mind :)

Also, can't we require Node.js 24+ so that we can use using and Symbol.dispose?

I suppose we can implement Symbol.dispose to call teardown(), providing a more idiomatic interface for disposal, but I'm failing to see how it should help us with the shared pool instance 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not using the Symbol.dispose approach anywhere else yet (and the implementation would be just calling this.teardown() anyway), I decided against adding it right now. Imo it would just prompt more questions (why does SessionPool have it and other classes do not?).

That being said, I'd be all for adding support for the using / Symbol.dispose API across the project in some later PR.

Comment thread test/core/crawlers/basic_crawler.test.ts
Comment thread docs/upgrading/upgrading_v4.md
@barjin barjin changed the base branch from v4 to feat/lazy-init-session-pool March 24, 2026 16:20
barjin added a commit that referenced this pull request Mar 24, 2026
This PR removes the static `SessionPool.open()` method. Until the `new
SessionPool()` instance has been interacted with, it doesn't access any
storage. Once the user (or another Crawlee component) calls one of the
`SessionPool` async methods, the `SessionPool` is automatically
initialized.

This simplifies usage in e.g. `AdaptivePlaywrightCrawler` (where parts
of the context pipeline are called separately before the `.initialize()`
call), makes things clearer for users, and does away with potentially
confusing temporal coupling (`SessionPool.initialize()` had to be called
before any other operation, otherwise SessionPool would throw
exceptions).

Prerequisite for #3499

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Base automatically changed from feat/lazy-init-session-pool to v4 March 24, 2026 16:36
@barjin barjin changed the title feat: accept pre-initialized SessionPool instance feat: share SessionPool instance between crawlers Mar 24, 2026
@barjin barjin requested review from janbuchar and vladfrangu April 16, 2026 12:10
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this!

@barjin barjin merged commit bd7943d into v4 Apr 17, 2026
6 checks passed
@barjin barjin deleted the feat/pass-session-pool branch April 17, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants