feat: share SessionPool instance between crawlers#3499
Conversation
Each SessionPool now gets an auto-incrementing `id` used to derive a unique `persistStateKey` (`SDK_SESSION_POOL_STATE_{id}`), preventing multiple instances from overwriting each other's persisted state. An explicit `id` or `persistStateKey` can still be provided to control persistence across restarts. BasicCrawler forwards its crawler ID to the SessionPool when available.
Adds a `sessionPool` option to BasicCrawlerOptions that accepts an already-initialized SessionPool, allowing multiple crawlers to share the same pool. When a shared pool is injected, the crawler skips creating a new one and does not tear it down or reset its store — the caller owns its lifecycle. Mutually exclusive with `sessionPoolOptions`.
Covers: using a shared pool instance, verifying it's not torn down by the crawler, sharing across sequential crawlers, and the mutual exclusivity guard with sessionPoolOptions.
There was a problem hiding this comment.
Pull request overview
This PR improves session pool isolation by default (unique persistence keys per pool) and adds support for injecting an existing SessionPool into BasicCrawler to enable session sharing across crawlers.
Changes:
- Add
SessionPoolOptions.idand change the defaultpersistStateKeyformat to include the pool id for isolation. - Add
BasicCrawlerOptions.sessionPool(mutually exclusive withsessionPoolOptions) and ensure injected pools are not reset/teardown by the crawler. - Update/add tests to cover persistence isolation and injected/shared session pool behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
packages/core/src/session_pool/session_pool.ts |
Introduces id and derives a unique default persistStateKey per pool instance. |
packages/basic-crawler/src/internals/basic-crawler.ts |
Allows passing an existing SessionPool and tracks ownership to avoid tearing down shared pools. |
test/core/session_pool/session_pool.test.ts |
Adjusts persistence tests for the new persist key behavior and adds isolation coverage. |
test/core/crawlers/basic_crawler.test.ts |
Adds tests for accepting an injected SessionPool, lifecycle behavior, and sharing across crawlers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Fix "it's" → "its" typo in JSDoc, validate `sessionPool` option as `instanceof SessionPool`, add missing teardown in test, and assert actual session reuse in the shared pool test.
| throw new Error( | ||
| 'Cannot use both `sessionPool` and `sessionPoolOptions` — pass either a pre-built SessionPool instance or options to create one, not both.', | ||
| ); |
There was a problem hiding this comment.
Can we just abolish sessionPoolOptions?
There was a problem hiding this comment.
Possibly, but then we cannot determine whether the crawler owns the SessionPool instance or not... which means, no automatic teardown() (maybe not as problematic, as mentioned here), but also no automatic .resetStore() on crawler restart (which, maybe that's not as problematic either, who knows?)
There was a problem hiding this comment.
Anyway, I tried dropping sessionPoolOptions and keeping only sessionPool in this commit.
Tbf, I didn't mind the sessionPoolOptions that much. Constructing the SessionPool explicitly each time shifts the responsibility to the user (see problems with resetStore and teardown above), which might feel burdensome.
I'm soft-leaning towards keeping both sessionPoolOptions (for BC and low-effort use-cases) and sessionPool (for SessionPool sharing and poweruser scenarios). Wdyt?
There was a problem hiding this comment.
Okay, you made me check how the Python version does it. Key findings:
- No session pool options, if you want to fine tune it, pass in a configured instance
init can be called multiple times, but the first teardown just... tears it down, which is not ideal 😁- EDIT init can only be called once, subsequent initialization attempts throw an error. There is also an
activeproperty on the session pool (and other components that require async init/teardown, e.g.,Statistics).BasicCrawler.runchecks thisactiveproperty.
I think the Python way is the right direction - it takes care of the init/teardown logic for the user, without providing two ways to customize the session pool.
There was a problem hiding this comment.
Also, can't we require Node.js 24+ so that we can use using and Symbol.dispose?
There was a problem hiding this comment.
it takes care of the init/teardown logic for the user
I'm all for this, but we cannot safely call .teardown() / .resetStore() on a shared pool, potentially used by multiple crawler instances at once (which is what the original issue is asking for).
I made the SessionPool lazy-initialized, i.e., any call to its async methods checks whether it has been initialized - and runs init() if not.
The teardown calls are trickier, though. I'm now tracking whether this.sessionPool is owned by the crawler, or if it has been passed from the outside. .teardown() is then called only if it's owned. Imo this is the best compromise we can have here, but feel free to change my mind :)
Also, can't we require Node.js 24+ so that we can use
usingandSymbol.dispose?
I suppose we can implement Symbol.dispose to call teardown(), providing a more idiomatic interface for disposal, but I'm failing to see how it should help us with the shared pool instance 🤔
There was a problem hiding this comment.
Since we're not using the Symbol.dispose approach anywhere else yet (and the implementation would be just calling this.teardown() anyway), I decided against adding it right now. Imo it would just prompt more questions (why does SessionPool have it and other classes do not?).
That being said, I'd be all for adding support for the using / Symbol.dispose API across the project in some later PR.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR removes the static `SessionPool.open()` method. Until the `new SessionPool()` instance has been interacted with, it doesn't access any storage. Once the user (or another Crawlee component) calls one of the `SessionPool` async methods, the `SessionPool` is automatically initialized. This simplifies usage in e.g. `AdaptivePlaywrightCrawler` (where parts of the context pipeline are called separately before the `.initialize()` call), makes things clearer for users, and does away with potentially confusing temporal coupling (`SessionPool.initialize()` had to be called before any other operation, otherwise SessionPool would throw exceptions). Prerequisite for #3499 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
SessionPool instanceSessionPool instance between crawlers
BasicCrawlerand subclasses now acceptsessionPooloption for passing a pre-initializedSessionPoolinstance. This is mutually exclusive withsessionPoolOptions.SessionPoolinstances now do not share the persistence keys (similar to e.g.Statistics).Closes #3445