Skip to content

refactor: Introduce new Apify storage client #470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented May 10, 2025

Description

Issues

Testing

  • The current test set covers the changes.

@vdusek vdusek self-assigned this May 10, 2025
@github-actions github-actions bot added this to the 114th sprint - Tooling team milestone May 10, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 10, 2025
@vdusek vdusek changed the title New apify storage clients refactor: Introduce new Apify storage client May 10, 2025
@vdusek vdusek force-pushed the new-apify-storage-clients branch from d27c080 to 82efd3e Compare June 12, 2025 12:44
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jun 18, 2025
@vdusek vdusek force-pushed the new-apify-storage-clients branch 2 times, most recently from 067b793 to 104a168 Compare June 23, 2025 09:12
@vdusek vdusek marked this pull request as ready for review June 26, 2025 13:04
@vdusek vdusek requested a review from Pijukatel June 26, 2025 13:05
@janbuchar janbuchar self-requested a review June 26, 2025 13:27
@@ -11,14 +11,14 @@ async def main() -> None:
await dataset.export_to(
content_type='csv',
key='data.csv',
to_key_value_store_name='my-cool-key-value-store',
to_kvs_name='my-cool-key-value-store',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this BC break worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's evaluate all the potential BCs at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I thought we are nearing that now 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just re-exporting the storages from crawlee here, there will be many more cases than this one. I'm not saying we have to rename this particular argument (and I will undo it if you insist—just I don't like those long identifiers, especially when we can use the KVS abbreviation).

vdusek added a commit to apify/crawlee-python that referenced this pull request Jul 1, 2025
## Description

- I consolidated all commits from
#1107 into this new PR.
- The previous storage-clients implementation was completely replaced
with a redesigned clients, including:
    - new storage-client interface,
    - in-memory storage client,
    - file-system storage client,
- Apify storage client (implemented in the SDK; see
apify/apify-sdk-python#470),
    - and various adjustments elsewhere in the codebase.
- The old "memory plus persist" client has been split into separate
memory and file-system implementations.
- The `Configuration.persist_storage` and
`Configuration.persist_metadata` options were removed.
- All old collection clients have been removed, they're no longer
needed.
- Each storage client now prints warnings if you pass method arguments
it does not support.
- The creation management modules in the storage clients and storages
were removed.
- Storage client parameters (e.g. `purge_on_start`, or `token` and
`base_api_url` for the Apify client) are configured via the
`Configuration`.
- Every storage, and its corresponding client, now provides both a
`purge` method (which clears all items but preserves the storage and
metadata) and a `drop` method (which removes the entire storage,
metadata included).
- All unused types, models, and helper utilities have been removed.
- The detailed, per-storage/client changes are listed below...

### Dataset

- Properties:
    - `id`
    - `name`
    - `metadata`
- Methods:
    - `open`
    - `purge` (new method)
    - `drop`
    - `push_data`
    - `get_data`
    - `iterate_items`
    - `list_items` (new method)
    - `export_to`
- Breaking changes:
- `from_storage_object` method has been removed - Use the `open` method
with `name` or `id` instead.
    - `get_info` -> `metadata` property
    - `storage_object` -> `metadata` property
- `set_metadata` method has been removed (it wasn't propage to clients)
        - Do we want to support it (e.g. for renaming)?
- `write_to_json` -> method has been removed, use `export_to` instead
    - `write_to_csv` -> method has been removed, use `export_to` instead

```python
import asyncio

from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import Dataset


async def main() -> None:
    dataset = await Dataset.open(storage_client=FileSystemStorageClient())
    print(f'default dataset - ID: {dataset.id}, name: {dataset.name}')

    await dataset.push_data({'name': 'John'})
    await dataset.push_data({'name': 'John', 'age': 20})
    await dataset.push_data({})

    dataset_with_name = await Dataset.open(
        name='my_dataset',
        storage_client=FileSystemStorageClient(),
    )
    print(f'named dataset - ID: {dataset_with_name.id}, name: {dataset_with_name.name}')

    await dataset_with_name.push_data([{'age': 30}, {'age': 25}])

    print('Default dataset items:')
    async for item in dataset.iterate_items(skip_empty=True):
        print(item)

    print('Named dataset items:')
    async for item in dataset_with_name.iterate_items():
        print(item)

    items = await dataset.get_data()
    print(items)

    dataset_by_id = await Dataset.open(id=dataset_with_name.id)
    print(f'dataset by ID - ID: {dataset_by_id.id}, name: {dataset_by_id.name}')


if __name__ == '__main__':
    asyncio.run(main())
```

### Key-value store

- Properties:
    - `id`
    - `name`
    - `metadata`
- Methods:
    - `open`
    - `purge` (new method)
    - `drop`
    - `get_value`
    - `set_value`
- `delete_value` (new method, Apify platform's set_value support setting
an empty value to a key, so having a separate method for deleting is
useful)
    - `iterate_keys`
    - `list_keys` (new method)
    - `get_public_url`
    - `get_auto_saved_value`
    - `persist_autosaved_values`
- Breaking changes:
- `from_storage_object` method has been removed - Use the `open` method
with `name` or `id` instead.
    - `get_info` -> `metadata` property
    - `storage_object` -> `metadata` property
- `set_metadata` method has been removed (it wasn't propage to clients)
        - Do we want to support it (e.g. for renaming)?

```python
import asyncio

import requests

from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import KeyValueStore


async def main() -> None:
    print('Opening key-value store "my_kvs"...')
    storage_client = FileSystemStorageClient()
    kvs = await KeyValueStore.open(name='my_kvs', storage_client=storage_client)

    print('Setting value to "file.json"...')
    await kvs.set_value('file.json', {'key': 'value'})

    print('Setting value to "file.jpg"...')
    response = requests.get('https://avatars.githubusercontent.com/u/25082181')
    await kvs.set_value('file.jpg', response.content)

    print('Iterating over keys:')
    async for key in kvs.iterate_keys():
        print(f'Key: {key}')

    print('Listing keys:')
    keys = [key.key for key in await kvs.list_keys()]
    print(f'Keys: {keys}')

    for key in keys:
        print(f'Getting value of {key}...')
        value = await kvs.get_value(key)
        print(f'Value: {str(value)[:100]}')

    print('Deleting value of "file.json"...')
    await kvs.delete_value('file.json')

    kvs_default = await KeyValueStore.open(storage_client=storage_client)

    special_key = 'key with spaces/and/slashes!@#$%^&*()'
    test_value = 'Special key value'

    await kvs_default.set_value(key=special_key, value=test_value)

    record = await kvs_default.get_value(key=special_key)
    assert record is not None
    assert record == test_value

    result = await kvs_default.list_keys()
    print(f'kvs_default list keys = {result}')

    kvs_2 = await KeyValueStore.open()
    result = await kvs_2.list_keys()
    print(f'kvs_2 list keys = {result}')


if __name__ == '__main__':
    asyncio.run(main())
```

### Request queue

- Properties:
    - `id`
    - `name`
    - `metadata`
- Methods:
    - `open`
    - `purge` (new method)
    - `drop`
    - `add_request`
    - `add_requests_batched` -> `add_requests`
    - `fetch_next_request`
    - `get_request`
    - `mark_request_as_handled`
    - `reclaim_request`
    - `is_empty`
    - `is_finished`
- Breaking changes:
- `from_storage_object` method has been removed - Use the `open` method
with `name` or `id` instead.
    - `get_info` -> `metadata` property
    - `storage_object` -> `metadata` property
- `set_metadata` method has been removed (it wasn't propage to clients)
        - Do we want to support it (e.g. for renaming)?
- `get_handled_count` method had been removed - Use
`metadata.handled_request_count` instead.
- `get_total_count` method has been removed - Use
`metadata.total_request_count` instead.
- `resource_directory` from the `RequestQueueMetadata` was removed, use
`path_to...` property instead.
- `RequestQueueHead` model has been removed - Use
`RequestQueueHeadWithLocks` instead.
- Notes:
- New RQ `add_requests` contain `forefront` arg (Apify API supports it)

```python
import asyncio

from crawlee import Request
from crawlee.configuration import Configuration
from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import RequestQueue


async def main() -> None:
    rq = await RequestQueue.open(
        name='my-queue',
        storage_client=FileSystemStorageClient(),
        configuration=Configuration(purge_on_start=True),
    )

    print(f'RequestQueue: {rq}')
    print(f'RequestQueue client: {rq._client}')

    await rq.add_requests(
        requests=[
            Request.from_url('https://example.com', use_extended_unique_key=True),
            Request.from_url('https://crawlee.dev', use_extended_unique_key=True),
            Request.from_url('https://apify.com', use_extended_unique_key=True),
        ],
    )

    print('Requests were added to the queue')

    is_empty = await rq.is_empty()
    is_finished = await rq.is_finished()

    print(f'Is empty: {is_empty}')
    print(f'Is finished: {is_finished}')

    request = await rq.fetch_next_request()
    print(f'Fetched request: {request}')

    await rq.add_request('https://facebook.com', forefront=True)

    request = await rq.fetch_next_request()
    print(f'Fetched request: {request}')

    rq_default = await RequestQueue.open(
        storage_client=FileSystemStorageClient(),
        configuration=Configuration(purge_on_start=True),
    )

    await rq_default.add_request('https://example.com/1')
    await rq_default.add_requests(
        [
            'https://example.com/priority-1',
            'https://example.com/priority-2',
            'https://example.com/priority-3',
        ]
    )
    await rq_default.add_request('https://example.com/2')


if __name__ == '__main__':
    asyncio.run(main())
```

### BaseDatasetClient

- Properties:
    - `metadata`
- Methods:
    - `open`
    - `purge`
    - `drop`
    - `push_data`
    - `get_data`
    - `iterate_items`

### BaseKeyValueStoreClient

- Properties:
    - `metadata`
- Methods:
    - `open`
    - `purge`
    - `drop`
    - `get_value`
    - `set_value`
    - `delete_value`
    - `iterate_keys`
    - `get_public_url`

### BaseRequestQueueClient

- Properties:
    - `metadata`
- Methods:
    - `open`
    - `purge`
    - `drop`
- `add_requests_batch` -> `add_batch_of_requests` (one backend method
for 2 frontend methods)
    - `get_request`
    - `fetch_next_request`
    - `mark_request_as_handled`
    - `reclaim_request`
    - `is_empty`
- Models
    - `RequestQueueHeadWithLocks` -> `RequestQueueHead`
    - `BatchRequestsOperationResponse` -> `AddRequestsResponse`
- Notes:
- Old file system (memory) version didn't persist the in-progress
requests
- Old file system (memory) version didn't persist the forefront values
(now there is a FS-specific `_sequence` field in the FS Request)
- The methods manipulating locks and listing heads are now only internal
methods of Apify RQ client.

## Issues

- Closes: #92
- Closes: #147
- Closes: #783
- Closes: #1247
- Relates: #1175
- Relates: #1191

## Testing

- The original tests were mostly removed and replaced with a new ones.
- Each storage-client implementation now has its own dedicated tests at
the client level (more targeted/edge-case coverage).
- On top of that, there are storage-level tests that use a parametrized
fixture for each storage client (`file-system` and `memory`), ensuring
every storage test runs against every client implementation.

## Checklist

- [x] CI passed
@vdusek vdusek requested a review from Pijukatel July 2, 2025 10:54
@vdusek
Copy link
Contributor Author

vdusek commented Jul 2, 2025

Probably let's wait for @janbuchar, and also I'll add an upgrading guide for v2.

Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it through the request queue client, but here's a batch of comments.

"""

@override
async def purge(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this override mean that Crawlee does not treat INPUT.json differently from other files anymore and this logic is only added in the SDK? I'm a fan, but I wonder if the rest of the team is on board with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - because Crawlee does not use INPUT.json for anything. We are in the same boat with @Pijukatel here.

@B4nan, would you be OK with this change? TLDR; so you don't have to study the code: As Honza wrote, Crawlee now treats INPUT.json just like any other file. Here in SDK, I've added a minimal override of FileSystemStorageClient that only customizes the purge method, preventing it from deleting INPUT.json. That's it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we talking about the input.json in the project root? since that is implemented on crawlee level (in JS) and i think it should stay this way. i am not talking about the purging logic, in JS, if there is an INPUT.json file in the root, KVS.getInput() will prefer it over anything in the storage folder (the local/memory implementation, with apify client it is ignored completely). not sure if we even do this in the python version (and I am open to discuss if we want to remove this from v4, but it feels like a good thing to me)

if this is just about purging, i am not entirely sure i understand what it means, we dont want to autopurge inputs on crawlee level, crawlee shouldnt behave differently because you init the SDK (not locally).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the "Crawlee is a web scraping framework" point of view, handling input feels a bit redundant here - if you intend to only run the crawler outside of Apify, you'll probably want to accept input some other way - CLI arguments, environment variables, some domain-speciffic config file... Providing a helper for loading a JSON file from two possible locations doesn't add much value IMO.

Asking Apifiers about this doesn't make much sense in this situation, because basically everything we write is intended to run on Apify 🙂

@Pijukatel Pijukatel self-requested a review July 17, 2025 13:23
Copy link
Contributor

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing the change on platform I used modified benchmark tests. They are specific, because they scrape locally hosted site and so the speed of scraping is unusually fast - no network related delays. This exposes RequestQueue race condition that can happen when checking if the RQ is empty or not.
When running the test using this branch, I saw that the crawler premature finished even though RQ was still quite full. The hypothesis about premature exit was confirmed when running same test on branch modified with extra log and sleep.

Please see following tiny commit based on this branch and attached log:
5b25728

image

From this run: https://console.apify.com/actors/UdJ0NQp2j8140G9db/runs/4yfZqLvLo1xhrk2Cb#log

@vdusek vdusek force-pushed the new-apify-storage-clients branch from 2f39b35 to ca72313 Compare July 20, 2025 18:49
# Prepare requests for API by converting to dictionaries
requests_dict = [request.model_dump(by_alias=True) for request in requests]

# Remove 'id' fields from requests as the API doesn't accept them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be achieved during model dump by request.model_dump(by_alias=True, exclude={"id"})

@Pijukatel Pijukatel self-requested a review July 22, 2025 13:45
self._api_client = api_client
"""The Apify request queue client for API operations."""

self._api_public_base_url = api_public_base_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere if the client is already created?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce new Apify storage client
4 participants