download and write by fetch info instead of by term #256

assafvayner · 2025-04-17T21:20:38Z

Fix XET-493.

This change changes the standard parallelized download process to run with respect to fetch info's within a segment rather than on a term by term basis.

Why? terms which are adjacent or overlapping in a xorb utilize a single fetch info instance or a presigned url to CDN/S3. Downloading term by term can cause repeat downloads of the same fetch info url. We do not often see this happen because the cache & single flight are used to combat this happening. However the cache does not help if it is disabled/slow, and in general we can get rid of the latency even for a cache hit (file read into memory) with this change. Notably local cache hit is still better than a CDN download so we still check the cache first.

…nload-write-by-fetch-info

cas_client/src/remote_client.rs

cas_client/src/download_utils.rs

cas_client/src/remote_client.rs

seanses · 2025-04-23T22:47:41Z

cas_client/src/remote_client.rs

+                        terms,
+                        offset_into_first_range,
+                        segment_size,
+                        total_len,


BUG: this parameter is called "remaining_total_len" but this never gets updated, i.e. ranges will be written at incorrect offset for 2+ segments.

this turns out to need to be passed in total_len in this case, the calculation ensures not writing past the end of the requested range.

There was an additional bug here though that each segment began writing at byte 0.

If you derive the remaining_total_len correctly, then the start offset for each segment is total_len - remaining_total_len

cas_client/src/download_utils.rs

seanses

There's a bug that write offset is not updated correctly. Some memcpy can also be avoided.

seanses and others added 18 commits April 14, 2025 11:27

segmented download and refresh fetch info on 403

f00f9a9

switch download and parallel write to segmented download

1e5eb93

refactor scheduler

db844af

fix range header bug and make Range<T> aliases type safe

0c323b6

fix futures queue polling

70ace7c

fix bugs

52c2a10

Merge branch 'main' into di/segmented-download

b210ed1

fix format

970e193

temporarily delete the test due to API change, will rewrite

841a35b

fix linting

b761c00

fix test compilation

4c26096

use JoinSet instead of FuturesUnordered for parallel write case

76e95c3

remove lock and wait in sequential mode

f04d86a

address PR comments, more tests

cdd3446

download_utils mods draft

7926775

compiles?

d52623b

channel to queue

2392d6e

lint

ad78a4e

assafvayner force-pushed the assaf/download-write-by-fetch-info branch from 2f83f2c to ad78a4e Compare April 18, 2025 18:21

assafvayner added 2 commits April 21, 2025 14:28

Merge branch 'main' of github.com:huggingface/xet-core into assaf/dow…

26a7fa6

…nload-write-by-fetch-info

fix based on tests

b132eee

assafvayner requested review from seanses and jgodlew April 21, 2025 22:40

assafvayner marked this pull request as ready for review April 21, 2025 22:40

assafvayner added 6 commits April 21, 2025 17:19

edits

ef2342c

fix lint

5498704

use unauthed client

84249c1

fmt

6319502

Update remote_client.rs (#261)

d83be50

fmt

798608b