Skip to content

download and write by fetch info instead of by term #256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

assafvayner
Copy link
Contributor

Fix XET-493.

This change changes the standard parallelized download process to run with respect to fetch info's within a segment rather than on a term by term basis.

Why? terms which are adjacent or overlapping in a xorb utilize a single fetch info instance or a presigned url to CDN/S3. Downloading term by term can cause repeat downloads of the same fetch info url. We do not often see this happen because the cache & single flight are used to combat this happening. However the cache does not help if it is disabled/slow, and in general we can get rid of the latency even for a cache hit (file read into memory) with this change. Notably local cache hit is still better than a CDN download so we still check the cache first.

@assafvayner assafvayner force-pushed the assaf/download-write-by-fetch-info branch from 2f83f2c to ad78a4e Compare April 18, 2025 18:21
@assafvayner assafvayner requested review from seanses and jgodlew April 21, 2025 22:40
@assafvayner assafvayner marked this pull request as ready for review April 21, 2025 22:40
terms,
offset_into_first_range,
segment_size,
total_len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: this parameter is called "remaining_total_len" but this never gets updated, i.e. ranges will be written at incorrect offset for 2+ segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turns out to need to be passed in total_len in this case, the calculation ensures not writing past the end of the requested range.

There was an additional bug here though that each segment began writing at byte 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you derive the remaining_total_len correctly, then the start offset for each segment is total_len - remaining_total_len

Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug that write offset is not updated correctly. Some memcpy can also be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants