-
Notifications
You must be signed in to change notification settings - Fork 12
download and write by fetch info instead of by term #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2f83f2c
to
ad78a4e
Compare
…nload-write-by-fetch-info
terms, | ||
offset_into_first_range, | ||
segment_size, | ||
total_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUG: this parameter is called "remaining_total_len" but this never gets updated, i.e. ranges will be written at incorrect offset for 2+ segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turns out to need to be passed in total_len in this case, the calculation ensures not writing past the end of the requested range.
There was an additional bug here though that each segment began writing at byte 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you derive the remaining_total_len correctly, then the start offset for each segment is total_len - remaining_total_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug that write offset is not updated correctly. Some memcpy can also be avoided.
Fix XET-493.
This change changes the standard parallelized download process to run with respect to fetch info's within a segment rather than on a term by term basis.
Why? terms which are adjacent or overlapping in a xorb utilize a single fetch info instance or a presigned url to CDN/S3. Downloading term by term can cause repeat downloads of the same fetch info url. We do not often see this happen because the cache & single flight are used to combat this happening. However the cache does not help if it is disabled/slow, and in general we can get rid of the latency even for a cache hit (file read into memory) with this change. Notably local cache hit is still better than a CDN download so we still check the cache first.