-
-
Notifications
You must be signed in to change notification settings - Fork 105
Description
My program wants to compress some large cached strings and decompress them later. I have no particular requirements on the form of the compressed data, so I used ZstdCompressionChunker to do the compression to avoid repeated reallocation of the output buffer. I would like to process the decompressed data in chunks to reduce peak memory usage. However there is no obvious efficient way to decompress chunks to chunks:
-
The ZstdCompressionChunker round-trip tests all concatenate the chunks with
bytes.join
for one-shot decompression. (Fine, they're tests.) -
I tried
chain.from_iterable(dctx.read_to_iter(c) for c in chunks)
. This doesn't work because eachread_to_iter
iterator expects to process a full stream. (I expected it to hold state in the ZstdDecompressor it was obtained from.) -
ZstdCompressionObj's documentation says it isn't efficient:
Because calls to decompress() may need to perform multiple memory (re)allocations, this streaming decompression API isn’t as efficient as other APIs.
-
read_to_iter
's documentation saysread_to_iter() accepts an object with a read(size) method that will return compressed bytes or an object conforming to the buffer protocol.
so I wrote a class with a read method that returns memoryviews over the chunks (to avoid copying slices). The documentation is grammatically ambiguous; it turns out that
read_to_iter
segfaults (!) when given an object with a read method that returns an object conforming to the buffer protocol that is not exactlybytes
(reduced test case below).
My feature request is to provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker (or to document an existing method as the efficient way, if there is one).
import zstandard as zstd
b = b'AB' * 1000
d = zstd.compress(b)
assert zstd.decompress(memoryview(d)) == b # passes
class Whatever:
def __init__(self, data):
self.data = data
def read(self, size):
assert len(data) <= size
return memoryview(self.data)
dctx = zstd.ZstdDecompressor()
assert b''.join(dctx.read_to_iter(Whatever(d))) == b # segfault
Segfaults using Arch Linux's python 3.13.2-1 and python-zstandard 0.23.0-2.