Skip to content

Provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker #259

@jbosboom

Description

@jbosboom

My program wants to compress some large cached strings and decompress them later. I have no particular requirements on the form of the compressed data, so I used ZstdCompressionChunker to do the compression to avoid repeated reallocation of the output buffer. I would like to process the decompressed data in chunks to reduce peak memory usage. However there is no obvious efficient way to decompress chunks to chunks:

  • The ZstdCompressionChunker round-trip tests all concatenate the chunks with bytes.join for one-shot decompression. (Fine, they're tests.)

  • I tried chain.from_iterable(dctx.read_to_iter(c) for c in chunks). This doesn't work because each read_to_iter iterator expects to process a full stream. (I expected it to hold state in the ZstdDecompressor it was obtained from.)

  • ZstdCompressionObj's documentation says it isn't efficient:

    Because calls to decompress() may need to perform multiple memory (re)allocations, this streaming decompression API isn’t as efficient as other APIs.

  • read_to_iter's documentation says

    read_to_iter() accepts an object with a read(size) method that will return compressed bytes or an object conforming to the buffer protocol.

    so I wrote a class with a read method that returns memoryviews over the chunks (to avoid copying slices). The documentation is grammatically ambiguous; it turns out that read_to_iter segfaults (!) when given an object with a read method that returns an object conforming to the buffer protocol that is not exactly bytes (reduced test case below).

My feature request is to provide an efficient way to decompress a sequence of chunks compressed with ZstdCompressionChunker (or to document an existing method as the efficient way, if there is one).


import zstandard as zstd
b = b'AB' * 1000
d = zstd.compress(b)
assert zstd.decompress(memoryview(d)) == b # passes
class Whatever:
    def __init__(self, data):
        self.data = data
    def read(self, size):
        assert len(data) <= size
        return memoryview(self.data)
dctx = zstd.ZstdDecompressor()
assert b''.join(dctx.read_to_iter(Whatever(d))) == b # segfault

Segfaults using Arch Linux's python 3.13.2-1 and python-zstandard 0.23.0-2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions