Skip to content

Conversation

shasheene
Copy link

Hi again Roberto,

I was looking to integrate gztool with an external application as a library, so this PR splits the project up enabling this.

It builds on the non-controversial initial cleanups (#22) which I'm sure will be merged without issue.

This PR exposes many key structures and functions in the public header and usable by other applications. But some key functions (notably the build and extract functions) are currently only exposed in the private (non-stabilized) API header file, because those two functions still have the pretty tight integration to action enums which I want to consider private/internal.

It would take a little bit more refactoring to delineate out the two keyfunctions related to building the index and extracting the bytes by factoring out the action enums that are used in the main file.

I was excited about doing this, and I think the direction of making gztool more usable as a library is still useful, but I'm currently less interested in pursuing it. For my primary goal of these changes, I ended up switching from gztool to the "zran.c". I was looking to use gztool as it's had more development than zran and eg, comes with serialization functions with endianness handling all ready to go. But unfortunately much of gztool's excellent advanced functionality (handling growing gzip files, line-oriented indexing etc) was causing significant extra complexity and getting in the way of my primary goal, and the advanced functionality isn't immediately useful my specific use case.

I'm happy to close this PR if it's not the direction you're planning on taking the project. I suspect others coming across the project would be interested in a libgztool along the lines of this PR.

Thoughts?

Enable all warnings, and treat them like errors.
Remove the obsolete macro (without autoupdate), and add autoreconf arguments
for missing auxillary files, because on Debian 12 Bookworm, the autoreconf instruction
fails with:

$ autoreconf
configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete.
configure.ac:4: You should run autoupdate.
./lib/autoconf/c.m4:1659: AC_PROG_CC_C99 is expanded from...
configure.ac:4: the top level
configure.ac:3: error: required file './compile' not found
configure.ac:3:   'automake --add-missing' can install 'compile'
configure.ac:2: error: required file './install-sh' not found
configure.ac:2:   'automake --add-missing' can install 'install-sh'
configure.ac:2: error: required file './missing' not found
configure.ac:2:   'automake --add-missing' can install 'missing'
Makefile.am: error: required file './depcomp' not found
Makefile.am:   'automake --add-missing' can install 'depcomp'
parallel-tests: error: required file './test-driver' not found
parallel-tests:   'automake --add-missing' can install 'test-driver'
autoreconf: error: automake failed with exit status: 1
Fix a typo in the names of the endianness functions ahead of a refactor.
Fixes the inconsistent use of camelCasing with function "giveMeSIUnits", and
its similarly misnamed  #define constant.
Very minor consistency fix up ahead of the files being split/refactor.
Fix another inconsistent use of camelCase, to be with the rest of the codebase
ahead of a refactor.
Removes ambiguity in name, as it was not clear whether was when using the
library in some contexts it was not clear whether it was index file size or the
uncompressed file size.
@shasheene
Copy link
Author

Oh interesting, I didn't realize to retarget the base in a PR (to the non-controversial-tidyups branch) it has to be branch in the same repository not a forked repository.

Well, only the final commit "Split "gztool.c" for usage as library" is what matters here. The rest are part of #22

@circulosmeos
Copy link
Owner

Hi @shasheene !,

I do really appreciate your PRs: I thought about making the code compatible with use as a library, but didn't advanced that way both because I lacked a use case and also lacked time. Sorry to hear this code is too complex for the project, but also totally understandable.

If you could advance in the refactoring to expose those two keyfunctions it would be great, but if not your work is great and I hope to find time to take a deeper look and integrate it in my repository. Unfortunately time is scarce lately, so no promises of dates, but again, great work.

@shasheene
Copy link
Author

shasheene commented Aug 8, 2025

It's really the action_create_index() I don't want to expose in the public API gztools.h:

int action_create_index(
    char *file_name, struct access **index,
    char *index_filename, enum INDEX_AND_EXTRACTION_OPTIONS indx_n_extraction_opts,
    uint64_t offset, uint64_t line_number_offset, uint64_t span_between_points, int write_index_to_disk,
    int end_on_first_proper_gzip_eof, int always_create_a_complete_index,
    int waiting_time, int force_action, int wait_for_file_creation,
    int extend_index_with_lines, uint64_t expected_first_byte, int gzip_stream_may_be_damaged,
    bool lazy_gzip_stream_patching_at_eof,
    uint64_t range_number_of_bytes, uint64_t range_number_of_lines,
    bool adjust_index_points_to_byte_boundary,
    int compression_factor, bool continue_tailing_on_eof );

I consider its function signature very big, and it would additionally pull the entire INDEX_AND_EXTRACTION_OPTIONS enum into the gztools.h public header file:

enum INDEX_AND_EXTRACTION_OPTIONS {
    JUST_CREATE_INDEX = 1, SUPERVISE_DO,
    SUPERVISE_DO_AND_EXTRACT_FROM_TAIL, EXTRACT_FROM_BYTE,
    EXTRACT_TAIL, COMPRESS_AND_CREATE_INDEX, DECOMPRESS,
    EXTRACT_FROM_LINE };

As of the current state of this PR they both are in gztools_internal.h, which as a private API can be considered by other developers to rapidly change within versions with no stability guarantees.

Part of what I dislike about the current interface is it forms a (for lack of a better term) modal interface, where an enum greatly modifies the internal behavior of the function. Which is fine when everything is main.c, but harder to provide function guarantees as an API that are easy to reason about.

Whatever is placed in gztools.h should be expected to be somewhat stable and any breaking changes should be accompanied with a semantic version style major version bump.

I see three approaches:

  1. Not worry about a major internal refactoring and just expose the enum and functions as-is. (Which I think is fine, and almost no work). And if there is a major refactor of the API, there can always be a major version bump (semantic version style) to indicate this to other developers

  2. Don't make any changes beyond the current state of the PR, and require developers to knowingly link against the _internal private API, to self-document that some future work may expose a stable public API. (this is the approach I was taking)

  3. Majorily refactor the action_create_index() to reduce the exposure of a "modal" function interface. (the optimal approach, but by far the most work)

At this stage, I'm definitely not inclined to conduct the significant refactor in number 3.

I can definitely assist with updating the much simpler option 1. Exposing the INDEX_AND_EXTRACTION_OPTIONS enum and its modal complexity to other developers is not ideal, but it's not the worst thing and it can be improved in the future.

If you concur that option number 1 is sufficient (ie, minimal changes to get a complete public API exposed), I'm happy to update this PR.

@jnorthrup
Copy link

the line code is a significant inroad to reducing context and page overheads, it is matched to JSONL and CSV pretty good. i use zran for early stage projects and gztool for terabyte line corpus with thunking range requests.

@jnorthrup
Copy link

I submitted jdk-compatible patch to zran upstream to avoid 7 of the 8 blocks that land on odd bits. I think he refactored something else in zran and closed PR but if you were looking for ancient zlib compat gztool has(d) a working downgrade to zlib chunk boundaries

@circulosmeos
Copy link
Owner

If you concur that option number 1 is sufficient (ie, minimal changes to get a complete public API exposed), I'm happy to update this PR.

Yeah, of course, that would be great! 👍🏻

@circulosmeos
Copy link
Owner

circulosmeos commented Aug 8, 2025

I consider its function signature very big

Yes, in fact it is... :_(
I've been thinking about creating some struct to make it lighter, but still it doesn't seem to me the perfect solution.

@shasheene
Copy link
Author

shasheene commented Aug 8, 2025

the line code is a significant inroad to reducing context and page overheads, it is matched to JSONL and CSV pretty good. i use zran for early stage projects and gztool for terabyte line corpus with thunking range requests.

Oh yes absolutely, I'm not knocking the line-based indexing feature of gztool at all. It's a great feature introduced by gztool that is incredibly valuable for those who use the tool on line-orientated gzipped files such as very large log files. This PR of course intends on continuing to supporting it. The concern was about how to expose most cleanly expose the feature with the least effort since time is so scarce, and some pain I had when adapting some of the code implementing it.

I submitted jdk-compatible patch to zran upstream to avoid 7 of the 8 blocks that land on odd bits. I think he refactored something else in zran and closed PR but if you were looking for ancient zlib compat gztool has(d) a working downgrade to zlib chunk boundaries

Thanks for the information @jnorthrup, I have found your PR here madler/zlib#801 and took a quick look to better understand your comment.

It's getting a bit off-topic, but the standard span interval used by default by both gztool and zran appears sufficient for right now. I have yet tried with the JDK or noticed any issues with the zlib chunk boundaries claimed in that PR and the push back in that PR from the zlib coauthor about the JDK zlib being incomplete or old is concerning to me. I haven't yet delved very deeply into the INFLATEPRIME polyfill defines and the JDK implementation to validate.

I will mention the basic zran.c library refactor (that's similar to this PR) was made in a zlib branch (that I made to achieve my primary goal) is based on the latest zran.c v1.6 from 2024-08-02, so it does includes those polyfills that Mark Adler suggested will fix the issues with the JDK.

While we're off topic, I'll also mention gztool.c is based on zran.c v1.1 from 2012-09-29 and there's a few changes since then:

/* Version History:
 1.0  29 May 2005  First version
 1.1  29 Sep 2012  Fix memory reallocation error
 1.2  14 Oct 2018  Handle gzip streams with multiple members
                   Add a header file to facilitate usage in applications
 1.3  18 Feb 2023  Permit raw deflate streams as well as zlib and gzip
                   Permit crossing gzip member boundaries when extracting
                   Support a size_t size when extracting (was an int)
                   Do a binary search over the index for an access point
                   Expose the access point type to enable save and load
 1.4  13 Apr 2023  Add a NOPRIME define to not use inflatePrime()
 1.5   4 Feb 2024  Set returned index to NULL on an index build error
                   Stop decoding once request is satisfied
                   Provide a reusable inflate engine in the index
                   Allocate the dictionaries to reduce memory usage
 1.6   2 Aug 2024  Remove unneeded dependency on limits.h

It's perhaps worth upgrading gztool to some of these latest "upstream" features, bugfixes and speedups.

But certainly out-of-scope for this PR.

@circulosmeos
Copy link
Owner

I cannot remember the reason why I didn't use version 1.2 back then, but probably was because I implemented that feature "Handle gzip streams with multiple members" myself, and needed to also implement it in a way compatible with bgzip blocks and other formats which gztool is able to treat.

Later zran versions are probably worth importing but the problem is that zran was the initial template, but now the code is not anymore compatible with it (as far as I remember), so the porting may require some excruciating examination with a required time that I cannot invest right now.

Splits "gztool.c" into components for use in external libraries.

1. gztool.h: Public API including structs and definitions that end-users of the
library would need to us. Modifications to API cause major version number bump.

2. gztool.c: Implementation of the public functions that IDEALLY users of
'gztool.h' would not need to read

3. gztool_internal.h:  Internal API, which can change between versions

4. gztool_internal.c: Implementation of the internal API

5. main.c: Existing CLI application

Unfortunately some key functions (relating to creating index, decompressing and
compressing) to the public API are far from ideal to be exposed in a public API
(especially around the modal enum that modifies function behavior), but this
minor change was concurred as acceptable by the author of gztool [1].

[1] circulosmeos#23 (comment)
@shasheene shasheene force-pushed the split-up-for-use-as-library branch from d2bdb4d to fc3ee53 Compare August 16, 2025 20:12
@shasheene
Copy link
Author

Ok @circulosmeos I have exposed the other key functions as public as-is without any refactoring to clean up the API.

As suggested earleir I recommend the previous PR (#22) be approved and merged, since the cleanups in its 7 trivial commits in it are completely non-controversial and can be merged without issue (and it makes this PR appear much tidier when the base commits are merged).

@circulosmeos
Copy link
Owner

Ok, thanks @shasheene for your hard work!

Hope I can scrub some time to integrate your changes. I'll probably make a number version upgrade too... 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants