-
Notifications
You must be signed in to change notification settings - Fork 14
Splitting "gztool.c" to make it more useful as a library ("libgztool") #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Splitting "gztool.c" to make it more useful as a library ("libgztool") #23
Conversation
Enable all warnings, and treat them like errors.
Remove the obsolete macro (without autoupdate), and add autoreconf arguments for missing auxillary files, because on Debian 12 Bookworm, the autoreconf instruction fails with: $ autoreconf configure.ac:4: warning: The macro `AC_PROG_CC_C99' is obsolete. configure.ac:4: You should run autoupdate. ./lib/autoconf/c.m4:1659: AC_PROG_CC_C99 is expanded from... configure.ac:4: the top level configure.ac:3: error: required file './compile' not found configure.ac:3: 'automake --add-missing' can install 'compile' configure.ac:2: error: required file './install-sh' not found configure.ac:2: 'automake --add-missing' can install 'install-sh' configure.ac:2: error: required file './missing' not found configure.ac:2: 'automake --add-missing' can install 'missing' Makefile.am: error: required file './depcomp' not found Makefile.am: 'automake --add-missing' can install 'depcomp' parallel-tests: error: required file './test-driver' not found parallel-tests: 'automake --add-missing' can install 'test-driver' autoreconf: error: automake failed with exit status: 1
Fix a typo in the names of the endianness functions ahead of a refactor.
Fixes the inconsistent use of camelCasing with function "giveMeSIUnits", and its similarly misnamed #define constant.
Very minor consistency fix up ahead of the files being split/refactor.
Fix another inconsistent use of camelCase, to be with the rest of the codebase ahead of a refactor.
Removes ambiguity in name, as it was not clear whether was when using the library in some contexts it was not clear whether it was index file size or the uncompressed file size.
Oh interesting, I didn't realize to retarget the base in a PR (to the Well, only the final commit "Split "gztool.c" for usage as library" is what matters here. The rest are part of #22 |
Hi @shasheene !, I do really appreciate your PRs: I thought about making the code compatible with use as a library, but didn't advanced that way both because I lacked a use case and also lacked time. Sorry to hear this code is too complex for the project, but also totally understandable. If you could advance in the refactoring to expose those two keyfunctions it would be great, but if not your work is great and I hope to find time to take a deeper look and integrate it in my repository. Unfortunately time is scarce lately, so no promises of dates, but again, great work. |
It's really the
I consider its function signature very big, and it would additionally pull the entire
As of the current state of this PR they both are in Part of what I dislike about the current interface is it forms a (for lack of a better term) modal interface, where an enum greatly modifies the internal behavior of the function. Which is fine when everything is Whatever is placed in I see three approaches:
At this stage, I'm definitely not inclined to conduct the significant refactor in number 3. I can definitely assist with updating the much simpler option 1. Exposing the If you concur that option number 1 is sufficient (ie, minimal changes to get a complete public API exposed), I'm happy to update this PR. |
the line code is a significant inroad to reducing context and page overheads, it is matched to JSONL and CSV pretty good. i use zran for early stage projects and gztool for terabyte line corpus with thunking range requests. |
I submitted jdk-compatible patch to zran upstream to avoid 7 of the 8 blocks that land on odd bits. I think he refactored something else in zran and closed PR but if you were looking for ancient zlib compat gztool has(d) a working downgrade to zlib chunk boundaries |
Yeah, of course, that would be great! 👍🏻 |
Yes, in fact it is... :_( |
Oh yes absolutely, I'm not knocking the line-based indexing feature of
Thanks for the information @jnorthrup, I have found your PR here madler/zlib#801 and took a quick look to better understand your comment. It's getting a bit off-topic, but the standard span interval used by default by both I will mention the basic While we're off topic, I'll also mention
It's perhaps worth upgrading But certainly out-of-scope for this PR. |
I cannot remember the reason why I didn't use version 1.2 back then, but probably was because I implemented that feature "Handle gzip streams with multiple members" myself, and needed to also implement it in a way compatible with Later |
Splits "gztool.c" into components for use in external libraries. 1. gztool.h: Public API including structs and definitions that end-users of the library would need to us. Modifications to API cause major version number bump. 2. gztool.c: Implementation of the public functions that IDEALLY users of 'gztool.h' would not need to read 3. gztool_internal.h: Internal API, which can change between versions 4. gztool_internal.c: Implementation of the internal API 5. main.c: Existing CLI application Unfortunately some key functions (relating to creating index, decompressing and compressing) to the public API are far from ideal to be exposed in a public API (especially around the modal enum that modifies function behavior), but this minor change was concurred as acceptable by the author of gztool [1]. [1] circulosmeos#23 (comment)
d2bdb4d
to
fc3ee53
Compare
Ok @circulosmeos I have exposed the other key functions as public as-is without any refactoring to clean up the API. As suggested earleir I recommend the previous PR (#22) be approved and merged, since the cleanups in its 7 trivial commits in it are completely non-controversial and can be merged without issue (and it makes this PR appear much tidier when the base commits are merged). |
Ok, thanks @shasheene for your hard work! Hope I can scrub some time to integrate your changes. I'll probably make a number version upgrade too... 👍🏻 |
Hi again Roberto,
I was looking to integrate gztool with an external application as a library, so this PR splits the project up enabling this.
It builds on the non-controversial initial cleanups (#22) which I'm sure will be merged without issue.
This PR exposes many key structures and functions in the public header and usable by other applications. But some key functions (notably the build and extract functions) are currently only exposed in the private (non-stabilized) API header file, because those two functions still have the pretty tight integration to action enums which I want to consider private/internal.
It would take a little bit more refactoring to delineate out the two keyfunctions related to building the index and extracting the bytes by factoring out the action enums that are used in the main file.
I was excited about doing this, and I think the direction of making gztool more usable as a library is still useful, but I'm currently less interested in pursuing it. For my primary goal of these changes, I ended up switching from gztool to the "zran.c". I was looking to use
gztool
as it's had more development thanzran
and eg, comes with serialization functions with endianness handling all ready to go. But unfortunately much of gztool's excellent advanced functionality (handling growing gzip files, line-oriented indexing etc) was causing significant extra complexity and getting in the way of my primary goal, and the advanced functionality isn't immediately useful my specific use case.I'm happy to close this PR if it's not the direction you're planning on taking the project. I suspect others coming across the project would be interested in a libgztool along the lines of this PR.
Thoughts?