-
Notifications
You must be signed in to change notification settings - Fork 54
Auto File Management Part 1: Introducing a Datafile Management Tool #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b2cc8d4
to
f91413c
Compare
I also added documentation and integrated the tool into the testing framework.
We plan to eventually install the grdata tool as a standalone command line program. Essentially the build-system will perform some substitutions (the CMake build system uses CMake's built-in ``configure_file`` command while the classic build system uses the analogous ``configure_file.py`` script) This commit introduces a few minor tweaks to grdata.py so that it can more easily be consumed by the ``configure_file.py`` script. - The ``configure_file.py`` script, itself, will ultimately require a few more tweaks so that it doesn't report occurences of python's decorator-syntax as errors - However, this commit minimizes the number of required changes
1. A bug is fixed in the Fortran Bindings 2. All references have been removed about the ``GRACKLE_DATA_DIR`` env variable. An upcoming PR (grackle-project#235) is going to overhaul this machinery, but it probably won't be merged the 3.4 release
This was inspired by something I noticed in the standard Python documentation. Essentially, we use sphinx's builtin [sphinx.ext.extlinks](https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html#module-sphinx.ext.extlinks) extension for adding custom roles to automatically expand external links in a simple manner. For example: - ``:source:`src/example` `` is replaced with the text "src/example" that is hyperlinked to [here](https://github.com/grackle-project/grackle/tree/main/src/example) - ``:code-example:`c_example.c` `` is replaced with the text "c_example.c" that is hyperlinked to [here](https://github.com/grackle-project/grackle/blob/main/src/example/c_example.c) - ``:gh-pr:`235` `` is replaced with the text "gh-pr#235" that is hyperlinked to [here](grackle-project#235)
This was inspired by something I noticed in the standard Python documentation. Essentially, we use sphinx's builtin [sphinx.ext.extlinks](https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html#module-sphinx.ext.extlinks) extension for adding custom roles to automatically expand external links in a simple manner. For example: - ``:source:`src/example` `` is replaced with the text "src/example" that is hyperlinked to [here](https://github.com/grackle-project/grackle/tree/main/src/example) - ``:code-example:`c_example.c` `` is replaced with the text "c_example.c" that is hyperlinked to [here](https://github.com/grackle-project/grackle/blob/main/src/example/c_example.c) - ``:gh-pr:`235` `` is replaced with the text "gh-pr#235" that is hyperlinked to [here](grackle-project#235)
Motivation ---------- At present, we provide several extremely useful resources for teaching Grackle. But, I've long felt that there's initially something of a learning curve. While all of the examples and the ["Interacting with Grackle in Your Simulation Code"](https://grackle.readthedocs.io/en/latest/Interaction.html) section are extremely rich, detailed, and informative, I always feel like they may include a little **too much** information... **Every** usage of Grackle, boils down to the following flow: 1. We have some data about parcels of gas & we want to know something about it (mostly how chemistry/cooling cause it to evolve) 2. We need to encode information about the gas and the physics we care about in a format that Grackle understands.[^1] 3. We package up the data about the gas parcels into a format that Grackle understands (the ``grackle_field_data`` struct). 4. We call the appropriate function. Unfortunately, this flow is not very obvious from the existing resources (i.e. there's just a little too much going on). But, once you understand the flow ("the big picture"), I think our existing resources are extremely useful. Overview -------- This patch tries to bridge this gap. I came up with the shortest and simplest possible example. (I probably could save a line or 2 if I used the global API, but that was a conscious choice). Then, I wrote a tutorial to try to walk the reader through the example chunk by chunk. To help annotate the example, I wrote a really simple sphinx extension. Future Thoughts --------------- Honestly, this is far from perfect, but I think its a good start. Some thoughts for the future: - It would be nice if Grackle could operate with CGS units. The tutorial could then gloss over a number of details. * Since we already use double-precision within Grackle, I think don't think there are any problems with doing this. But clearly, it needs to wait * I think it would also be nice for teaching the pygrackle bindings - It would be better if the example were even shorter: * the `gr_fields` type proposed within grackle-project#271 will help * a number of my other "experimental APIs" will also help * in the 4.0 release, we may want to revisit the default values for the ``use_grackle`` and ``with_radiative_cooling`` parameters (the current defaults are pretty silly). - if we merge in grackle-project#235, grackle-project#237, and grackle-project#246, then we could easily add a section describing how to try out the example without git commands (those PRs are essential for crafting a sample CMake configuration that can automatically install data-files) [^1]: Of course, Grackle is designed to let callers avoid repeating step 2 over and over again.
I removed some commands, combined the Config classes, and refactored/consolidated some argparse-related logic
if prefix is None and envvar == "HOME": | ||
prefix = os.path.expanduser("~") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification opportunity: If we treat this case as an error, we could probably clip ~70 lines of C code from PR #237
raise GrdataError(f"server can't be reached to fetch {url}: {e.code}") | ||
|
||
|
||
def _get_data_dir(system_str=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be able to directly access these files in C, this function could not be replaced by pooch
data_dir_chain = [ # tuple fmt: (envvar, suffix_path, fatal_if_rel, is_enabled) | ||
("GRACKLE_DATA_DIR", None, True, True), | ||
("XDG_DATA_HOME", "grackle", False, True), | ||
("HOME", "Library/Application Support/grackle", True, is_darwin), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification opportunity: we could cut the ("HOME", "Library/Application Support/grackle", True, is_darwin),
line.
Cutting this lines means that we would treat the default directory on macOS in the exact same manner as for any other unix-like OS.
def _progress_bar(total_bytes): | ||
"""provides a function for drawing/updating progress bars""" | ||
ncols = shutil.get_terminal_size()[0] - 1 | ||
power_div_3 = int(log10(total_bytes) // 3) if total_bytes > 0 else 0 | ||
factor, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3] | ||
# the output line has the form: '[<progress-bar>] <size>/<size> <unit>' | ||
fmt = "\r[{bar:{barlen}.{nfill}}] {size:.2f}" + f"/{total_bytes/factor:.2f} {unit}" | ||
barlen = ncols - 19 # for context, 15 <= (len(fmt.format(...)) - barlen) <= 19 | ||
bar = (barlen * "=") if (barlen >= 1) and sys.stdout.isatty() else None | ||
|
||
def _update(size): | ||
nonlocal bar | ||
if size is None and bar is not None: | ||
print(flush=True) | ||
bar = None | ||
elif bar is not None: | ||
nfill = barlen * int(size / total_bytes) | ||
val = fmt.format(bar=bar, barlen=barlen, nfill=nfill, size=size / factor) | ||
print(val, end="", flush=True) | ||
|
||
return _update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification opportunity: this is not strictly necessary and could be clipped (or simplified)
A simpler alternative would might just print the amount of data that was downloaded
def _download_status(tot_bytes):
"""provides information about the current download"""
power_div_3 = int(log10(tot_bytes) // 3) if tot_bytes > 0 else 0
div, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
done = not sys.stdout.isatty()
def _update(size):
nonlocal done
m = "\n" if size is None else f"\r{size / div:.2f}/{tot_bytes / div:.2f} {unit}"
if not done:
print(m, end="", flush=True)
done = size is None
return _update
Prior to this PR, our test suite would fail unless we had an editable install. In reality, all test-cases outside of test_models infer the location of the data files based on the location of the test-files, which are never part of the wheel.[^1] Therefore, this PR introduces a few small tweaks to support testing all test-cases outside of test_models. Overall, this will be really useful for providing a more rigorous check of the correctness of our precompiled wheels (see PR grackle-project#320). [^1]: I know that some packages do ship their test-suites as part of the wheel, but that currently makes absolutely no sense for us unless we provide users with an easy way to get the datafiles. We can always revisit this choice in the future after we merge PRs grackle-project#235 & grackle-project#237 (or come up with an equivalent)
def _progress_bar(total_bytes, silent=False): | ||
"""provides a function for drawing/updating progress bars""" | ||
ncols = shutil.get_terminal_size()[0] - 1 | ||
power_div_3 = int(log10(total_bytes) // 3) if total_bytes > 0 else 0 | ||
factor, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3] | ||
# the output line has the form: '[<progress-bar>] <size>/<size> <unit>' | ||
fmt = "\r[{bar:{barlen}.{nfill}}] {size:.2f}" + f"/{total_bytes/factor:.2f} {unit}" | ||
barlen = ncols - 19 # for context, 15 <= (len(fmt.format(...)) - barlen) <= 19 | ||
suppress = (barlen < 1) or silent or not sys.stdout.isatty() | ||
bar = None if suppress else (barlen * "=") | ||
|
||
def _update(size): | ||
nonlocal bar | ||
if size is None and bar is not None: | ||
print(flush=True) | ||
bar = None | ||
elif bar is not None: | ||
nfill = int(barlen * (size / total_bytes)) | ||
val = fmt.format(bar=bar, barlen=barlen, nfill=nfill, size=size / factor) | ||
print(val, end="", flush=True) | ||
|
||
return _update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification opportunity: we could replace this with the following snippet (or entirely delete the function -- it's just telling us about our current progress)
def _progress_bar(tot_bytes, *, silent=False):
"""provides information about the current download"""
power_div_3 = int(log10(tot_bytes) // 3) if tot_bytes > 0 else 0
div, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
done = tot_bytes <= 0 or silent or not sys.stdout.isatty()
def _update(size):
nonlocal done
m = "\n" if size is None else f"\r{size / div:.2f}/{tot_bytes / div:.2f} {unit}"
if not done:
print(m, end="", flush=True)
done = size is None
return _update
EDIT: @brittonsmith and I had a discussion about this PR about a month ago. The whole discussion was very instructive. He made some good points about maintainability. Since then, I have significantly overhauled this PR. (I still need to propagate these changes forward to #237 and #246 -- but that shouldn't be hard at all. I also have some ideas for shrinking #237).
The overhaul cut the size of the PR in half (originally it added ~2850 lines, now it adds ~1400 lines). 1
Overview
This PR is part of a sequence for adding the
grdata
tool to download/cache data files and let the core Grackle library access the cached files (without specifying the full file path).grdata
tool (currently only usable as part of pygrackle)grdata.py
so it can be used without pygrackle (and so it can easily be used through the build-system of downstream applications). If<prefix>/lib
holdslibgrackle.so
and<prefix>/include
holds headers, thengrdata
will be installed as<prefix>/bin/grdata
.Motivation
The benefits for pygrackle are somewhat self-evident. I think it's pretty clear we will want something like this functionality as soon pygrackle can be installed from PyPI.
I'm going to focus on why this sort of functionality is useful for the core Grackle library! The primary motivation for this functionality (both fetching and allowing Grackle to directly access the cached files without specifying a full path) has to do with using Grackle in a simulation code.
Primary Motivation I: Parameter File Non-Portability in Simulation Codes
Any time somebody wants to run a simulation with Grackle, they will need to modify parameter files with the path to a Grackle parameter file. While this may sound like a trivial task the CONSEQUENCES are significant.
This creates 2 significant pain points for the maintainers of simulation codes; I have experienced both first-hand with Enzo-E. In both cases, these problems are exacerbated by automatic testing
Automated testing:
In either case, I need to write documentation instructing people what they need to do.
Writing Examples/Tutorials:
Both approaches are undesirable, because the example won't work out of the box and you are introduce the chance for users to make silly mistakes. In either case, you need to write extra documentation instructing people what they need to do (if the example is part of a tutorial, the you may want the documentation to be extra detailed for less technically savvy users)
How the proposed functionality helps: Essentially code-maintainers would write all of their examples to avoid using hard-coded paths and have grackle read from the cached location. Then they have 2 options:
grdata
tool during the build to download the necessary data-files (PR Auto File Management Part 3: Install data-management tool as command line program #246 makes this easy!).grdata
tool to fetch the data files (and link to our documentation for more instructions)Primary Motivation II: Grackle Distribution
Historically, we always assumed that people would download the full git repository, install Grackle as a standalone library and all dependencies would directly link against that copy of Grackle. And when they would download the git repository, they would "get the data files for free." But that's no longer the case.
Let's consider a concrete scenario:
https://github.com/grackle-project/grackle/archive/<SHA1-hash>.zip
& then use thegrdata
tool to download the data-filesgrdata
to download the data-files, the tool is smart enough to know that we can skip another download<SHA1-hash>
should be replaced with a commit hash. For example using0a1493839b1279c438cc3ffcfa2f7bc248e49def
will download a 500 KB zip file (more than 200 times smaller than the full git-repository)grdata
tool from the build-systemMore generally, if Grackle is ever distributed through a package manager like apt, dnf, or brew (not a crazy prospect), we won't be allowed to directly distribute the data files with the repository.
Secondary Motivations
The fact that this functionality would support portable Grackle configuration would have some potential advantages from an end-user perspective:
There are also a number of ways that this functionality would help Grackle, itself. This list is roughly orderred from most-important to least-important:
grdata
to download the data filesCMakeLists.txt
file into 2 new files (basically users would be able to run the example without a single git command)scripts/ci/record-answers.py
script #306 for auto-generating answer-test answers could be simplifiedgrdata
program works out well, we could eventually consider dropping the use of git submodules. (definitely not the most important thing)The proposed functionality could help in a few hypothetical scenarios in the future :
Overview of the
grdata
toolThe proposed
grdata
will works in 2 ways:This PR makes the
grdata
tool accessible throughpygrackle
. (#246 will make it accessible a standalone tool). In this discussion we use<launcher>
as placeholder forpython -m pygrackle
orpath/to/grdata
.We go through the commands.
fetch
command (this is the primary command!)<launcher> fetch
downloads all files to the cache location. Files will be skipped if they are already downloaded<launcher> fetch <name1> [<name2> ...]
downloads the specified subset of files to the cache location--untracked-dest-dir=<path>
flag is a convenience to let people download files to an arbitrary location (specified by<path>
.calc-reg
command is provided to help us generate a new file-registry with the proper formatting (in the event that we ever add/change datafiles)show-grackle-version
specifies the associated grackle versionshow-protocol-version
specifies the associated protocol version (more on that below)show-checksum-kind
specifies the associated Checksum Algorithm (SHA-1)show-data-dir
shows the associated data-dir (more on that below)show-help
shows detailed documentation (these docs also appear on the website)show-known-reg
shows the associated file registryCache Location
Invoking
<launcher> fetch
will fetch and download the files to the cache location. The cache is organized in a small hierarchy:The cache location is stored 2 levels deep within the data-directory.
${GRACKLE_DATA_DIR}
env variable. If this isn't provided, we fall back to a platform-specific default.<launcher> show-data-dir
The data-store directory is named after the protocol-version and it contains all of the cached data files
<launcher> show-protocol-version
When we invoke
<launcher> fetch
:The cache location is organized in the following hierarchy
Maintainability
I took the comment about maintainability to heart. Currently, the
src/python/pygrackle/utilities/grdata.py
script is 774 lines. The first ~185 lines hold documentation, this is used in the website. We can remove it if you prefer.Consequently, we have ~590 lines of code.
Some thoughts on further reducing the size:
_progress_bar
and replace_retrieve_url
with this function, we would save another ~40 linesIf we adopted
pooch
that would save at most another ~80 lines.2 This only counts stuff we could delete. We would probably need ~10 lines to call pooch._get_data_dir
&_parse_file_registry
, and retain most of_setup_and_get_dirs
as long as we want the Grackle library to be able to directly files without full paths.grdata
as a standalone tool if we want to use pooch:zipapp
module to packagegrdata.py
,pooch
, andpooch
's dependencies into a file calledgrdata.pyz
. We would need to add some logic to CMake to generate this file (with some special handling for systems where pip isn't installed).grdata
alongsidegrdata.pyz
with the contents:pooch
(hopefully, we can assume the system-python always meetspooch
's minimum required version).Footnotes
Back when I first made the PR, I realized I could probably shave off ~150 lines. But most of the shrinking of this PR comes from reworking the deduplication machinery in grdata. The original version used a very robust, albeit more complex, approach that allowed us to deduplicate data files even if we renamed them. Because of this added complexity, we grdata needed provide subcommands for inspecting the cached data & for deleting the cached data. The newer deduplication approach is MUCH simpler. While it can't deduplicate renamed files, it let us remove a ton of extra functionality (and extra tests). ↩
Aside: shifting to pooch requires us to drop deduplication, ↩