Skip to content

Conversation

mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Aug 29, 2024

EDIT: @brittonsmith and I had a discussion about this PR about a month ago. The whole discussion was very instructive. He made some good points about maintainability. Since then, I have significantly overhauled this PR. (I still need to propagate these changes forward to #237 and #246 -- but that shouldn't be hard at all. I also have some ideas for shrinking #237).

The overhaul cut the size of the PR in half (originally it added ~2850 lines, now it adds ~1400 lines). 1


Overview

This PR is part of a sequence for adding the grdata tool to download/cache data files and let the core Grackle library access the cached files (without specifying the full file path).

Motivation

The benefits for pygrackle are somewhat self-evident. I think it's pretty clear we will want something like this functionality as soon pygrackle can be installed from PyPI.

I'm going to focus on why this sort of functionality is useful for the core Grackle library! The primary motivation for this functionality (both fetching and allowing Grackle to directly access the cached files without specifying a full path) has to do with using Grackle in a simulation code.

Primary Motivation I: Parameter File Non-Portability in Simulation Codes

Any time somebody wants to run a simulation with Grackle, they will need to modify parameter files with the path to a Grackle parameter file. While this may sound like a trivial task the CONSEQUENCES are significant.

This creates 2 significant pain points for the maintainers of simulation codes; I have experienced both first-hand with Enzo-E. In both cases, these problems are exacerbated by automatic testing

  1. Automated testing:

    • Scenario: say I just added Grackle to my simulation code. I'm going to want to set up some automated test-problem that makes sure that my code is properly communicates with Grackle. More importantly, these tests are in place to make sure that future changes to my code or to Grackle won't break things.
    • Goal: The tests should be trivial for any developer/user to run.
    • Currently, there are effectively 2 unattractive ways to accomplish this:
      1. A simple approach is to setup the parameter files in my test-case to assume that Grackle's files are in a fixed location relative to the directory where the test-suite is invoked (this obviously makes the test-suite more brittle -- it would be easy for a developer/user to make a mistake).
      2. Alternatively, I can have the developer/user tell the testing framework (by argument/env-variable) where to find the data-files on their machine and then encode logic in the framework to copy the files or modify the runtime-parameters.

    In either case, I need to write documentation instructing people what they need to do.

  2. Writing Examples/Tutorials:

    • Scenario: after adding Grackle to my simulation code, I want to add an example of how to use my code with Grackle. If Grackle provides the only implementation of radiative cooling in my code (as in Enzo-E), I probably want to include the example in my user tutorial.
    • Goal: Ideally, the examples will work out of the box as easily as possible. I want this example/tutorial to be as frictionless as possible, it's as easy as possible for new users to start using my code (especially for less technically savvy users, like brand new grad students)
    • There are again 2 unattractive ways to accomplish this:
      1. A simple approach is to have the parameter file in my example assume that Grackle's files are in a fixed location relative to the directory where the example is invoked.
      2. I can instruct my users to modify the paths in my parameter files. This poses Once again, we will need to add extra documentation explaining to people what they need to modify to run the example.

    Both approaches are undesirable, because the example won't work out of the box and you are introduce the chance for users to make silly mistakes. In either case, you need to write extra documentation instructing people what they need to do (if the example is part of a tutorial, the you may want the documentation to be extra detailed for less technically savvy users)

How the proposed functionality helps: Essentially code-maintainers would write all of their examples to avoid using hard-coded paths and have grackle read from the cached location. Then they have 2 options:

  1. the build-system of the external code can directly invoke the grdata tool during the build to download the necessary data-files (PR Auto File Management Part 3: Install data-management tool as command line program #246 makes this easy!).
  2. they can instruct their users to use the grdata tool to fetch the data files (and link to our documentation for more instructions)

Primary Motivation II: Grackle Distribution

Historically, we always assumed that people would download the full git repository, install Grackle as a standalone library and all dependencies would directly link against that copy of Grackle. And when they would download the git repository, they would "get the data files for free." But that's no longer the case.

Let's consider a concrete scenario:

  • Enzo-E now automatically downloads Grackle (by default) as part of configuring a fresh build. Every time we create a new build-directory it performs a recursive git clone (to get the data files). There are 2 points of friction:
    1. this is a large download (over 100 Megabytes) and during development it can make sense to maintain multiple build-directories (especially if you are jumping between branches)
    2. we have write documentation telling users where exactly they can look for the data files or we need to move the data-files during the build (and we need to make it clear that this behavior only applies during automatic downloads)
  • How does the proposed functionality help?
    • rather that recursively cloning Grackle for each build and checking out the target hash, Enzo-E's build system could (i) download the zip file at https://github.com/grackle-project/grackle/archive/<SHA1-hash>.zip & then use the grdata tool to download the data-files
    • if we previously used grdata to download the data-files, the tool is smart enough to know that we can skip another download
    • note: <SHA1-hash> should be replaced with a commit hash. For example using 0a1493839b1279c438cc3ffcfa2f7bc248e49def will download a 500 KB zip file (more than 200 times smaller than the full git-repository)
    • Reminder: PR Auto File Management Part 3: Install data-management tool as command line program #246 makes it really easy to invoke the grdata tool from the build-system

More generally, if Grackle is ever distributed through a package manager like apt, dnf, or brew (not a crazy prospect), we won't be allowed to directly distribute the data files with the repository.

Secondary Motivations

The fact that this functionality would support portable Grackle configuration would have some potential advantages from an end-user perspective:

  1. First, some end-users run simulations and do analysis on different machines. Having Grackle access data-files without requiring full hardcoded paths would be very useful if you are using pygrackle for analysis. (I think this could be very useful)
  2. Less importantly, having Grackle access data-files without requiring full hardcoded paths, would make sharing of simulation run-files between end-users easier.

There are also a number of ways that this functionality would help Grackle, itself. This list is roughly orderred from most-important to least-important:

  1. we could tweak our code-examples so that they can run from anywhere. I think this is HUGELY IMPORTANT
    • currently it's really easy to mess up by putting the build-directory in an unexpected location, or invoking the example from the wrong directory. We want these to always "just work."
    • as an added bonus we can simplify some of the logic in the scripts that test the code-examples
  2. We can discard a bunch of logic from the core-library test-suite. I'm talking about all this source code and this highlighted CMake logic
  3. PR Introduce a simplified tutorial #312 introduces a simplified tutorial. With the proposed functionality, we could make this into a more convenient quick-start guide by doing the following:
    • tweak the example so that it reads the data-file from the cache
    • adding a sample CMakeLists.txt file that is written to automatically download Grackle and subsequently invoke grdata to download the data files
    • add instructions telling people that they can copy the contents of the example program and example CMakeLists.txt file into 2 new files (basically users would be able to run the example without a single git command)
  4. The script proposed by PR Introduce the scripts/ci/record-answers.py script #306 for auto-generating answer-test answers could be simplified
  5. If the grdata program works out well, we could eventually consider dropping the use of git submodules. (definitely not the most important thing)

The proposed functionality could help in a few hypothetical scenarios in the future :

  • what if a user needs to have multiple versions of grackle installed and they have incompatible data files? Without the proposed functionality, they need to maintain multiple source repositories or come up with a manual solution
  • what if a user wants to delete the source directory after they installed Grackle?

Overview of the grdata tool

The proposed grdata will works in 2 ways:

python -m pygrackle COMMAND [ARGS...]  # added by this PR
path/to/grdata COMMAND [ARGS...]       # to be added by PR #246

This PR makes the grdata tool accessible through pygrackle. (#246 will make it accessible a standalone tool). In this discussion we use <launcher> as placeholder for python -m pygrackle or path/to/grdata.

We go through the commands.

  • fetch command (this is the primary command!)
    • <launcher> fetch downloads all files to the cache location. Files will be skipped if they are already downloaded
    • <launcher> fetch <name1> [<name2> ...] downloads the specified subset of files to the cache location
    • the optional --untracked-dest-dir=<path> flag is a convenience to let people download files to an arbitrary location (specified by <path>.
    • In all cases, downloads are considered an error if the checksum doesn't match the registry. When the optional-flag isn't used, the tool does some basic deduplication after the download
  • the calc-reg command is provided to help us generate a new file-registry with the proper formatting (in the event that we ever add/change datafiles)
  • the rest of the commands just query basic information and exit:
    • show-grackle-version specifies the associated grackle version
    • show-protocol-version specifies the associated protocol version (more on that below)
    • show-checksum-kind specifies the associated Checksum Algorithm (SHA-1)
    • show-data-dir shows the associated data-dir (more on that below)
    • show-help shows detailed documentation (these docs also appear on the website)
    • show-known-reg shows the associated file registry

Cache Location

Invoking <launcher> fetch will fetch and download the files to the cache location. The cache is organized in a small hierarchy:

The cache location is stored 2 levels deep within the data-directory.

  • the data-directory is set by the ${GRACKLE_DATA_DIR} env variable. If this isn't provided, we fall back to a platform-specific default.
  • it can be queried with <launcher> show-data-dir
  • it contains a user-data directory (reserved for future use by users) and the data-store

The data-store directory is named after the protocol-version and it contains all of the cached data files

  • the protocol-version is hard-coded to "1" and will probably never need to change (but we allow ourselves the option to change it in the future if we decide we want to reorganize the files)
  • the protocol-version can be queried with <launcher> show-protocol-version

When we invoke <launcher> fetch:

  • we create a directory within the data-store directory named after the Grackle-version associated with the launcher (if it doesn't already exist). This is a version-dir
  • We download all files into the version-dir (we don't redownload files that already exist)
  • For each downloaded file, we subsequently try to deduplicate it: we check whether it duplicates another file within the data-store with the same name & contents (but in a different version dir). If it has a duplicate, we replace the downloaded file with a hardlink to the pre-existing file.

The cache location is organized in the following hierarchy

<data-dir>
 ├── data-store-v<protocol-version>/      # <- the data-store
 │    ├── 3.4.1-dev/                     # <- a version-dir
 │    │    ├── CloudyData_UVB=FG2011.h5
 │    │    ├── ...
 │    │    └── cloudy_metals_2008_3D.h5
 │    └── 3.5.0/                         # <- another version-dir
 │         ├── CloudyData_UVB=FG2011.h5
 │         ├── ...
 │         └── cloudy_metals_2008_3D.h5
 └── user-data/                          # <- reserved for user data
      ├── ...
      └── ...

Maintainability

I took the comment about maintainability to heart. Currently, the src/python/pygrackle/utilities/grdata.py script is 774 lines. The first ~185 lines hold documentation, this is used in the website. We can remove it if you prefer.

Consequently, we have ~590 lines of code.

Some thoughts on further reducing the size:

  • removing deduplication logic would clip ~37 lines of codes/comments/function-docstrings (with some light refactoring, maybe 40 lines)
  • if we remove _progress_bar and replace _retrieve_url with this function, we would save another ~40 lines

If we adopted pooch that would save at most another ~80 lines.2 This only counts stuff we could delete. We would probably need ~10 lines to call pooch.

  • Maybe with a little refactoring we could clip another 20-30 line?
  • we need to retain functions like _get_data_dir & _parse_file_registry, and retain most of _setup_and_get_dirs as long as we want the Grackle library to be able to directly files without full paths.
  • keep in mind, there is a bunch of additional hidden complexity to be able to run grdata as a standalone tool if we want to use pooch:
    • We could use the built-in zipapp module to package grdata.py, pooch, and pooch's dependencies into a file called grdata.pyz. We would need to add some logic to CMake to generate this file (with some special handling for systems where pip isn't installed).
    • However, we would probably want to install a file called grdata alongside grdata.pyz with the contents:
    #!/bin/sh
    /usr/bin/env python3 -m zipapp grdata.pyz $@
    and we probably need to add some additional logic somewhere to make sure we only ever invoke the script with a python version compatible with pooch (hopefully, we can assume the system-python always meets pooch's minimum required version).

Footnotes

  1. Back when I first made the PR, I realized I could probably shave off ~150 lines. But most of the shrinking of this PR comes from reworking the deduplication machinery in grdata. The original version used a very robust, albeit more complex, approach that allowed us to deduplicate data files even if we renamed them. Because of this added complexity, we grdata needed provide subcommands for inspecting the cached data & for deleting the cached data. The newer deduplication approach is MUCH simpler. While it can't deduplicate renamed files, it let us remove a ton of extra functionality (and extra tests).

  2. Aside: shifting to pooch requires us to drop deduplication,

@mabruzzo mabruzzo force-pushed the auto-data branch 3 times, most recently from b2cc8d4 to f91413c Compare August 29, 2024 20:41
@mabruzzo mabruzzo changed the title Introducing a Datafile Management Tool Auto File Management Part 1: Introducing a Datafile Management Tool Sep 4, 2024
@mabruzzo mabruzzo added this to the 3.4 milestone Sep 9, 2024
We plan to eventually install the grdata tool as a standalone command
line program. Essentially the build-system will perform some
substitutions (the CMake build system uses CMake's built-in
``configure_file`` command while the classic build system uses the
analogous ``configure_file.py`` script)

This commit introduces a few minor tweaks to grdata.py so that it can
more easily be consumed by the ``configure_file.py`` script.
- The ``configure_file.py`` script, itself, will ultimately require a
  few more tweaks so that it doesn't report occurences of python's
  decorator-syntax as errors
- However, this commit minimizes the number of required changes
mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Mar 18, 2025
1. A bug is fixed in the Fortran Bindings
2. All references have been removed about the ``GRACKLE_DATA_DIR`` env
   variable. An upcoming PR (grackle-project#235) is going to overhaul this machinery,
   but it probably won't be merged the 3.4 release
@mabruzzo mabruzzo modified the milestones: 3.4, 3.5 Mar 18, 2025
@mabruzzo mabruzzo changed the base branch from main to newchem-cpp March 29, 2025 04:02
@mabruzzo mabruzzo changed the base branch from newchem-cpp to main March 29, 2025 04:02
mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Apr 4, 2025
This was inspired by something I noticed in the standard Python documentation. Essentially, we use sphinx's builtin [sphinx.ext.extlinks](https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html#module-sphinx.ext.extlinks) extension for adding custom roles to automatically expand external links in a simple manner.

For example:
- ``:source:`src/example` `` is replaced with the text "src/example" that is hyperlinked to [here](https://github.com/grackle-project/grackle/tree/main/src/example)
- ``:code-example:`c_example.c` `` is replaced with the text "c_example.c" that is hyperlinked to [here](https://github.com/grackle-project/grackle/blob/main/src/example/c_example.c)
- ``:gh-pr:`235` `` is replaced with the text "gh-pr#235" that is hyperlinked to [here](grackle-project#235)
mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Apr 4, 2025
This was inspired by something I noticed in the standard Python documentation. Essentially, we use sphinx's builtin [sphinx.ext.extlinks](https://www.sphinx-doc.org/en/master/usage/extensions/extlinks.html#module-sphinx.ext.extlinks) extension for adding custom roles to automatically expand external links in a simple manner.

For example:
- ``:source:`src/example` `` is replaced with the text "src/example" that is hyperlinked to [here](https://github.com/grackle-project/grackle/tree/main/src/example)
- ``:code-example:`c_example.c` `` is replaced with the text "c_example.c" that is hyperlinked to [here](https://github.com/grackle-project/grackle/blob/main/src/example/c_example.c)
- ``:gh-pr:`235` `` is replaced with the text "gh-pr#235" that is hyperlinked to [here](grackle-project#235)
mabruzzo added a commit to mabruzzo/grackle that referenced this pull request Apr 15, 2025
Motivation
----------

At present, we provide several extremely useful resources for teaching
Grackle. But, I've long felt that there's initially something of a
learning curve. While all of the examples and the ["Interacting with
Grackle in Your Simulation Code"](https://grackle.readthedocs.io/en/latest/Interaction.html)
section are extremely rich, detailed, and informative, I always
feel like they may include a little **too much** information...

**Every** usage of Grackle, boils down to the following flow:

1. We have some data about parcels of gas & we want to know something
   about it (mostly how chemistry/cooling cause it to evolve)

2. We need to encode information about the gas and the physics we care
   about in a format that Grackle understands.[^1]

3. We package up the data about the gas parcels into a format that
   Grackle understands (the ``grackle_field_data`` struct).

4. We call the appropriate function.

Unfortunately, this flow is not very obvious from the existing resources
(i.e. there's just a little too much going on). But, once you understand
the flow ("the big picture"), I think our existing resources are extremely
useful.

Overview
--------

This patch tries to bridge this gap. I came up with the shortest and
simplest possible example. (I probably could save a line or 2 if I used
the global API, but that was a conscious choice). Then, I wrote a
tutorial to try to walk the reader through the example chunk by chunk.

To help annotate the example, I wrote a really simple sphinx extension.

Future Thoughts
---------------

Honestly, this is far from perfect, but I think its a good start. Some
thoughts for the future:

- It would be nice if Grackle could operate with CGS units. The tutorial
  could then gloss over a number of details.

  * Since we already use double-precision within Grackle, I think don't
    think there are any problems with doing this. But clearly, it needs
    to wait

  * I think it would also be nice for teaching the pygrackle bindings

- It would be better if the example were even shorter:

  * the `gr_fields` type proposed within grackle-project#271 will help

  * a number of my other "experimental APIs" will also help

  * in the 4.0 release, we may want to revisit the default values for
    the ``use_grackle`` and ``with_radiative_cooling`` parameters
    (the current defaults are pretty silly).

- if we merge in grackle-project#235, grackle-project#237, and grackle-project#246, then we could easily add a
  section describing how to try out the example without git commands
  (those PRs are essential for crafting a sample CMake configuration
  that can automatically install data-files)

[^1]: Of course, Grackle is designed to let callers avoid repeating step 2 over and over again.
Comment on lines 312 to 313
if prefix is None and envvar == "HOME":
prefix = os.path.expanduser("~")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification opportunity: If we treat this case as an error, we could probably clip ~70 lines of C code from PR #237

raise GrdataError(f"server can't be reached to fetch {url}: {e.code}")


def _get_data_dir(system_str=None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to directly access these files in C, this function could not be replaced by pooch

data_dir_chain = [ # tuple fmt: (envvar, suffix_path, fatal_if_rel, is_enabled)
("GRACKLE_DATA_DIR", None, True, True),
("XDG_DATA_HOME", "grackle", False, True),
("HOME", "Library/Application Support/grackle", True, is_darwin),
Copy link
Collaborator Author

@mabruzzo mabruzzo May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification opportunity: we could cut the ("HOME", "Library/Application Support/grackle", True, is_darwin), line.

Cutting this lines means that we would treat the default directory on macOS in the exact same manner as for any other unix-like OS.

Comment on lines 232 to 252
def _progress_bar(total_bytes):
"""provides a function for drawing/updating progress bars"""
ncols = shutil.get_terminal_size()[0] - 1
power_div_3 = int(log10(total_bytes) // 3) if total_bytes > 0 else 0
factor, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
# the output line has the form: '[<progress-bar>] <size>/<size> <unit>'
fmt = "\r[{bar:{barlen}.{nfill}}] {size:.2f}" + f"/{total_bytes/factor:.2f} {unit}"
barlen = ncols - 19 # for context, 15 <= (len(fmt.format(...)) - barlen) <= 19
bar = (barlen * "=") if (barlen >= 1) and sys.stdout.isatty() else None

def _update(size):
nonlocal bar
if size is None and bar is not None:
print(flush=True)
bar = None
elif bar is not None:
nfill = barlen * int(size / total_bytes)
val = fmt.format(bar=bar, barlen=barlen, nfill=nfill, size=size / factor)
print(val, end="", flush=True)

return _update
Copy link
Collaborator Author

@mabruzzo mabruzzo May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification opportunity: this is not strictly necessary and could be clipped (or simplified)

A simpler alternative would might just print the amount of data that was downloaded

def _download_status(tot_bytes):
    """provides information about the current download"""
    power_div_3 = int(log10(tot_bytes) // 3) if tot_bytes > 0 else 0
    div, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
    done = not sys.stdout.isatty()

    def _update(size):
        nonlocal done
        m = "\n" if size is None else f"\r{size / div:.2f}/{tot_bytes / div:.2f} {unit}"
        if not done:
            print(m, end="", flush=True)
            done = size is None

    return _update

mabruzzo added a commit to mabruzzo/grackle that referenced this pull request May 19, 2025
Prior to this PR, our test suite would fail unless we had an editable
install. In reality, all test-cases outside of test_models infer the
location of the data files based on the location of the test-files,
which are never part of the wheel.[^1] Therefore, this PR introduces a
few small tweaks to support testing all test-cases outside of
test_models.

Overall, this will be really useful for providing a more rigorous check
of the correctness of our precompiled wheels (see PR grackle-project#320).

[^1]: I know that some packages do ship their test-suites as part of
      the wheel, but that currently makes absolutely no sense for us
      unless we provide users with an easy way to get the datafiles.
      We can always revisit this choice in the future after we merge
      PRs grackle-project#235 & grackle-project#237 (or come up with an equivalent)
Comment on lines 232 to 253
def _progress_bar(total_bytes, silent=False):
"""provides a function for drawing/updating progress bars"""
ncols = shutil.get_terminal_size()[0] - 1
power_div_3 = int(log10(total_bytes) // 3) if total_bytes > 0 else 0
factor, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
# the output line has the form: '[<progress-bar>] <size>/<size> <unit>'
fmt = "\r[{bar:{barlen}.{nfill}}] {size:.2f}" + f"/{total_bytes/factor:.2f} {unit}"
barlen = ncols - 19 # for context, 15 <= (len(fmt.format(...)) - barlen) <= 19
suppress = (barlen < 1) or silent or not sys.stdout.isatty()
bar = None if suppress else (barlen * "=")

def _update(size):
nonlocal bar
if size is None and bar is not None:
print(flush=True)
bar = None
elif bar is not None:
nfill = int(barlen * (size / total_bytes))
val = fmt.format(bar=bar, barlen=barlen, nfill=nfill, size=size / factor)
print(val, end="", flush=True)

return _update
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification opportunity: we could replace this with the following snippet (or entirely delete the function -- it's just telling us about our current progress)

 def _progress_bar(tot_bytes, *, silent=False):
    """provides information about the current download"""
    power_div_3 = int(log10(tot_bytes) // 3) if tot_bytes > 0 else 0
    div, unit = 1000.0**power_div_3, (" B", "KB", "MB", "GB")[power_div_3]
    done = tot_bytes <= 0 or silent or not sys.stdout.isatty()

    def _update(size):
        nonlocal done
        m = "\n" if size is None else f"\r{size / div:.2f}/{tot_bytes / div:.2f} {unit}"
        if not done:
            print(m, end="", flush=True)
            done = size is None

    return _update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant