Skip to content

Use pooch #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jul 15, 2025
Merged

Use pooch #513

merged 33 commits into from
Jul 15, 2025

Conversation

Zeitsperre
Copy link
Member

@Zeitsperre Zeitsperre commented Jun 10, 2025

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • (If applicable) Documentation has been added / updated (for bug fixes / features).
  • (If applicable) Tests have been added.
  • CHANGELOG.rst has been updated (with summary of main changes).
    • Link to issue (:issue:number) and pull request (:pull:number) has been added.

What kind of change does this PR introduce?

  • Removes the older, cobbled-together testing data management system for one based on pooch
  • Updates several environment variables to use a more consistent naming system, used by similar projects
  • The pooch helper function (called "yangtze"; not married to the name) can now be used to fetch() local or remote datasets.
  • The main.yml workflow now uses caching for testing data to reduce the synonymous download requests from the various builds.
  • A new workflow (testdata-version.yml) has been adapted to ensure that the raven-testdata tag is kept up-to-date.

Does this PR introduce a breaking change?

Absolutely. The testing data management system is completely changed, and environment variables have been modified. The new system uses a pooch registry that will validate that files have the proper checksum on pytest call, which is much faster than checking remote servers. The downside is that changes need to be ported to RavenPy when testing data is changed (the testdata-version.yml workflow will help with this).

SKIP_TEST_DATA is now unused and obsolete.

Other information:

The changes here will break the current raven-testdata structure (for the better). The migration will need to happen progressively as multiple projects (and PAVICS) are reliant on the current raven-testdata structure, so changes will proceed as follows:

  • master remains the default branch and is now fixed.
  • main is currently a copy of master.
  • new-system will be merged to main and tagged using a calendar-based versioning system.
  • The new tagged version of raven-testdata will be set in main.yml::env::RAVEN_TESTDATA_BRANCH and in raven.testing.utils.default_testdata_version

Then, after changes have been made to all repositories, main will become the new default branch. I'm not certain if we need to inform users, but the default RavenPy behaviour will be to refer to main and vYYYY.MM.DD, set by us.

See also: Ouranosinc/raven-testdata#43

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Zeitsperre
Copy link
Member Author

One of the major issues I'm having is that the testing data comprises lots of large files. Around 45 MB are used in the regular testing of RavenPy, but ~210 MB are downloaded by default at the moment.

Once I've finalized a version of raven-testdata using the versioning system, I can reduce this amount by limiting the registry.txt only to the relevant files.

Copy link

github-actions bot commented Jun 10, 2025

Note

It appears that this Pull Request modifies the main.yml workflow.

On inspection, the RAVEN_TESTDATA_BRANCH environment variable is set to the most recent tag (v2025.6.12).

No further action is required.

@Zeitsperre
Copy link
Member Author

I was running into a race condition where multiple workers were attempting to download and write the remote registry file at the same time. I've adjusted the logic to make better use of lock files so that we don't run into these situations. These changes could/should be ported to other implementations this system (miranda/xhydro/xclim/etc.)

Zeitsperre added a commit to Ouranosinc/raven-testdata that referenced this pull request Jun 12, 2025
### Description

The modifications here completely change the way that testing data is managed in order to provide a versioning system as well as reduce the number of remote requests to GitHub.

Projects that rely on `raven-testdata` will now use `pooch` alongside file registries to track the file and checksums, as well as manage the caches. This repo will also now be using tagged commits to ensure that testing data is kept up-to-date while at the same time facilitating easier backtracking if needed.

### Changes

- All `md5` files have been removed. 
- The `README.md` has been updated to detail the new process for fetching data with `pooch`
- The `report_check_sums.py` script now creates new registries as well as updates the registry table in the `README.md`
- The return characters in the text files have been converted to `LF` to ensure that checksums remain the same when opened on Linux and macOS; lots of redundant whitespace has also been removed from tables.

### What's next

Once this PR is approved, `main` will be tagged using calendar versioning as the first stable version. Once all projects reliant on `raven-testdata` have been updated to the new system, the default branch will change from `master` to `main`.

The `master` branch will be kept for backwards-compatibility.

See also: CSHS-CWRA/RavenPy#513
@Zeitsperre Zeitsperre requested a review from huard June 12, 2025 18:21
@coveralls
Copy link

coveralls commented Jun 12, 2025

Coverage Status

coverage: 80.752% (-1.6%) from 82.35%
when pulling dc35082 on use-pooch
into 1a119bf on master.

@huard
Copy link
Collaborator

huard commented Jun 20, 2025

I thought we had agreed that the name yangtze was not descriptive of what that piece of code is doing. I get that you've redefined get_file to partially hide it from users, but it still appears in the import of every notebook. Sorry, but you'll have to come up with a boring, descriptive name.

Zeitsperre added a commit to Ouranosinc/raven that referenced this pull request Jul 15, 2025
## Overview

Changes:

* Replaced testdata fetching mechanism with `pooch`-based caching system
* Updated tooling to replace a few checks with `ruff`, `Makefile`
linting checks are now more precise
* Synchronized dependencies and added a few base versions for
consistency
* Removed obsolete `setup.cfg`
* `raven` now uses Trusted Publisher for TestPyPI and PyPI releases

## Related Issue / Discussion

This PR is part of an effort to move the `raven-testdata` repo to a
versioned scheme that will make it much easier to track changes between
major releases of `RavenWPS` and `RavenPy`. It also provides much more
thread-safety and prevents needless requests to GitHub when running
`pytest`.

I'm pinning things so that this should be the last release to support
Python 3.9. We should drop that version in the next significant release
(if we do another one).

## Additional Information

CSHS-CWRA/RavenPy#513
Ouranosinc/raven-testdata#43
@Zeitsperre Zeitsperre merged commit 6344b88 into master Jul 15, 2025
19 checks passed
@Zeitsperre Zeitsperre deleted the use-pooch branch July 15, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Pooch to handle testing data retrieval
3 participants