Skip to content

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Aug 1, 2025

Adds a unit test showcasing how to hook up a pytorch emulator to an eamxx atm process.

[BFB]


Notes

  • The cld_fraction emulator is quite rudimental, and therefore does not perform well, but that's beyond the point here. The test is simply to showcase how to hook up a python ML emulator to an eamxx process.
  • So far, we only tested the python bindings with a CPU backend. Since torch is capable to work with GPU backends, I want to experiment enabling the test on GPUs
  • I have not added a baseline test for the pyml emulator.
  • I am also planning to add a purely C++ test where the ML emulator is implemented via a LAPIS-generated conversion of the pth model currently in the folder. Perhaps in a follow up PR.

Credit to @mahf708 for creating the pytorch model.

@bartgol bartgol self-assigned this Aug 1, 2025
@bartgol bartgol added BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Aug 1, 2025
@bartgol bartgol requested a review from mahf708 August 1, 2025 19:16
@mahf708 mahf708 requested a review from ndkeen August 1, 2025 19:22
@mahf708
Copy link
Contributor

mahf708 commented Aug 1, 2025

Looks good, I will review this carefully in a second. I added @ndkeen as awareness (no need to review, but ofc welcome to do so)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is "data" (not code) and as such likely doesn't belong here at all (let's either throw it on the inputdata repo, or we can create a dedicated public place on github for toy models that we can just wget)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the inputdata server. But I thought that a) it's a relatively small file (<100k), and b) I want to wait until the feature is "stable" before starting to put stuff on the data server. I feel that once data is on the server, it's doomed to stay there (as folks could checkout older version of master). I'd like to give the feature/test some "probation" time...

Comment on lines +54 to +55
model = CldFracNet(nlevs,nlevs)
model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the design (which we can revisit), I was thinking of a few things:

  1. We actually don't need the model defined above fwiw, we can just get save it along with the weights and just instantiate it without the code above in the class
  2. I was hoping we would have options for users to run different versions, say:
    1. option 1: run c++ stuff we have by default
    2. option 2: run regular python stuff to just reproduce the c++ (since we have that)
    3. option 3: run pytorch python stuff (which is being added in this PR)
  3. I would do some guarding (try ... except type of stuff) to help give informative errors/messages to users (this may be should be done below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all counts. In more details:

  1. Yes, I agree. It's just that I never used pytorch (/* noob mode on */) and was getting torch errors when loading the full model. It was in the early cycles of the impl, so maybe I fixed the core issue and can use the full model pth file now...
  2. We do that for this test. There are 3 tests that do precisely that. The _py and _cpp tests are already tested to verify they are bfb, while the torch one is ofc not bfb with them, but runs the same input file.
  3. I do have some try/catch blocks in c++, but maybe there are other places that I need to guard.

@rljacob
Copy link
Member

rljacob commented Aug 1, 2025

Why don't I see any build or runtime mods pointing to a python install?

@bartgol
Copy link
Contributor Author

bartgol commented Aug 1, 2025

Why don't I see any build or runtime mods pointing to a python install?

The python hooks for eamxx require

  • EAMXX_ENABLE_PYTHON=ON in the cmake configuration: this is OFF by default, we're just enabling in some CI testing for now. As the feature matures, we can think of how to make this CIME configurable (or maybe even ON by default)
  • The pybind11 package must be available on the system. Right now, we do have it installed on the ghci-snl-cpu runner, so that works. As for the previous point, as the feature matures we can think of adding some pip install mechanism to ensure its presence (or modules logic in cime config).

Edit: of course, I forgot to install the torch module...

@bartgol bartgol force-pushed the bartgol/eamxx/cld-fraction-pyml branch from 9268585 to 349884c Compare August 7, 2025 22:08
@bartgol
Copy link
Contributor Author

bartgol commented Aug 8, 2025

Needs to wait for ghci-snl-cpu to switch to new container image with pytorch installed.

@mahf708
Copy link
Contributor

mahf708 commented Aug 8, 2025

Needs to wait for ghci-snl-cpu to switch to new container image with pytorch installed.

wanna discuss how you're thinking of adding it or you're cool on your own? note that gpu-enabled pytorch will be a bit more involved than the cpu one. The pytorch team has good documentation on pypi-like installation, but there are other more stable ways to handle these difficult python packages

@bartgol
Copy link
Contributor Author

bartgol commented Aug 8, 2025

For now, the python hooks are only enabled in our CPU machines, and not tested at all on GPU (that's on the todo list). For the cpu CI, I already tested the new patched image, and the test passes. I just need to update the systemctl job on mappy to run the new image, then regen baselines (since I'm upping the gcc version 12->13)

@bartgol bartgol marked this pull request as draft August 15, 2025 00:26
@bartgol bartgol marked this pull request as ready for review August 15, 2025 00:26
@bartgol bartgol requested a review from mahf708 August 15, 2025 01:40
@bartgol bartgol force-pushed the bartgol/eamxx/cld-fraction-pyml branch from 349884c to 269e89b Compare August 21, 2025 00:02
- Add (empty) init method to the module
- Use try/catch blocks when calling methods (for debugging)
Note: the emulator is NOT good, it just showcases the capability
@bartgol bartgol force-pushed the bartgol/eamxx/cld-fraction-pyml branch from 269e89b to bb66f88 Compare August 21, 2025 00:45
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Air kiss

@bartgol bartgol marked this pull request as draft September 8, 2025 16:04
@bartgol bartgol marked this pull request as ready for review September 8, 2025 16:04
@bartgol
Copy link
Contributor Author

bartgol commented Sep 8, 2025

Retriggering tests, since it's been a while. If all passes, I will integrate.

bartgol added a commit that referenced this pull request Sep 8, 2025
Adds a unit test showcasing how to hook up a pytorch emulator to an eamxx atm process.

[BFB]
@bartgol bartgol merged commit d7436f4 into master Sep 8, 2025
30 checks passed
@bartgol bartgol deleted the bartgol/eamxx/cld-fraction-pyml branch September 8, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants