Skip to content

Conversation

mmccrackan
Copy link
Contributor

@mmccrackan mmccrackan commented May 29, 2025

The update_det_match process on Prefect is failing with the error:

State message: Flow run encountered an exception. FileNotFoundError: [Errno 2] No such file or directory: '/so/data/lati6/obs/17268/obs_1726844446_lati6_001/M_index.yaml'

I believe this is related to some detsets failing due to only a handful of detectors having detcal. In the update_det_match output, I see:

Loaded obs_id obs_1698366808_lati1_010. Running matches for detsets:
2025-05-15 08:53:17,925 INFO update_det_match :     - ufm_mv24_1698361003_tune
  0%|                                                                                                       | 0/27 [00:00<?, ?it/s]
2025-05-15 08:53:18,190 ERROR update_det_match : deset ufm_mv24_1698361003_tune failed with float division by zero

Which is one of the obsids that is failing. This is preceded by:

WARNING: sotodlib.core.metadata.loader: Only 4 of 1794 detectors have data for metadata specified by spec={'db': '/global/cfs/cdirs/sobs/metadata/lat/manifests//det_cal/v0/det_cal_local.sqlite', 'name': 'det_cal'}. Trimming.

This branch just wraps a try and except around the match function and makes sure that the detset in question will be added to the failed list. This doesn't directly address the problem which is older files missing but once update_det_match is re-run on Nersc with this it should skip these files. It will still fail if newer obsids files are missing, which I think is the behavior we want.

@mmccrackan
Copy link
Contributor Author

Okay, this appears to have avoided the error based on the latest LAT det match run:
https://prefect.simonsobs.org/flow-runs/flow-run/cab27d4a-a28c-44a5-922e-c4ae50019732

@mmccrackan mmccrackan requested a review from mhasself May 30, 2025 17:16
Copy link
Member

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error message suggests to me that this it's trying to analyze a book that is missing. (It's from Sept. 2024 so no surprise.) But I guess you're saying it would not have tried to do that, if this were truly a new detset that needs processing. Regardless...

The docstring advertises "if match fails for a known reason ..." (my emphasis :) ), and a general try/except is at odds with that, I think. A specific test should be written here.

The obs_id = obs_ids[0] is suspicious, too. I can see why it would be unlikely for that to fail due to a Book having been auto-cleared. (I.e. obs_ids[0] is gone, but obs_ids[-1] is still around and looking good.)

With automated processes we need to be careful not to just give up at the first signs of trouble -- that can lead to big problems going unnoticed for long periods, because the pipeline so gracefully recovers.

Apologies if I'm not sufficiently understanding the issue and the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants