-
Notifications
You must be signed in to change notification settings - Fork 19
Description
This topic has been discussed extensively on slack and in the site-pipeline call, but don't think we made an issue to catalog h5 writing or other similar h5 file problems with preprocessing yet. There are ongoing issues when writing files in the cleanup_mandb
function in the preprocess_tod
and make_atomic_filterbin_map
flows on Prefect. This occurs because another flow (generally record_qa
) has opened the preprocess archive h5 file that these flows want to write to and locks the file. The error is:
File "/usr/local/lib/python3.10/site-packages/sotodlib/preprocess/preprocess_util.py", line 1111, in cleanup_mandb
with h5py.File(dest_file,'a') as f_dest:
File "/usr/local/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/usr/local/lib/python3.10/site-packages/h5py/_hl/files.py", line 250, in make_fid
fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
File "h5py/_objects.pyx", line 56, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 57, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 102, in h5py.h5f.open
BlockingIOError: [Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
Example failed flows:
https://prefect.simonsobs.org/flow-runs/flow-run/1de16673-2fc7-4a94-a47c-b862ff573338
https://prefect.simonsobs.org/flow-runs/flow-run/e5ca0262-0837-4406-952e-948227557bf5
@mhasself has proposed a fix for this which is to use a jobdb
across the different functions that read/write from these archives.
An unrelated problem I just encountered but still related to the archive files is when a slurm job runs out of time while a temp file is writing out. This causes the save_group_and_cleanup
step at the start of preprocess_tod
to fail with:
File "/global/homes/m/mmccrack/so/git/sotodlib/20250910_lat_catchup/sotodlib/sotodlib/preprocess/preprocess_util.py", line 1118, in cleanup_mandb
f_src.copy(f_src[f'{dts}'], f_dest, f'{dts}')
File "/global/u1/m/mmccrack/so/env/20250710_0.2.1_soconda/lib/python3.12/site-packages/h5py/_hl/group.py", line 600, in copy
h5o.copy(source.id, self._e(source_path), dest.id, self._e(dest_path),
File "h5py/_objects.pyx", line 56, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 57, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 284, in h5py.h5o.copy
RuntimeError: Unable to synchronously copy object (len not positive after adjustment for EOA)
This is a straightforward fix, which is to just delete the problematic file when it fails with this error.