Skip to content

Preprocess h5 read/write problems #1376

@mmccrackan

Description

@mmccrackan

This topic has been discussed extensively on slack and in the site-pipeline call, but don't think we made an issue to catalog h5 writing or other similar h5 file problems with preprocessing yet. There are ongoing issues when writing files in the cleanup_mandb function in the preprocess_tod and make_atomic_filterbin_map flows on Prefect. This occurs because another flow (generally record_qa) has opened the preprocess archive h5 file that these flows want to write to and locks the file. The error is:

File "/usr/local/lib/python3.10/site-packages/sotodlib/preprocess/preprocess_util.py", line 1111, in cleanup_mandb
    with h5py.File(dest_file,'a') as f_dest:
  File "/usr/local/lib/python3.10/site-packages/h5py/_hl/files.py", line 564, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/usr/local/lib/python3.10/site-packages/h5py/_hl/files.py", line 250, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 56, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 57, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 102, in h5py.h5f.open
BlockingIOError: [Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

Example failed flows:
https://prefect.simonsobs.org/flow-runs/flow-run/1de16673-2fc7-4a94-a47c-b862ff573338
https://prefect.simonsobs.org/flow-runs/flow-run/e5ca0262-0837-4406-952e-948227557bf5

@mhasself has proposed a fix for this which is to use a jobdb across the different functions that read/write from these archives.

An unrelated problem I just encountered but still related to the archive files is when a slurm job runs out of time while a temp file is writing out. This causes the save_group_and_cleanup step at the start of preprocess_tod to fail with:

 File "/global/homes/m/mmccrack/so/git/sotodlib/20250910_lat_catchup/sotodlib/sotodlib/preprocess/preprocess_util.py", line 1118, in cleanup_mandb
    f_src.copy(f_src[f'{dts}'], f_dest, f'{dts}')
  File "/global/u1/m/mmccrack/so/env/20250710_0.2.1_soconda/lib/python3.12/site-packages/h5py/_hl/group.py", line 600, in copy
    h5o.copy(source.id, self._e(source_path), dest.id, self._e(dest_path),
  File "h5py/_objects.pyx", line 56, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 57, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 284, in h5py.h5o.copy
RuntimeError: Unable to synchronously copy object (len not positive after adjustment for EOA)

This is a straightforward fix, which is to just delete the problematic file when it fails with this error.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions