-
Notifications
You must be signed in to change notification settings - Fork 117
Description
TLDR Summary: We have been trying to use the develop branch of reframe-hpc on our system (which is using PBS Pro 2022.1.7), but ran into some issues with pbs.py, possibly due to differences in the PBS version pbs.py was originally developed for and our version of PBS. We wrote a patch for pbs.py to make it work with PBS Pro 2022.1.7 on our system: https://github.com/reframe-hpc/reframe/compare/develop...colleeneb:reframe:develop?expand=1 . Is this useful to contribute back to reframe as a PR and if so, what should I tag it as? A new feature "PBS Pro Scheduler" or something else?
Details:
The main changes we needed to make to get it to work were in poll in pbs.py. A comparison is here: https://github.com/reframe-hpc/reframe/compare/develop...colleeneb:reframe:develop?expand=1 shows the differences we made to it to get it to work reliably on our system. We would like to contribute back what we did if they are useful for you – either in pbs.py or a new "PBS Pro" scheduler type, whatever would be preferred by reframe. Let us know and we can submit this as a PR however you want--I'm not sure if it is a new feature of pbs pro or bugfix.
To add more details, a few of the issues we had were:
- The call to "qstat -f {' '.join(job.jobid for job in jobs)}" () will return 35 if any of the jobs in the joblist are not in the queue (i.e. if they have finished running). Thus if one job is out of the queue but others are still running, this call will still return 35, and all jobs are marked as completed when they are not actually done.
reframe/reframe/core/schedulers/pbs.py
Line 232 in 36afc5c
f'qstat -f {" ".join(job.jobid for job in jobs)}' - output_ready () is used as a check on if the job is done as well, but "output_ready" is true as soon as the output and error files exist, and for PBS Pro at least, these files exist when the job begins to run, not when it ends, so it's true whenever the job starts running, and can't be used to know when the job is done. [0]
reframe/reframe/core/schedulers/pbs.py
Line 217 in 36afc5c
def output_ready(job): - In PBS Pro, the state that jobs go into when they are complete is "F" for Finished, not "C". (Table 2-25: Job States in https://help.altair.com/2022.1.0/PBS%20Professional/PBS2022.1.pdf#M54.9.38036.PBSHeading1.151.qstat). Thus is never true for us since there's no "C" state.
reframe/reframe/core/schedulers/pbs.py
Line 296 in 36afc5c
if job.state == 'COMPLETED':
With that in mind, we modified poll to a poll function for PBS Pro where we keep the same structure, but query "qstat -xf F json" to get the output of all the jobids. Since we use "-x" this list includes the jobids that have already finished, and we asked for it in json format so it's a bit easier to parse. Then we loop over the jobs and we can rely on just checking the state to determine if it's finished.
Thanks!
Credit also to @TApplencourt , @thilinarmtb, @bensallen !
[0] @bensallen has clarified that the files are written as soon as the job runs for our system is because of how it's set up (we're using $usecp in the pbs_mom configuration, so PBS knows we have shared filesystems and thus will write directly to them instead of copying the files back at the end of the job). However, the most clear way of knowing a job is finished is with checking the status via qstat as it works regardless of the state of the files.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status