-
Notifications
You must be signed in to change notification settings - Fork 34
How GridMap works under the hood
To run a function on a bunch of arguments using GridMap, you either use the grid_map or process_jobs functions. grid_map automatically creates a bunch of Job objects and sends them to process_jobs, so in the end, you're always using process_jobs (albeit indirectly with grid_map). When you create a Job object it saves your function, its arguments, and the path to the module containing the function (since that needs to be in the sys.path for unpickling later).
process_jobs does the following with the Job objects:
- It creates a
JobMonitorinstance that uses ØMQ to communicate with theheartbeatandrunnerprocesses that will be running on the different machines. TheJobMonitorkeeps track of what the inputs and outputs are for all of the jobs, and kills/resubmits jobs that have stalled. It also sends error email reports when things go awry. - It submits a bunch of Grid Engine command-line jobs that call
python -m gridmap.runner HOME_ADDRESS JOB_PATH, whereHOME_ADDRESSis the URL of the ØMQJobMonitorandJOB_PATHis the path to the module that the job's function belongs to. When these jobs gets executed on the cluster, they:- Immediately add
JOB_PATHtosys.path - Spawn a separate heartbeat process that repeatedly monitors CPU/memory usage and reports those back to the
JobMonitor. - Request the job's function and input from the
JobMonitor, which are sent as bz2-compressed pickles. - Execute the function inside a
try/exceptthat catches all exceptions. - Sends the return value of the function back to the
JobMonitoras a bz2-compressed pickle. If the job encountered an exception, that is considered the return value. The text of the stack trace is also sent back to aid in debugging in these cases. - Kill the heartbeat process after completion.
- Immediately add
- It waits until the
JobMonitorhas either received valid output from all of the jobs, or any one of the jobs has encountered an exception. If there was an exception, it is re-raised and all jobs are killed. - Tears down the
JobMonitor(and its local heartbeat process) and returns the outputs from all the jobs in a list.
As a side note, I should also mention that the JobMonitor class is a context manager, so if any exceptions are encountered when we're inside a with statement that instantiates one, we can tell that that happened and automatically try to kill all of the jobs. This includes KeyboardInterrupt exceptions when people hit CTRL-C.