-
Notifications
You must be signed in to change notification settings - Fork 197
Description
Hello, is there any documentation on how to effectively use num_workers
and use_batched_sampling
? I am running into very long run times with run_sbc
and I am not sure whats going wrong. Here's how I'm calling the function:
ranks, dap_samples = run_sbc(thetas=thetas, xs=xs,
posterior=posterior,
num_posterior_samples=nsamples,
show_progress_bar=True,
num_workers=ncpus
)
I have 1000 simulations and I set nsamples
to be 1000. When I toggle the above between use_batched_sampling=False
and use_batched_sampling=True
(default) in the function call, the former at least gives me a progress update although it still doesn't finish.
Looking through the code, I think the bottleneck might be max_sampling_batch_size
which is set to 10,000? The parameter is not exposed though (at least when you build a posterior via inference.build_posterior
). I did set simulation_batch_size
in simulate_for_sbi
(to be int(nsims/ncpus)
) but I dont think that gets communicated to the DirectPosterior
object.
I run into the same issue with run_tarp
which doesnt have the use_batched_sampling
exposed (although #1321 should enable that once its merged).
I use cpus-per-task=35
in my sbatch
script and confirm that 35 cpus are indeed available. The run_sbc
call seems to be stuck at 1/1000 even after 5hours when using the default option for use_batched_sampling
, and barely passes 100/1000 after 12hrs (even though the time estimates on the progress bar estimate otherwise) when I set use_batched_sampling=False
.
I'd really appreciate some help. I am starting to unpack run_sbc
since I can't think of anything else but thought I'd inquire here in case I'm missing something. My understanding is that my call never makes it past get_posterior_samples_on_batch
(which calls posterior.sample_batched
).
Thank you!