Questions regarding z-scoring and MAP-computation when using an embedding net and training on large dataset. #1371

Aranka-S · 2025-01-20T12:33:46Z

Aranka-S
Jan 20, 2025
Collaborator

Hi all,

I've been working on using NPE with an embedding net instead of summary statistics as input to the NDE. Addionally, my dataset (multivariate time-series) that is now the input to the embedding net is too large to keep in memory all at once.
With the new training interface, I got the training up and running with an own dataloader (I am using the SBI release v.0.23.3). However, I still have questions about aspects of the pipeline:

I understand z-scoring is computed based on the data passed during density estimator construction (build_maf). But I cannot pass all my data of course. Is there any elegant method to still incorporate z-scoring in the sbi-pipeline based on what's in the toolbox? Or should I perform z-scoring as a form of preprocessing beforehand and then make sure I keep that info to z-score all observations on which I want to run inference later (can I save the parameters in the MAF somehow after pre-computing them)? Do you have any insights in how large the impact would be if z-scoring would be computed based on just a small subset of the data (I guess passing only a subset the size of what fits into memory during construction would be a solution, though not the prefered one as the computation of the z-score parameters would be based on very little data)?
When computing the MAP of a posterior, the gradient ascent has become very slow in comparison with previous runs using summary statistics, and memory usage of the .map procedure is also very high. I'm guessing the slowness is just a "normal" consequence of the larger NN (embedding net+NDE) that has to be evaluated? But I find the memory usage increase harder to grasp, could anybody help me with understanding this? I did find that the peak in memory usage seems to decrease if I lower the num_init samples, but I can't really explain why this has such a large influence (Is the complete neural network saved for each initial sample somehow?). It might be related to the answer in this discussion? Any thoughts on this?
And maybe even better: is there anything I can do to mitigate either the runtime or memory usage increase? I could compute the MAP on GPU if there is a way around the memory usage?
As I was using the progress bar during MAP computation to try to understand things a bit better, I also noticed that the gradient ascent procedure seems to have converged fairly quickly, but the number of iterations is fixed and there are thus many slow computations that no longer improve the log-prob? I found this issue that was opened and closed last year, but as far as I see in the code, there is no early stopping yet. I can try to implement my own early stopping, but before I do that, I am wondering why there was decided against implementing this? Did it have drawbacks?

This became quite an overload in questions it seems, but I would be grateful for any insights or help this community can provide.

michaeldeistler · 2025-01-21T09:46:41Z

michaeldeistler
Jan 21, 2025
Maintainer

Hey! Thanks for reaching out!

Unfortunately, there is no good mechanism to do this at the moment. I don't think that the impact of z-scoring on just a subset of the data would be very large. And yes, I think another option would be to just z-score the data manually as you suggest.
Yes, I would assume the slowness is just a consequence of the larger NN. I would also expect the memory to be a consequence of the larger NN, as you have to backprop though a larger NN and thus need more memory (we backprop for performing gradient ascent on the log-prob). How much higher is the memory consumption and how big is the embedding net? The simplest workaround around any such issue is to write the optimization for the MAP yourself. It is really just gradient ascent (using multiple initial conditions) on the log-probability.
There is no automated early stopping. The reason is that such a procedure will almost never stop because it will keep increasing ever so slightly for ever (which is why we closed the issue). You can, however, interrupt the MAP computation with CTRL-C and then access the current best estimate will still be returned. I.e. best_map = posterior.map() will save the best value in best_map even if you interrupt.

Hope this helps!
Michael

0 replies

Aranka-S · 2025-01-23T09:16:05Z

Aranka-S
Jan 23, 2025
Collaborator Author

Hi Michael,

Thank you for your quick answer!
I was somehow reluctant to add many procedures "external" to the library, but you are right, both z-scoring and MAP-computation are straightforward to implement outside of the toolbox. Now I can try to tackle my memory usage issues without messing in the actual code base.

For the z-scoring I am now computing a mean and std manually and creating a Standardize(mean, std) instance that I then attach to the embedding net (as in the SBI code). This way I can "link" these computed values to the NN and don't have to worry about storing them somewhere separately for use at inference time.
My embedding net has 90,746 parameters and forward-backward pass size for a single sample is 55.31 Mb. There is a very large peak at the beginning of the MAP-computation (see below, in fact the peak memory usage was around 40 Gb, but the plot averages out a little bit), while memory usage is not small after that, it seems more managable. I'm guessing the peak happens during the computation of the log-probs for the 1000 initial samples before sorting them and selecting the best 100 initial conditions? I'll try and see if I can spread those computations a bit (or have them on CPU). It's a shame to have everything crash before gradient ascent has started when the gradient ascent itself would be manageable on GPU.

I was aiming for as little manual interventions as possible, but as MAP is more part of results analysis I guess interupting could be an option. I might try around a bit for a stopping criterion based on the size of the increase as well as I'll be writing my own optimisation code to solve my issue (2) anyway.

0 replies

michaeldeistler · 2025-01-23T09:29:31Z

michaeldeistler
Jan 23, 2025
Maintainer

Great, thanks a lot for the answer!

I could imagine that the issue with memory is solved by avoiding the computation tree to be tracked here (i.e. by wrapping this in a with torch.no_grad(): context manager. I will have a look in the future, but I don't have immediate capacity to look into this.

Michael

0 replies

janfb · 2025-04-02T13:33:11Z

janfb
Apr 2, 2025
Maintainer

Hi @Aranka-S, did you find a solution to this problem in the end?

1 reply

Aranka-S Apr 18, 2025
Collaborator Author

I tried adding the with torch.no_grad():, but that did not work for me.
I do think the size of my datapoints + all computations on them cause the memory issue of their own, with or without gradients.

So I ended up copying and adapting the MAP- and gradient ascent functions locally, such that I could split the computation over the multiple startpoints in different batches (and then tune the batch size depending on the memory capacity of the GPU I'm using). It's probably not the nicest solution, and I made it kind of slow, but at least I can compute a MAP value without my script crashing.
I was planning to look into making a more elegant solution for myself, but I haven't gotten 'round to it unfortunately.

While I was changing things in the MAP code anyway, I was able to add an early-stopping criterion based on the size of the increase for myself. So if that infinite "ever so slight increase" that Michael mentioned, becomes small enough, I interupt the gradient ascent. This does seem to stop early often. Is there any reason why it might not be a good idea to have an early-stopping like this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions regarding z-scoring and MAP-computation when using an embedding net and training on large dataset. #1371

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questions regarding z-scoring and MAP-computation when using an embedding net and training on large dataset. #1371

Uh oh!

Uh oh!

Aranka-S Jan 20, 2025 Collaborator

Replies: 4 comments · 1 reply

Uh oh!

michaeldeistler Jan 21, 2025 Maintainer

Uh oh!

Aranka-S Jan 23, 2025 Collaborator Author

Uh oh!

michaeldeistler Jan 23, 2025 Maintainer

Uh oh!

janfb Apr 2, 2025 Maintainer

Uh oh!

Aranka-S Apr 18, 2025 Collaborator Author

Aranka-S
Jan 20, 2025
Collaborator

Replies: 4 comments 1 reply

michaeldeistler
Jan 21, 2025
Maintainer

Aranka-S
Jan 23, 2025
Collaborator Author

michaeldeistler
Jan 23, 2025
Maintainer

janfb
Apr 2, 2025
Maintainer

Aranka-S Apr 18, 2025
Collaborator Author