Is your feature request related to a problem? Please describe.
In my experience the same code (for training) is faster with pure C++ than python. My workflows utilize distributed compute for training, therefore such a solution would be awesome.
Describe the solution you'd like
A similar software like torch.distributed but for C++