-
Notifications
You must be signed in to change notification settings - Fork 421
Make htod_memcpy_async available on CPU #4640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Make htod_memcpy_async available on CPU #4640
Conversation
The reason we have never added cpu versions of these functions is that it might result in unnecessary copies. In many cases, one could simply copy the pointer for CPU cases. |
And amrex::Gpu::copy can be used to copy for both GPU and CPU cases. But this issue/complaint keeps coming up. My thought is we should make it hard for the users to write non-optimal code. To me, htod_memcpy etc. are GPU only concepts. But I can be convinced. |
The code I want to clean up in hipace packs and unpacks buffers for MPI. Even on CPU, a memory copy is necessary here. As it is dealing with raw memory, the iterator interface of copyAsync doesn't fit very well; additionally, I find that function more verbose to use due to the double For the cases where data is copied to be used in a GPU kernel, which seems to be very common, it is currently quite difficult to do this efficiently on both CPU and GPU. For this, I want to eventually add a container type that has both a CPU and GPU vector internally when a GPU is used but only one vector on unified memory systems or CPUs. template<class T>
struct HostDeviceVector {
amrex::Gpu::DeviceVector<T> m_d_data;
amrex::Gpu::PinnedVector<T> m_h_data;
bool m_unified_memory = true;
T* dataPtr () {
if (m_unified_memory) {
return m_h_data.dataPtr();
} else {
return m_d_data.dataPtr();
}
}
T& operator[] (std::size_t i) {
return m_h_data[i];
}
void resize (std::size_t i) {
m_h_data.resize(i);
if (!m_unified_memory) {
m_d_data.resize(i);
}
}
void async_copy_d2h () {
if (!m_unified_memory) {
amrex::Gpu::dtoh_memcpy_async(m_h_data.dataPtr(), m_d_data.dataPtr(),
m_d_data.size() * sizeof(T));
}
}
void async_copy_h2d () {
if (!m_unified_memory) {
amrex::Gpu::htod_memcpy_async(m_d_data.dataPtr(), m_h_data.dataPtr(),
m_h_data.size() * sizeof(T));
}
}
void sync () {
if (!m_unified_memory) {
amrex::Gpu::streamSynchronize();
}
}
}; |
We have something that is somewhat like that with one "vector" for CPU build and two for GPU builds. amrex::Gpu::Buffer |
Very interesting, I didn't know this existed. One flaw I see with the current version of it is that to use the |
Agreed. We need to add more methods for initialization. |
## Summary This PR expands the capabilities of `amrex::Gpu::Buffer<T>` so that it can be first default constructed, then modified by the CPU, and lastly copied over to the GPU, instead of everything happening in the constructor. Example: ```C++ amrex::Gpu::Buffer<int> buf; buf.resize(n); for (int i=0; i<n; ++i) { buf[i] = i*i; } buf.copyToDeviceAsync(); int * ptr = buf.data(); // Use ptr inside ParallelFor // optional: // Change values of ptr inside ParallelFor buf.copyToHost(); // Use buf.hostData() or buf[] on the CPU ``` ## Additional background Follow-up to discussion in #4640
Summary
This PR makes the 6 GPU memory copy functions
available in CPU code, where they all call
std::memcpy
.This helps to avoid the typical pattern of using ifdef in user code:
Additional background
Checklist
The proposed changes: