Skip to content

Commit 24f908e

Browse files
committed
add hapi docs
1 parent 46afa36 commit 24f908e

File tree

2 files changed

+80
-0
lines changed

2 files changed

+80
-0
lines changed

docs/gpus.rst

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
====
2+
GPUs
3+
====
4+
5+
.. .. contents::
6+
7+
8+
GPUs are supported in Charm4py via the Charm++ HAPI (Hybrid API) interface.
9+
Presently, this support allows asynchronous completion detection of GPU kernels via Charm4py futures,
10+
using the function ``charm.hapiAddCudaCallback``.
11+
12+
The HAPI Charm4py API is:
13+
14+
.. code-block:: python
15+
16+
def hapiAddCudaCallback(stream, future)
17+
18+
.. note::
19+
20+
For now, ``charm.hapiAddCudaCallback`` only supports numba and torch streams as input. This function inserts a callback
21+
into the stream such that when the callback is reached, the corresponding Charm4py future is set.
22+
23+
24+
Examples
25+
--------
26+
27+
.. code-block:: python
28+
29+
from charm4py import charm
30+
import time
31+
import numba.cuda as cuda
32+
import numpy as np
33+
34+
@cuda.jit
35+
def elementwise_sum_kernel(x_in, x_out):
36+
idx = cuda.grid(1)
37+
if idx < x_in.shape[0]:
38+
x_out[idx] = x_in[idx] + x_in[idx]
39+
40+
def main(args):
41+
N = 1_000_000
42+
array_size = (N,)
43+
44+
s = cuda.stream()
45+
stream_handle = s.handle.value
46+
47+
A_host = np.arange(N, dtype=np.float32)
48+
49+
A_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s)
50+
B_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s)
51+
A_gpu.copy_to_device(A_host, stream=s)
52+
53+
threads_per_block = 128
54+
blocks_per_grid = (N + (threads_per_block - 1)) // threads_per_block
55+
56+
print("Launching kernel and inserting callback...")
57+
start_time = time.perf_counter()
58+
elementwise_sum_kernel[blocks_per_grid, threads_per_block, s](A_gpu, B_gpu)
59+
60+
return_fut = charm.Future()
61+
charm.hapiAddCudaCallback(stream_handle, return_fut)
62+
return_fut.get()
63+
kernel_done_time = time.perf_counter()
64+
print(f"Callback received, kernel finished in {kernel_done_time - start_time:.6f} seconds.")
65+
66+
B_host = B_gpu.copy_to_host(stream=s)
67+
68+
s.synchronize()
69+
70+
sum_result = np.sum(B_host)
71+
print(f"Sum of result is {sum_result}")
72+
73+
charm.exit()
74+
75+
charm.start(main)
76+
77+
78+
The above example demonstrates how to use the Charm4py HAPI interface to insert a callback into a CUDA stream and track
79+
completion of a numba kernel launch.

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ to the largest supercomputers.
4141
sections
4242
pool
4343
rules
44+
gpus
4445

4546
.. toctree::
4647
:maxdepth: 2

0 commit comments

Comments
 (0)