Skip to content

Fill in website #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions content/en/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,12 @@ title: Portable Data-Parallel Python Extensions with oneAPI
</div>
<div class="lead text-center">
<div class="mx-auto mb-5">
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://IntelPython.github.io/portable-data-parallel-extensions-scipy-2024/docs/">
First<i class="fa-solid fa-question ms-2 "></i>
</a>
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
Demonstration<i class="fab fa-github ms-2 "></i>
</a>
<a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
About<i class="fa-solid fa-address-card ms-2 "></i>
</a>
</div>
</div>
</div>
Expand Down
99 changes: 93 additions & 6 deletions content/en/docs/first-app.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,102 @@
---
title: First DPC++ app
description: A SYCL and DPC++ "Hello, World!" example.
date: 2017-01-05
date: 2024-07-02
weight: 2
---

{{% pageinfo %}}
For an in-depth introduction to SYCL and to accelerators programming please refer to the "[Data Parallel C++](https://link.springer.com/book/10.1007/978-1-4842-9691-2)" open access e-book.

This is a placeholder page that shows you how to use this template site.
A SYCL application runs on SYCL platform (host, connected to one or more heterogeneous devices). The application is structured in three scopes: application scope, command group scope, and kernel scope. The kernel scope specifies a single kernel function that will be compiled by the device
compiler and executed on the device. The command group scope specifies a unit work which includes the kernel function, preparation of
its arguments and specifying execution ordering information. The application scope specifies all the other code outside of command group scope.
Execution of SYCL application begins in the application scope.

{{% /pageinfo %}}
```cpp
// Compile: icpx -fsycl first.cpp -o first
#include <sycl/sycl.hpp>

Do you have any example **applications** or **code** for your users in your repo
or elsewhere? Link to your examples here.
int main(void) {
// queue to enqueue work to
// default-selected device
sycl::queue q{sycl::default_selector_v};

// allocation device
size_t data_size = 256;
int *data = sycl::malloc_device<int>(data_size, q);

// submit a task to populate
// device allocation
sycl::event e_fill =
q.fill<int>(data, 42, data_size); // built-in kernel

// submit kernel to modify device allocation
sycl::event e_comp =
q.submit([&](sycl::handler &cgh) { // command-group scope
// order execution after
// fill task completes
cgh.depends_on(e_fill);

sycl::range<1> global_iter_range{data_size};
cgh.parallel_for(
global_iter_range,
[=](sycl::item<1> it) { // kernel scope
int i = it.get_id(0);
data[i] += i;
}
);
});

// copy from device to host
// order execution after modification task completes
int *host_data = new int[data_size];

q.copy<int>( // built-in kernel
data, host_data, data_size, {e_comp}
).wait();
sycl::free(data, q);

// Output content of the array
output_array(host_data, data_size);
delete[] host_data;

return 0;
}
```

The device where the kernel functions is executed is controlled by a device selector function, ``sycl::default_selector_v``.
The default selector assigns scores to every device recognized by the runtime, and selects the one with the highest score.
A list of devices recognized by the DPC++ runtime can be obtained by running ``sycl-ls`` command.

A user of SYCL application compiled with DPC++ may restrict the set of devices discoverable by the runtime using
``ONEAPI_DEVICE_SELECTOR`` environment variable. For example:

```bash
# execute on GPU
ONEAPI_DEVICE_SELECTOR=*:gpu ./first
# execute on CPU
ONEAPI_DEVICE_SELECTOR=*:cpu ./first
```

By default, DPC++ compiler generates offload code for [SPIR64](https://www.khronos.org/spir/) SYCL target, supported by
Intel GPUs as well as by CPU devices of x86_64 architecture. An attempt to execute SYCL program while
selecting only devices that do not support SPIR language would result in an error.

### Targeting other GPUs

DPC++ supports generation of offload sections for multiple targets. For example, to compile for both SPIR and NVPTX targets (oneAPI for NVidia(R) GPUs is assumed installed):

```bash
icpx -fsycl -Xsycl-targets="nvptx64-nvidia-cuda,spir64-unknown-unknown" first.cpp -o first.out
```

To compile for both SPIR and AMD GCN targets (oneAPI for AMD GPUs is assumed installed):

```bash
icpx -fsycl -Xsycl-targets="amdgcn-amd-amdhsa,spir64-unknown-unknown" first.cpp -o first.out
```

It is possible to pass additional arguments to the specific SYCL target backend. For example, to target specific architecture use:

- ``-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1030`` for AMD GPUs
- ``-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80`` for NVidia GPUs
197 changes: 122 additions & 75 deletions content/en/docs/kde-cpp.md
Original file line number Diff line number Diff line change
@@ -1,81 +1,128 @@
---
title: KDE DPC++ example
description: KDE (kernel density estimation) example using SYCL and DPC++.
date: 2024-07-02
weight: 2
---

{{% pageinfo %}}

These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines "as is" or adapt them with your own instructions: for example, other deployment options, information about your doc project's file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. [Kubeflow](https://github.com/kubeflow/website/blob/master/README.md) has a great example.

Don't forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.

{{% /pageinfo %}}

We use [Hugo](https://gohugo.io/) to format and generate our website, the
[Docsy](https://github.com/google/docsy) theme for styling and site structure,
and [Netlify](https://www.netlify.com/) to manage the deployment of the site.
Hugo is an open-source static site generator that provides us with templates,
content organisation in a standard directory structure, and a website generation
engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.

All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.

## Quick start with Netlify

Here's a quick guide to updating the docs. It assumes you're familiar with the
GitHub workflow and you're happy to use the automated preview of your doc
updates:

1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) on GitHub.
1. Make your changes and send a pull request (PR).
1. If you're not yet ready for a review, add "WIP" to the PR name to indicate
it's a work in progress. (**Don't** add the Hugo property
"draft = true" to the page front matter, because that prevents the
auto-deployment of the content preview described in the next point.)
1. Wait for the automated PR workflow to do some checks. When it's ready,
you should see a comment like this: **deploy/netlify — Deploy preview ready!**
1. Click **Details** to the right of "Deploy preview ready" to see a preview
of your updates.
1. Continue updating your doc and pushing your changes until you're happy with
the content.
1. When you're ready for a review, add a comment to the PR, and remove any
"WIP" markers.

## Updating a single page

If you've just spotted something you'd like to change while using the docs, Docsy has a shortcut for you:

1. Click **Edit this page** in the top right hand corner of the page.
1. If you don't already have an up to date fork of the project repo, you are prompted to get one - click **Fork this repository and propose changes** or **Update your Fork** to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
1. Follow the rest of the [Quick start with Netlify](#quick-start-with-netlify) process above to make, preview, and propose your changes.

## Previewing your changes locally

If you want to run your own local Hugo server to preview your changes as you work:

1. Follow the instructions in [Getting started](/docs/getting-started) to install Hugo and any other tools you need. You'll need at least **Hugo version 0.45** (we recommend using the most recent available version), and it must be the **extended** version, which supports SCSS.
1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) repo into your own project, then create a local copy using `git clone`. Don’t forget to use `--recurse-submodules` or you won’t pull down some of the code you need to generate a working site.

```
git clone --recurse-submodules --depth 1 https://github.com/google/docsy-example.git
```

1. Run `hugo server` in the site root directory. By default your site will be available at http://localhost:1313/. Now that you're serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
1. Continue with the usual GitHub workflow to edit files, commit them, push the
changes up to your fork, and create a pull request.

## Creating an issue

If you've found a problem in the docs, but you're not sure how to fix it yourself, please create an issue in the [Goldydocs repo](https://github.com/google/docsy-example/issues). You can also create an issue about a specific page by clicking the **Create Issue** button in the top right hand corner of the page.

## Useful resources

* [Docsy user guide](https://www.docsy.dev/docs/): All about Docsy, including how it manages navigation, look and feel, and multi-language support.
* [Hugo documentation](https://gohugo.io/documentation/): Comprehensive reference for Hugo.
* [Github Hello World!](https://guides.github.com/activities/hello-world/): A basic introduction to GitHub concepts and workflow.


Given a sample of \\(n\\) observations \\(x_i\\) drawn from an unknown underlying continuous distribution \\(f(x)\\),
the kernel density estimate of that density function is computed as follows, for some kernel
smoothing parameter \\(h \in \mathbb{R}\\):

$$
\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left(\frac{x - x_i}{h}\right)
$$

An example of NumPy code performing the estimation, for a common choice of kernel function as standard
\\(d\\)-dimensional Gaussian distribution:

<!-- See https://stackoverflow.com/questions/5319754/cross-reference-named-anchor-in-markdown //-->
<a id="kde_numpy" href=""></a>
```python
def kde(poi : np.ndarray, sample : np.ndarray, h : float) -> np.ndarray:
"""Given a sample from underlying continuous distribution and
a smoothing parameter `h`, evaluate density estimate at each point of
interest `poi`.
"""
assert sample.ndim == 2
assert poi.ndim == 2
m, d1 = poi.shape
n, d2 = sample.shape
assert d1 == d2
assert h > 0
dm = np.sum(np.square(poi[:, np.newaxis, ...] - sample[np.newaxis, ...]), axis=-1)
return np.mean(np.exp(dm/(-2*h*h)), axis=-1)/np.power(np.sqrt(2*np.pi) * h, d1)
```

The code above evaluates \\(f(x)\\) for \\(m\\) values of points of interest \\(y_t\\).

$$
f(y_t) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left( \frac{1}{h^2} \left\lVert y_t - x_i \right\rVert^{2} \right), \;\;\; \forall 0 \leq t \le m
$$

Evaluating such an expression can be done in parallel. Evaluation can be done independently for each \\(t\\).
Furthermore, summation over \\(i\\) can be partitioned among work-items, each summing \\(n_{wi}\\) distinct terms.
Such work partitioning would generate \\(m \cdot \left\lceil {n}/{n_{wi}}\right\rceil\\) independent tasks.
Each work-item could write its partial sum into a dedicated temporary memory location to avoid race condition
for further summation by another kernel operating in a similar fashion.

```cpp
parallel_for(
range<2>(m, ((n + n_wi - 1) / n_wi)),
[=](sycl::item<2> it) {
auto t = it.get_id(0);
auto i_block = it.get_id(1);

T local_partial_sum = ...;

partial_sums[t * ((n + n_wi - 1) / n_wi) + i_block] = local_partial_sum;
}
);
```

Such an approach, known as tree reduction, is implemented in ``kernel_density_esimation_temps`` function found in
``"steps/kernel_density_estimation_cpp/kde.hpp"``.

Use of temporary allocation can be avoided if each work-item atomically adds the value of the local sum to the
appropriate zero-initialized location in the output array, as in implementation ``kernel_density_estimation_atomic_ref``
in the same header file:

```cpp
parallel_for(
range<2>(m, ((n + n_wi - 1) / n_wi)),
[=](sycl::item<2> it) {
auto t = it.get_id(0);
auto i_block = it.get_id(1);

T local_partial_sum = ...;

sycl::atomic_ref<...> f_aref(f[t]);
f_aref += local_partial_sum;
}
);
```

Multiple work-items may concurrently updating the same location in global memory would produce the correct result due to
use of ``sycl::atomic_ref`` but at the expense of increased number of attempts, phenomenon known as atomic pressure.
Atomic pressure leads to thread divergence and degrades performance.

To reduce the atomic pressure work-items can be organized into work-groups. Every work-item in a work-group has access
to local shared memory, dedicated on-chip memory, which can be used to cooperatively combine values held by work-items
in the work-group without accessing the global memory. This could be done efficiently by calling group function
``sycl::reduce_over_group``. To be able to call it, we must specify iteration range using ``sycl::nd_range`` rather than
``sycl::range`` as we did earlier.

```cpp
auto wg = 256; // work-group-size
auto n_data_per_wg = n_wi * wg;
auto n_groups = ((n + n_data_per_wg - 1) / n_data_per_Wg);

range<2> gRange(m, n_groups * wg);
range<2> lRange(1, wg);

parallel_for(
nd_range<2>(gRange, lRange),
[=](sycl::nd_item<2> it) {
auto t = it.get_global_id(0);

T local_partial_sum = ...;

auto work_group = it.get_group();
T sum_over_wg = sycl::reduce_over_group(work_group, local_sum, sycl::plus<>());

if (work_group.leader()) {
sycl::atomic_ref<...> f_aref(f[t]);
f_aref += sum_over_wg;
}
}
);
```

Complete implementation can be found in ``kernel_density_estimation_work_group_reduce_and_atomic_ref`` function
in ``"steps/kernel_density_estimation_cpp/kde.hpp"``.

These implementations are called from C++ application ``"steps/kernel_density_estimation_cpp/app.cpp"``, which
samples data uniformly distributed over unit cuboid, and estimates the density using Kernel Density Estimation
and spherically symmetric multivariate Gaussian probability density function as the kernel.

The application can be built using `CMake`, or `Meson`, please refer to [README](steps/kernel_density_estimation_cpp/README.md) document in that folder.
Loading