IntelPython · ndgrigorian · Jul 8, 2024 · Jul 7, 2024 · Jul 8, 2024
diff --git a/content/en/_index.md b/content/en/_index.md
@@ -17,15 +17,12 @@ title: Portable Data-Parallel Python Extensions with oneAPI
   </div>
     <div class="lead text-center">
       <div class="mx-auto mb-5">
-        <a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
+        <a class="btn btn-lg btn-secondary me-3 mb-4" href="https://IntelPython.github.io/portable-data-parallel-extensions-scipy-2024/docs/">
           First<i class="fa-solid fa-question ms-2 "></i>
         </a>
         <a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
           Demonstration<i class="fab fa-github ms-2 "></i>
         </a>
-        <a class="btn btn-lg btn-secondary me-3 mb-4" href="https://github.com/google/docsy-example">
-          About<i class="fa-solid fa-address-card ms-2 "></i>
-        </a>
       </div>
     </div>
   </div>

diff --git a/content/en/docs/first-app.md b/content/en/docs/first-app.md
@@ -1,15 +1,102 @@
 ---
 title: First DPC++ app
 description: A SYCL and DPC++ "Hello, World!" example.
-date: 2017-01-05
+date: 2024-07-02
 weight: 2
 ---
 
-{{% pageinfo %}}
+For an in-depth introduction to SYCL and to accelerators programming please refer to the "[Data Parallel C++](https://link.springer.com/book/10.1007/978-1-4842-9691-2)" open access e-book.
 
-This is a placeholder page that shows you how to use this template site.
+A SYCL application runs on SYCL platform (host, connected to one or more heterogeneous devices). The application is structured in three scopes: application scope, command group scope, and kernel scope. The kernel scope specifies a single kernel function that will be compiled by the device
+compiler and executed on the device. The command group scope specifies a unit work which includes the kernel function, preparation of
+its arguments and specifying execution ordering information. The application scope specifies all the other code outside of command group scope.
+Execution of SYCL application begins in the application scope.
 
-{{% /pageinfo %}}
+```cpp
+// Compile: icpx -fsycl first.cpp -o first
+#include <sycl/sycl.hpp>
 
-Do you have any example **applications** or **code** for your users in your repo
-or elsewhere? Link to your examples here.
+int main(void) {
+    // queue to enqueue work to
+    // default-selected device
+    sycl::queue q{sycl::default_selector_v};
+
+    // allocation device
+    size_t data_size = 256;
+    int *data = sycl::malloc_device<int>(data_size, q);
+
+    // submit a task to populate
+    // device allocation
+    sycl::event e_fill =
+        q.fill<int>(data, 42, data_size);   // built-in kernel
+
+    // submit kernel to modify device allocation
+    sycl::event e_comp =
+        q.submit([&](sycl::handler &cgh) {   // command-group scope
+            // order execution after
+            // fill task completes
+            cgh.depends_on(e_fill);
+
+            sycl::range<1> global_iter_range{data_size};
+            cgh.parallel_for(
+                global_iter_range,
+                [=](sycl::item<1> it) {      // kernel scope
+                    int i = it.get_id(0);
+                    data[i] += i;
+                }
+            );
+        });
+
+    // copy from device to host
+    // order execution after modification task completes
+    int *host_data = new int[data_size];
+
+    q.copy<int>(                              // built-in kernel
+        data, host_data, data_size, {e_comp}
+    ).wait();
+    sycl::free(data, q);
+
+    // Output content of the array
+    output_array(host_data, data_size);
+    delete[] host_data;
+
+    return 0;
+}
+```
+
+The device where the kernel functions is executed is controlled by a device selector function, ``sycl::default_selector_v``.
+The default selector assigns scores to every device recognized by the runtime, and selects the one with the highest score.
+A list of devices recognized by the DPC++ runtime can be obtained by running ``sycl-ls`` command.
+
+A user of SYCL application compiled with DPC++ may restrict the set of devices discoverable by the runtime using
+``ONEAPI_DEVICE_SELECTOR`` environment variable. For example:
+
+```bash
+# execute on GPU
+ONEAPI_DEVICE_SELECTOR=*:gpu ./first
+# execute on CPU
+ONEAPI_DEVICE_SELECTOR=*:cpu ./first
+```
+
+By default, DPC++ compiler generates offload code for [SPIR64](https://www.khronos.org/spir/) SYCL target, supported by
+Intel GPUs as well as by CPU devices of x86_64 architecture. An attempt to execute SYCL program while
+selecting only devices that do not support SPIR language would result in an error.
+
+### Targeting other GPUs
+
+DPC++ supports generation of offload sections for multiple targets. For example, to compile for both SPIR and NVPTX targets (oneAPI for NVidia(R) GPUs is assumed installed):
+
+```bash
+icpx -fsycl -Xsycl-targets="nvptx64-nvidia-cuda,spir64-unknown-unknown" first.cpp -o first.out
+```
+
+To compile for both SPIR and AMD GCN targets (oneAPI for AMD GPUs is assumed installed):
+
+```bash
+icpx -fsycl -Xsycl-targets="amdgcn-amd-amdhsa,spir64-unknown-unknown" first.cpp -o first.out
+```
+
+It is possible to pass additional arguments to the specific SYCL target backend. For example, to target specific architecture use:
+
+- ``-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1030`` for AMD GPUs
+- ``-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80`` for NVidia GPUs
diff --git a/content/en/docs/kde-cpp.md b/content/en/docs/kde-cpp.md
@@ -1,81 +1,128 @@
 ---
 title: KDE DPC++ example
 description: KDE (kernel density estimation) example using SYCL and DPC++.
+date: 2024-07-02
 weight: 2
 ---
 
-{{% pageinfo %}}
-
-These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines "as is" or adapt them with your own instructions: for example, other deployment options, information about your doc project's file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. [Kubeflow](https://github.com/kubeflow/website/blob/master/README.md) has a great example.
-
-Don't forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.
-
-{{% /pageinfo %}}
-
-We use [Hugo](https://gohugo.io/) to format and generate our website, the
-[Docsy](https://github.com/google/docsy) theme for styling and site structure,
-and [Netlify](https://www.netlify.com/) to manage the deployment of the site.
-Hugo is an open-source static site generator that provides us with templates,
-content organisation in a standard directory structure, and a website generation
-engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
-
-All submissions, including submissions by project members, require review. We
-use GitHub pull requests for this purpose. Consult
-[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
-information on using pull requests.
-
-## Quick start with Netlify
-
-Here's a quick guide to updating the docs. It assumes you're familiar with the
-GitHub workflow and you're happy to use the automated preview of your doc
-updates:
-
-1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) on GitHub.
-1. Make your changes and send a pull request (PR).
-1. If you're not yet ready for a review, add "WIP" to the PR name to indicate
-  it's a work in progress. (**Don't** add the Hugo property
-  "draft = true" to the page front matter, because that prevents the
-  auto-deployment of the content preview described in the next point.)
-1. Wait for the automated PR workflow to do some checks. When it's ready,
-  you should see a comment like this: **deploy/netlify — Deploy preview ready!**
-1. Click **Details** to the right of "Deploy preview ready" to see a preview
-  of your updates.
-1. Continue updating your doc and pushing your changes until you're happy with
-  the content.
-1. When you're ready for a review, add a comment to the PR, and remove any
-  "WIP" markers.
-
-## Updating a single page
-
-If you've just spotted something you'd like to change while using the docs, Docsy has a shortcut for you:
-
-1. Click **Edit this page** in the top right hand corner of the page.
-1. If you don't already have an up to date fork of the project repo, you are prompted to get one - click **Fork this repository and propose changes** or **Update your Fork** to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
-1. Follow the rest of the [Quick start with Netlify](#quick-start-with-netlify) process above to make, preview, and propose your changes.
-
-## Previewing your changes locally
-
-If you want to run your own local Hugo server to preview your changes as you work:
-
-1. Follow the instructions in [Getting started](/docs/getting-started) to install Hugo and any other tools you need. You'll need at least **Hugo version 0.45** (we recommend using the most recent available version), and it must be the **extended** version, which supports SCSS.
-1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) repo into your own project, then create a local copy using `git clone`. Don’t forget to use `--recurse-submodules` or you won’t pull down some of the code you need to generate a working site.
-
-    ```
-    git clone --recurse-submodules --depth 1 https://github.com/google/docsy-example.git
-    ```
-
-1. Run `hugo server` in the site root directory. By default your site will be available at http://localhost:1313/. Now that you're serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
-1. Continue with the usual GitHub workflow to edit files, commit them, push the
-  changes up to your fork, and create a pull request.
-
-## Creating an issue
-
-If you've found a problem in the docs, but you're not sure how to fix it yourself, please create an issue in the [Goldydocs repo](https://github.com/google/docsy-example/issues). You can also create an issue about a specific page by clicking the **Create Issue** button in the top right hand corner of the page.
-
-## Useful resources
-
-* [Docsy user guide](https://www.docsy.dev/docs/): All about Docsy, including how it manages navigation, look and feel, and multi-language support.
-* [Hugo documentation](https://gohugo.io/documentation/): Comprehensive reference for Hugo.
-* [Github Hello World!](https://guides.github.com/activities/hello-world/): A basic introduction to GitHub concepts and workflow.
-
-
+Given a sample of \\(n\\) observations \\(x_i\\) drawn from an unknown underlying continuous distribution \\(f(x)\\),
+the kernel density estimate of that density function is computed as follows, for some kernel
+smoothing parameter \\(h \in \mathbb{R}\\):
+
+$$
+    \hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left(\frac{x - x_i}{h}\right)
+$$
+
+An example of NumPy code performing the estimation, for a common choice of kernel function as standard
+\\(d\\)-dimensional Gaussian distribution:
+
+<!-- See https://stackoverflow.com/questions/5319754/cross-reference-named-anchor-in-markdown //-->
+<a id="kde_numpy" href=""></a>
+```python
+def kde(poi : np.ndarray, sample : np.ndarray, h : float) -> np.ndarray:
+    """Given a sample from underlying continuous distribution and
+    a smoothing parameter `h`, evaluate density estimate at each point of
+    interest `poi`.
+    """
+    assert sample.ndim == 2
+    assert poi.ndim == 2
+    m, d1 = poi.shape
+    n, d2 = sample.shape
+    assert d1 == d2
+    assert h > 0
+    dm = np.sum(np.square(poi[:, np.newaxis, ...] - sample[np.newaxis, ...]), axis=-1)
+    return np.mean(np.exp(dm/(-2*h*h)), axis=-1)/np.power(np.sqrt(2*np.pi) * h, d1)
+```
+
+The code above evaluates \\(f(x)\\) for \\(m\\) values of points of interest \\(y_t\\).
+
+$$
+   f(y_t) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left( \frac{1}{h^2} \left\lVert y_t - x_i \right\rVert^{2}  \right), \;\;\;  \forall 0 \leq t \le m
+$$
+
+Evaluating such an expression can be done in parallel. Evaluation can be done independently for each \\(t\\).
+Furthermore, summation over \\(i\\) can be partitioned among work-items, each summing \\(n_{wi}\\) distinct terms.
+Such work partitioning would generate \\(m \cdot \left\lceil {n}/{n_{wi}}\right\rceil\\) independent tasks.
+Each work-item could write its partial sum into a dedicated temporary memory location to avoid race condition
+for further summation by another kernel operating in a similar fashion.
+
+```cpp
+    parallel_for(
+        range<2>(m, ((n + n_wi - 1) / n_wi)),
+        [=](sycl::item<2> it) {
+            auto t = it.get_id(0);
+            auto i_block = it.get_id(1);
+
+            T local_partial_sum = ...;
+
+            partial_sums[t * ((n + n_wi - 1) / n_wi) + i_block] = local_partial_sum;
+        }
+    );
+```
+
+Such an approach, known as tree reduction, is implemented in ``kernel_density_esimation_temps`` function found in
+``"steps/kernel_density_estimation_cpp/kde.hpp"``.
+
+Use of temporary allocation can be avoided if each work-item atomically adds the value of the local sum to the
+appropriate zero-initialized location in the output array, as in implementation ``kernel_density_estimation_atomic_ref``
+in the same header file:
+
+```cpp
+    parallel_for(
+        range<2>(m, ((n + n_wi - 1) / n_wi)),
+        [=](sycl::item<2> it) {
+            auto t = it.get_id(0);
+            auto i_block = it.get_id(1);
+
+            T local_partial_sum = ...;
+
+            sycl::atomic_ref<...> f_aref(f[t]);
+            f_aref += local_partial_sum;
+        }
+    );
+```
+
+Multiple work-items may concurrently updating the same location in global memory would produce the correct result due to
+use of ``sycl::atomic_ref`` but at the expense of increased number of attempts, phenomenon known as atomic pressure.
+Atomic pressure leads to thread divergence and degrades performance.
+
+To reduce the atomic pressure work-items can be organized into work-groups. Every work-item in a work-group has access
+to local shared memory, dedicated on-chip memory, which can be used to cooperatively combine values held by work-items
+in the work-group without accessing the global memory. This could be done efficiently by calling group function
+``sycl::reduce_over_group``. To be able to call it, we must specify iteration range using ``sycl::nd_range`` rather than
+``sycl::range`` as we did earlier.
+
+```cpp
+    auto wg = 256; // work-group-size
+    auto n_data_per_wg = n_wi * wg;
+    auto n_groups = ((n + n_data_per_wg - 1) / n_data_per_Wg);
+
+    range<2> gRange(m, n_groups * wg);
+    range<2> lRange(1, wg);
+
+    parallel_for(
+        nd_range<2>(gRange, lRange),
+        [=](sycl::nd_item<2> it) {
+            auto t = it.get_global_id(0);
+
+            T local_partial_sum = ...;
+
+            auto work_group = it.get_group();
+            T sum_over_wg = sycl::reduce_over_group(work_group, local_sum, sycl::plus<>());
+
+            if (work_group.leader()) {
+                sycl::atomic_ref<...> f_aref(f[t]);
+                f_aref += sum_over_wg;
+            }
+        }
+    );
+```
+
+Complete implementation can be found in ``kernel_density_estimation_work_group_reduce_and_atomic_ref`` function
+in ``"steps/kernel_density_estimation_cpp/kde.hpp"``.
+
+These implementations are called from C++ application ``"steps/kernel_density_estimation_cpp/app.cpp"``, which
+samples data uniformly distributed over unit cuboid, and estimates the density using Kernel Density Estimation
+and spherically symmetric multivariate Gaussian probability density function as the kernel.
+
+The application can be built using `CMake`, or `Meson`, please refer to [README](steps/kernel_density_estimation_cpp/README.md) document in that folder.