v0.4.0 - Delightful Daikon
We are back with a major release that touches all aspects of Celerity, bringing considerable improvements to its APIs, usability and performance.
Thanks to everybody who contributed to this release: @almightyvats @BlackMark29A @facuMH @fknorr @PeterTh @psalz!
HIGHLIGHTS
- Celerity 0.4.0 uses a fully distributed scheduling model replacing the old master-worker approach. This improves the scheduling complexity of applications with all-to-all communication from O(N^2) to O(N), solving a central scaling bottleneck for many Celerity applications (#186).
- Objects shared between multiple
host_tasks, such as file handles for I/O operations, can now be explicitly managed by the runtime through a new experimental declarative API: Ahost_objectencapsulates arbitrary host-side objects, whileside_effectsare used to read and/or mutate them, analogously tobufferandaccessor. Embracing this new pattern will guarantee correct lifetimes and synchronization around these objects. (#68). - The new experimental
fenceAPI allows accessing buffer and host-object data from the main thread without manual synchronization and reimagines SYCL's host accessors in a way that is more compatible with Celerity's asynchronous execution model (#151). - The new CMake option
CELERITY_ACCESSOR_BOUNDARY_CHECKcan be set to enable out-of-bounds buffer access detection at runtime inside device kernels to detect errors such as incorrectly-specified range-mappers, at the cost of some runtime overhead. This check is enabled by default for debug builds of Celerity (#178). - Celerity now expects buffers (and the new host-objects) to be captured by reference into command group functions, where it previously required by-value captures. This is in accordance with SYCL 2020 and removes one common source of user errors (#173).
- Last but not least, several significant performance improvements make Celerity even more competitive for real-world HPC applications (#100, #111, #112, #115, #133, #137, #138, #145, #184).
Changelog
We recommend using the following SYCL versions with this release:
- DPC++: 61e51015 or newer
- hipSYCL: 24980221 or newer
See our platform support guide for a complete list of all officially supported configurations.
Added
- Introduce new experimental
host_objectandside_effectAPIs to express non-buffer dependencies between host tasks (#68, 7a5326a) - Add new
CELERITY_GRAPH_PRINT_MAX_VERTSconfig options (#80, d3dd722) - Named threads for better debugging (#98, 25d769d, #131, ff5fbec)
- Add support for passing device selectors to
distr_queueconstructor (#113, 556b6f2) - Add new
CELERITY_DRY_RUN_NODESenvironment variable to simulate the scheduling of an application on a large number of nodes (without execution or data transfers) (#125, 299ebbf) - Add ability to name buffers for debugging (#132, 1076522)
- Introduce experimental
fenceAPI for accessing buffer and host-object data from the main thread (#151, 6b803f8) - Introduce backend system for vendor-specific code paths (#162, 750f32a)
- Add
CELERITY_USE_MIMALLOCCMake configuration option to use the mimalloc allocator (enabled by default) (#170, 234e3d2) - Support 0-dimensional buffers, accessors and kernels (#163, 0685d94)
- Introduce new diagnostics utility for detecting erroneous reference captures into kernel functions, as well as unused accessors (#173, ff7ed02)
- Introduce
CELERITY_ACCESSOR_BOUNDARY_CHECKCMake option to detect out-of-bounds buffer accesses inside device kernels (enabled by default for debug builds) (#178, 2c738c8) - Print more helpful error message when buffer allocations exceed available device memory (#179, 79f97c2)
Changed
- Update spdlog to 1.9.2 (#80, a178828)
- Overhaul logging mechanism (#80, 1b19bfc)
- Improve graph dependency tracking performance (#100, c9dab18)
- Improve task lookup performance (#112, 5139256)
- Introduce epochs as a mechanism for in-graph synchronization (#86, 61dd07e)
- Miscellaneous performance improvements (#115, 9a099d2, #137, b0254fd, #138, 02258c0, #145, f0b53ce)
- Improve scheduler performance by reducing lock contention (#111, 4547b5f)
- Improve graph generation and printing performance (#133, 8122798)
- Use libenvpp to validate all
CELERITY_*environment variables (#158, b2ced9b) - Use native ("USM") pointers instead of SYCL buffers for backing buffer allocations (#162, 44497b3)
- Implement
rangeandidtypes instead of aliasing SYCL types (#163, 0685d94) - Disallow in-source builds (#176, 0a96d15)
- Lift restrictions on reductions for DPC++ (#175, efff21b)
- Remove multi-pass mechanism to allow reference capture of buffers and host-objects into command group functions, in alignment with the SYCL 2020 API (#173, 0a743c7)
- Drastically improve performance of buffer data location tracking (#184, adff79e)
- Switch to distributed scheduling model (#186, 0970bff)
Deprecated
- Passing
sycl::devicetodistr_queueconstructor (use a device selector instead) (#113, 556b6f2) - Capturing buffers and host objects by value into command group functions (capture by reference instead) (#173, 0a743c7)
allow_by_refis no longer required to capture references into command group functions (#173, 0a743c7)
Removed
- Removed support for ComputeCpp (discontinued) (#167, 68367dd)
- Removed deprecated
host_memory_layout(usebuffer_allocation_windowinstead) (#187, f5e6510) - Removed deprecated kernel dimension template parameter on
one_to_one,fixedandallrange mappers (#187, 40a12a4) - Kernels can no longer receive
sycl::item(usecelerity::iteminstead), this was already broken in 0.3.2 (#163, 67ccacc)
Fixed
- Improve performance for buffer transfers on IBM Spectrum MPI (#114, c60527f)
- Increase size limit on individual buffer transfer operations from 2 GiB to 128 GiB (#153, 972682f)
- Fix race between creating collective groups and submitting host tasks (#152, 0a4fca5)
- Align read-accessor
operator[]with SYCL 2020 spec by returning const-reference instead of value (#156, 5011ded)