18 Jul 16:12

de53028

v0.6.0 Latest

Latest

Summary

CubeCL 0.6.0 introduces significant enhancements to performance, functionality, and compatibility across various backends. Key features include n-dimensional convolution, multi-stage matrix multiplication (matmul), and dynamic shared memory support for CUDA. Performance optimizations, such as reworked into_contiguous and double buffering, improve efficiency. New functionality like random number generation, fp8/fp6 support, and recursive profiling enhance the library's capabilities.
Bug fixes address issues in backends (Metal, HIP, Vulkan, WASM), memory alignment, and deadlocks.

What's New

Features

N-Dimensional Convolution: Added support for n-dimensional convolution operations (@wingertge, #649).
Multi-Stage Convolution: Implemented multi-stage convolution for enhanced processing (@wingertge, #602).
Matrix Multiplication Enhancements:
- Added double-stage matmul with k > 1 (@louisfd, #653).
- Generalized tilewise loading for multiple tiles (@louisfd, #655).
- Introduced ordered double buffering (@louisfd, #680).
- Added specialized configs, event listener refactoring, and selection improvements (@louisfd, @nathanielsimard, #710, #711, #719, #722, #749, #751).
- Unit matmul with plane matmul merging and double buffering (@louisfd, #686, #697).
Random Number Generation: Added random number generation with vectorized kernels and improved tests (@Cielbird, #673, #677, #679, #681, #682).
Low-Precision Support: Added fp8, fp6, and theoretical fp4 support (@wingertge, #675).
Dynamic Shared Memory on CUDA: Enabled dynamic shared memory allocation for CUDA (@wingertge, #620).
Intrinsic Macro: Introduced intrinsic macro support for enhanced flexibility (@wingertge, #639).
Recursive Profiling: Added recursive profiling capabilities (@nathanielsimard, #674).
Sync Plane Instruction: Added sync_plane instruction for synchronization (@louisfd, #676).
CubeCL Configuration: Introduced configuration options for CubeCL (@nathanielsimard, #665).
Multi-Tensor Allocation: Added support for multi-tensor allocation to handle quantization (@wingertge, #661).
Autotune Enhancements:
- Made autotune optional (@nathanielsimard, #685).
- Added basic error handling for autotune (@nathanielsimard, #738).
- Improved matmul selection and tuner deadlock fixes (@nathanielsimard, #771, #782).
f16 Support for WGSL: Added f16 support to the WGSL backend (@wingertge, #658).
GFX10 (RDNA2) Support: Added support for GFX10 architecture (@VirxEC, #662).
Graphviz Output for SPIR-V: Added Graphviz output to spirv-dump for better visualization (@wingertge, #664).
PTX WMMA for CUDA: Added PTX WMMA support for CUDA (@syl20bnr, #668).
Tunable Priority: Introduced tunable priority for improved control (@nathanielsimard, #768).

Performance Improvements

Reworked into_contiguous for better performance (@wingertge, #621).
Optimized double buffering event cleanup (@nathanielsimard, #663).
Reduced mixed precision overhead (@nathanielsimard, #619).
Improved compilation times (@nathanielsimard, #669).
Sped up SPIR-V compilation and softened matmul autotune key (@nathanielsimard, #740).

Bug Fixes

Fixed cluster issues caused by merges (@wingertge, #648).
Corrected edge case in calculate_cube_count_elemwise (@wingertge, #646).
Fixed Metal and HIP slice offset issues (@louisfd, #651).
Resolved inner mutability and register mutability issues (@nathanielsimard, #652, #656).
Fixed deadlock by avoiding lock captures (@ArthurBrussee, #657).
Corrected buffer offset alignment and size calculation (@wingertge, #684).
Fixed WASM by using cfg(std_io) (@ArthurBrussee, #670).
Addressed Vulkan atomics issues (@nathanielsimard, #704).
Fixed configuration environment parsing (@nathanielsimard, #678).
Corrected random interval and logger profile issues (@laggui, @nathanielsimard, #744, #683).
Fixed Metal backend tests and removed unused warnings (@louisfd, #762, #763).
Addressed SPIR-V issues, including CMMA offset and compilation (@marcantoinem, @nathanielsimard, #752, #764).
Fixed matmul cube count overflow (@louisfd, #760).
Resolved tuner deadlock (@nathanielsimard, #782).
Fixed benchmark API for dead code elimination and memory alignment (@nathanielsimard, #712).

Refactorings

Unified slice implementation across backends (@nathanielsimard, #644).
Refactored init to IntoMut (@nathanielsimard, #659).
Split cubecl-linalg into cubecl-matmul and cubecl-convolution (@louisfd, #708).
Moved SPIR-V extension methods to rspirv-ext crate (@wingertge, #596).
Refactored matmul tiling scheme, setup, and compute resource dependency (@louisfd, #707, #709, #716).
Moved profile logging to ComputeClient and made it async (@ArthurBrussee, #692).
Improved unit selector and HIP device refactoring (@nathanielsimard, #758, #761).
Cleaned up SPIR-V backend code (@marcantoinem, #769).

Documentation & Testing

Fixed typo in CubeCL book (@marcantoinem, #666).
Improved documentation with additional CubeCL book pages (@marcantoinem, #733, #774).
Enhanced matmul documentation and refactoring (@louisfd, #772, #775).
Improved debug information (@nathanielsimard, #689).
Added finer-grained feature flags for matmul tests (@louisfd, #734).
Updated matmul benchmarks (@nathanielsimard, #781).

Dependencies & Maintenance

Bumped version to 0.6.0 (@syl20bnr, #643).
Updated cudarc dependency (@wingertge, #637).
Updated cubecl-hip-sys to version 6.4.4348201 (@syl20bnr, #743).
Bumped major versions of dependencies (@ArthurBrussee, #776).
Silenced MAPPABLE_PRIMARY_BUFFERS warning (@ArthurBrussee, #688).

Thank you to all contributors for making CubeCL 0.6.0 possible!

Contributors

wingertge, syl20bnr, and 7 other contributors

Assets 2

23 Apr 20:16

nathanielsimard

v0.5.0

59fc96b

v0.5.0

CubeCL Release Notes

Features

Autotune Rework: Enhanced autotuning with type magic and persistent cache support. (#430, #567, #598, #604, #630, #635)
Fast Float Math: Added fast floating-point math operations for SPIR-V. (#432)
Tensor Memory Accelerator (TMA): Introduced TMA for faster matmul and im2col convolution. (#533, #584, #572)
Uniformity Analysis: Implemented for SPIR-V to optimize kernel execution. (#460)
Full Atomic Sum: Added support for full atomic sum operations. (#448)
Pipeline API for CUDA: New API to streamline CUDA pipeline operations. (#422)
Block-Wise Quantization: Initial support for per-tensor and block-wise quantization in matmul. (#536, #578)
Clustering Support: Basic clustering with metadata for distributed workloads. (#560)
Min/Max Reduction: Added min/max reduction operations. (#594)
Double Buffering Multi-Tasks: Enhanced double buffering for multi-task matmul. (#626)
CubeCL Standard Library: Introduced cubecl-std for common utilities. (#431)

Performance Improvements

Matmul Optimizations:
- Async buffer loading and multi-row selection. (#535, #616, #623, #638)
- Refactored loaders, stage buffering, and tiling layout for efficiency. (#528, #573, #575, #577, #583, #586, #587, #593, #597, #609, #611, #613, #632)
- Simplified configuration and quantized test metadata. (#469, #481, #538)
- Double buffering fragments and precision type support. (#547, #550, #636)
Convolution: Refactored for Burn and added conv2d benchmark. (#500, #531, #631)
Reduce Operations: Optimized reduce kernels with stride 0 and bound checks. (#534, #580, #594)
Memory Management: Streamlined memory handling and ExclusivePages allocator improvements. (#419, #445, #512, #529)
Fusion: Improved kernel fusion for better performance. (#463, #484, #499)

Bug Fixes

Matmul:
- Fixed naive kernel, lower precision, and compilation issues. (#546, #553, #557, #636)
- Corrected cyclic loading and strided loader bugs. (#440, #444, #482, #507, #509)
Reduce: Fixed shared sum test and general reduce issues. (#467, #554)
SPIR-V: Resolved spirv-dump and mixed kernel feature registration. (#466)
WASM: Fixed compilation and arc-related issues. (#454, #559, #592)
HIP: Corrected shuffle intrinsics, bf16 reduce, and ROCm 6.4.0 updates. (#450, #601, #614, #617, #627)
Metal: Fixed simdgroup instructions, mulhi, ffs, and cmma synchronization. (#540, #566, #591, #606, #607, #612, #624)
Reinterpret Operations: Fixed slice and read/write issues. (#561, #568, #569, #570, #603)
Cache and Autotune: Addressed cache file issues and autotune timing/locking. (#517, #521, #598, #604, #630)
Miscellaneous: Fixed bitwise unary ops, path issues on Arch Linux, and debug print macro. (#421, #428, #462, #475)

Platform Support

ARM64 Compilation: Fixed compilation issues for ARM64. (#413)
HIP Bindings: Improved bindings and documentation. (#427, #588)
WGPU: Upgraded to versions 24 and 25, with dynamic compiler selection. (#436, #470, #589)
Metal MSL CPP Compiler: Added support with WGPU runtime. (#540)
Rust: Updated to Rust 1.85.1 and edition 2024. (#532)

Refactorings

IR Refactor:
- Separated IR into its own crate with reflection and semantic categories. (#435, #442)
- Made IR compatible with no_std. (#456)
Matmul:
- Unified multi-buffer and single-buffer algorithms. (#587)
- Refactored loaders, stage matmul, and job configurations. (#528, #548, #593, #597, #613)
Memory Management:
- Merged CubeContext and Scope. (#452)
- Replaced Arcs and improved deallocation. (#443, #454, #512)
Error Handling: Consolidated error types into a single type. (#453)
CubeLaunch and CubeType: Major refactor of derive macros. (#530)
Runtime: Refactored CUDA, backend arguments, and binding passing. (#522, #526, #543)

Developer Experience

Debugging: Improved debug symbols, print macro, and general debug tools. (#462, #474, #562)
Testing: Added flexible matmul tests, TMA tests, and conv2d benchmarks. (#476, #572, #631)
Documentation: Updated README example and added pull request template. (#490, #605)
Dependencies:
- Upgraded rand to 0.9.0 and cudarc to 0.13.9. (#473, #503)
- Fixed getrandom for no_std. (#477)
Macros: Replaced return with terminate macro and added CubeOption. (#449, #494)

Miscellaneous

New Operations: Added leading_zeros, find_first_set, plane_ballot, and inclusive/exclusive sum/prod. (#446, #461)
Type System: Moved to custom typehash implementation. (#455)
Hardware Properties: Added max cube count and dimension to hardware metadata. (#515)

Assets 2

14 Jan 20:36

nathanielsimard

v0.4.0

8f3b9f3

v0.4.0

Matrix Multiplication (Matmul) Improvements:

Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.

Matmul components by @louisfd in #220
more precision required by @louisfd in #224
Matmul/transpose loader by @louisfd in #230
Matmul: refactor configs by @louisfd in #233
Matmul/tensor reference by @louisfd in #238
Matmul: CPU reference with same precisions by @louisfd in #237
Matmul: batch one_to_many + refactor configuration by @louisfd in #242
Matmul: slice level by @louisfd in #246
Matmul: reuse accumulator by @louisfd in #257
Different seeds for lhs/rhs by @louisfd in #263
Matmul: customizable cube dispatch by @louisfd in #273
Matmul: double buffering by @louisfd in #271
Matmul: minor refactor by @louisfd in #275
Matmul: small refactoring to allow easier kernel selection by @louisfd in #276
Matmul: Some refactor by @louisfd in #285
Matmul: check input bounds by @louisfd in #288
Matmul: More comptime in accelerated tile by @louisfd in #290
Matmul: batch broadcast by @louisfd in #306
Matmul: fix transposed (swizzle) dispatch for non square matrix by @louisfd in #307
Refactor + Profile Matmul by @nathanielsimard in #292
Matmul: tilewise loader by @louisfd in #310
[Feat] Ground work to make GEMM components usable for convolution by @wingertge in #309
Matmul: kernel select heuristic by @louisfd in #312
Cast fragment by @nathanielsimard in #311
Matmul: better error message (minor PR) by @louisfd in #313
Matmul/auto by @nathanielsimard in #316
Matmul: pipelined double buffering by @louisfd in #323
Add Support for cast instruction in hip wmma intrinsic compiler by @syl20bnr in #317
Return result from matmul launch function by @nathanielsimard in #340
Some fixes in matmul by @louisfd in #350
Remove unneeded includes function in WmmaCompiler trait by @syl20bnr in #351
Improve + refactor matmul by @nathanielsimard in #365
Fix/matmul plane size 64 by @nathanielsimard in #378

New Crate for Reduce Kernels

This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.

Implement sum as a reduce for vector by @maxtremblay in #264
Reduce on cuda by @maxtremblay in #274
Implement a reduction accross lines by @maxtremblay in #280
Query num planes by @maxtremblay in #294
Improve tests for cubecl-reduce by @maxtremblay in #299
Import reduce naive from burn by @maxtremblay in #314
Import reduce shared by @maxtremblay in #329
Implement a plane reduction by @maxtremblay in #336
major refactor of reduce by @maxtremblay in #349
Reduce plane by @maxtremblay in #359
Reduce shared unit by @maxtremblay in #363
Reduce shared plane by @maxtremblay in #369
Merge reduce by @maxtremblay in #402
Reduce stride by @maxtremblay in #408

Compiler and Runtime Optimizations:

Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.

Remove leftover dead phi nodes from branch elimination by @wingertge in #225
Ensure LoopBreak is updated when merging blocks by @wingertge in #228
Perf/cuda fence by @nathanielsimard in #232
Refactor wgpu with stream by @nathanielsimard in #245
Add HIP wmma intrinsic compiler by @syl20bnr in #279
Read many buffers at once by @nathanielsimard in #277
Replace pointer magic with reinterpret_cast by @maxtremblay in #281
Use CudaArchitecture struct to mirror HIP implementation by @syl20bnr in #287
Add support for 64 wavefront size in HIP compiler by @syl20bnr in #282
Feat: comptime fields by @nathanielsimard in #338
More comptime support by @nathanielsimard in #344
Refactor for wgpu v23 compatibility with an example of wgpu device sharing by @AsherJingkongChen in #211
Port fence and read many buffers at once to HIP runtime by @syl20bnr in #348
ROCm 6.3.0 HIP bindings update by @syl20bnr in #362
[Feat] SPIR-V debug info by @wingertge in #356
Make rocwmma compiler the default has it covers more AMD architectures by @syl20bnr in #366
Disable WMMA compiler on CDNA GPUs and update naming accordingly by @syl20bnr in #367
Feat/ virtual tensor by @nathanielsimard in #380
[Feat] Expanded SPIR-V debug info by @wingertge in #368
Remove HIP context from runtime by @syl20bnr in #375
Add two of the missing bit operations that have hardware acceleration by @wingertge in #391
[Feat] Rework allocator by @wingertge in #401

New Functionalities:

Added support for more instructions and better type support.

Allow for overriding device, return device setup by @ArthurBrussee in #210
Feat/more types support by @wingertge in #207
Add subcube and mma support for HIP compiler by @syl20bnr in #219
Allow creating kernels with no runtime by @wingertge in #229
Add topology properties to client by @wingertge in #244
[Fix] Merge vectorized tf32 with f32 by @wingertge in #253
Subcube elect by @maxtremblay in #259
feat: bitwise ops implementation for line by @quinton11 in #284
[Feat] Make atomics generic and add comprehensive support for different types by @wingertge in #406
patch: extending the int trait with not ops by @quinton11 in #411
feat: not-trait-impl-for-int-types by @quinton11 in #412

Bug Fixes

Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.

Force the staging buffer in read to be aligned by @wingertge in #208
Fix visibility of impl fns getting dropped by the cube macro by @jbelanich in #212
Fix asynchronous autotuning for wasm by @ArthurBrussee in #213
Fix setup for SPIR-V by @wingertge in #215
Fix/wgsl extension by @nathanielsimard in #221
Fix device creation for wasm by @ArthurBrussee in #218
Fix sign conversion in SPIR-V by @wingertge in https://githu...

Contributors

wingertge, vaijira, and 12 other contributors

Assets 2

28 Oct 15:47

nathanielsimard

v0.3.0

929bcf4

v0.3.0

CubeCL v0.3.0 Release Notes

This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.

Language Features

Added support for numeric constants by @booti386 in #112
Added for in syntax for immutable arrays, tensors and slices by @wingertge in #119
Added support for ROCm HIP by @syl20bnr in #183
Added if as a value expression by @wingertge in #120
Added select (ternary) operations by @wingertge in #152
Implemented support for func generics for impl block by @nathanielsimard in #189
Added support for Enum + Const Match by @nathanielsimard in #145
Added support for numeric match at runtime by @wingertge in #143
Added support for comptime arrays available as runtime constants by @wingertge in #147
Added features for each supported datatype by @wingertge in #193
Reimplemented macro to make writing kernels more ergonomic by @wingertge in #80
Clean up macro and optimize branch operations by @wingertge in #118

Runtime Improvements

CUDA

Improved CUDA compiler by @nathanielsimard in #88
Fixed CUDA architecture version by @nathanielsimard in #89
Fixed native vector types by @nathanielsimard in #92
Fixed CUDA support for different ranks by @nathanielsimard in #124
Better CMMA configuration by @nathanielsimard in #146
Support SSA bindings for CUDA by @wingertge in #153
Fixed various CUDA bugs by @nathanielsimard in #168

WGPU

Fixed WGPU memory corruption for CubeCount::Dynamic by @ArthurBrussee in #156
Added support for autotuning on WebGPU, more precise timings by @ArthurBrussee in #167
Fixed overflow when max page == 4GB on WASM by @ArthurBrussee in #194
Merged cubecl-wgpu and cubecl-wgpu-spirv by @wingertge in #184

HIP/ROCm

Added support for ROCm HIP by @syl20bnr in #183
Added half precision support to HIP by @syl20bnr in #201
Limited cubecl-hip for Linux targets only by @syl20bnr in #205

SPIR-V

Added SPIR-V compiler by @wingertge in #155
Fixed casting, powf and alignment for SPIR-V by @wingertge in #188

Optimization & Performance

Added value-based partial redundancy elimination by @wingertge in #169
Added prefetching to into_contiguous by @wingertge in #181
Added block merging by @wingertge in #163
Added round and bitwise or operations by @laggui in #99
Skipped zero initialization of workgroup memory by @ArthurBrussee in #125
CMMA Optimizations:
- CMMA: cube dispatch strategy by @louisfd in #126
- Reuse lhs frag strategy by @louisfd in #132
- Invert k n loops by @louisfd in #131
- Continuous warp loading by @louisfd in #138
- Relative warp IDs by @louisfd in #144
- Relaxed b_m = b_n by @louisfd in #148
- New strategy for num compute planes + many refactors by @louisfd in #150

Infrastructure

Added profiling support by @nathanielsimard in #137
Improved compilation arguments by @nathanielsimard in #141
Added simple benchmarking capabilities by @jbelanich in #190
Added periodic memory cleanup by @ArthurBrussee in #178
Reworked & added ExclusivePages as memory management option by @ArthurBrussee in #158
Fixed concurrency problems with autotune by @nathanielsimard in #200
Improved timing methods for benchmarking by @jbelanich in #190
Fixed CI for Rust 1.82 by @nathanielsimard in #182
Migrated xtask to tracel-xtask by @syl20bnr in #93
Updated CI workflow and badges by @syl20bnr in #96

Math & Operations

Implemented dot product by @RianGoossens in #140
Implemented magnitude by @RianGoossens in #105
Added Round, Floor, Ceil for Line by @med1844 in #179
Implemented Vector Normalization by @RianGoossens in #100
Added round and bitwise operations by @laggui in #99

Documentation & Examples

Added simple fusion example by @nathanielsimard in #142
Updated README by @nathanielsimard in #192
Added book by @nathanielsimard in #133
Format floating point values with maximum precision by @ArthurBrussee in #130

Bug Fixes & Maintenance

Handle empty tensors by @laggui in #86
Fixed flaky tests in topology by @nathanielsimard in #109
Fixed no-std support by @nathanielsimard in #175
Fixed WASM infinite loop by @nathanielsimard in #176
Fixed deadlock by @ArthurBrussee in #177
Fixed legacy kernels by auto-casting unary ops by @wingertge in #187
Fixed pico support by @BjornTheProgrammer in #198
Fixed check on macOS and minor refactor by @AsherJingkongChen in #204
Fixed validate checksum by @nathanielsimard in #202
Fixed for backends with higher alignments by @ArthurBrussee in #191

Contributors

wingertge, syl20bnr, and 10 other contributors

Assets 2

Releases: tracel-ai/cubecl

v0.6.0

Summary

What's New

Features

Performance Improvements

Bug Fixes

Refactorings

Documentation & Testing

Dependencies & Maintenance

Contributors

Uh oh!

v0.5.0

CubeCL Release Notes

Features

Performance Improvements

Bug Fixes

Platform Support

Refactorings

Developer Experience

Miscellaneous

Uh oh!

v0.4.0

Matrix Multiplication (Matmul) Improvements:

New Crate for Reduce Kernels

Compiler and Runtime Optimizations:

New Functionalities:

Bug Fixes

Contributors

Uh oh!

v0.3.0

CubeCL v0.3.0 Release Notes

Language Features

Runtime Improvements

CUDA

WGPU

HIP/ROCm

SPIR-V

Optimization & Performance

Infrastructure

Math & Operations

Documentation & Examples

Bug Fixes & Maintenance

Contributors

Uh oh!