Skip to content

Releases: tracel-ai/cubecl

v0.6.0

18 Jul 16:12
Compare
Choose a tag to compare

Summary

CubeCL 0.6.0 introduces significant enhancements to performance, functionality, and compatibility across various backends. Key features include n-dimensional convolution, multi-stage matrix multiplication (matmul), and dynamic shared memory support for CUDA. Performance optimizations, such as reworked into_contiguous and double buffering, improve efficiency. New functionality like random number generation, fp8/fp6 support, and recursive profiling enhance the library's capabilities.
Bug fixes address issues in backends (Metal, HIP, Vulkan, WASM), memory alignment, and deadlocks.

What's New

Features

Performance Improvements

Bug Fixes

Refactorings

Documentation & Testing

Dependencies & Maintenance


Thank you to all contributors for making CubeCL 0.6.0 possible!

v0.5.0

23 Apr 20:16
59fc96b
Compare
Choose a tag to compare

CubeCL Release Notes

Features

  • Autotune Rework: Enhanced autotuning with type magic and persistent cache support. (#430, #567, #598, #604, #630, #635)
  • Fast Float Math: Added fast floating-point math operations for SPIR-V. (#432)
  • Tensor Memory Accelerator (TMA): Introduced TMA for faster matmul and im2col convolution. (#533, #584, #572)
  • Uniformity Analysis: Implemented for SPIR-V to optimize kernel execution. (#460)
  • Full Atomic Sum: Added support for full atomic sum operations. (#448)
  • Pipeline API for CUDA: New API to streamline CUDA pipeline operations. (#422)
  • Block-Wise Quantization: Initial support for per-tensor and block-wise quantization in matmul. (#536, #578)
  • Clustering Support: Basic clustering with metadata for distributed workloads. (#560)
  • Min/Max Reduction: Added min/max reduction operations. (#594)
  • Double Buffering Multi-Tasks: Enhanced double buffering for multi-task matmul. (#626)
  • CubeCL Standard Library: Introduced cubecl-std for common utilities. (#431)

Performance Improvements

  • Matmul Optimizations:
  • Convolution: Refactored for Burn and added conv2d benchmark. (#500, #531, #631)
  • Reduce Operations: Optimized reduce kernels with stride 0 and bound checks. (#534, #580, #594)
  • Memory Management: Streamlined memory handling and ExclusivePages allocator improvements. (#419, #445, #512, #529)
  • Fusion: Improved kernel fusion for better performance. (#463, #484, #499)

Bug Fixes

  • Matmul:
  • Reduce: Fixed shared sum test and general reduce issues. (#467, #554)
  • SPIR-V: Resolved spirv-dump and mixed kernel feature registration. (#466)
  • WASM: Fixed compilation and arc-related issues. (#454, #559, #592)
  • HIP: Corrected shuffle intrinsics, bf16 reduce, and ROCm 6.4.0 updates. (#450, #601, #614, #617, #627)
  • Metal: Fixed simdgroup instructions, mulhi, ffs, and cmma synchronization. (#540, #566, #591, #606, #607, #612, #624)
  • Reinterpret Operations: Fixed slice and read/write issues. (#561, #568, #569, #570, #603)
  • Cache and Autotune: Addressed cache file issues and autotune timing/locking. (#517, #521, #598, #604, #630)
  • Miscellaneous: Fixed bitwise unary ops, path issues on Arch Linux, and debug print macro. (#421, #428, #462, #475)

Platform Support

  • ARM64 Compilation: Fixed compilation issues for ARM64. (#413)
  • HIP Bindings: Improved bindings and documentation. (#427, #588)
  • WGPU: Upgraded to versions 24 and 25, with dynamic compiler selection. (#436, #470, #589)
  • Metal MSL CPP Compiler: Added support with WGPU runtime. (#540)
  • Rust: Updated to Rust 1.85.1 and edition 2024. (#532)

Refactorings

  • IR Refactor:
    • Separated IR into its own crate with reflection and semantic categories. (#435, #442)
    • Made IR compatible with no_std. (#456)
  • Matmul:
    • Unified multi-buffer and single-buffer algorithms. (#587)
    • Refactored loaders, stage matmul, and job configurations. (#528, #548, #593, #597, #613)
  • Memory Management:
    • Merged CubeContext and Scope. (#452)
    • Replaced Arcs and improved deallocation. (#443, #454, #512)
  • Error Handling: Consolidated error types into a single type. (#453)
  • CubeLaunch and CubeType: Major refactor of derive macros. (#530)
  • Runtime: Refactored CUDA, backend arguments, and binding passing. (#522, #526, #543)

Developer Experience

  • Debugging: Improved debug symbols, print macro, and general debug tools. (#462, #474, #562)
  • Testing: Added flexible matmul tests, TMA tests, and conv2d benchmarks. (#476, #572, #631)
  • Documentation: Updated README example and added pull request template. (#490, #605)
  • Dependencies:
    • Upgraded rand to 0.9.0 and cudarc to 0.13.9. (#473, #503)
    • Fixed getrandom for no_std. (#477)
  • Macros: Replaced return with terminate macro and added CubeOption. (#449, #494)

Miscellaneous

  • New Operations: Added leading_zeros, find_first_set, plane_ballot, and inclusive/exclusive sum/prod. (#446, #461)
  • Type System: Moved to custom typehash implementation. (#455)
  • Hardware Properties: Added max cube count and dimension to hardware metadata. (#515)

v0.4.0

14 Jan 20:36
Compare
Choose a tag to compare

Matrix Multiplication (Matmul) Improvements:

Refactored configuration for better kernel selection and performance tuning. Added support for batch operations, double buffering, and pipelined processing to enhance throughput and efficiency. Implemented customizable dispatch for non-square matrices and introduced heuristics for kernel selection.

New Crate for Reduce Kernels

This release introduces a new crate (cubecl-reduce) that contains optimized reduce kernels working on all platforms.

Compiler and Runtime Optimizations:

Refactored SPIR-V and HIP compilers with support for new features like WMMA intrinsics and improved debug information. Enhanced WebGPU support with better sync mechanisms and hardware property queries. Added support for compile-time constants and improved code generation for various architectures.

New Functionalities:

Added support for more instructions and better type support.

Bug Fixes

Fixed various issues with autotuning, particularly for WASM and CUDA environments.
Resolved visibility issues with implementation functions in macros. Addressed multiple synchronization and compilation bugs across different runtime environments. Corrected handling of specific data types and operations in SPIR-V, WGSL, and CUDA.

Read more

v0.3.0

28 Oct 15:47
Compare
Choose a tag to compare

CubeCL v0.3.0 Release Notes

This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.

Language Features

Runtime Improvements

CUDA

WGPU

HIP/ROCm

SPIR-V

Optimization & Performance

Infrastructure

Math & Operations

Documentation & Examples

Bug Fixes & Maintenance