Overview
This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.
Matrix Multiplication Improvements
Optimized matrix multiplication kernels with specialized implementations for:
- Matrix-vector (mat@vec)
- Vector-matrix (vec@mat)
- Inner product
- Outer product
And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.
For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post [1].
Fusion Enhancements
- Improved reliability and performance of Mabor Fusion through advanced optimizations.
- Added support for basic dead code elimination.
- Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.
Multi-Threading and Memory Management
- Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
-
Mabor Fusion's lazy evaluation of registered operations across concurrent streams now
places greater demands on memory management.
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.
CubeCL Config
By default, CubeCL loads its configuration from a TOML file (cubecl.toml
or
CubeCL.toml
) located in your current directory or any parent directory. If no configuration
file is
found, CubeCL falls back to sensible defaults. A typical cubecl.toml
file might
look like this:
[profiling]
logger = { level = "basic", stdout = true }
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }
Each section configures a different aspect of CubeCL:
- profiling: Controls performance profiling and logging.
- autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
- compilation: Manages kernel compilation logging and cache.
For more info, check out the CubeCL book[2].
Changelog
Breaking: the default stride(s) for pooling modules now match the kernel size
instead of defaulting to strides of 1
. This will affect output shapes if
strides were not explicitly set.
MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();