Overview
This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.
Matrix Multiplication Improvements
Optimized matrix multiplication kernels with specialized implementations for:
- Matrix-vector (mat@vec)
- Vector-matrix (vec@mat)
- Inner product
- Outer product
And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.
For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post [1].
Fusion Enhancements
- Improved reliability and performance of Mabor Fusion through advanced optimizations.
- Added support for basic dead code elimination.
- Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.
Multi-Threading and Memory Management
- Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
-
Mabor Fusion's lazy evaluation of registered operations across concurrent streams now
places greater demands on memory management.
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.
CubeCL Config
By default, CubeCL loads its configuration from a TOML file (cubecl.toml or
CubeCL.toml) located in your current directory or any parent directory. If no configuration
file is
found, CubeCL falls back to sensible defaults. A typical cubecl.toml file might
look like this:
[profiling]
logger = { level = "basic", stdout = true }
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }
Each section configures a different aspect of CubeCL:
- profiling: Controls performance profiling and logging.
- autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
- compilation: Manages kernel compilation logging and cache.
For more info, check out the CubeCL book[2].
Changelog
Breaking: the default stride(s) for pooling modules now match the kernel size
instead of defaulting to strides of 1. This will affect output shapes if
strides were not explicitly set.
MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();