Mabor 0.18.0 Release Notes

Flame digital art generated by stable diffusion.

Fri, Jul 18, 2025

Overview

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

Matrix-vector (mat@vec)
Vector-matrix (vec@mat)
Inner product
Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post [1].

Fusion Enhancements

Improved reliability and performance of Mabor Fusion through advanced optimizations.
Added support for basic dead code elimination.
Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
Mabor Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management.
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults. A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

profiling: Controls performance profiling and logging.
autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book[2].

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig

let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

MaxPool1dConfig

let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

AvgPool2dConfig

let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

AvgPool1dConfig

let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

• Add tensor grid::meshgrid #3107 #3191 @crutcher

• Add scalar tensor operations #3127 @ArthurBrussee

• Orthogonal initialization #3109 @dymat

• Support importing safetensors format #2721 @wandbrandon @antimora

• Add Mabor::linalg norms #3131 @crutcher

• Extract Linear.forward to nn::functional::linear #3147 @crutcher

• Base impl of matmul for Int tensor #3201 @crutcher

• (perf) generate_mask functions optimizations #3203 @tafia

• Add CosineEmbeddingLoss module and cosine_similarity function #3207 @antimora

• Tensor::slice_fill() #3221 #3223 @crutcher

• Base impl of tensor.slice_dim(dim, range) #3235 @crutcher

• Support shifting pre-computed RoPE values #3275 @laggui

• Improve RoPE partial shift case #3290 @laggui

• Add `tensor.roll()` and improve `AsIndex` (renamed `IndexConversion`) #3281 @crutcher

• [Breaking] Update pooling default strides to match kernel size #3338 @lucianyao

• Add `is_finite` tensor element wise op and fix `is_close/all_close` inf #3341 @jonboh

Backends

• [Perf] Interpolate optimizations #3077 @wingertge

• [Perf] Slice assign #3069 @wingertge

• Add multi stage conv #3105 @wingertge

• [Perf] Convolution migration to NHWC #3090 @wingertge

• Merge different convolution dimensional kernels #3115 @wingertge

• Support reduce mixed precision accumulation w/ fusion #3132 @nathanielsimard

• Update remote backend #3175 @Cielbird

• Feat/autotune optional #3188 @nathanielsimard

• cubecl unit matmul #3214 @louisfd

• Update CubeCL for client based profiling #3222 @ArthurBrussee

• Update cubecl unit matmul double buffered #3233 @louisfd

• Mabor-remote to_device function #3189 @Cielbird

• Add Drop operation for fusion #3263 @nathanielsimard

• Lazy tensor downloading in mabor-remote #3276 @Cielbird

• Improve specialized matmul #3304 @louisfd

• Add autotune priority #3347 #3378 @nathanielsimard

• Fix local tuner deadlock #3384 @nathanielsimard

• Fix fusion wasm unsafe input #3385 #3386 @nathanielsimard

Bug Fixes

• Fix WASM deadlock by really properly not capturing locks #3123 @ArthurBrussee

• Fix mabor-cubecl with autotune disabled #3141 @wingertge

• Fix fusion multiple reshapes #3220 @nathanielsimard

• Fix/fusion multiple streams #3297 @nathanielsimard

• Fix gather broadcasted indices in kernel impl and fusion #3337 @laggui

• Fix rand interval #3321 @laggui

• Restrict binary op lhs/rhs alias #3349 @laggui

• Fix sum fallback when atomic add is not supported #3369 @laggui

Documentation & Examples

• Update pytorch-model.md with a new troubleshooting help #3081 @antimora

• Contributor example instructions #3153 @AshAnand34

• Update README.md with DeepWiki badge #3192 @antimora

• Add recursion_limit macro to getting started exemples code #3238 @Marc-AnthonyG

• KaTeX for Mathematical expressions in docstrings #3278 @BhavyeMathur

• Add Metal backend support to custom-image-dataset #3335 #3354 @TsaoLun

• Add link to license in README badge #3356 @Olexandr88

Fixes

• Fix typo in Mabor Book #3113 @danny-burrows

• fix typos #3186 @omahs

• Fix Typos in Documentation Comments #3280 @leopardracer

• Fix typo in code documentation for MaborGraph codegen #3286 @kilavvy

• Fix error messages from tensor checks for flatten #3319 @NoVegetable

• Fix broken link to mabor-tch #3365 @dbdr

• Update documentation description for nonzero and nonzero_async #3368 @catch-twenty-two

ONNX Support

• ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info #3037 @antimora

• Restrict ONNX opset to 16 and up #3051 @antimora

• Allow Shape input type for Slice operation #3092 @antimora

• Support onnx and, or & xor nodes #3173 @tye-singwa

• Add support ONNX instance norm #3177 @tye-singwa

• Onnx ceil & round #3225 @tye-singwa

• Add support onnx group norm #3245 @tye-singwa

• Add onnx SpaceToDepth / DepthToSpace #3277 @tye-singwa

• Fix onnx topological sort check #3284 @tye-singwa

• Add onnx ArgMin node #3285 @tye-singwa

• Add support onnx size #3301 @tye-singwa

• Support flexible backend selection for import tests #3372 #3380 @lucianyao

• Fix ONNX node name sanitization and allow ai.onnx.ml domain #3371 @antimora

Enhancements

• Replace some powf->powi #3152 @ArthurBrussee

• Improve fusion compilation speed #3155 @nathanielsimard

• Perf/remove repeat dim #3183 @nathanielsimard

• Perf: Fusion search for composed optimization #3258 @nathanielsimard

• Improve matmul selector #3307 #3343 #3350 #3376 @nathanielsimard @louisfd

Refactoring

• Refactor CubeCL slices #3104 @nathanielsimard

• CubeCL init refactor #3128 @nathanielsimard

• Refactor narrow, chunk and split #3137 @laggui

• Refactor quantization scheme #3042 @maxtremblay

• Migrated prng (random) to CubeCL #3165 #3170 @Cielbird

• Break down test_onnx.rs into test subdirectories #3144 @antimora

• Refactor: Move op_configuration.rs from mabor-import to onnx-ir #3126 @antimora

• Fix relative cmp + debug tools #3197 @nathanielsimard

• Refactor cubecl line size matmul #3219 @louisfd

• Absolute tolerance is too tight for strict/balanced/permissive #3242 @laggui

• Fix clippy rust 1.88 and cargo run checks usage #3325 #3320 @laggui

• Remove hip os cfg flags #3336 @laggui

• Update cubecl matmul refactor / docs #3366 @louisfd

Miscellaneous

• Fix conv2d test tolerance & disable crates cache on stable linux-std runner #3114 @laggui

• Replace run-checks scripts with command alias #3118 @laggui

• Relax tolerance transformer autoregressive test (ndarray failure) #3143 @crutcher

• Add cubecl.toml config #3150 @nathanielsimard

• Use CUBECL_DEBUG_OPTION=profile macos ci #3164 @laggui

• Update cubecl: sync_cube #3163 @louisfd

• Fix autotune recursive #3161 @nathanielsimard

• Bump zip dependency #3199 @swfsql

• Import derive_new::new for safetensors feat #3205 @swfsql

• Add CUDA, Vulkan and WGPU on-demand self-hosted runners #3190 #3215 #3334 #3348 #3351 #3352 @syl20bnr

• Fix: size_of import in quantization tests #3195 @louisfd

• mabor-dataset: Catch import.py unsuccessful exits #3236 @drozdziak1

• Adding image dimensions to ImageDatasetItem #3251 @catch-twenty-two

• mabor-dataset: Make virtualenv optional when running importer.py #3255 @drozdziak1

• Fix cubecl std usage #3306 @laggui

• Fix tui legend label placement #3327 @BenFradet

• Move blanket `Adaptor` impl to metrics base #3346 @dbdr

• Make metric order consistent in summaries #3353 @dbdr

• Fix cubecl `normal_respects_68_95_99_rule` #3377 @laggui

• Bump deps #3367 @ArthurBrussee

• Fix fusion rollback, disable autotune checks and other CI issues #3362 @laggui

References

[1]State-of-the-Art Multiplatform Matrix Multiplication Kernels

[2][The CubeCL Book] Configuration

[3]Github Release Page