Overview

Welcome to the CubeCL Book 👋

This book will help you get started with your high-performance computing project using CubeCL, making sure you leverage the most of any hardware.

Why CubeCL is Important

In the current landscape of high-performance computing, developers face several significant challenges that CubeCL aims to address:

Complexity in Performance Optimization

Optimizing compute kernels for different hardware is inherently complex. Developers must understand the intricacies of each platform’s architecture and API, leading to a steep learning curve and the risk of suboptimal performance. The need for manual tuning and platform-specific adjustments often results in code that is difficult to maintain and extend.

The simplest way to solve this problem is to provide high-level abstractions that can be composed in a variety of ways. All of those variations can be autotuned to select the best settings for the current hardware and problem at hand.

Lack of Portability

Portability remains a significant issue. Code written for one API or even for a single GPU architecture often cannot be easily transferred to another, hindering the ability to develop software that can run across diverse hardware environments.

The GPU computing ecosystem is fragmented, with different hardware platforms like CUDA, ROCm, Metal, and Vulkan requiring their own specialized codebases. This fragmentation forces developers to write and maintain separate implementations for each platform, increasing both development time and complexity.

Conclusion

In essence, these challenges underscore the need for a more unified and developer-friendly approach to high-performance computing. CubeCL seeks to bridge these gaps by addressing the core issues within the current ecosystem, offering a new direction for high-performance and portable computing solutions.

Getting Started

In this section, we’ll walk through the installation process for CubeCL, explain the process of writing an optimized CubeCL kernel, and provide some examples to help you start experimenting with CubeCL. By the end, you'll have a solid understanding of how CubeCL integrates into your workflow and how to leverage its features for high-performance computing.

Tutorial

Installation

Installing CubeCL is straightforward. It’s available as a Rust crate, and you can add it to your project by updating your Cargo.toml:

[dependencies]
cubecl = {
    version = "0.6.0",  # Replace with the latest version from crates.io
    features = ["wgpu"]  # Enable desired runtime features (e.g., wgpu, cuda, hip)
}

The more challenging aspect is ensuring that you have the necessary drivers to run the selected runtime.

CubeCL supports multiple GPU runtimes, each requiring specific drivers or frameworks. Enable the appropriate feature flag in Cargo.toml and ensure the corresponding drivers are installed.

Platform	Runtime	Supported OS	Requirements	Installation/Notes	Feature Flag
WebGPU	wgpu	Linux, Windows, macOS, wasm	Vulkan drivers (typically pre-installed on modern OSes)	On linux install the vulkan driver.	wgpu
CUDA	CUDA	Linux, Windows	NVIDIA CUDA drivers and toolkit	Download and install from the NVIDIA CUDA Downloads page. Verify installation with nvidia-smi.	cuda
ROCm	HIP	Linux, Windows	AMD ROCm framework	Linux: Follow the ROCm Linux Quick Start. Windows: See the ROCm Windows Installation Guide.	hip
Metal	wgpu	macOS	Apple device with Metal support (macOS 10.13 or later)	No additional drivers needed; Metal is built into macOS.	wgpu-msl
Vulkan	wgpu	Linux, Windows	Vulkan drivers	On linux install via package manager, on windows it is typically included with GPU drivers (NVIDIA/AMD).	wgpu-spirv

Simple Reduction

To get started with CubeCL, we will implement a simple reduction operation on a multidimensional array (tensor). This example will help you understand the basic concepts of CubeCL and how to use it to perform parallel computations on tensors.

An example of a CPU reduction in Rust without CubeCL

This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using pure Rust. The code is designed to be easy to understand and serves as a starting point for more complex operations that can be parallelized with CubeCL. It is not optimized for performance, but it illustrates the basic concepts of working with tensors and performing reductions.

CpuTensor Struct

Tensors are the basic data structure used in CubeCL to represent multidimensional arrays. Here is a simple implementation of a tensor in pure Rust. It is not optimized for performance, but it is easy to understand.

/// Example of a naive multidimensional tensor in pure Rust
#[derive(Debug, Clone)]
pub struct CpuTensor {
    /// Raw contiguous value buffer
    pub data: Vec<f32>,
    /// How many element are between each dimensions
    pub strides: Vec<usize>,
    /// Dimension of the tensor
    pub shape: Vec<usize>,
}

/// Function to compute strides in a compact layout
fn compact_strides(shape: &[usize]) -> Vec<usize> {
    let rank = shape.len();
    let mut strides = vec![1; rank];
    for i in (0..rank - 1).rev() {
        strides[i] = strides[i + 1] * shape[i + 1];
    }
    strides
}

impl CpuTensor {
    /// Create a CpuTensor with a shape filled by number in order
    pub fn arange(shape: Vec<usize>) -> Self {
        let size = shape.iter().product();
        let data = (0..size).map(|i| i as f32).collect();
        let strides = compact_strides(&shape);
        Self {
            data,
            strides,
            shape,
        }
    }

    /// Create an empty CpuTensor with a shape
    pub fn empty(shape: Vec<usize>) -> Self {
        let size = shape.iter().product();
        let data = vec![0.0; size];
        let strides = compact_strides(&shape);
        Self {
            data,
            strides,
            shape,
        }
    }

    /// Read the inner data
    pub fn read(self) -> Vec<f32> {
        self.data
    }
}

Reduce function

The following function is a naive implementation of a reduction operation on a matrix. It sums the values of each row and stores the result in a new tensor. The input tensor is expected to be a 2D matrix, and the output tensor will be a 1D vector containing the sum of each row.

use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor

/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2]    [0 + 1 + 2]    [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8]    [6 + 7 + 8]    [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
    for i in 0..input.shape[0] {
        let mut acc = 0.0f32;
        for j in 0..input.shape[1] {
            acc += input.data[i * input.strides[0] + j];
        }
        output.data[i] = acc;
    }
}

Launching code

The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix function to perform the reduction. The result is printed to the console.

use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor

/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2]    [0 + 1 + 2]    [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8]    [6 + 7 + 8]    [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
    for i in 0..input.shape[0] {
        let mut acc = 0.0f32;
        for j in 0..input.shape[1] {
            acc += input.data[i * input.strides[0] + j];
        }
        output.data[i] = acc;
    }
}

fn launch() {
    let input_shape = vec![3, 3];
    let output_shape = vec![3];
    let input = CpuTensor::arange(input_shape);
    let mut output = CpuTensor::empty(output_shape);

    reduce_matrix(&input, &mut output);

    println!("Executed reduction => {:?}", output.read());
}

fn main() {
    launch();
}

A first example of a GPU reduction with CubeCL

This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using CubeCL. It is a simple implementation that will be used as a starting point to show how to use CubeCL in the next chapters.

GpuTensor struct

The GpuTensor struct is a representation of a tensor that resides on the GPU. It contains the data handle, shape, strides, and marker types for the runtime and floating-point type. The GpuTensor struct provides methods to create tensors, read data from the GPU, and convert them into tensor arguments for kernel execution. Please note that it is generic over the runtime and floating-point type, allowing it to work with different CubeCL runtimes and floating-point types (e.g., f16, f32). Also, the strides can be computed using the compact_strides function from the cubecl::std::tensor module, which will compute the strides for a given shape with a compact representation.

Another important concept is the ComputeClient trait, which define what a runtime should implement to be able to run kernels. Each runtime has their own implementation of the ComputeClient trait, which provides methods to create tensors and read data from the GPU. The ComputeClient can send compute task to a Server that will run the kernel on the GPU and schedule the tasks.

use std::marker::PhantomData;

If you need a tensor library instead of defining your own kernel and tensor, you should use Mabor directly instead.

use std::marker::PhantomData;

use cubecl::{prelude::*, server::Handle, std::tensor::compact_strides};

/// Simple GpuTensor
#[derive(Debug)]
pub struct GpuTensor<R: Runtime, F: Float + CubeElement> {
    data: Handle,
    shape: Vec<usize>,
    strides: Vec<usize>,
    _r: PhantomData<R>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Clone for GpuTensor<R, F> {
    fn clone(&self) -> Self {
        Self {
            data: self.data.clone(), // Handle is a pointer to the data, so cloning it is cheap
            shape: self.shape.clone(),
            strides: self.strides.clone(),
            _r: PhantomData,
            _f: PhantomData,
        }
    }
}

impl<R: Runtime, F: Float + CubeElement> GpuTensor<R, F> {
    /// Create a GpuTensor with a shape filled by number in order
    pub fn arange(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
        let size = shape.iter().product();
        let data: Vec<F> = (0..size).map(|i| F::from_int(i as i64)).collect();
        let data = client.create(F::as_bytes(&data));

        let strides = compact_strides(&shape);
        Self {
            data,
            shape,
            strides,
            _r: PhantomData,
            _f: PhantomData,
        }
    }

    /// Create an empty GpuTensor with a shape
    pub fn empty(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
        let size = shape.iter().product();
        let data = client.empty(size);

        let strides = compact_strides(&shape);
        Self {
            data,
            shape,
            strides,
            _r: PhantomData,
            _f: PhantomData,
        }
    }

    /// Create a TensorArg to pass to a kernel
    pub fn into_tensor_arg(&self, line_size: u8) -> TensorArg<'_, R> {
        unsafe { TensorArg::from_raw_parts::<F>(&self.data, &self.strides, &self.shape, line_size) }
    }

    /// Return the data from the client
    pub fn read(self, client: &ComputeClient<R::Server, R::Channel>) -> Vec<F> {
        let bytes = client.read_one(self.data.binding());
        F::from_bytes(&bytes).to_vec()
    }
}

Reduce function

Compared to the previous example, this function is similar but uses CubeCL's cube macro to define the kernel. The kernel performs the same reduction operation, summing the values of each row and storing the result in a new tensor. The variable F is a generic type that implements the Float trait, allowing the function to work with different floating-point types (e.g., f32, f64). The tensor is provided by cubecl::prelude, which includes the necessary traits and types for using CubeCL.

use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

Launching code

Once the kernel is defined, we can launch it using CubeCL's runtime. The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix function to perform the reduction. The result is printed to the console. Note that this code uses the cubecl::wgpu::WgpuRuntime runtime, which is a CubeCL runtime for WebGPU. You can replace it with any other CubeCL runtime that you prefer.

use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(device);

    let input = GpuTensor::<R, F>::arange(vec![3, 3], &client);
    let output = GpuTensor::<R, F>::empty(vec![3, 3], &client);

    unsafe {
        reduce_matrix::launch_unchecked::<F, R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            input.into_tensor_arg(1),
            input.into_tensor_arg(1),
        )
    };

    println!(
        "Executed reduction with runtime {:?} => {:?}",
        R::name(&client),
        output.read(&client)
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

Benchmarking reduction

Now that we have a basic understanding of how to perform a reduction operation, let's benchmark it to see how it performs in terms of speed and efficiency.

Benchmarking struct

For benchmarking, we will create a struct that holds the necessary information for the benchmark, such as the input shape, device, and client. This struct will be used to run the benchmark tests and configure the benchmarking environment. Please note that the Runtime and Float traits are used to make the benchmark generic over different CubeCL runtimes and floating-point types. A PhantomData is used to indicate that the struct holds a type parameter F without actually storing a value of that type, which is useful for generic programming in Rust, for more information see the Rust documentation and in our case allows us to easily change the type of float used in the benchmark.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

Implementing the benchmark trait

To benchmark a CubeCL kernel, it is recommended to implement the Benchmark trait that defines the necessary methods for preparing, executing, and synchronizing the benchmark because GPUs are asynchronous and most benchmarking tools will not wait for the GPU to finish executing the kernel before measuring the time it takes to execute it with a sync.

/// Benchmark trait.
pub trait Benchmark {
    /// Benchmark input arguments.
    type Input: Clone;
    /// The benchmark output.
    type Output;

    /// Prepare the benchmark, run anything that is essential for the benchmark, but shouldn't
    /// count as included in the duration.
    ///
    /// # Notes
    ///
    /// This should not include warmup, the benchmark will be run at least one time without
    /// measuring the execution time.
    fn prepare(&self) -> Self::Input;

    /// Execute the benchmark and returns the logical output of the task executed.
    ///
    /// It is important to return the output since otherwise deadcode optimization might optimize
    /// away code that should be benchmarked.
    fn execute(&self, input: Self::Input) -> Self::Output;

    /// Name of the benchmark, should be short and it should match the name
    /// defined in the crate Cargo.toml
    fn name(&self) -> String;

    /// Wait for computation to complete.
    fn sync(&self);
}

In the prepare method, we will create the input data and return a GpuTensor that will be used in the execute method. The execute method will launch the kernel and the sync method will wait for the GPU to finish executing the kernel before measuring the time it takes to execute it. Don't forget to add the function that we want to benchmark.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(1, 1, 1),
                input.into_tensor_arg(1),
                output.into_tensor_arg(1),
            );
        }

        output
    }
}

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

Running the benchmark

Now that we have implemented the Benchmark trait, we can run the benchmark using the Benchmark::run method. This method will execute the benchmark and return the time it took to complete it.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(1, 1, 1),
                input.into_tensor_arg(1),
                output.into_tensor_arg(1),
            );
        }

        output
    }
}

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![512, 8 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![128, 32 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

The Results

When you run the above code, it will execute the reduction benchmark for two different input shapes and print the results to the console. The output will look something like this:

wgpu<wgsl>-reduction-[512, 8192]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        240.730ms
  Variance    1.595µs
  Median      240.310ms
  Min         239.974ms
  Max         244.374ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        241.018ms
  Variance    1.068µs
  Median      240.943ms
  Min         239.734ms
  Max         243.782ms
―――――――――――――――――――――――――

As we will see in the next chapter, our time is not that good, but it is expected because we are using a very simple kernel that does not take advantage of the GPU parallelism. In the next chapter, we will see how to optimize our kernel to take advantage of the GPU parallelism and improve the performance of our reduction operation.

Parallel Reduction

Before we dive into the code to implement parallel reduction, let's take a look at the constants that CubeCL provides to help us write efficient parallel reduction kernels.

CubeCL Constants

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.

CubeCL - Topology

A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.

Topology Equivalence

Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages, except WebGPU with local_invocation_index.

CubeCL	CUDA	WebGPU
CUBE_COUNT	N/A	N/A
CUBE_COUNT_X	gridDim.x	num_workgroups.x
CUBE_COUNT_Y	gridDim.y	num_workgroups.y
CUBE_COUNT_Z	gridDim.z	num_workgroups.z
CUBE_POS	N/A	N/A
CUBE_POS_X	blockIdx.x	workgroup.x
CUBE_POS_Y	blockIdx.y	workgroup.y
CUBE_POS_Z	blockIdx.z	workgroup.z
CUBE_DIM	N/A	N/A
CUBE_DIM_X	blockDim.x	workgroup_size.x
CUBE_DIM_Y	blockDim.y	workgroup_size.y
CUBE_DIM_Z	blockDim.z	workgroup_size.z
UNIT_POS	N/A	local_invocation_index
UNIT_POS_X	threadIdx.x	local_invocation_id.x
UNIT_POS_Y	threadIdx.y	local_invocation_id.y
UNIT_POS_Z	threadIdx.z	local_invocation_id.z
PLANE_DIM	warpSize	subgroup_size
ABSOLUTE_POS	N/A	N/A
ABSOLUTE_POS_X	N/A	global_id.x
ABSOLUTE_POS_Y	N/A	global_id.y
ABSOLUTE_POS_Z	N/A	global_id.z

Parallel Reduction Example

Remembering the previous example, we will now implement a parallel reduction using CubeCL. The goal is to reduce a 2D matrix into a 1D vector by summing the elements of each row. Where can we parallelize this operation? We can parallelize the reduction of each row, allowing each thread to compute the sum of a row independently. We need to change the launch parameters to set the CubeDim to launch multiple invocation in parallel and we can just remove the outer loop and use the UNIT_POS_X to access the rows of the input tensor. Please note that it is important to worry about data races when parallelizing operations, so we need to ensure that each invocation writes to a different position in the output tensor, because each invocation run in parallel.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(self.input_shape[0] as u32, 1, 1), // Add parallelization on the first dimension
                input.into_tensor_arg(1),
                output.into_tensor_arg(1),
            );
        }

        output
    }
}

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    let mut acc = F::new(0.0f32);
    for i in 0..input.shape(1) {
        acc += input[UNIT_POS_X * input.stride(0) + i];
    }
    output[UNIT_POS_X] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![512, 8 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![128, 32 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

The Results

Now that we have improved parallelism, the kernel is up to 70x faster for the first shape and up to 25x faster for the second shape. Why is the first shape faster than the second? Even though the two shape have the same number of elements, the first shape has more rows than the second shape, which means that more invocations can be used to compute the reduction in parallel. The second shape has fewer rows, so fewer invocations are available to compute the reduction in parallel.

wgpu<wgsl>-reduction-[512, 8192]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        3.369ms
  Variance    48.000ns
  Median      3.321ms
  Min         3.177ms
  Max         4.011ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        8.713ms
  Variance    5.069µs
  Median      8.507ms
  Min         5.963ms
  Max         12.301ms
―――――――――――――――――――――――――

Vectorized Reduction

In this section, we will explore how to implement a vectorized reduction operation using CubeCL. Vectorization is a powerful technique that allows us to process multiple data elements simultaneously, significantly improving performance for certain types of computations especially I/O operations.

What is vectorization?

Vectorization is the process of converting scalar operations (which operate on single data elements) into vector operations (which operate on multiple data elements simultaneously). This is typically done using SIMD (Single Instruction, Multiple Data) instructions available in modern CPUs and GPUs. By leveraging vectorization, we can achieve significant performance improvements for operations that can be vectorized. For more information on vectorization in CubeCL, you can refer to this section.

Application to the reduction problem

To apply vectorization to the reduction problem, we will modify our reduction kernel to process multiple elements at once. This means that instead of summing one element at a time, we will sum multiple elements with vectorization, which can lead to substantial performance gains. The number of element processed at a time is the line size. So to add vectorization we just needs to pass the LINE_SIZE to the TensorArgs and reduce the number of iteration of the reduce_matrix.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

const LINE_SIZE: u32 = 4;

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(self.input_shape[0] as u32, 1, 1),
                input.into_tensor_arg(LINE_SIZE as u8),
                output.into_tensor_arg(LINE_SIZE as u8),
            );
        }

        output
    }
}

// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
    let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
    for i in 0..input.shape(1) / LINE_SIZE {
        acc = acc + input[UNIT_POS_X * input.stride(0) + i];
    }
    output[UNIT_POS_X] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![512, 8 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![128, 32 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

The Result

The result of adding vectorization is an average of 3x speedup compared to the previous parallel reduction implementation. This is because we are now processing multiple elements at a time in each invocation, which reduces the time of running a single invocation.

wgpu<wgsl>-reduction-[512, 8192]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        1.085ms
  Variance    14.000ns
  Median      1.045ms
  Min         998.981µs
  Max         1.375ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        3.124ms
  Variance    37.000ns
  Median      3.061ms
  Min         3.009ms
  Max         3.670ms
―――――――――――――――――――――――――

Parallel reduction 3D

The purpose of this example is to demonstrate how to perform a parallel reduction operation on a 3D tensor using CubeCL. The reduction will sum the elements along the last dimension (depth) of the tensor, resulting in a 2D tensor.

A first try

We will start with a simple implementation of a parallel reduction on a 3D tensor. The goal is to reduce the tensor along the last dimension (depth) by summing the elements. This will result in a 2D tensor where each element is the sum of the corresponding elements in the depth dimension.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

const LINE_SIZE: u32 = 4;

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(self.input_shape[0] as u32, self.input_shape[1] as u32, 1),
                input.into_tensor_arg(LINE_SIZE as u8),
                output.into_tensor_arg(LINE_SIZE as u8),
            );
        }

        output
    }
}

// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
    let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
    for i in 0..input.shape(2) / LINE_SIZE {
        acc = acc + input[UNIT_POS_X * input.stride(0) + UNIT_POS_Y * input.stride(1) + i];
    }
    output[UNIT_POS_X * input.stride(0) + UNIT_POS_Y] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![64, 256, 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![64, 64, 4096],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

Let's try to run this code.

wgpu error: Validation Error

Caused by:
  In Device::create_compute_pipeline, label = 'reduce_matrix_f32'
    Error matching shader requirements against the pipeline
      Shader entry point's workgroup size [64, 256, 1] (16384 total invocations) must be less or equal to the per-dimension limit [1024, 1024, 1024] and the total invocation limit 1024

What happened? The error message indicates that the workgroup size exceeds the limits imposed by the WebGPU backend. The total number of invocations (64 * 256 * 1 = 16384) exceeds the maximum allowed invocations per workgroup, which is 1024. In other words, the CubeDim size is too large for the GPU to handle. We needs to find another way to parallelize the reduction operation without exceeding the limits.

A better approach

To address the issue, we will parallelize with the CUBE_COUNT and CUBE_POS variables, which will allow us to launch multiple invocation in parallel without exceeding the limits of the CUBE_DIM. The CUBE_COUNT variable will determine how many invocation we will launch, and the CUBE_POS variable will determine the position of each invocation in the 3D tensor.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

const LINE_SIZE: u32 = 4;

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(self.input_shape[0] as u32, 1, 1),
                CubeDim::new(self.input_shape[1] as u32, 1, 1),
                input.into_tensor_arg(LINE_SIZE as u8),
                output.into_tensor_arg(LINE_SIZE as u8),
            );
        }

        output
    }
}

// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
    let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
    for i in 0..input.shape(2) / LINE_SIZE {
        acc = acc + input[CUBE_POS_X * input.stride(0) + UNIT_POS_X * input.stride(1) + i];
    }
    output[CUBE_POS_X * output.stride(0) + UNIT_POS_X] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![64, 256, 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![64, 64, 4096],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

Now, let's run the code again.

wgpu<wgsl>-reduction-[64, 256, 1024]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        1.483ms
  Variance    27.000ns
  Median      1.535ms
  Min         1.239ms
  Max         1.808ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[64, 64, 4096]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        924.409µs
  Variance    189.000ns
  Median      945.270µs
  Min         600.110µs
  Max         2.098ms
―――――――――――――――――――――――――

It runs and it is fast! The reduction operation is now parallelized across multiple invocations, and we can see that the performance is significantly improved compared to the previous implementation. The results show that the reduction operation is efficient and can handle larger tensors without exceeding the GPU limits. It's also almost the same speed as the 2D reduction, even if there's even more elements to reduce. This is because the reduction is now parallelized across multiple cubes and hyper-cubes, allowing the GPU to process the data more efficiently. See the parallel reduction if you need a refresher on the different parallelization level used in CubeCL. It is also worth noting that the performance and optimal CUBE_COUNT and CUBE_DIM values may vary depending on the GPU architecture and the specific workload. You may need to experiment with different values to find the best configuration for your use case.

Examples

For now we only have a limited amount of examples listed in the table below. Note that you can also look at how the matmul is implemented. Don't hesitate to contribute more examples to the CubeCL repository!

Example	Description
GeLU	Implement the GeLU activation function using CubeCL.
Sum Things	Sum some numbers using many different variations leveraging the CubeCL core features and trait support.
Normalization	Show how to use normalization on vectorized elements.
Device Sharing	Share a WGPU device with CubeCL and other service.
Fusing	Use comptime to select operation

Core Features

In this section, we'll explore the core features of CubeCL and what sets it apart from other high-performance computing languages like CUDA, OpenCL, and HIP.

Comptime

CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time.

This enables a lot of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance.

Loop Unrolling

You can easily unroll loops in CubeCL using the unroll attribute on top of a for loop.

#![allow(unused)]
fn main() {
#[cube(launch)]
fn sum<F: Float>(input: &Array<F>, output: &mut Array<F>, #[comptime] end: Option<u32>) {
    let unroll = end.is_some();
    let end = end.unwrap_or_else(|| input.len());
    let mut sum = F::new(0.0);

    #[unroll(unroll)]
    for i in 0..end {
        sum += input[i];
    }

    output[ABSOLUTE_POS] = sum;
}
}

Note that if you provide a variable end that can't be determined at compile time, a panic will arise when trying to execute that kernel.

Feature Specialization

You could also achieve the sum using plane operations. We will write a kernel that uses that instruction when available based on a comptime feature flag. When it isn't available, it will fall back on the previous implementation, essentially making it portable.

#![allow(unused)]
fn main() {
#[cube(launch)]
fn sum_plane<F: Float>(
    input: &Array<F>,
    output: &mut Array<F>,
    #[comptime] plane: bool,
    #[comptime] end: Option<u32>,
) {
    if plane {
        output[UNIT_POS] = plane_sum(input[UNIT_POS]);
    } else {
        sum_basic(input, output, end);
    }
}
}

Note that no branching will actually occur on the GPU, since three different kernels can be generated from the last code snippet. You can also use the trait system to achieve a similar behavior.

Vectorization

High-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input variable when launching a kernel. Inside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels and have all the necessary information to use the best instructions! However, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system!

Autotune

Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability. This feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising!

Even if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused. It is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models. You can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target.

Querying hardware features

Some features and datatypes are only supported on some hardware or some backends. They can be queried with:

client.properties().feature_enabled(feature)

Also see Feature.

Overview

Features

Also requires device support

Feature	CUDA	ROCm	WGPU (WGSL)	WGPU (SPIR-V)
Plane	✔️	✔️	✔️	✔️
CMMA	✔️	✔️	❌	✔️

Datatypes

flex32 represented as f32 everywhere except SPIR-V, with no reduced precision. f64 not supported for all operations

Type	CUDA	ROCm	WGPU (WGSL)	WGPU (SPIR-V)
u8	✔️	✔️	❌	✔️
u16	✔️	✔️	❌	✔️
u32	✔️	✔️	✔️	✔️
u64	✔️	✔️	❌	✔️
i8	✔️	✔️	❌	✔️
i16	✔️	✔️	❌	✔️
i32	✔️	✔️	✔️	✔️
i64	✔️	✔️	❌	✔️
f16	✔️	✔️	❌	✔️
bf16	✔️	✔️	❌	❌
flex32	❔	❔	❔	✔️
tf32	✔️	❌	❌	❌
f32	✔️	✔️	✔️	✔️
f64	❔	❔	❌	❔
bool	✔️	✔️	✔️	✔️

Datatype Details

Flex32

Relaxed precision 32-bit float. Minimum range and precision is equivalent to f16, but may be higher. Defaults to f32 when relaxed precision isn't supported.

Tensor-Float32

19-bit CUDA-only type that should only be used as a CMMA matrix type. May be able to reinterpret from f32, but officially undefined. Use Cast::cast_from to safely convert.

Feature Details

Plane

Plane level operations, i.e. plane_sum, plane_elect.

Cooperative Matrix Multiply-Add (CMMA)

Plane-level cooperative matrix multiply-add operations. Maps to wmma in CUDA and CooperativeMatrixMultiply in SPIR-V. Features are registered for each size and datatype that is supported by the hardware. For supported functions, see cmma.

Language Support

In this section, we will highlight key language features of CubeCL and demonstrate how to use them in your kernels to enhance performance, portability, and maintainability.

Trait Support

CubeCL partially supports traits to modularize your kernel code without any overhead. For now most features are supported except stateful functions.

#![allow(unused)]
fn main() {
#[cube]
trait MyTrait {
    /// Supported
    fn my_function(x: &Array<f32>) -> f32;
    /// Unsupported
    fn my_function_2(&self, x: &Array<f32>) -> f32;
}
}

The trait system allows you to do specialization quite easily. Let's take the same example as in the comptime section.

First you can define your trait. Note that if you use your trait from the launch function, you will need to add 'static + Send + Sync.

#![allow(unused)]
fn main() {
#[cube]
trait SumKind: 'static + Send + Sync {
    fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
}
}

Then we can define some implementations:

#![allow(unused)]
fn main() {
struct SumBasic;
struct SumPlane;

#[cube]
impl SumKind for SumBasic {
    fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
        let unroll = end.is_some();
        let end = end.unwrap_or_else(|| input.len());

        let mut sum = F::new(0.0);

        #[unroll(unroll)]
        for i in 0..end {
            sum += input[i];
        }

        sum
    }
}

#[cube]
impl SumKind for SumPlane {
    fn sum<F: Float>(input: &Slice<F>, #[comptime] _end: Option<u32>) -> F {
        plane_sum(input[UNIT_POS])
    }
}
}

Associated types are also supported. Let's say you want to create a series from a sum.

#![allow(unused)]
fn main() {
#[cube]
trait CreateSeries: 'static + Send + Sync {
    type SumKind: SumKind;

    fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F;
}
}

You may want to define what kind of series you want to create using an implementation.

#![allow(unused)]
fn main() {
struct SumThenMul<K: SumKind> {
    _p: PhantomData<K>,
}

#[cube]
impl<K: SumKind> CreateSeries for SumThenMul<K> {
    type SumKind = K;

    fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F {
        let val = Self::SumKind::sum(input, end);
        val * input[UNIT_POS]
    }
}
}

It's actually not the best example of using associated types, but it shows how they are totally supported with CubeCL.

Enum Support

CubeCL provides robust support for Rust enums, enabling you to express variant-based logic in your GPU kernels. Enums can be used as kernel arguments, returned from kernels, or as intermediate types within your GPU code. This allows you to write expressive, idiomatic Rust code that maps efficiently to GPU kernels.

Defining enums

To use an enum in a CubeCL kernel, simply derive the required traits on the enum you want to use:

CubeType enables the enum to be used as a CubeCL type in a kernel.
CubeLaunch allows the enum to be used as a kernel argument or return type.

Enums can also have data associated with their variants, as long as all fields implement the required CubeCL traits, here's an example that is available in cubecl-std:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub enum CubeOption<T: CubeLaunch + CubeType> {
    Some(T),
    None,
}

Using enums in kernels

Enums can be passed as kernel arguments, returned from kernels, or used as local variables:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch, Clone, Copy)]
pub enum Function {
    AffineTransformation { a: f32, b: f32 },
    Cos,
    DivideScalar(f32),
}

#[cube(launch_unchecked)]
pub fn kernel_enum_example(
    input: &Array<Line<f32>>,
    output: &mut Array<Line<f32>>,
    function: Function,
) {
    output[UNIT_POS] = match function {
        Function::AffineTransformation { a, b } => Line::new(a) * input[UNIT_POS] + Line::new(b),
        Function::Cos => Line::cos(input[UNIT_POS]),
        Function::DivideScalar(coef) => input[UNIT_POS] / Line::new(coef),
    }
}

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);
    let input = client.create(f32::as_bytes(&[1.0, -2.0, 0.5]));
    let output = client.empty(3 * core::mem::size_of::<f32>());
    unsafe {
        kernel_enum_example::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(3, 1, 1),
            ArrayArg::from_raw_parts::<f32>(&input, 3, 2),
            ArrayArg::from_raw_parts::<f32>(&output, 3, 2),
            FunctionArgs::AffineTransformation {
                a: ScalarArg::new(1.0),
                b: ScalarArg::new(2.0),
            },
        )
    };
    println!(
        "Executed kernel_enum_example with runtime {:?} => {:?}",
        R::name(&client),
        f32::from_bytes(&client.read_one(output.binding()))
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}

You can also use enums with data in pattern matching:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub enum CubeOption<T: CubeType> {
    Some(T),
    None,
}

#[cube(launch_unchecked)]
pub fn kernel_enum_option(input: &Array<f32>, output: &mut Array<f32>, maybe: CubeOption<Array<f32>>) {
    output[UNIT_POS] = match maybe {
        CubeOption::Some(val) => input[UNIT_POS] + val[UNIT_POS],
        CubeOption::None => input[UNIT_POS],
    };
}

Adding methods to enums

You can implement methods for enums using the #[cube] attribute on the impl block:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch, Clone, Copy)]
pub enum Function {
    AffineTransformation { a: f32, b: f32 },
    Cos,
    DivideScalar(f32),
}

#[cube]
impl Function {
    pub fn apply(self, x: Line<f32>) -> Line<f32> {
        match self {
            Function::AffineTransformation { a, b } => Line::new(a) * x + Line::new(b),
            Function::Cos => Line::cos(x),
            Function::DivideScalar(coef) => x / Line::new(coef),
        }
    }
}

#[cube(launch_unchecked)]
pub fn kernel_enum_example(
    input: &Array<Line<f32>>,
    output: &mut Array<Line<f32>>,
    function: Function,
) {
    output[UNIT_POS] = function.apply(input[UNIT_POS]);
}

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);
    let input = client.create(f32::as_bytes(&[1.0, -2.0, 0.5]));
    let output = client.empty(3 * core::mem::size_of::<f32>());
    unsafe {
        kernel_enum_example::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(3, 1, 1),
            ArrayArg::from_raw_parts::<f32>(&input, 3, 2),
            ArrayArg::from_raw_parts::<f32>(&output, 3, 2),
            FunctionArgs::AffineTransformation {
                a: ScalarArg::new(1.0),
                b: ScalarArg::new(2.0),
            },
        )
    };
    println!(
        "Executed kernel_enum_example with runtime {:?} => {:?}",
        R::name(&client),
        f32::from_bytes(&client.read_one(output.binding()))
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}

Struct Support

CubeCL provides robust support for Rust structs, allowing you to organize and modularize your kernel code with zero-cost abstractions. Structs can be used as kernel arguments, returned from kernels, or as intermediate types within your GPU code. This enables you to write idiomatic, maintainable Rust code that maps efficiently to GPU kernels.

Defining structs

To use a struct in a CubeCL kernel, simply derive the required traits on the struct that you want to use:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
    pub left: T,
    pub right: T,
}

CubeType enables the struct to be used as a CubeCL type in a kernel.
CubeLaunch allows the struct to be used as a kernel argument or return type.

Structs can contain other structs, arrays, or generic parameters, as long as all fields implement the required CubeCL traits. Generics are also supported, allowing you to create reusable types that can be instantiated with different types.

Using structs in kernels

Structs can be passed as kernel arguments if annotated with CubeLaunch, returned from kernels, or used as local variables:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
    pub left: T,
    pub right: T,
}

#[cube(launch_unchecked)]
pub fn kernel_struct_example(pair: &Pair<Array<f32>>, output: &mut Array<f32>) {
    output[UNIT_POS] = pair.left[UNIT_POS] + pair.right[UNIT_POS];
}

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);

    let left = [f32::from_int(1)];
    let left = client.create(f32::as_bytes(&left));
    let right = [f32::from_int(1)];
    let right = client.create(f32::as_bytes(&right));
    let output = client.empty(core::mem::size_of::<f32>());

    unsafe {
        kernel_struct_example::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            PairLaunch::new(
                ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
                ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
            ),
            ArrayArg::from_raw_parts::<f32>(&output, 1, 1),
        )
    };

    println!(
        "Executed kernel_struct_example with runtime {:?} => {:?}",
        R::name(&client),
        f32::from_bytes(&client.read_one(output.binding()))
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}

You can also mutate struct fields if the struct is passed as a mutable reference:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
    pub left: T,
    pub right: T,
}

#[cube(launch_unchecked)]
pub fn kernel_struct_mut(output: &mut Pair<Array<f32>>) {
    output.left[UNIT_POS] = 42.0;
    output.right[UNIT_POS] = 3.14;
}

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);

    let left = [f32::from_int(1)];
    let left = client.create(f32::as_bytes(&left));
    let right = [f32::from_int(1)];
    let right = client.create(f32::as_bytes(&right));

    unsafe {
        kernel_struct_mut::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            PairLaunch::new(
                ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
                ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
            ),
        )
    };

    println!(
        "Executed kernel_struct_mut with runtime {:?} => ({:?}, {:?})",
        R::name(&client),
        f32::from_bytes(&client.read_one(left.binding())),
        f32::from_bytes(&client.read_one(right.binding())),
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}

Comptime fields

You can mark struct fields as comptime, which means their values are known at kernel compilation time and can be used for specialization:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub struct TaggedArray {
    pub array: Array<f32>,
    #[cube(comptime)]
    pub tag: String,
}

#[cube(launch_unchecked)]
pub fn kernel_with_tag(output: &mut TaggedArray) {
    if UNIT_POS == 0 {
        if comptime! {&output.tag == "zero"} {
            output.array[0] = 0.0;
        } else {
            output.array[0] = 1.0;
        }
    }
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(device);

    let output = client.empty(core::mem::size_of::<F>());

    unsafe {
        kernel_with_tag::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            TaggedArrayLaunch::new(
                ArrayArg::<R>::from_raw_parts::<F>(&output, 1, 1),
                &"not_zero".to_string(),
            ),
        )
    };

    println!(
        "Executed kernel_with_tag with runtime {:?} => {:?}",
        R::name(&client),
        F::from_bytes(&client.read_one(output.binding()))
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

Adding methods to struct

You can implement methods for structs using the #[cube] attribute. Please note that the #[cube] attribute must be on the impl block. Here's an example:

use cubecl::prelude::*;

#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
    pub left: T,
    pub right: T,
}

#[cube]
impl Pair<Array<f32>> {
    pub fn sum(&self, index: u32) -> f32 {
        self.left[index] + self.right[index]
    }
}

#[cube(launch_unchecked)]
pub fn kernel_struct_example(pair: &Pair<Array<f32>>, output: &mut Array<f32>) {
    output[UNIT_POS] = pair.sum(UNIT_POS);
}

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);

    let left = [f32::from_int(1)];
    let left = client.create(f32::as_bytes(&left));
    let right = [f32::from_int(1)];
    let right = client.create(f32::as_bytes(&right));
    let output = client.empty(core::mem::size_of::<f32>());

    unsafe {
        kernel_struct_example::launch_unchecked::<R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            PairLaunch::new(
                ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
                ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
            ),
            ArrayArg::from_raw_parts::<f32>(&output, 1, 1),
        )
    };

    println!(
        "Executed kernel_struct_example with runtime {:?} => {:?}",
        R::name(&client),
        f32::from_bytes(&client.read_one(output.binding()))
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}

Advanced usage

This section contains useful information on advanced features of CubeCL.

Configuration

CubeCL provides a flexible and powerful configuration system to control logging, autotuning, profiling, and compilation behaviors.

Overview

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

You can also override configuration options using environment variables, which is useful for CI, debugging, or deployment scenarios.

Configuration File Structure

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

profiling: Controls performance profiling and logging.
autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
compilation: Manages kernel compilation logging and cache.

Configuration Options

Profiling

The [profiling] section controls how CubeCL logs profiling information.

Log Levels:

disabled: No profiling logs.
minimal: Only logs which kernels run.
basic: Adds basic profiling info.
medium: More detailed profiling.
full: Maximum detail.

Example:

[profiling]
logger = { level = "basic", stdout = true }

Autotune

The [autotune] section configures how aggressively CubeCL autotunes kernels and where it stores autotune results.

Autotune Levels:

minimal: Fastest, least thorough.
balanced: Good trade-off (default).
extensive: More thorough.
full: Most thorough, slowest.

Log Levels:

disabled, minimal, full

Example:

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

Cache Location (if enabled):

local: Current directory
target: Project's target directory (default)
global: System config directory
file: Custom path

Compilation

The [compilation] section manages logging and caching for kernel compilation.

Log Levels:

disabled: No logs.
basic: Logs when kernels are compiled.
full: Logs full details, including source code.

Example:

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Environment Variable Overrides

CubeCL supports several environment variables to override configuration at runtime:

CUBECL_DEBUG_LOG: Controls logging output.
- "stdout": Log to stdout.
- "stderr": Log to stderr.
- "1"/"true": Log to /tmp/cubecl.log.
- "0"/"false": Disable logging.
- Any other value: Treated as a file path.
CUBECL_DEBUG_OPTION: Sets log verbosity.
- "debug": Full compilation and autotune logs, medium profiling.
- "debug-full": Full logs for all.
- "profile", "profile-medium", "profile-full": Set profiling log level.
CUBECL_AUTOTUNE_LEVEL: Sets autotune level.
- "minimal"/"0"
- "balanced"/"1"
- "extensive"/"2"
- "full"/"3"

Example (Linux/macOS):

export CUBECL_DEBUG_LOG=stdout
export CUBECL_AUTOTUNE_LEVEL=full

Programmatic Configuration

You can also set the global configuration from Rust code before CubeCL is initialized:

#![allow(unused)]
fn main() {
let config = cubecl::config::GlobalConfig {
    profiling: ...,
    autotune: ...,
    compilation: ...,
};
cubecl::config::GlobalConfig::set(config);
}

Note: You must call GlobalConfig::set before any CubeCL operations, and only once per process.

Logging

CubeCL supports logging to multiple destinations simultaneously:

File (with append/overwrite)
Stdout
Stderr
Rust log crate (for integration with other logging frameworks)

You can configure these in the logger field for each section.

Saving the Default Configuration

To generate a default configuration file:

#![allow(unused)]
fn main() {
cubecl::config::GlobalConfig::save_default("cubecl.toml").unwrap();
}

Example: Full Configuration

[profiling]
logger = { level = "medium", stdout = true }

[autotune]
level = "extensive"
logger = { level = "full", file = "autotune.log", append = false }

[compilation]
logger = { level = "full", file = "compile.log", append = true }

Algorithm reference

In this section, we introduce different algorithms provided by CubeCL. This is a best effort list and we focus first on nontrivial algorithms deserving more explanations than what is reasonable to put in the API documentation. This section is also a bit more technical compared to the others, as it serves two purposes. First, it is a reference for users interested in the lower-level details of CubeCL. Second, it is a reference for the developers who want to update the implementation as the algorithms often get obfuscated by optimization details.

Quantized matrix multiplication

To make matrix multiplication faster, we replace floating-point arithmetic using f32 with integer arithmetic using a mix of u8, u16 and i32. The benefits are twofold. First, we replace Tensor<f32> with Tensor<u8> to reduce memory cost by a factor of 4. This leads to faster read and write operations into global memory. Second, integer operations are often faster than their floating-point counterparts.

In this section, we start by presenting a more mathematical overview of the algorithm, before discussing implementation.

Mathematical formulation

Scalar quantization

A real number \(a\) can be approximated by an integer \(q\) using the formula \[ a \approx s(q - z). \] In this equation \(s\) is a scaling factor and is also a real number, while \(z\) is called the zero-offset and is an integer. In theory, with this approximation, we can represent exactly all real numbers that are integral multiples of \(s\). All other real numbers are rounded up to the closest representable value. However, in practice, the range of \(q\) is limited by its representation (e.g. u8, i32). Hence, the zero-offset \(z\) allows us to slide the interval of representable numbers toward an interval we are interested in a particular application. Also, by using the same type for \(q\) and \(z\), we assure that 0 is exactly representable.

The multiplication of two real numbers is equivalent to \[ a b = s_a s_b (q_a - z_a) (q_b - z_b). \] However, we are more interested in the quantized version \(q_c\) of \(c = ab \). Given we want to approximate \(c\) with scaling \(s_c\) and zero-offset \(z_c\), we have \[ q_c = z_c + \frac{s_a s_b}{s_c} (q_a - z_a) (q_b - z_b). \] Except for the factor \( (s_a s_b) / s_c \), the above equation involves only integer arithmetic. However, we can always find two integers \(u, v\) such that \[ \frac uv \approx \frac{s_a s_b}{s_c} \] is a satisfying approximation. This leads to the final approximation for quantized multiplication \[ q_c \approx z_c + \frac uv (q_a - z_a)(q_b - z_b) \] requiring only integer arithmetic.

Matrix quantization

The same idea holds for matrix multiplication. To distinguish matrices from scalars, we use capital letters for the former and lower letters for the latter.

A real matrix \( A \) is approximated by an integer matrix \( Q \) using \[ A \approx s (Q - z N). \] Here \( N \) is a matrix of ones the same size as \( A \). For two matrices \(A \) and \( B \) with respective shape \(m \times k\) and \(k \times n\) and their product \( C \) of shape \( m \times n \), we have, similar to the scalar case that \[ Q_c \approx z_c N_c + \frac uv (Q_a - z_a N_a)(Q_b - z_b N_b). \]

Implementation

As an example, we describe how to implement the quantized matrix multiplication where the elements of \(Q_a\), \(Q_b\) and \(Q_c\) and the zero-offsets are represented as u8.

To compute \(Q_a - z_a N_a \), we first convert the values to i16 before performing the subtraction. Then, we can compute the product \((Q_a - z_a N_a)(Q_b - z_b N_b)\) by converting the values to i32 before multiplying. Of course, in practice, we perform all these conversions on-the-fly to avoid wastefully allocating new matrices.

Now, suppose that \(x\) is a single element in the resulting matrix and \(y\) is the element with the same position in \(Q_c\). We still need to compute the following \[ y = z_c + \frac uv \cdot x. \] The tricky part here is the product. First, we impose that \( v \) is a power of 2 so that dividing by \( v \) is equivalent to right-shifting the product \( u x \). Then, we need to find the best values \( u \) and \( v \) for the scaling factor \( \sigma = \frac{s_a s_b}{s_c} \). The trick is to cleverly multiply \( \sigma \) by 1, to get a form that allows us to work with powers of 2: \[ \sigma = \frac{2^{31 - f}}{2^{31 - f}} \sigma \] where \(2^f\) is the smallest power of 2 larger than \(\sigma\). For example, if \(\sigma = 0.3\), then \(f = -1\) as \(2^{-1} = 0.5 > 0.3 \) and \(2^{-2} = 0.25 < 0.3\). From this, we deduce we that we can use \(u = 2^{31 - f} \sigma\) rounded to the nearest i64 value and \(v = 2^{31 - f}\). This gives us a 31-bit approximation for multiplying by \(\sigma\), which is the best we can achieve when the other multiplicand is an i32. Indeed, we need to keep one bit for the sign. To properly round the product, one can add \(\frac v 2\) to the product before right shifting.

A naive implementation of the above algorithm looks like the following.

#![allow(unused)]
fn main() {
fn scaling_ratio(sigma: f32) -> (i64, u32) {
    let log = x.log2().ceil() as i32;
    let u = (x * 2.0_f32.powi(31 - log)).round() as i64;
    let v_shift = (31 - log) as u32;
    (u, v_shift)
}

fn approx_mul(x: i32, u: i64, v_shift: u32) -> i32 {
    let prod = (x as i64) * u;
    let rounding: i64 = 1 << (v_shift - 1);
    let prod_with_rounding = prod + self.rounding;
    (prod_with_rounding >> self.shift) as i32
}

fn clamp_to_u8(x: i32) -> u8 {
    if x < 0 {
      0
    } else if x > u8::MAX as i32 {
      u8::Max
    } else {
      x as u8
    }
}

struct Matrix {
  scaling: f32,
  zero_offset: u8,
  // ... other fields to store the matrix elements.
}

impl Matrix {
  fn quantized_mul(&self, other: &Self, output: &mut Self) -> Self {
      // assume the shapes of the matrices match.

      let sigma = self.scaling * other.scaling / output.scaling;
      let (u, v_shift) = scaling_ratio(sigma);

      for row in 0..self.row_count() {
          for col in 0..other.col_count() {
              let mut sum: i32 = 0;
              for middle in 0..self.col_count() {
                  let a = self.get(row, middle) as i16 - self.zero_offset as i16;
                  let b = other.get(middle, col) as i16 - other.zero_offset as i16;
                  sum += (a as i32) * (b as i32);
              }
              sum = approx_mul(sum, u, v_shift);

              output.update(row, col, clamp_to_u8(sum + output.zero_offset as i32))
          }
      }
  }

  // return the value at (row, col)
  fn get(&self, row: usize, col: usize) -> u8 { /* ... */ }

  // replace the value at (row, col) with the given value.
  fn update(&mut self, row: usize, col: usize, value: u8) { /* ... */ }

  // return the number of rows of the matrix.
  fn row_count(&self) -> usize { /* ... */ }

  // return the number of columns of the matrix.
  fn col_count(&self) -> usize { /* ... */ }
}
}

Of course, in CubeCL, we stride to provide the fastest implementation for GPU devices. As such, the example emphasizes the correct type casting to demonstrate how this is achieved in CubeCL.

Keyboard shortcuts

The CubeCL Book 🧊