Overview
Welcome to the CubeCL Book 👋
This book will help you get started with your high-performance computing project using CubeCL, making sure you leverage the most of any hardware.
Why CubeCL is Important
In the current landscape of high-performance computing, developers face several significant challenges that CubeCL aims to address:
Complexity in Performance Optimization
Optimizing compute kernels for different hardware is inherently complex. Developers must understand the intricacies of each platform’s architecture and API, leading to a steep learning curve and the risk of suboptimal performance. The need for manual tuning and platform-specific adjustments often results in code that is difficult to maintain and extend.
The simplest way to solve this problem is to provide high-level abstractions that can be composed in a variety of ways. All of those variations can be autotuned to select the best settings for the current hardware and problem at hand.
Lack of Portability
Portability remains a significant issue. Code written for one API or even for a single GPU architecture often cannot be easily transferred to another, hindering the ability to develop software that can run across diverse hardware environments.
The GPU computing ecosystem is fragmented, with different hardware platforms like CUDA, ROCm, Metal, and Vulkan requiring their own specialized codebases. This fragmentation forces developers to write and maintain separate implementations for each platform, increasing both development time and complexity.
Conclusion
In essence, these challenges underscore the need for a more unified and developer-friendly approach to high-performance computing. CubeCL seeks to bridge these gaps by addressing the core issues within the current ecosystem, offering a new direction for high-performance and portable computing solutions.
Getting Started
In this section, we’ll walk through the installation process for CubeCL, explain the process of writing an optimized CubeCL kernel, and provide some examples to help you start experimenting with CubeCL. By the end, you'll have a solid understanding of how CubeCL integrates into your workflow and how to leverage its features for high-performance computing.
Tutorial
- Installation
- Simple Reduction
- Benchmark
- Parallel Reduction
- Vectorized Reduction
- Parallel Reduction 3D
- Examples
Installation
Installing CubeCL is straightforward. It’s available as a Rust crate, and you can add it to your project by updating your Cargo.toml:
[dependencies]
cubecl = {
version = "0.6.0", # Replace with the latest version from crates.io
features = ["wgpu"] # Enable desired runtime features (e.g., wgpu, cuda, hip)
}
The more challenging aspect is ensuring that you have the necessary drivers to run the selected runtime.
CubeCL supports multiple GPU runtimes, each requiring specific drivers or frameworks. Enable the appropriate feature flag in Cargo.toml and ensure the corresponding drivers are installed.
Platform | Runtime | Supported OS | Requirements | Installation/Notes | Feature Flag |
---|---|---|---|---|---|
WebGPU | wgpu | Linux, Windows, macOS, wasm | Vulkan drivers (typically pre-installed on modern OSes) | On linux install the vulkan driver. | wgpu |
CUDA | CUDA | Linux, Windows | NVIDIA CUDA drivers and toolkit | Download and install from the NVIDIA CUDA Downloads page. Verify installation with nvidia-smi. | cuda |
ROCm | HIP | Linux, Windows | AMD ROCm framework | Linux: Follow the ROCm Linux Quick Start. Windows: See the ROCm Windows Installation Guide. | hip |
Metal | wgpu | macOS | Apple device with Metal support (macOS 10.13 or later) | No additional drivers needed; Metal is built into macOS. | wgpu-msl |
Vulkan | wgpu | Linux, Windows | Vulkan drivers | On linux install via package manager, on windows it is typically included with GPU drivers (NVIDIA/AMD). | wgpu-spirv |
Simple Reduction
To get started with CubeCL, we will implement a simple reduction operation on a multidimensional array (tensor). This example will help you understand the basic concepts of CubeCL and how to use it to perform parallel computations on tensors.
An example of a CPU reduction in Rust without CubeCL
This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using pure Rust. The code is designed to be easy to understand and serves as a starting point for more complex operations that can be parallelized with CubeCL. It is not optimized for performance, but it illustrates the basic concepts of working with tensors and performing reductions.
CpuTensor Struct
Tensors are the basic data structure used in CubeCL to represent multidimensional arrays. Here is a simple implementation of a tensor in pure Rust. It is not optimized for performance, but it is easy to understand.
/// Example of a naive multidimensional tensor in pure Rust
#[derive(Debug, Clone)]
pub struct CpuTensor {
/// Raw contiguous value buffer
pub data: Vec<f32>,
/// How many element are between each dimensions
pub strides: Vec<usize>,
/// Dimension of the tensor
pub shape: Vec<usize>,
}
/// Function to compute strides in a compact layout
fn compact_strides(shape: &[usize]) -> Vec<usize> {
let rank = shape.len();
let mut strides = vec![1; rank];
for i in (0..rank - 1).rev() {
strides[i] = strides[i + 1] * shape[i + 1];
}
strides
}
impl CpuTensor {
/// Create a CpuTensor with a shape filled by number in order
pub fn arange(shape: Vec<usize>) -> Self {
let size = shape.iter().product();
let data = (0..size).map(|i| i as f32).collect();
let strides = compact_strides(&shape);
Self {
data,
strides,
shape,
}
}
/// Create an empty CpuTensor with a shape
pub fn empty(shape: Vec<usize>) -> Self {
let size = shape.iter().product();
let data = vec![0.0; size];
let strides = compact_strides(&shape);
Self {
data,
strides,
shape,
}
}
/// Read the inner data
pub fn read(self) -> Vec<f32> {
self.data
}
}
Reduce function
The following function is a naive implementation of a reduction operation on a matrix. It sums the values of each row and stores the result in a new tensor. The input tensor is expected to be a 2D matrix, and the output tensor will be a 1D vector containing the sum of each row.
use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor
/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2] [0 + 1 + 2] [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8] [6 + 7 + 8] [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
for i in 0..input.shape[0] {
let mut acc = 0.0f32;
for j in 0..input.shape[1] {
acc += input.data[i * input.strides[0] + j];
}
output.data[i] = acc;
}
}
Launching code
The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix
function to perform the reduction. The result is printed to the console.
use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor
/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2] [0 + 1 + 2] [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8] [6 + 7 + 8] [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
for i in 0..input.shape[0] {
let mut acc = 0.0f32;
for j in 0..input.shape[1] {
acc += input.data[i * input.strides[0] + j];
}
output.data[i] = acc;
}
}
fn launch() {
let input_shape = vec![3, 3];
let output_shape = vec![3];
let input = CpuTensor::arange(input_shape);
let mut output = CpuTensor::empty(output_shape);
reduce_matrix(&input, &mut output);
println!("Executed reduction => {:?}", output.read());
}
fn main() {
launch();
}
A first example of a GPU reduction with CubeCL
This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using CubeCL. It is a simple implementation that will be used as a starting point to show how to use CubeCL in the next chapters.
GpuTensor struct
The GpuTensor
struct is a representation of a tensor that resides on the GPU. It contains the data handle, shape, strides, and marker types for the runtime and floating-point type. The GpuTensor
struct provides methods to create tensors, read data from the GPU, and convert them into tensor arguments for kernel execution. Please note that it is generic over the runtime and floating-point type, allowing it to work with different CubeCL runtimes and floating-point types (e.g., f16
, f32
). Also, the strides can be computed using the compact_strides
function from the cubecl::std::tensor
module, which will compute the strides for a given shape with a compact representation.
Another important concept is the ComputeClient
trait, which define what a runtime should implement to be able to run kernels. Each runtime has their own implementation of the ComputeClient
trait, which provides methods to create tensors and read data from the GPU. The ComputeClient
can send compute task to a Server
that will run the kernel on the GPU and schedule the tasks.
use std::marker::PhantomData;
use std::marker::PhantomData;
use cubecl::{prelude::*, server::Handle, std::tensor::compact_strides};
/// Simple GpuTensor
#[derive(Debug)]
pub struct GpuTensor<R: Runtime, F: Float + CubeElement> {
data: Handle,
shape: Vec<usize>,
strides: Vec<usize>,
_r: PhantomData<R>,
_f: PhantomData<F>,
}
impl<R: Runtime, F: Float + CubeElement> Clone for GpuTensor<R, F> {
fn clone(&self) -> Self {
Self {
data: self.data.clone(), // Handle is a pointer to the data, so cloning it is cheap
shape: self.shape.clone(),
strides: self.strides.clone(),
_r: PhantomData,
_f: PhantomData,
}
}
}
impl<R: Runtime, F: Float + CubeElement> GpuTensor<R, F> {
/// Create a GpuTensor with a shape filled by number in order
pub fn arange(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
let size = shape.iter().product();
let data: Vec<F> = (0..size).map(|i| F::from_int(i as i64)).collect();
let data = client.create(F::as_bytes(&data));
let strides = compact_strides(&shape);
Self {
data,
shape,
strides,
_r: PhantomData,
_f: PhantomData,
}
}
/// Create an empty GpuTensor with a shape
pub fn empty(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
let size = shape.iter().product();
let data = client.empty(size);
let strides = compact_strides(&shape);
Self {
data,
shape,
strides,
_r: PhantomData,
_f: PhantomData,
}
}
/// Create a TensorArg to pass to a kernel
pub fn into_tensor_arg(&self, line_size: u8) -> TensorArg<'_, R> {
unsafe { TensorArg::from_raw_parts::<F>(&self.data, &self.strides, &self.shape, line_size) }
}
/// Return the data from the client
pub fn read(self, client: &ComputeClient<R::Server, R::Channel>) -> Vec<F> {
let bytes = client.read_one(self.data.binding());
F::from_bytes(&bytes).to_vec()
}
}
Reduce function
Compared to the previous example, this function is similar but uses CubeCL's cube
macro to define the kernel. The kernel performs the same reduction operation, summing the values of each row and storing the result in a new tensor. The variable F
is a generic type that implements the Float
trait, allowing the function to work with different floating-point types (e.g., f32
, f64
). The tensor is provided by cubecl::prelude, which includes the necessary traits and types for using CubeCL.
use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
Launching code
Once the kernel is defined, we can launch it using CubeCL's runtime. The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix
function to perform the reduction. The result is printed to the console. Note that this code uses the cubecl::wgpu::WgpuRuntime
runtime, which is a CubeCL runtime for WebGPU. You can replace it with any other CubeCL runtime that you prefer.
use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(device);
let input = GpuTensor::<R, F>::arange(vec![3, 3], &client);
let output = GpuTensor::<R, F>::empty(vec![3, 3], &client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
input.into_tensor_arg(1),
input.into_tensor_arg(1),
)
};
println!(
"Executed reduction with runtime {:?} => {:?}",
R::name(&client),
output.read(&client)
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
Benchmarking reduction
Now that we have a basic understanding of how to perform a reduction operation, let's benchmark it to see how it performs in terms of speed and efficiency.
Benchmarking struct
For benchmarking, we will create a struct that holds the necessary information for the benchmark, such as the input shape, device, and client. This struct will be used to run the benchmark tests and configure the benchmarking environment. Please note that the Runtime
and Float
traits are used to make the benchmark generic over different CubeCL runtimes and floating-point types. A PhantomData
is used to indicate that the struct holds a type parameter F
without actually storing a value of that type, which is useful for generic programming in Rust, for more information see the Rust documentation and in our case allows us to easily change the type of float used in the benchmark.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
Implementing the benchmark trait
To benchmark a CubeCL kernel, it is recommended to implement the Benchmark
trait that defines the necessary methods for preparing, executing, and synchronizing the benchmark because GPUs are asynchronous and most benchmarking tools will not wait for the GPU to finish executing the kernel before measuring the time it takes to execute it with a sync.
/// Benchmark trait.
pub trait Benchmark {
/// Benchmark input arguments.
type Input: Clone;
/// The benchmark output.
type Output;
/// Prepare the benchmark, run anything that is essential for the benchmark, but shouldn't
/// count as included in the duration.
///
/// # Notes
///
/// This should not include warmup, the benchmark will be run at least one time without
/// measuring the execution time.
fn prepare(&self) -> Self::Input;
/// Execute the benchmark and returns the logical output of the task executed.
///
/// It is important to return the output since otherwise deadcode optimization might optimize
/// away code that should be benchmarked.
fn execute(&self, input: Self::Input) -> Self::Output;
/// Name of the benchmark, should be short and it should match the name
/// defined in the crate Cargo.toml
fn name(&self) -> String;
/// Wait for computation to complete.
fn sync(&self);
}
In the prepare
method, we will create the input data and return a GpuTensor
that will be used in the execute
method. The execute
method will launch the kernel and the sync
method will wait for the GPU to finish executing the kernel before measuring the time it takes to execute it. Don't forget to add the function that we want to benchmark.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
input.into_tensor_arg(1),
output.into_tensor_arg(1),
);
}
output
}
}
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
Running the benchmark
Now that we have implemented the Benchmark
trait, we can run the benchmark using the Benchmark::run
method. This method will execute the benchmark and return the time it took to complete it.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
input.into_tensor_arg(1),
output.into_tensor_arg(1),
);
}
output
}
}
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(&device);
let bench1 = ReductionBench::<R, F> {
input_shape: vec![512, 8 * 1024],
client: client.clone(),
_f: PhantomData,
};
let bench2 = ReductionBench::<R, F> {
input_shape: vec![128, 32 * 1024],
client: client.clone(),
_f: PhantomData,
};
for bench in [bench1, bench2] {
println!("{}", bench.name());
println!("{}", bench.run(TimingMethod::System));
}
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
The Results
When you run the above code, it will execute the reduction benchmark for two different input shapes and print the results to the console. The output will look something like this:
wgpu<wgsl>-reduction-[512, 8192]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 240.730ms
Variance 1.595µs
Median 240.310ms
Min 239.974ms
Max 244.374ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 241.018ms
Variance 1.068µs
Median 240.943ms
Min 239.734ms
Max 243.782ms
―――――――――――――――――――――――――
As we will see in the next chapter, our time is not that good, but it is expected because we are using a very simple kernel that does not take advantage of the GPU parallelism. In the next chapter, we will see how to optimize our kernel to take advantage of the GPU parallelism and improve the performance of our reduction operation.
Parallel Reduction
Before we dive into the code to implement parallel reduction, let's take a look at the constants that CubeCL provides to help us write efficient parallel reduction kernels.
CubeCL Constants
CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.
A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.
Topology Equivalence
Since all topology variables are constant within the kernel entry point, we chose to use the Rust
constant syntax with capital letters. Often when creating kernels, we don't always care about the
relative position of a unit within a cube along each axis, but often we only care about its position
in general. Therefore, each kind of variable also has its own axis-independent variable, which is
often not present in other languages, except WebGPU with local_invocation_index
.
CubeCL | CUDA | WebGPU |
---|---|---|
CUBE_COUNT | N/A | N/A |
CUBE_COUNT_X | gridDim.x | num_workgroups.x |
CUBE_COUNT_Y | gridDim.y | num_workgroups.y |
CUBE_COUNT_Z | gridDim.z | num_workgroups.z |
CUBE_POS | N/A | N/A |
CUBE_POS_X | blockIdx.x | workgroup.x |
CUBE_POS_Y | blockIdx.y | workgroup.y |
CUBE_POS_Z | blockIdx.z | workgroup.z |
CUBE_DIM | N/A | N/A |
CUBE_DIM_X | blockDim.x | workgroup_size.x |
CUBE_DIM_Y | blockDim.y | workgroup_size.y |
CUBE_DIM_Z | blockDim.z | workgroup_size.z |
UNIT_POS | N/A | local_invocation_index |
UNIT_POS_X | threadIdx.x | local_invocation_id.x |
UNIT_POS_Y | threadIdx.y | local_invocation_id.y |
UNIT_POS_Z | threadIdx.z | local_invocation_id.z |
PLANE_DIM | warpSize | subgroup_size |
ABSOLUTE_POS | N/A | N/A |
ABSOLUTE_POS_X | N/A | global_id.x |
ABSOLUTE_POS_Y | N/A | global_id.y |
ABSOLUTE_POS_Z | N/A | global_id.z |
Parallel Reduction Example
Remembering the previous example, we will now implement a parallel reduction using CubeCL. The goal is to reduce a 2D matrix into a 1D vector by summing the elements of each row. Where can we parallelize this operation? We can parallelize the reduction of each row, allowing each thread to compute the sum of a row independently. We need to change the launch parameters to set the CubeDim to launch multiple invocation in parallel and we can just remove the outer loop and use the UNIT_POS_X
to access the rows of the input tensor. Please note that it is important to worry about data races when parallelizing operations, so we need to ensure that each invocation writes to a different position in the output tensor, because each invocation run in parallel.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(1, 1, 1),
CubeDim::new(self.input_shape[0] as u32, 1, 1), // Add parallelization on the first dimension
input.into_tensor_arg(1),
output.into_tensor_arg(1),
);
}
output
}
}
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
let mut acc = F::new(0.0f32);
for i in 0..input.shape(1) {
acc += input[UNIT_POS_X * input.stride(0) + i];
}
output[UNIT_POS_X] = acc;
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(&device);
let bench1 = ReductionBench::<R, F> {
input_shape: vec![512, 8 * 1024],
client: client.clone(),
_f: PhantomData,
};
let bench2 = ReductionBench::<R, F> {
input_shape: vec![128, 32 * 1024],
client: client.clone(),
_f: PhantomData,
};
for bench in [bench1, bench2] {
println!("{}", bench.name());
println!("{}", bench.run(TimingMethod::System));
}
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
The Results
Now that we have improved parallelism, the kernel is up to 70x faster for the first shape and up to 25x faster for the second shape. Why is the first shape faster than the second? Even though the two shape have the same number of elements, the first shape has more rows than the second shape, which means that more invocations can be used to compute the reduction in parallel. The second shape has fewer rows, so fewer invocations are available to compute the reduction in parallel.
wgpu<wgsl>-reduction-[512, 8192]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 3.369ms
Variance 48.000ns
Median 3.321ms
Min 3.177ms
Max 4.011ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 8.713ms
Variance 5.069µs
Median 8.507ms
Min 5.963ms
Max 12.301ms
―――――――――――――――――――――――――
Vectorized Reduction
In this section, we will explore how to implement a vectorized reduction operation using CubeCL. Vectorization is a powerful technique that allows us to process multiple data elements simultaneously, significantly improving performance for certain types of computations especially I/O operations.
What is vectorization?
Vectorization is the process of converting scalar operations (which operate on single data elements) into vector operations (which operate on multiple data elements simultaneously). This is typically done using SIMD (Single Instruction, Multiple Data) instructions available in modern CPUs and GPUs. By leveraging vectorization, we can achieve significant performance improvements for operations that can be vectorized. For more information on vectorization in CubeCL, you can refer to this section.
Application to the reduction problem
To apply vectorization to the reduction problem, we will modify our reduction kernel to process multiple elements at once. This means that instead of summing one element at a time, we will sum multiple elements with vectorization, which can lead to substantial performance gains. The number of element processed at a time is the line size. So to add vectorization we just needs to pass the LINE_SIZE
to the TensorArgs
and reduce the number of iteration of the reduce_matrix
.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
const LINE_SIZE: u32 = 4;
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(1, 1, 1),
CubeDim::new(self.input_shape[0] as u32, 1, 1),
input.into_tensor_arg(LINE_SIZE as u8),
output.into_tensor_arg(LINE_SIZE as u8),
);
}
output
}
}
// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
for i in 0..input.shape(1) / LINE_SIZE {
acc = acc + input[UNIT_POS_X * input.stride(0) + i];
}
output[UNIT_POS_X] = acc;
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(&device);
let bench1 = ReductionBench::<R, F> {
input_shape: vec![512, 8 * 1024],
client: client.clone(),
_f: PhantomData,
};
let bench2 = ReductionBench::<R, F> {
input_shape: vec![128, 32 * 1024],
client: client.clone(),
_f: PhantomData,
};
for bench in [bench1, bench2] {
println!("{}", bench.name());
println!("{}", bench.run(TimingMethod::System));
}
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
The Result
The result of adding vectorization is an average of 3x speedup compared to the previous parallel reduction implementation. This is because we are now processing multiple elements at a time in each invocation, which reduces the time of running a single invocation.
wgpu<wgsl>-reduction-[512, 8192]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 1.085ms
Variance 14.000ns
Median 1.045ms
Min 998.981µs
Max 1.375ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 3.124ms
Variance 37.000ns
Median 3.061ms
Min 3.009ms
Max 3.670ms
―――――――――――――――――――――――――
Parallel reduction 3D
The purpose of this example is to demonstrate how to perform a parallel reduction operation on a 3D tensor using CubeCL. The reduction will sum the elements along the last dimension (depth) of the tensor, resulting in a 2D tensor.
A first try
We will start with a simple implementation of a parallel reduction on a 3D tensor. The goal is to reduce the tensor along the last dimension (depth) by summing the elements. This will result in a 2D tensor where each element is the sum of the corresponding elements in the depth dimension.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
const LINE_SIZE: u32 = 4;
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(1, 1, 1),
CubeDim::new(self.input_shape[0] as u32, self.input_shape[1] as u32, 1),
input.into_tensor_arg(LINE_SIZE as u8),
output.into_tensor_arg(LINE_SIZE as u8),
);
}
output
}
}
// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
for i in 0..input.shape(2) / LINE_SIZE {
acc = acc + input[UNIT_POS_X * input.stride(0) + UNIT_POS_Y * input.stride(1) + i];
}
output[UNIT_POS_X * input.stride(0) + UNIT_POS_Y] = acc;
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(&device);
let bench1 = ReductionBench::<R, F> {
input_shape: vec![64, 256, 1024],
client: client.clone(),
_f: PhantomData,
};
let bench2 = ReductionBench::<R, F> {
input_shape: vec![64, 64, 4096],
client: client.clone(),
_f: PhantomData,
};
for bench in [bench1, bench2] {
println!("{}", bench.name());
println!("{}", bench.run(TimingMethod::System));
}
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
Let's try to run this code.
wgpu error: Validation Error
Caused by:
In Device::create_compute_pipeline, label = 'reduce_matrix_f32'
Error matching shader requirements against the pipeline
Shader entry point's workgroup size [64, 256, 1] (16384 total invocations) must be less or equal to the per-dimension limit [1024, 1024, 1024] and the total invocation limit 1024
What happened? The error message indicates that the workgroup size exceeds the limits imposed by the WebGPU backend. The total number of invocations (64 * 256 * 1 = 16384) exceeds the maximum allowed invocations per workgroup, which is 1024. In other words, the CubeDim size is too large for the GPU to handle. We needs to find another way to parallelize the reduction operation without exceeding the limits.
A better approach
To address the issue, we will parallelize with the CUBE_COUNT
and CUBE_POS
variables, which will allow us to launch multiple invocation in parallel without exceeding the limits of the CUBE_DIM
. The CUBE_COUNT
variable will determine how many invocation we will launch, and the CUBE_POS
variable will determine the position of each invocation in the 3D tensor.
use std::marker::PhantomData;
use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
input_shape: Vec<usize>,
client: ComputeClient<R::Server, R::Channel>,
_f: PhantomData<F>,
}
const LINE_SIZE: u32 = 4;
impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
type Input = GpuTensor<R, F>;
type Output = GpuTensor<R, F>;
fn prepare(&self) -> Self::Input {
GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
}
fn name(&self) -> String {
format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
}
fn sync(&self) {
future::block_on(self.client.sync())
}
fn execute(&self, input: Self::Input) -> Self::Output {
let output_shape: Vec<usize> = vec![self.input_shape[0]];
let output = GpuTensor::<R, F>::empty(output_shape, &self.client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&self.client,
CubeCount::Static(self.input_shape[0] as u32, 1, 1),
CubeDim::new(self.input_shape[1] as u32, 1, 1),
input.into_tensor_arg(LINE_SIZE as u8),
output.into_tensor_arg(LINE_SIZE as u8),
);
}
output
}
}
// Note the addition of the [Line] struct inside the tensor to guarantee that the data is contiguous and can be parallelized.
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<Line<F>>, output: &mut Tensor<Line<F>>) {
let mut acc = Line::new(F::new(0.0f32)); // A [Line] is also necessary here
for i in 0..input.shape(2) / LINE_SIZE {
acc = acc + input[CUBE_POS_X * input.stride(0) + UNIT_POS_X * input.stride(1) + i];
}
output[CUBE_POS_X * output.stride(0) + UNIT_POS_X] = acc;
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(&device);
let bench1 = ReductionBench::<R, F> {
input_shape: vec![64, 256, 1024],
client: client.clone(),
_f: PhantomData,
};
let bench2 = ReductionBench::<R, F> {
input_shape: vec![64, 64, 4096],
client: client.clone(),
_f: PhantomData,
};
for bench in [bench1, bench2] {
println!("{}", bench.name());
println!("{}", bench.run(TimingMethod::System));
}
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
Now, let's run the code again.
wgpu<wgsl>-reduction-[64, 256, 1024]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 1.483ms
Variance 27.000ns
Median 1.535ms
Min 1.239ms
Max 1.808ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[64, 64, 4096]
―――――――― Result ―――――――――
Timing system
Samples 10
Mean 924.409µs
Variance 189.000ns
Median 945.270µs
Min 600.110µs
Max 2.098ms
―――――――――――――――――――――――――
It runs and it is fast! The reduction operation is now parallelized across multiple invocations, and we can see that the performance is significantly improved compared to the previous implementation. The results show that the reduction operation is efficient and can handle larger tensors without exceeding the GPU limits. It's also almost the same speed as the 2D reduction, even if there's even more elements to reduce. This is because the reduction is now parallelized across multiple cubes and hyper-cubes, allowing the GPU to process the data more efficiently. See the parallel reduction if you need a refresher on the different parallelization level used in CubeCL. It is also worth noting that the performance and optimal CUBE_COUNT
and CUBE_DIM
values may vary depending on the GPU architecture and the specific workload. You may need to experiment with different values to find the best configuration for your use case.
Examples
For now we only have a limited amount of examples listed in the table below. Note that you can also look at how the matmul is implemented. Don't hesitate to contribute more examples to the CubeCL repository!
Example | Description |
---|---|
GeLU | Implement the GeLU activation function using CubeCL. |
Sum Things | Sum some numbers using many different variations leveraging the CubeCL core features and trait support. |
Normalization | Show how to use normalization on vectorized elements. |
Device Sharing | Share a WGPU device with CubeCL and other service. |
Fusing | Use comptime to select operation |
Core Features
In this section, we'll explore the core features of CubeCL and what sets it apart from other high-performance computing languages like CUDA, OpenCL, and HIP.
Comptime
CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time.
This enables a lot of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance.
Loop Unrolling
You can easily unroll loops in CubeCL using the unroll
attribute on top of a for loop.
#![allow(unused)] fn main() { #[cube(launch)] fn sum<F: Float>(input: &Array<F>, output: &mut Array<F>, #[comptime] end: Option<u32>) { let unroll = end.is_some(); let end = end.unwrap_or_else(|| input.len()); let mut sum = F::new(0.0); #[unroll(unroll)] for i in 0..end { sum += input[i]; } output[ABSOLUTE_POS] = sum; } }
Note that if you provide a variable end
that can't be determined at compile time, a panic will
arise when trying to execute that kernel.
Feature Specialization
You could also achieve the sum using plane operations. We will write a kernel that uses that instruction when available based on a comptime feature flag. When it isn't available, it will fall back on the previous implementation, essentially making it portable.
#![allow(unused)] fn main() { #[cube(launch)] fn sum_plane<F: Float>( input: &Array<F>, output: &mut Array<F>, #[comptime] plane: bool, #[comptime] end: Option<u32>, ) { if plane { output[UNIT_POS] = plane_sum(input[UNIT_POS]); } else { sum_basic(input, output, end); } } }
Note that no branching will actually occur on the GPU, since three different kernels can be generated from the last code snippet. You can also use the trait system to achieve a similar behavior.
Vectorization
High-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input variable when launching a kernel. Inside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels and have all the necessary information to use the best instructions! However, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system!
Autotune
Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability. This feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising!
Even if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused. It is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models. You can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target.
Querying hardware features
Some features and datatypes are only supported on some hardware or some backends. They can be queried with:
client.properties().feature_enabled(feature)
Also see Feature
.
Overview
Features
Also requires device support
Feature | CUDA | ROCm | WGPU (WGSL) | WGPU (SPIR-V) |
---|---|---|---|---|
Plane | ✔️ | ✔️ | ✔️ | ✔️ |
CMMA | ✔️ | ✔️ | ❌ | ✔️ |
Datatypes
flex32
represented as f32
everywhere except SPIR-V, with no reduced precision. f64
not
supported for all operations
Type | CUDA | ROCm | WGPU (WGSL) | WGPU (SPIR-V) |
---|---|---|---|---|
u8 | ✔️ | ✔️ | ❌ | ✔️ |
u16 | ✔️ | ✔️ | ❌ | ✔️ |
u32 | ✔️ | ✔️ | ✔️ | ✔️ |
u64 | ✔️ | ✔️ | ❌ | ✔️ |
i8 | ✔️ | ✔️ | ❌ | ✔️ |
i16 | ✔️ | ✔️ | ❌ | ✔️ |
i32 | ✔️ | ✔️ | ✔️ | ✔️ |
i64 | ✔️ | ✔️ | ❌ | ✔️ |
f16 | ✔️ | ✔️ | ❌ | ✔️ |
bf16 | ✔️ | ✔️ | ❌ | ❌ |
flex32 | ❔ | ❔ | ❔ | ✔️ |
tf32 | ✔️ | ❌ | ❌ | ❌ |
f32 | ✔️ | ✔️ | ✔️ | ✔️ |
f64 | ❔ | ❔ | ❌ | ❔ |
bool | ✔️ | ✔️ | ✔️ | ✔️ |
Datatype Details
Flex32
Relaxed precision 32-bit float. Minimum range and precision is equivalent to f16
, but may be
higher. Defaults to f32
when relaxed precision isn't supported.
Tensor-Float32
19-bit CUDA-only type that should only be used as a CMMA matrix type. May be able to reinterpret
from f32
, but officially undefined. Use Cast::cast_from
to safely convert.
Feature Details
Plane
Plane level operations, i.e.
plane_sum
,
plane_elect
.
Cooperative Matrix Multiply-Add (CMMA)
Plane-level cooperative matrix multiply-add operations. Maps to wmma
in CUDA and
CooperativeMatrixMultiply
in SPIR-V. Features are registered for each size and datatype that is
supported by the hardware. For supported functions, see
cmma
.
Language Support
In this section, we will highlight key language features of CubeCL and demonstrate how to use them in your kernels to enhance performance, portability, and maintainability.
Trait Support
CubeCL partially supports traits to modularize your kernel code without any overhead. For now most features are supported except stateful functions.
#![allow(unused)] fn main() { #[cube] trait MyTrait { /// Supported fn my_function(x: &Array<f32>) -> f32; /// Unsupported fn my_function_2(&self, x: &Array<f32>) -> f32; } }
The trait system allows you to do specialization quite easily. Let's take the same example as in the comptime section.
First you can define your trait. Note that if you use your trait from the launch function, you will
need to add 'static + Send + Sync
.
#![allow(unused)] fn main() { #[cube] trait SumKind: 'static + Send + Sync { fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F; } }
Then we can define some implementations:
#![allow(unused)] fn main() { struct SumBasic; struct SumPlane; #[cube] impl SumKind for SumBasic { fn sum<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F { let unroll = end.is_some(); let end = end.unwrap_or_else(|| input.len()); let mut sum = F::new(0.0); #[unroll(unroll)] for i in 0..end { sum += input[i]; } sum } } #[cube] impl SumKind for SumPlane { fn sum<F: Float>(input: &Slice<F>, #[comptime] _end: Option<u32>) -> F { plane_sum(input[UNIT_POS]) } } }
Associated types are also supported. Let's say you want to create a series from a sum.
#![allow(unused)] fn main() { #[cube] trait CreateSeries: 'static + Send + Sync { type SumKind: SumKind; fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F; } }
You may want to define what kind of series you want to create using an implementation.
#![allow(unused)] fn main() { struct SumThenMul<K: SumKind> { _p: PhantomData<K>, } #[cube] impl<K: SumKind> CreateSeries for SumThenMul<K> { type SumKind = K; fn execute<F: Float>(input: &Slice<F>, #[comptime] end: Option<u32>) -> F { let val = Self::SumKind::sum(input, end); val * input[UNIT_POS] } } }
It's actually not the best example of using associated types, but it shows how they are totally supported with CubeCL.
Enum Support
CubeCL provides robust support for Rust enums, enabling you to express variant-based logic in your GPU kernels. Enums can be used as kernel arguments, returned from kernels, or as intermediate types within your GPU code. This allows you to write expressive, idiomatic Rust code that maps efficiently to GPU kernels.
Defining enums
To use an enum in a CubeCL kernel, simply derive the required traits on the enum you want to use:
CubeType
enables the enum to be used as a CubeCL type in a kernel.CubeLaunch
allows the enum to be used as a kernel argument or return type.
Enums can also have data associated with their variants, as long as all fields implement the required CubeCL traits, here's an example that is available in cubecl-std:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub enum CubeOption<T: CubeLaunch + CubeType> {
Some(T),
None,
}
Using enums in kernels
Enums can be passed as kernel arguments, returned from kernels, or used as local variables:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch, Clone, Copy)]
pub enum Function {
AffineTransformation { a: f32, b: f32 },
Cos,
DivideScalar(f32),
}
#[cube(launch_unchecked)]
pub fn kernel_enum_example(
input: &Array<Line<f32>>,
output: &mut Array<Line<f32>>,
function: Function,
) {
output[UNIT_POS] = match function {
Function::AffineTransformation { a, b } => Line::new(a) * input[UNIT_POS] + Line::new(b),
Function::Cos => Line::cos(input[UNIT_POS]),
Function::DivideScalar(coef) => input[UNIT_POS] / Line::new(coef),
}
}
pub fn launch<R: Runtime>(device: &R::Device) {
let client = R::client(device);
let input = client.create(f32::as_bytes(&[1.0, -2.0, 0.5]));
let output = client.empty(3 * core::mem::size_of::<f32>());
unsafe {
kernel_enum_example::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(3, 1, 1),
ArrayArg::from_raw_parts::<f32>(&input, 3, 2),
ArrayArg::from_raw_parts::<f32>(&output, 3, 2),
FunctionArgs::AffineTransformation {
a: ScalarArg::new(1.0),
b: ScalarArg::new(2.0),
},
)
};
println!(
"Executed kernel_enum_example with runtime {:?} => {:?}",
R::name(&client),
f32::from_bytes(&client.read_one(output.binding()))
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}
You can also use enums with data in pattern matching:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub enum CubeOption<T: CubeType> {
Some(T),
None,
}
#[cube(launch_unchecked)]
pub fn kernel_enum_option(input: &Array<f32>, output: &mut Array<f32>, maybe: CubeOption<Array<f32>>) {
output[UNIT_POS] = match maybe {
CubeOption::Some(val) => input[UNIT_POS] + val[UNIT_POS],
CubeOption::None => input[UNIT_POS],
};
}
Adding methods to enums
You can implement methods for enums using the #[cube]
attribute on the impl
block:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch, Clone, Copy)]
pub enum Function {
AffineTransformation { a: f32, b: f32 },
Cos,
DivideScalar(f32),
}
#[cube]
impl Function {
pub fn apply(self, x: Line<f32>) -> Line<f32> {
match self {
Function::AffineTransformation { a, b } => Line::new(a) * x + Line::new(b),
Function::Cos => Line::cos(x),
Function::DivideScalar(coef) => x / Line::new(coef),
}
}
}
#[cube(launch_unchecked)]
pub fn kernel_enum_example(
input: &Array<Line<f32>>,
output: &mut Array<Line<f32>>,
function: Function,
) {
output[UNIT_POS] = function.apply(input[UNIT_POS]);
}
pub fn launch<R: Runtime>(device: &R::Device) {
let client = R::client(device);
let input = client.create(f32::as_bytes(&[1.0, -2.0, 0.5]));
let output = client.empty(3 * core::mem::size_of::<f32>());
unsafe {
kernel_enum_example::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(3, 1, 1),
ArrayArg::from_raw_parts::<f32>(&input, 3, 2),
ArrayArg::from_raw_parts::<f32>(&output, 3, 2),
FunctionArgs::AffineTransformation {
a: ScalarArg::new(1.0),
b: ScalarArg::new(2.0),
},
)
};
println!(
"Executed kernel_enum_example with runtime {:?} => {:?}",
R::name(&client),
f32::from_bytes(&client.read_one(output.binding()))
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}
Struct Support
CubeCL provides robust support for Rust structs, allowing you to organize and modularize your kernel code with zero-cost abstractions. Structs can be used as kernel arguments, returned from kernels, or as intermediate types within your GPU code. This enables you to write idiomatic, maintainable Rust code that maps efficiently to GPU kernels.
Defining structs
To use a struct in a CubeCL kernel, simply derive the required traits on the struct that you want to use:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
pub left: T,
pub right: T,
}
CubeType
enables the struct to be used as a CubeCL type in a kernel.CubeLaunch
allows the struct to be used as a kernel argument or return type.
Structs can contain other structs, arrays, or generic parameters, as long as all fields implement the required CubeCL traits. Generics are also supported, allowing you to create reusable types that can be instantiated with different types.
Using structs in kernels
Structs can be passed as kernel arguments if annotated with CubeLaunch
, returned from kernels, or used as local variables:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
pub left: T,
pub right: T,
}
#[cube(launch_unchecked)]
pub fn kernel_struct_example(pair: &Pair<Array<f32>>, output: &mut Array<f32>) {
output[UNIT_POS] = pair.left[UNIT_POS] + pair.right[UNIT_POS];
}
pub fn launch<R: Runtime>(device: &R::Device) {
let client = R::client(device);
let left = [f32::from_int(1)];
let left = client.create(f32::as_bytes(&left));
let right = [f32::from_int(1)];
let right = client.create(f32::as_bytes(&right));
let output = client.empty(core::mem::size_of::<f32>());
unsafe {
kernel_struct_example::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
PairLaunch::new(
ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
),
ArrayArg::from_raw_parts::<f32>(&output, 1, 1),
)
};
println!(
"Executed kernel_struct_example with runtime {:?} => {:?}",
R::name(&client),
f32::from_bytes(&client.read_one(output.binding()))
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}
You can also mutate struct fields if the struct is passed as a mutable reference:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
pub left: T,
pub right: T,
}
#[cube(launch_unchecked)]
pub fn kernel_struct_mut(output: &mut Pair<Array<f32>>) {
output.left[UNIT_POS] = 42.0;
output.right[UNIT_POS] = 3.14;
}
pub fn launch<R: Runtime>(device: &R::Device) {
let client = R::client(device);
let left = [f32::from_int(1)];
let left = client.create(f32::as_bytes(&left));
let right = [f32::from_int(1)];
let right = client.create(f32::as_bytes(&right));
unsafe {
kernel_struct_mut::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
PairLaunch::new(
ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
),
)
};
println!(
"Executed kernel_struct_mut with runtime {:?} => ({:?}, {:?})",
R::name(&client),
f32::from_bytes(&client.read_one(left.binding())),
f32::from_bytes(&client.read_one(right.binding())),
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}
Comptime fields
You can mark struct fields as comptime, which means their values are known at kernel compilation time and can be used for specialization:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub struct TaggedArray {
pub array: Array<f32>,
#[cube(comptime)]
pub tag: String,
}
#[cube(launch_unchecked)]
pub fn kernel_with_tag(output: &mut TaggedArray) {
if UNIT_POS == 0 {
if comptime! {&output.tag == "zero"} {
output.array[0] = 0.0;
} else {
output.array[0] = 1.0;
}
}
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(device);
let output = client.empty(core::mem::size_of::<F>());
unsafe {
kernel_with_tag::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
TaggedArrayLaunch::new(
ArrayArg::<R>::from_raw_parts::<F>(&output, 1, 1),
&"not_zero".to_string(),
),
)
};
println!(
"Executed kernel_with_tag with runtime {:?} => {:?}",
R::name(&client),
F::from_bytes(&client.read_one(output.binding()))
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}
Adding methods to struct
You can implement methods for structs using the #[cube]
attribute. Please note that the #[cube]
attribute must be on the impl block. Here's an example:
use cubecl::prelude::*;
#[derive(CubeType, CubeLaunch)]
pub struct Pair<T: CubeLaunch> {
pub left: T,
pub right: T,
}
#[cube]
impl Pair<Array<f32>> {
pub fn sum(&self, index: u32) -> f32 {
self.left[index] + self.right[index]
}
}
#[cube(launch_unchecked)]
pub fn kernel_struct_example(pair: &Pair<Array<f32>>, output: &mut Array<f32>) {
output[UNIT_POS] = pair.sum(UNIT_POS);
}
pub fn launch<R: Runtime>(device: &R::Device) {
let client = R::client(device);
let left = [f32::from_int(1)];
let left = client.create(f32::as_bytes(&left));
let right = [f32::from_int(1)];
let right = client.create(f32::as_bytes(&right));
let output = client.empty(core::mem::size_of::<f32>());
unsafe {
kernel_struct_example::launch_unchecked::<R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
PairLaunch::new(
ArrayArg::from_raw_parts::<f32>(&left, 1, 1),
ArrayArg::from_raw_parts::<f32>(&right, 1, 1),
),
ArrayArg::from_raw_parts::<f32>(&output, 1, 1),
)
};
println!(
"Executed kernel_struct_example with runtime {:?} => {:?}",
R::name(&client),
f32::from_bytes(&client.read_one(output.binding()))
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime>(&Default::default());
}
Advanced usage
This section contains useful information on advanced features of CubeCL.
Configuration
CubeCL provides a flexible and powerful configuration system to control logging, autotuning, profiling, and compilation behaviors.
Overview
By default, CubeCL loads its configuration from a TOML file (cubecl.toml
or CubeCL.toml
) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.
You can also override configuration options using environment variables, which is useful for CI, debugging, or deployment scenarios.
Configuration File Structure
A typical cubecl.toml
file might look like this:
[profiling]
logger = { level = "basic", stdout = true }
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }
Each section configures a different aspect of CubeCL:
- profiling: Controls performance profiling and logging.
- autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
- compilation: Manages kernel compilation logging and cache.
Configuration Options
Profiling
The [profiling]
section controls how CubeCL logs profiling information.
Log Levels:
disabled
: No profiling logs.minimal
: Only logs which kernels run.basic
: Adds basic profiling info.medium
: More detailed profiling.full
: Maximum detail.
Example:
[profiling]
logger = { level = "basic", stdout = true }
Autotune
The [autotune]
section configures how aggressively CubeCL autotunes kernels and where it stores autotune results.
Autotune Levels:
minimal
: Fastest, least thorough.balanced
: Good trade-off (default).extensive
: More thorough.full
: Most thorough, slowest.
Log Levels:
disabled
,minimal
,full
Example:
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
Cache Location (if enabled):
local
: Current directorytarget
: Project'starget
directory (default)global
: System config directoryfile
: Custom path
Compilation
The [compilation]
section manages logging and caching for kernel compilation.
Log Levels:
disabled
: No logs.basic
: Logs when kernels are compiled.full
: Logs full details, including source code.
Example:
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }
Environment Variable Overrides
CubeCL supports several environment variables to override configuration at runtime:
CUBECL_DEBUG_LOG
: Controls logging output."stdout"
: Log to stdout."stderr"
: Log to stderr."1"
/"true"
: Log to/tmp/cubecl.log
."0"
/"false"
: Disable logging.- Any other value: Treated as a file path.
CUBECL_DEBUG_OPTION
: Sets log verbosity."debug"
: Full compilation and autotune logs, medium profiling."debug-full"
: Full logs for all."profile"
,"profile-medium"
,"profile-full"
: Set profiling log level.
CUBECL_AUTOTUNE_LEVEL
: Sets autotune level."minimal"
/"0"
"balanced"
/"1"
"extensive"
/"2"
"full"
/"3"
Example (Linux/macOS):
export CUBECL_DEBUG_LOG=stdout
export CUBECL_AUTOTUNE_LEVEL=full
Programmatic Configuration
You can also set the global configuration from Rust code before CubeCL is initialized:
#![allow(unused)] fn main() { let config = cubecl::config::GlobalConfig { profiling: ..., autotune: ..., compilation: ..., }; cubecl::config::GlobalConfig::set(config); }
Note: You must call
GlobalConfig::set
before any CubeCL operations, and only once per process.
Logging
CubeCL supports logging to multiple destinations simultaneously:
- File (with append/overwrite)
- Stdout
- Stderr
- Rust
log
crate (for integration with other logging frameworks)
You can configure these in the logger
field for each section.
Saving the Default Configuration
To generate a default configuration file:
#![allow(unused)] fn main() { cubecl::config::GlobalConfig::save_default("cubecl.toml").unwrap(); }
Example: Full Configuration
[profiling]
logger = { level = "medium", stdout = true }
[autotune]
level = "extensive"
logger = { level = "full", file = "autotune.log", append = false }
[compilation]
logger = { level = "full", file = "compile.log", append = true }
Algorithm reference
In this section, we introduce different algorithms provided by CubeCL. This is a best effort list and we focus first on nontrivial algorithms deserving more explanations than what is reasonable to put in the API documentation. This section is also a bit more technical compared to the others, as it serves two purposes. First, it is a reference for users interested in the lower-level details of CubeCL. Second, it is a reference for the developers who want to update the implementation as the algorithms often get obfuscated by optimization details.
Quantized matrix multiplication
To make matrix multiplication faster,
we replace floating-point arithmetic using f32
with integer arithmetic using a mix of u8
, u16
and i32
.
The benefits are twofold.
First,
we replace Tensor<f32>
with Tensor<u8>
to reduce memory cost by a factor of 4.
This leads to faster read and write operations into global memory.
Second,
integer operations are often faster than their floating-point counterparts.
In this section, we start by presenting a more mathematical overview of the algorithm, before discussing implementation.
Mathematical formulation
Scalar quantization
A real number \(a\) can be approximated by an integer \(q\) using the formula
\[
a \approx s(q - z).
\]
In this equation \(s\) is a scaling factor and is also a real number,
while \(z\) is called the zero-offset and is an integer.
In theory,
with this approximation,
we can represent exactly all real numbers that are integral multiples of \(s\).
All other real numbers are rounded up to the closest representable value.
However, in practice, the range of \(q\) is limited by its representation (e.g. u8
, i32
).
Hence, the zero-offset \(z\) allows us to slide the interval of representable numbers toward
an interval we are interested in a particular application.
Also, by using the same type for \(q\) and \(z\),
we assure that 0 is exactly representable.
The multiplication of two real numbers is equivalent to \[ a b = s_a s_b (q_a - z_a) (q_b - z_b). \] However, we are more interested in the quantized version \(q_c\) of \(c = ab \). Given we want to approximate \(c\) with scaling \(s_c\) and zero-offset \(z_c\), we have \[ q_c = z_c + \frac{s_a s_b}{s_c} (q_a - z_a) (q_b - z_b). \] Except for the factor \( (s_a s_b) / s_c \), the above equation involves only integer arithmetic. However, we can always find two integers \(u, v\) such that \[ \frac uv \approx \frac{s_a s_b}{s_c} \] is a satisfying approximation. This leads to the final approximation for quantized multiplication \[ q_c \approx z_c + \frac uv (q_a - z_a)(q_b - z_b) \] requiring only integer arithmetic.
Matrix quantization
The same idea holds for matrix multiplication. To distinguish matrices from scalars, we use capital letters for the former and lower letters for the latter.
A real matrix \( A \) is approximated by an integer matrix \( Q \) using \[ A \approx s (Q - z N). \] Here \( N \) is a matrix of ones the same size as \( A \). For two matrices \(A \) and \( B \) with respective shape \(m \times k\) and \(k \times n\) and their product \( C \) of shape \( m \times n \), we have, similar to the scalar case that \[ Q_c \approx z_c N_c + \frac uv (Q_a - z_a N_a)(Q_b - z_b N_b). \]
Implementation
As an example,
we describe how to implement the quantized matrix multiplication
where the elements of \(Q_a\), \(Q_b\) and \(Q_c\) and the zero-offsets are represented as u8
.
To compute \(Q_a - z_a N_a \),
we first convert the values to i16
before performing the subtraction.
Then, we can compute the product \((Q_a - z_a N_a)(Q_b - z_b N_b)\)
by converting the values to i32
before multiplying.
Of course,
in practice, we perform all these conversions on-the-fly to avoid wastefully allocating new matrices.
Now, suppose that \(x\) is a single element in the resulting matrix and \(y\)
is the element with the same position in \(Q_c\).
We still need to compute the following
\[
y = z_c + \frac uv \cdot x.
\]
The tricky part here is the product.
First,
we impose that \( v \) is a power of 2 so that dividing by \( v \)
is equivalent to right-shifting the product \( u x \).
Then, we need to find the best values \( u \) and \( v \)
for the scaling factor \( \sigma = \frac{s_a s_b}{s_c} \).
The trick is to cleverly multiply \( \sigma \) by 1, to get a form that allows us to work with powers of 2:
\[
\sigma = \frac{2^{31 - f}}{2^{31 - f}} \sigma
\]
where \(2^f\) is the smallest power of 2 larger than \(\sigma\).
For example, if \(\sigma = 0.3\), then \(f = -1\) as \(2^{-1} = 0.5 > 0.3 \)
and \(2^{-2} = 0.25 < 0.3\).
From this, we deduce we that we can use \(u = 2^{31 - f} \sigma\) rounded to the
nearest i64
value and \(v = 2^{31 - f}\).
This gives us a 31-bit approximation for multiplying by \(\sigma\), which is the best
we can achieve when the other multiplicand is an i32
.
Indeed, we need to keep one bit for the sign.
To properly round the product,
one can add \(\frac v 2\) to the product before right shifting.
A naive implementation of the above algorithm looks like the following.
#![allow(unused)] fn main() { fn scaling_ratio(sigma: f32) -> (i64, u32) { let log = x.log2().ceil() as i32; let u = (x * 2.0_f32.powi(31 - log)).round() as i64; let v_shift = (31 - log) as u32; (u, v_shift) } fn approx_mul(x: i32, u: i64, v_shift: u32) -> i32 { let prod = (x as i64) * u; let rounding: i64 = 1 << (v_shift - 1); let prod_with_rounding = prod + self.rounding; (prod_with_rounding >> self.shift) as i32 } fn clamp_to_u8(x: i32) -> u8 { if x < 0 { 0 } else if x > u8::MAX as i32 { u8::Max } else { x as u8 } } struct Matrix { scaling: f32, zero_offset: u8, // ... other fields to store the matrix elements. } impl Matrix { fn quantized_mul(&self, other: &Self, output: &mut Self) -> Self { // assume the shapes of the matrices match. let sigma = self.scaling * other.scaling / output.scaling; let (u, v_shift) = scaling_ratio(sigma); for row in 0..self.row_count() { for col in 0..other.col_count() { let mut sum: i32 = 0; for middle in 0..self.col_count() { let a = self.get(row, middle) as i16 - self.zero_offset as i16; let b = other.get(middle, col) as i16 - other.zero_offset as i16; sum += (a as i32) * (b as i32); } sum = approx_mul(sum, u, v_shift); output.update(row, col, clamp_to_u8(sum + output.zero_offset as i32)) } } } // return the value at (row, col) fn get(&self, row: usize, col: usize) -> u8 { /* ... */ } // replace the value at (row, col) with the given value. fn update(&mut self, row: usize, col: usize, value: u8) { /* ... */ } // return the number of rows of the matrix. fn row_count(&self) -> usize { /* ... */ } // return the number of columns of the matrix. fn col_count(&self) -> usize { /* ... */ } } }
Of course, in CubeCL, we stride to provide the fastest implementation for GPU devices. As such, the example emphasizes the correct type casting to demonstrate how this is achieved in CubeCL.