Simple Reduction

To get started with CubeCL, we will implement a simple reduction operation on a multidimensional array (tensor). This example will help you understand the basic concepts of CubeCL and how to use it to perform parallel computations on tensors.

An example of a CPU reduction in Rust without CubeCL

This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using pure Rust. The code is designed to be easy to understand and serves as a starting point for more complex operations that can be parallelized with CubeCL. It is not optimized for performance, but it illustrates the basic concepts of working with tensors and performing reductions.

CpuTensor Struct

Tensors are the basic data structure used in CubeCL to represent multidimensional arrays. Here is a simple implementation of a tensor in pure Rust. It is not optimized for performance, but it is easy to understand.

/// Example of a naive multidimensional tensor in pure Rust
#[derive(Debug, Clone)]
pub struct CpuTensor {
    /// Raw contiguous value buffer
    pub data: Vec<f32>,
    /// How many element are between each dimensions
    pub strides: Vec<usize>,
    /// Dimension of the tensor
    pub shape: Vec<usize>,
}

/// Function to compute strides in a compact layout
fn compact_strides(shape: &[usize]) -> Vec<usize> {
    let rank = shape.len();
    let mut strides = vec![1; rank];
    for i in (0..rank - 1).rev() {
        strides[i] = strides[i + 1] * shape[i + 1];
    }
    strides
}

impl CpuTensor {
    /// Create a CpuTensor with a shape filled by number in order
    pub fn arange(shape: Vec<usize>) -> Self {
        let size = shape.iter().product();
        let data = (0..size).map(|i| i as f32).collect();
        let strides = compact_strides(&shape);
        Self {
            data,
            strides,
            shape,
        }
    }

    /// Create an empty CpuTensor with a shape
    pub fn empty(shape: Vec<usize>) -> Self {
        let size = shape.iter().product();
        let data = vec![0.0; size];
        let strides = compact_strides(&shape);
        Self {
            data,
            strides,
            shape,
        }
    }

    /// Read the inner data
    pub fn read(self) -> Vec<f32> {
        self.data
    }
}

Reduce function

The following function is a naive implementation of a reduction operation on a matrix. It sums the values of each row and stores the result in a new tensor. The input tensor is expected to be a 2D matrix, and the output tensor will be a 1D vector containing the sum of each row.

use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor

/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2]    [0 + 1 + 2]    [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8]    [6 + 7 + 8]    [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
    for i in 0..input.shape[0] {
        let mut acc = 0.0f32;
        for j in 0..input.shape[1] {
            acc += input.data[i * input.strides[0] + j];
        }
        output.data[i] = acc;
    }
}

Launching code

The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix function to perform the reduction. The result is printed to the console.

use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor

/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2]    [0 + 1 + 2]    [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8]    [6 + 7 + 8]    [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
    for i in 0..input.shape[0] {
        let mut acc = 0.0f32;
        for j in 0..input.shape[1] {
            acc += input.data[i * input.strides[0] + j];
        }
        output.data[i] = acc;
    }
}

fn launch() {
    let input_shape = vec![3, 3];
    let output_shape = vec![3];
    let input = CpuTensor::arange(input_shape);
    let mut output = CpuTensor::empty(output_shape);

    reduce_matrix(&input, &mut output);

    println!("Executed reduction => {:?}", output.read());
}

fn main() {
    launch();
}

A first example of a GPU reduction with CubeCL

This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using CubeCL. It is a simple implementation that will be used as a starting point to show how to use CubeCL in the next chapters.

GpuTensor struct

The GpuTensor struct is a representation of a tensor that resides on the GPU. It contains the data handle, shape, strides, and marker types for the runtime and floating-point type. The GpuTensor struct provides methods to create tensors, read data from the GPU, and convert them into tensor arguments for kernel execution. Please note that it is generic over the runtime and floating-point type, allowing it to work with different CubeCL runtimes and floating-point types (e.g., f16, f32). Also, the strides can be computed using the compact_strides function from the cubecl::std::tensor module, which will compute the strides for a given shape with a compact representation.

Another important concept is the ComputeClient trait, which define what a runtime should implement to be able to run kernels. Each runtime has their own implementation of the ComputeClient trait, which provides methods to create tensors and read data from the GPU. The ComputeClient can send compute task to a Server that will run the kernel on the GPU and schedule the tasks.

use std::marker::PhantomData;

If you need a tensor library instead of defining your own kernel and tensor, you should use Mabor directly instead.

use std::marker::PhantomData;

use cubecl::{prelude::*, server::Handle, std::tensor::compact_strides};

/// Simple GpuTensor
#[derive(Debug)]
pub struct GpuTensor<R: Runtime, F: Float + CubeElement> {
    data: Handle,
    shape: Vec<usize>,
    strides: Vec<usize>,
    _r: PhantomData<R>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Clone for GpuTensor<R, F> {
    fn clone(&self) -> Self {
        Self {
            data: self.data.clone(), // Handle is a pointer to the data, so cloning it is cheap
            shape: self.shape.clone(),
            strides: self.strides.clone(),
            _r: PhantomData,
            _f: PhantomData,
        }
    }
}

impl<R: Runtime, F: Float + CubeElement> GpuTensor<R, F> {
    /// Create a GpuTensor with a shape filled by number in order
    pub fn arange(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
        let size = shape.iter().product();
        let data: Vec<F> = (0..size).map(|i| F::from_int(i as i64)).collect();
        let data = client.create(F::as_bytes(&data));

        let strides = compact_strides(&shape);
        Self {
            data,
            shape,
            strides,
            _r: PhantomData,
            _f: PhantomData,
        }
    }

    /// Create an empty GpuTensor with a shape
    pub fn empty(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
        let size = shape.iter().product();
        let data = client.empty(size);

        let strides = compact_strides(&shape);
        Self {
            data,
            shape,
            strides,
            _r: PhantomData,
            _f: PhantomData,
        }
    }

    /// Create a TensorArg to pass to a kernel
    pub fn into_tensor_arg(&self, line_size: u8) -> TensorArg<'_, R> {
        unsafe { TensorArg::from_raw_parts::<F>(&self.data, &self.strides, &self.shape, line_size) }
    }

    /// Return the data from the client
    pub fn read(self, client: &ComputeClient<R::Server, R::Channel>) -> Vec<F> {
        let bytes = client.read_one(self.data.binding());
        F::from_bytes(&bytes).to_vec()
    }
}

Reduce function

Compared to the previous example, this function is similar but uses CubeCL's cube macro to define the kernel. The kernel performs the same reduction operation, summing the values of each row and storing the result in a new tensor. The variable F is a generic type that implements the Float trait, allowing the function to work with different floating-point types (e.g., f32, f64). The tensor is provided by cubecl::prelude, which includes the necessary traits and types for using CubeCL.

use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

Launching code

Once the kernel is defined, we can launch it using CubeCL's runtime. The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix function to perform the reduction. The result is printed to the console. Note that this code uses the cubecl::wgpu::WgpuRuntime runtime, which is a CubeCL runtime for WebGPU. You can replace it with any other CubeCL runtime that you prefer.

use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    for i in 0..input.shape(0) {
        let mut acc = F::new(0.0f32);
        for j in 0..input.shape(1) {
            acc += input[i * input.stride(0) + j];
        }
        output[i] = acc;
    }
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(device);

    let input = GpuTensor::<R, F>::arange(vec![3, 3], &client);
    let output = GpuTensor::<R, F>::empty(vec![3, 3], &client);

    unsafe {
        reduce_matrix::launch_unchecked::<F, R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(1, 1, 1),
            input.into_tensor_arg(1),
            input.into_tensor_arg(1),
        )
    };

    println!(
        "Executed reduction with runtime {:?} => {:?}",
        R::name(&client),
        output.read(&client)
    );
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

Keyboard shortcuts

The CubeCL Book 🧊