Simple Reduction
To get started with CubeCL, we will implement a simple reduction operation on a multidimensional array (tensor). This example will help you understand the basic concepts of CubeCL and how to use it to perform parallel computations on tensors.
An example of a CPU reduction in Rust without CubeCL
This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using pure Rust. The code is designed to be easy to understand and serves as a starting point for more complex operations that can be parallelized with CubeCL. It is not optimized for performance, but it illustrates the basic concepts of working with tensors and performing reductions.
CpuTensor Struct
Tensors are the basic data structure used in CubeCL to represent multidimensional arrays. Here is a simple implementation of a tensor in pure Rust. It is not optimized for performance, but it is easy to understand.
/// Example of a naive multidimensional tensor in pure Rust
#[derive(Debug, Clone)]
pub struct CpuTensor {
/// Raw contiguous value buffer
pub data: Vec<f32>,
/// How many element are between each dimensions
pub strides: Vec<usize>,
/// Dimension of the tensor
pub shape: Vec<usize>,
}
/// Function to compute strides in a compact layout
fn compact_strides(shape: &[usize]) -> Vec<usize> {
let rank = shape.len();
let mut strides = vec![1; rank];
for i in (0..rank - 1).rev() {
strides[i] = strides[i + 1] * shape[i + 1];
}
strides
}
impl CpuTensor {
/// Create a CpuTensor with a shape filled by number in order
pub fn arange(shape: Vec<usize>) -> Self {
let size = shape.iter().product();
let data = (0..size).map(|i| i as f32).collect();
let strides = compact_strides(&shape);
Self {
data,
strides,
shape,
}
}
/// Create an empty CpuTensor with a shape
pub fn empty(shape: Vec<usize>) -> Self {
let size = shape.iter().product();
let data = vec![0.0; size];
let strides = compact_strides(&shape);
Self {
data,
strides,
shape,
}
}
/// Read the inner data
pub fn read(self) -> Vec<f32> {
self.data
}
}
Reduce function
The following function is a naive implementation of a reduction operation on a matrix. It sums the values of each row and stores the result in a new tensor. The input tensor is expected to be a 2D matrix, and the output tensor will be a 1D vector containing the sum of each row.
use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor
/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2] [0 + 1 + 2] [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8] [6 + 7 + 8] [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
for i in 0..input.shape[0] {
let mut acc = 0.0f32;
for j in 0..input.shape[1] {
acc += input.data[i * input.strides[0] + j];
}
output.data[i] = acc;
}
}
Launching code
The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix
function to perform the reduction. The result is printed to the console.
use cubecl_example::cpu_tensor::CpuTensor; // Change to the path of your own module containing the CpuTensor
/// This function execute the reduction in the following way by reducing the last dimension with a sum over each row a 2D matrix
/// [0 1 2] [0 + 1 + 2] [3 ]
/// [3 4 5] -> [3 + 4 + 5] -> [12]
/// [6 7 8] [6 + 7 + 8] [21]
fn reduce_matrix(input: &CpuTensor, output: &mut CpuTensor) {
for i in 0..input.shape[0] {
let mut acc = 0.0f32;
for j in 0..input.shape[1] {
acc += input.data[i * input.strides[0] + j];
}
output.data[i] = acc;
}
}
fn launch() {
let input_shape = vec![3, 3];
let output_shape = vec![3];
let input = CpuTensor::arange(input_shape);
let mut output = CpuTensor::empty(output_shape);
reduce_matrix(&input, &mut output);
println!("Executed reduction => {:?}", output.read());
}
fn main() {
launch();
}
A first example of a GPU reduction with CubeCL
This example demonstrates how to perform a simple reduction operation on a multidimensional array (tensor) using CubeCL. It is a simple implementation that will be used as a starting point to show how to use CubeCL in the next chapters.
GpuTensor struct
The GpuTensor
struct is a representation of a tensor that resides on the GPU. It contains the data handle, shape, strides, and marker types for the runtime and floating-point type. The GpuTensor
struct provides methods to create tensors, read data from the GPU, and convert them into tensor arguments for kernel execution. Please note that it is generic over the runtime and floating-point type, allowing it to work with different CubeCL runtimes and floating-point types (e.g., f16
, f32
). Also, the strides can be computed using the compact_strides
function from the cubecl::std::tensor
module, which will compute the strides for a given shape with a compact representation.
Another important concept is the ComputeClient
trait, which define what a runtime should implement to be able to run kernels. Each runtime has their own implementation of the ComputeClient
trait, which provides methods to create tensors and read data from the GPU. The ComputeClient
can send compute task to a Server
that will run the kernel on the GPU and schedule the tasks.
use std::marker::PhantomData;
use std::marker::PhantomData;
use cubecl::{prelude::*, server::Handle, std::tensor::compact_strides};
/// Simple GpuTensor
#[derive(Debug)]
pub struct GpuTensor<R: Runtime, F: Float + CubeElement> {
data: Handle,
shape: Vec<usize>,
strides: Vec<usize>,
_r: PhantomData<R>,
_f: PhantomData<F>,
}
impl<R: Runtime, F: Float + CubeElement> Clone for GpuTensor<R, F> {
fn clone(&self) -> Self {
Self {
data: self.data.clone(), // Handle is a pointer to the data, so cloning it is cheap
shape: self.shape.clone(),
strides: self.strides.clone(),
_r: PhantomData,
_f: PhantomData,
}
}
}
impl<R: Runtime, F: Float + CubeElement> GpuTensor<R, F> {
/// Create a GpuTensor with a shape filled by number in order
pub fn arange(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
let size = shape.iter().product();
let data: Vec<F> = (0..size).map(|i| F::from_int(i as i64)).collect();
let data = client.create(F::as_bytes(&data));
let strides = compact_strides(&shape);
Self {
data,
shape,
strides,
_r: PhantomData,
_f: PhantomData,
}
}
/// Create an empty GpuTensor with a shape
pub fn empty(shape: Vec<usize>, client: &ComputeClient<R::Server, R::Channel>) -> Self {
let size = shape.iter().product();
let data = client.empty(size);
let strides = compact_strides(&shape);
Self {
data,
shape,
strides,
_r: PhantomData,
_f: PhantomData,
}
}
/// Create a TensorArg to pass to a kernel
pub fn into_tensor_arg(&self, line_size: u8) -> TensorArg<'_, R> {
unsafe { TensorArg::from_raw_parts::<F>(&self.data, &self.strides, &self.shape, line_size) }
}
/// Return the data from the client
pub fn read(self, client: &ComputeClient<R::Server, R::Channel>) -> Vec<F> {
let bytes = client.read_one(self.data.binding());
F::from_bytes(&bytes).to_vec()
}
}
Reduce function
Compared to the previous example, this function is similar but uses CubeCL's cube
macro to define the kernel. The kernel performs the same reduction operation, summing the values of each row and storing the result in a new tensor. The variable F
is a generic type that implements the Float
trait, allowing the function to work with different floating-point types (e.g., f32
, f64
). The tensor is provided by cubecl::prelude, which includes the necessary traits and types for using CubeCL.
use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
Launching code
Once the kernel is defined, we can launch it using CubeCL's runtime. The following code creates a 3x3 matrix, initializes the input tensor, and calls the reduce_matrix
function to perform the reduction. The result is printed to the console. Note that this code uses the cubecl::wgpu::WgpuRuntime
runtime, which is a CubeCL runtime for WebGPU. You can replace it with any other CubeCL runtime that you prefer.
use cubecl::prelude::*;
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor
#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
for i in 0..input.shape(0) {
let mut acc = F::new(0.0f32);
for j in 0..input.shape(1) {
acc += input[i * input.stride(0) + j];
}
output[i] = acc;
}
}
pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
let client = R::client(device);
let input = GpuTensor::<R, F>::arange(vec![3, 3], &client);
let output = GpuTensor::<R, F>::empty(vec![3, 3], &client);
unsafe {
reduce_matrix::launch_unchecked::<F, R>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(1, 1, 1),
input.into_tensor_arg(1),
input.into_tensor_arg(1),
)
};
println!(
"Executed reduction with runtime {:?} => {:?}",
R::name(&client),
output.read(&client)
);
}
fn main() {
launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}