Parallel Reduction

Before we dive into the code to implement parallel reduction, let's take a look at the constants that CubeCL provides to help us write efficient parallel reduction kernels.

CubeCL Constants

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.

CubeCL - Topology

A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.

Topology Equivalence

Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages, except WebGPU with local_invocation_index.

CubeCL	CUDA	WebGPU
CUBE_COUNT	N/A	N/A
CUBE_COUNT_X	gridDim.x	num_workgroups.x
CUBE_COUNT_Y	gridDim.y	num_workgroups.y
CUBE_COUNT_Z	gridDim.z	num_workgroups.z
CUBE_POS	N/A	N/A
CUBE_POS_X	blockIdx.x	workgroup.x
CUBE_POS_Y	blockIdx.y	workgroup.y
CUBE_POS_Z	blockIdx.z	workgroup.z
CUBE_DIM	N/A	N/A
CUBE_DIM_X	blockDim.x	workgroup_size.x
CUBE_DIM_Y	blockDim.y	workgroup_size.y
CUBE_DIM_Z	blockDim.z	workgroup_size.z
UNIT_POS	N/A	local_invocation_index
UNIT_POS_X	threadIdx.x	local_invocation_id.x
UNIT_POS_Y	threadIdx.y	local_invocation_id.y
UNIT_POS_Z	threadIdx.z	local_invocation_id.z
PLANE_DIM	warpSize	subgroup_size
ABSOLUTE_POS	N/A	N/A
ABSOLUTE_POS_X	N/A	global_id.x
ABSOLUTE_POS_Y	N/A	global_id.y
ABSOLUTE_POS_Z	N/A	global_id.z

Parallel Reduction Example

Remembering the previous example, we will now implement a parallel reduction using CubeCL. The goal is to reduce a 2D matrix into a 1D vector by summing the elements of each row. Where can we parallelize this operation? We can parallelize the reduction of each row, allowing each thread to compute the sum of a row independently. We need to change the launch parameters to set the CubeDim to launch multiple invocation in parallel and we can just remove the outer loop and use the UNIT_POS_X to access the rows of the input tensor. Please note that it is important to worry about data races when parallelizing operations, so we need to ensure that each invocation writes to a different position in the output tensor, because each invocation run in parallel.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(self.input_shape[0] as u32, 1, 1), // Add parallelization on the first dimension
                input.into_tensor_arg(1),
                output.into_tensor_arg(1),
            );
        }

        output
    }
}

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    let mut acc = F::new(0.0f32);
    for i in 0..input.shape(1) {
        acc += input[UNIT_POS_X * input.stride(0) + i];
    }
    output[UNIT_POS_X] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![512, 8 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![128, 32 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

The Results

Now that we have improved parallelism, the kernel is up to 70x faster for the first shape and up to 25x faster for the second shape. Why is the first shape faster than the second? Even though the two shape have the same number of elements, the first shape has more rows than the second shape, which means that more invocations can be used to compute the reduction in parallel. The second shape has fewer rows, so fewer invocations are available to compute the reduction in parallel.

wgpu<wgsl>-reduction-[512, 8192]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        3.369ms
  Variance    48.000ns
  Median      3.321ms
  Min         3.177ms
  Max         4.011ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        8.713ms
  Variance    5.069µs
  Median      8.507ms
  Min         5.963ms
  Max         12.301ms
―――――――――――――――――――――――――