Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parallel Reduction

Before we dive into the code to implement parallel reduction, let's take a look at the constants that CubeCL provides to help us write efficient parallel reduction kernels.

CubeCL Constants

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.


A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.

Topology Equivalence

Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages, except WebGPU with local_invocation_index.


CubeCLCUDAWebGPU
CUBE_COUNTN/AN/A
CUBE_COUNT_XgridDim.xnum_workgroups.x
CUBE_COUNT_YgridDim.ynum_workgroups.y
CUBE_COUNT_ZgridDim.znum_workgroups.z
CUBE_POSN/AN/A
CUBE_POS_XblockIdx.xworkgroup.x
CUBE_POS_YblockIdx.yworkgroup.y
CUBE_POS_ZblockIdx.zworkgroup.z
CUBE_DIMN/AN/A
CUBE_DIM_XblockDim.xworkgroup_size.x
CUBE_DIM_YblockDim.yworkgroup_size.y
CUBE_DIM_ZblockDim.zworkgroup_size.z
UNIT_POSN/Alocal_invocation_index
UNIT_POS_XthreadIdx.xlocal_invocation_id.x
UNIT_POS_YthreadIdx.ylocal_invocation_id.y
UNIT_POS_ZthreadIdx.zlocal_invocation_id.z
PLANE_DIMwarpSizesubgroup_size
ABSOLUTE_POSN/AN/A
ABSOLUTE_POS_XN/Aglobal_id.x
ABSOLUTE_POS_YN/Aglobal_id.y
ABSOLUTE_POS_ZN/Aglobal_id.z

Parallel Reduction Example

Remembering the previous example, we will now implement a parallel reduction using CubeCL. The goal is to reduce a 2D matrix into a 1D vector by summing the elements of each row. Where can we parallelize this operation? We can parallelize the reduction of each row, allowing each thread to compute the sum of a row independently. We need to change the launch parameters to set the CubeDim to launch multiple invocation in parallel and we can just remove the outer loop and use the UNIT_POS_X to access the rows of the input tensor. Please note that it is important to worry about data races when parallelizing operations, so we need to ensure that each invocation writes to a different position in the output tensor, because each invocation run in parallel.

use std::marker::PhantomData;

use cubecl::benchmark::{Benchmark, TimingMethod};
use cubecl::{future, prelude::*};
use cubecl_example::gpu_tensor::GpuTensor; // Change to the path of your own module containing the GpuTensor

pub struct ReductionBench<R: Runtime, F: Float + CubeElement> {
    input_shape: Vec<usize>,
    client: ComputeClient<R::Server, R::Channel>,
    _f: PhantomData<F>,
}

impl<R: Runtime, F: Float + CubeElement> Benchmark for ReductionBench<R, F> {
    type Input = GpuTensor<R, F>;
    type Output = GpuTensor<R, F>;

    fn prepare(&self) -> Self::Input {
        GpuTensor::<R, F>::arange(self.input_shape.clone(), &self.client)
    }

    fn name(&self) -> String {
        format!("{}-reduction-{:?}", R::name(&self.client), self.input_shape).to_lowercase()
    }

    fn sync(&self) {
        future::block_on(self.client.sync())
    }

    fn execute(&self, input: Self::Input) -> Self::Output {
        let output_shape: Vec<usize> = vec![self.input_shape[0]];
        let output = GpuTensor::<R, F>::empty(output_shape, &self.client);

        unsafe {
            reduce_matrix::launch_unchecked::<F, R>(
                &self.client,
                CubeCount::Static(1, 1, 1),
                CubeDim::new(self.input_shape[0] as u32, 1, 1), // Add parallelization on the first dimension
                input.into_tensor_arg(1),
                output.into_tensor_arg(1),
            );
        }

        output
    }
}

#[cube(launch_unchecked)]
fn reduce_matrix<F: Float>(input: &Tensor<F>, output: &mut Tensor<F>) {
    let mut acc = F::new(0.0f32);
    for i in 0..input.shape(1) {
        acc += input[UNIT_POS_X * input.stride(0) + i];
    }
    output[UNIT_POS_X] = acc;
}

pub fn launch<R: Runtime, F: Float + CubeElement>(device: &R::Device) {
    let client = R::client(&device);

    let bench1 = ReductionBench::<R, F> {
        input_shape: vec![512, 8 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };
    let bench2 = ReductionBench::<R, F> {
        input_shape: vec![128, 32 * 1024],
        client: client.clone(),
        _f: PhantomData,
    };

    for bench in [bench1, bench2] {
        println!("{}", bench.name());
        println!("{}", bench.run(TimingMethod::System));
    }
}

fn main() {
    launch::<cubecl::wgpu::WgpuRuntime, f32>(&Default::default());
}

The Results

Now that we have improved parallelism, the kernel is up to 70x faster for the first shape and up to 25x faster for the second shape. Why is the first shape faster than the second? Even though the two shape have the same number of elements, the first shape has more rows than the second shape, which means that more invocations can be used to compute the reduction in parallel. The second shape has fewer rows, so fewer invocations are available to compute the reduction in parallel.

wgpu<wgsl>-reduction-[512, 8192]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        3.369ms
  Variance    48.000ns
  Median      3.321ms
  Min         3.177ms
  Max         4.011ms
―――――――――――――――――――――――――
wgpu<wgsl>-reduction-[128, 32768]

―――――――― Result ―――――――――
  Timing      system
  Samples     10
  Mean        8.713ms
  Variance    5.069µs
  Median      8.507ms
  Min         5.963ms
  Max         12.301ms
―――――――――――――――――――――――――