Compute Shaders

Caution

This guide is made for LÖVE 12.0!

Compute shaders are a more general way to do calculations on the GPU, as opposed to the vertex and fragment shader, which are a lot more constrained.

To edit data outside of the shader, we can use SSBO's (Shader storage buffer object) and textures. Textures need to be marked as computewritable with the computewrite tag, but we'll get to that.

SSBO's are a way to store a large amount of data. A compute shader can perform read and write operations on these buffers.

One of the things we're going to encounter sometimes when writing compute shaders is issues with memory read and write operations. This is because we don't have complete control over when threads are accessing data.

Buffer types #

Let's go over some of the different types of buffers, these can be combined if need be.

When defining `buffers` in GLSL, LÖVE automatically adds the std430 qualifier, which allows for better packing of data. So we don't have to add that.

Thread groups #

Compute shaders are executed in three dimensional thread groups, each group has N amount of threads, which we can define using local_size_n = m in the compute shader later. Defining the local size to amount to 64 threads per thread group is usually optimal, threads within a thread group can communicate between eachother using shared variables They can also be synced meaning every thread has to be at the same point in execution to continue, though this should be done sparingly.

Built-in variables #

The local position in the thread group is stored in the gl_LocalInvocationID uvec3,
The position of the entire group is stored in the gl_WorkGroupID uvec3,
And finally the global position (group pos + local pos in group) is stored in the gl_GlobalInvocationID uvec3 input variable

Particles #

let's start with a small compute shader for moving particles around on the screen. We have two shader files,
updateParticles.glsl Is the compute shader which edits the particle data stored in our SSBO, by moving them around.
drawParticles.glsl Is the vertex and fragment shader which draw the particles to the screen.
Finally, our main.lua file will tell the gpu how to update our particles and where they spawn initially.

updateParticles.glsl

// A final local size amounting to 64 is optimal.
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// Let's define a struct for our particles
struct Particle {
    vec2 Position;
    vec2 Velocity;
    vec4 Color;
};

// The buffer will be called "Particles" when sending it from the CPU
restrict buffer Particles {
    // An array of the `Particle` struct with an unknown size.
    Particle particles[];
};

uniform mediump float DeltaTime;
uniform mediump uint ParticleCount;

// Min-X, min-Y, max-X, max-Y
uniform mediump vec4 WorldSize;

void computemain() {
    // get the ID of this thread, which we'll use as the index of the particle to simulate.
    uint index = gl_GlobalInvocationID.x;

    // Since this compute shader has a group size bigger than 1 (Which we should always use),
    // The Particle count might not be evenly divisible by the group size,
    // causing us to launch a few extra threads that won't be doing anything.
    if (index >= ParticleCount)
        return;

    // Move the particle
    particles[index].Position += particles[index].Velocity * DeltaTime;

    // Let's make the particles bounce around the screen.

    vec2 Position = particles[index].Position;

    if (Position.x < WorldSize[0]) particles[index].Velocity.x = abs(particles[index].Velocity.x);
    if (Position.x > WorldSize[2]) particles[index].Velocity.x = -abs(particles[index].Velocity.x);

    if (Position.y < WorldSize[1]) particles[index].Velocity.y = abs(particles[index].Velocity.y);
    if (Position.y > WorldSize[3]) particles[index].Velocity.y = -abs(particles[index].Velocity.y);
}

drawParticles.glsl

#pragma language glsl4
// Define our particles again
struct Particle {
    vec2 Position;
    vec2 Velocity;
    vec4 Color;
};

// The restrict keyword allows the compiler to optimize the buffer access better.
// Readonly means we won't be writing to the buffer. (Which we want anyways since that's faster)
// But it will also cause an error if we don't use the buffer as readonly in the shader.
restrict readonly buffer Particles {
    Particle particles[];
};

#ifdef VERTEX
out vec4 vColor;
vec4 position(mat4 transform_projection, vec4 vertex_position) {
    gl_PointSize = 2.0;
    uint index = love_VertexID;
    vColor = particles[index].Color;

    // Ignore the input vertex position and use the particle position instead.
    return transform_projection * vec4(particles[index].Position, 0.0, 1.0);
}
#endif
#ifdef PIXEL
in vec4 vColor;
vec4 effect(vec4 color, Image tex, vec2 texture_coords, vec2 screen_coords) {
    return vColor;
}
#endif
local drawShader = love.graphics.newShader("drawParticles.glsl")

local particleShader = love.graphics.newComputeShader("updateParticles.glsl")

local particleFormat = {
    -- name doesn't do anything but it's nicer to read
    { name = "Position", format = "floatvec2" },
    { name = "Velocity", format = "floatvec2" },
    { name = "Color",    format = "floatvec4" },
}

local particleCount = 1000000
local buffer = love.graphics.newBuffer(particleFormat, particleCount, { shaderstorage = true })

local worldSize = { 0, 0, love.graphics.getWidth(), love.graphics.getHeight() }
particleShader:send("WorldSize", worldSize)
particleShader:send("ParticleCount", particleCount)
particleShader:send("Particles", buffer)
drawShader:send("Particles", buffer)

-- FYI, If we want to update particles from the cpu every frame, or make it start faster,
-- it's better to use ByteData.

local particles = {}
local width, height = love.graphics.getDimensions()

for i = 1, particleCount do
    table.insert(particles, {
        love.math.random(width), love.math.random(height),
        love.math.randomNormal(100), love.math.randomNormal(100),
        love.math.random(), love.math.random(), love.math.random(), love.math.random()
    })
end

buffer:setArrayData(particles)

-- Create a mesh to run the vertex shader
local format = { { name = 'VertexPosition', location = 0, format = 'float' } }
local mesh = love.graphics.newMesh(format, particleCount, 'points',
    'static')

local function updateParticles(dt)
    -- Update the delta time
    particleShader:send("DeltaTime", dt)

    -- Get the local thread group size and divide the amount of particles we have by that amount
    -- Since every thread group will edit that amount of particles.
    local sizeX, sizeY, sizeZ = particleShader:getLocalThreadgroupSize()
    sizeX = math.ceil(particleCount / sizeX)

    -- Use this function to dispatch the compute shader
    love.graphics.dispatchThreadgroups(particleShader, sizeX, sizeY, sizeZ)
end

function love.update(dt)
    updateParticles(dt)
end

function love.draw()
    love.graphics.setShader(drawShader)
    love.graphics.draw(mesh)
    love.graphics.setShader()
    love.graphics.print("Simulating " .. particleCount .. " Particles at " .. love.timer.getFPS() .. " FPS")
end

Average of pixels #

`Image` and `image2D` are two different things, the first defining a 2D sampler (Readonly texture), the other a 2D image (Read / Write texture)

This compute shader will take any image with a size that is a multiple of 8, and calculate the average of those 64 pixels, then store it in another image.

// 8*8*1 = 64 threads
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;

// Input Texture
uniform Image InputImage;

// This line has way too many qualifiers :O
// Let's break it down!
// layout(rgba8), the type of an image needs to be defined beforehand, which we do like so.
// uniform, meaning this can be set from the CPU,
// mediump, meaning we want mediump precision
// restrict, allows the compiler to optimise read and write operations better
// write, tells te compiler we only want to write to this image
layout(rgba8) uniform mediump restrict writeonly image2D OutputImage;

// Our first shared variable, every thread within the thread group can read and write to this!
shared vec4[8][8] Colors;
shared vec4 Average;

void computemain() {
    ivec2 position = ivec2(gl_GlobalInvocationID.xy);

    ivec2 size = textureSize(InputImage, 0);

    if (position.x > size.x || position.y > size.y)
        return;

    // Sample at the desired position and mip 0
    vec4 CurrentColor = texelFetch(InputImage, position, 0);

    Colors[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = CurrentColor;

    // Now, if we were to try to calculate the average now, some threads might still be waiting on their texture fetch
    // and we'd be using random numbers (as variables aren't reset to a default when creating them)

    // To sync our threads (WITHIN THE LOCAL THREAD GROUP) we can use any of the following functions:
    /*
        barrier
        groupMemoryBarrier
        memoryBarrier
        memoryBarrierAtomicCounter
        memoryBarrierBuffer
        memoryBarrierImage
        memoryBarrierShared
    */
    barrier();

    // Let's let the first thread compute the Average

    if (gl_LocalInvocationID.x == 0u && gl_LocalInvocationID.y == 0u)
    {
        vec4 sum = vec4(0.0);
        for (int x = 0; x < 8; x++)
            for (int y = 0; y < 8; y++)
                sum += Colors[x][y];

        Average = sum * (1.0 / 64.0);
    }

    // Wait for the first thread to compute the average
    barrier();

    imageStore(OutputImage, position, Average);
}
local shader = love.graphics.newComputeShader("AveragingShader.glsl")

local img = love.graphics.newImage("YourImage.png")
local blurred = love.graphics.newTexture(love.graphics.getWidth(), love.graphics.getHeight(), { computewrite = true })

shader:send("InputImage", img)
shader:send("OutputImage", blurred)

love.graphics.dispatchThreadgroups(shader, math.ceil(love.graphics.getWidth() / 8),
    math.ceil(love.graphics.getHeight() / 8), 1)

function love.draw()
    love.graphics.draw(blurred)
end