Compute Shaders
Caution
This guide is made for LÖVE 12.0!
Compute shaders are a more general way to do calculations on the GPU, as opposed to the vertex and fragment shader, which are a lot more constrained.
To edit data outside of the shader, we can use SSBO's (Shader storage buffer object) and textures.
Textures need to be marked as computewritable with the computewrite
tag, but we'll get to that.
SSBO's are a way to store a large amount of data.
A compute shader can perform read
and write
operations on these buffers
.
One of the things we're going to encounter sometimes when writing compute shaders is issues with memory read and write operations. This is because we don't have complete control over when threads are accessing data.
Buffer types #
Let's go over some of the different types of buffers
, these can be combined if need be.
-
shaderstorage
, allows the buffer to be read and written to and from in a shader (we can read in fragment, vertex and compute shaders, but only write in compute shaders). One thing to keep in mind is memory alignment of 4 bytes. -
vertex
, allows the buffer to be used as inputs to a vertex shader, this is what is used when creating a new mesh. Memory alignment isn't as strict allowing for more compact data storage, If we combine this withshaderstorage
, the memory alignment will be forced to 4 bytes again, which is something that needs to be accounted for -
index
, allows the buffer to be used as an index buffer for a vertex shader, which is used to store indices we send with thesetVertexMap
method of meshes. the format can only beuint16
anduint32
. -
indirectarguments
, allows the buffer to be used parameters for a draw command, effectively allowing the gpu to generate it's own work without needing readbacks.
When defining `buffers` in GLSL, LÖVE automatically adds the std430 qualifier, which allows for better packing of data. So we don't have to add that.
Thread groups #
Compute shaders are executed in three dimensional thread groups, each group has N amount of threads, which we can define using local_size_n = m
in the compute shader later.
Defining the local size to amount to 64 threads per thread group is usually optimal, threads within a thread group can communicate between eachother using shared
variables
They can also be synced meaning every thread has to be at the same point in execution to continue, though this should be done sparingly.
Built-in variables #
The local position in the thread group is stored in the gl_LocalInvocationID
uvec3
,
The position of the entire group is stored in the gl_WorkGroupID
uvec3
,
And finally the global position (group pos + local pos in group) is stored in the gl_GlobalInvocationID
uvec3
input variable
Particles #
let's start with a small compute shader for moving particles around on the screen.
We have two shader files,
updateParticles.glsl
Is the compute shader which edits the particle data stored in our SSBO, by moving them around.
drawParticles.glsl
Is the vertex and fragment shader which draw the particles to the screen.
Finally, our main.lua
file will tell the gpu how to update our particles and where they spawn initially.
updateParticles.glsl
// A final local size amounting to 64 is optimal.
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
// Let's define a struct for our particles
struct Particle {
vec2 Position;
vec2 Velocity;
vec4 Color;
};
// The buffer will be called "Particles" when sending it from the CPU
restrict buffer Particles {
// An array of the `Particle` struct with an unknown size.
Particle particles[];
};
uniform mediump float DeltaTime;
uniform mediump uint ParticleCount;
// Min-X, min-Y, max-X, max-Y
uniform mediump vec4 WorldSize;
void computemain() {
// get the ID of this thread, which we'll use as the index of the particle to simulate.
uint index = gl_GlobalInvocationID.x;
// Since this compute shader has a group size bigger than 1 (Which we should always use),
// The Particle count might not be evenly divisible by the group size,
// causing us to launch a few extra threads that won't be doing anything.
if (index >= ParticleCount)
return;
// Move the particle
particles[index].Position += particles[index].Velocity * DeltaTime;
// Let's make the particles bounce around the screen.
vec2 Position = particles[index].Position;
if (Position.x < WorldSize[0]) particles[index].Velocity.x = abs(particles[index].Velocity.x);
if (Position.x > WorldSize[2]) particles[index].Velocity.x = -abs(particles[index].Velocity.x);
if (Position.y < WorldSize[1]) particles[index].Velocity.y = abs(particles[index].Velocity.y);
if (Position.y > WorldSize[3]) particles[index].Velocity.y = -abs(particles[index].Velocity.y);
}
drawParticles.glsl
#pragma language glsl4
// Define our particles again
struct Particle {
vec2 Position;
vec2 Velocity;
vec4 Color;
};
// The restrict keyword allows the compiler to optimize the buffer access better.
// Readonly means we won't be writing to the buffer. (Which we want anyways since that's faster)
// But it will also cause an error if we don't use the buffer as readonly in the shader.
restrict readonly buffer Particles {
Particle particles[];
};
#ifdef VERTEX
out vec4 vColor;
vec4 position(mat4 transform_projection, vec4 vertex_position) {
gl_PointSize = 2.0;
uint index = love_VertexID;
vColor = particles[index].Color;
// Ignore the input vertex position and use the particle position instead.
return transform_projection * vec4(particles[index].Position, 0.0, 1.0);
}
#endif
#ifdef PIXEL
in vec4 vColor;
vec4 effect(vec4 color, Image tex, vec2 texture_coords, vec2 screen_coords) {
return vColor;
}
#endif
local drawShader = love.graphics.newShader("drawParticles.glsl")
local particleShader = love.graphics.newComputeShader("updateParticles.glsl")
local particleFormat = {
-- name doesn't do anything but it's nicer to read
{ name = "Position", format = "floatvec2" },
{ name = "Velocity", format = "floatvec2" },
{ name = "Color", format = "floatvec4" },
}
local particleCount = 1000000
local buffer = love.graphics.newBuffer(particleFormat, particleCount, { shaderstorage = true })
local worldSize = { 0, 0, love.graphics.getWidth(), love.graphics.getHeight() }
particleShader:send("WorldSize", worldSize)
particleShader:send("ParticleCount", particleCount)
particleShader:send("Particles", buffer)
drawShader:send("Particles", buffer)
-- FYI, If we want to update particles from the cpu every frame, or make it start faster,
-- it's better to use ByteData.
local particles = {}
local width, height = love.graphics.getDimensions()
for i = 1, particleCount do
table.insert(particles, {
love.math.random(width), love.math.random(height),
love.math.randomNormal(100), love.math.randomNormal(100),
love.math.random(), love.math.random(), love.math.random(), love.math.random()
})
end
buffer:setArrayData(particles)
-- Create a mesh to run the vertex shader
local format = { { name = 'VertexPosition', location = 0, format = 'float' } }
local mesh = love.graphics.newMesh(format, particleCount, 'points',
'static')
local function updateParticles(dt)
-- Update the delta time
particleShader:send("DeltaTime", dt)
-- Get the local thread group size and divide the amount of particles we have by that amount
-- Since every thread group will edit that amount of particles.
local sizeX, sizeY, sizeZ = particleShader:getLocalThreadgroupSize()
sizeX = math.ceil(particleCount / sizeX)
-- Use this function to dispatch the compute shader
love.graphics.dispatchThreadgroups(particleShader, sizeX, sizeY, sizeZ)
end
function love.update(dt)
updateParticles(dt)
end
function love.draw()
love.graphics.setShader(drawShader)
love.graphics.draw(mesh)
love.graphics.setShader()
love.graphics.print("Simulating " .. particleCount .. " Particles at " .. love.timer.getFPS() .. " FPS")
end
Average of pixels #
`Image` and `image2D` are two different things, the first defining a 2D sampler (Readonly texture), the other a 2D image (Read / Write texture)
This compute shader will take any image with a size that is a multiple of 8, and calculate the average of those 64 pixels, then store it in another image.
// 8*8*1 = 64 threads
layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
// Input Texture
uniform Image InputImage;
// This line has way too many qualifiers :O
// Let's break it down!
// layout(rgba8), the type of an image needs to be defined beforehand, which we do like so.
// uniform, meaning this can be set from the CPU,
// mediump, meaning we want mediump precision
// restrict, allows the compiler to optimise read and write operations better
// write, tells te compiler we only want to write to this image
layout(rgba8) uniform mediump restrict writeonly image2D OutputImage;
// Our first shared variable, every thread within the thread group can read and write to this!
shared vec4[8][8] Colors;
shared vec4 Average;
void computemain() {
ivec2 position = ivec2(gl_GlobalInvocationID.xy);
ivec2 size = textureSize(InputImage, 0);
if (position.x > size.x || position.y > size.y)
return;
// Sample at the desired position and mip 0
vec4 CurrentColor = texelFetch(InputImage, position, 0);
Colors[gl_LocalInvocationID.x][gl_LocalInvocationID.y] = CurrentColor;
// Now, if we were to try to calculate the average now, some threads might still be waiting on their texture fetch
// and we'd be using random numbers (as variables aren't reset to a default when creating them)
// To sync our threads (WITHIN THE LOCAL THREAD GROUP) we can use any of the following functions:
/*
barrier
groupMemoryBarrier
memoryBarrier
memoryBarrierAtomicCounter
memoryBarrierBuffer
memoryBarrierImage
memoryBarrierShared
*/
barrier();
// Let's let the first thread compute the Average
if (gl_LocalInvocationID.x == 0u && gl_LocalInvocationID.y == 0u)
{
vec4 sum = vec4(0.0);
for (int x = 0; x < 8; x++)
for (int y = 0; y < 8; y++)
sum += Colors[x][y];
Average = sum * (1.0 / 64.0);
}
// Wait for the first thread to compute the average
barrier();
imageStore(OutputImage, position, Average);
}
local shader = love.graphics.newComputeShader("AveragingShader.glsl")
local img = love.graphics.newImage("YourImage.png")
local blurred = love.graphics.newTexture(love.graphics.getWidth(), love.graphics.getHeight(), { computewrite = true })
shader:send("InputImage", img)
shader:send("OutputImage", blurred)
love.graphics.dispatchThreadgroups(shader, math.ceil(love.graphics.getWidth() / 8),
math.ceil(love.graphics.getHeight() / 8), 1)
function love.draw()
love.graphics.draw(blurred)
end