Description
runComputeOnCpu is a function that simulates a GPU-like compute environment on the CPU. It organizes work into workgroups and invocations, similar to how compute shaders operate on GPUs.
Warning:
The thread pool size must be at least MaxConcurrentWorkGroups * (ceilDiv(workgroupSizeX * workgroupSizeY * workgroupSizeZ, SubgroupSize) + 1). Compile with: -d:ThreadPoolSize=N where N meets this requirement.
Warning:
Using barrier() within conditional branches may lead to undefined behavior. The emulator is modeled using a single barrier that must be accessible from all threads within a workgroup.
Parameters
- numWorkGroups: UVec3 The number of workgroups in each dimension (x, y, z).
- workGroupSize: UVec3 The size of each workgroup in each dimension (x, y, z).
- compute: ThreadGenerator[A, B, C] The compute shader procedure to execute.
- ssbo: A Storage buffer object(s) containing the data to process.
- smem: B Shared memory for each workgroup.
- args: C Additional arguments passed to the compute shader.
Compute Function Signature
The compute shader procedure can be written in two ways:
- With shared memory:
proc computeFunction[A, B, C]( buffers: A, # Storage buffer (typically ptr T) shared: ptr B, # Shared memory for workgroup-local data args: C # Additional arguments ) {.computeShader.}
- Without shared memory:
proc computeFunction[A, C]( buffers: A, # Storage buffer (typically ptr T) args: C # Additional arguments ) {.computeShader.}
Example
type Buffers = object input, output: seq[float32] Shared = seq[float32] Args = object factor: int32 proc myComputeShader( buffers: ptr Buffers, shared: ptr Shared, args: Args) {.computeShader.} = # Computation logic here let numWorkGroups = uvec3(4, 1, 1) let workGroupSize = uvec3(256, 1, 1) var buffers: Buffers let coarseFactor = 4'i32 runComputeOnCpu( numWorkGroups, workGroupSize, myComputeShader, addr buffers, newSeq[float32](workGroupSize.x), Args(factor: coarseFactor) )
GLSL Built-in Variables
GLSL Constant | Type | Description |
---|---|---|
gl_WorkGroupID | UVec3 | ID of the current workgroup [0..gl_NumWorkGroups) |
gl_WorkGroupSize | UVec3 | Size of the workgroup (x, y, z) |
gl_NumWorkGroups | UVec3 | Total number of workgroups (x, y, z) |
gl_NumSubgroups | uint32 | Number of subgroups in the workgroup |
gl_SubgroupID | uint32 | ID of the current subgroup [0..gl_NumSubgroups) |
gl_GlobalInvocationID | UVec3 | Global ID of the current invocation [0..gl_NumWorkGroups * gl_WorkGroupSize) |
gl_LocalInvocationID | UVec3 | Local ID within the workgroup [0..gl_WorkGroupSize) |
gl_SubgroupSize | uint32 | Size of subgroups (constant across all subgroups) |
gl_SubgroupInvocationID | uint32 | ID of the invocation within the subgroup [0..gl_SubgroupSize) |
gl_SubgroupEqMask | UVec4 | Mask with bit set only at current invocation's index |
gl_SubgroupGeMask | UVec4 | Mask with bits set at and above current invocation's index |
gl_SubgroupGtMask | UVec4 | Mask with bits set above current invocation's index |
gl_SubgroupLeMask | UVec4 | Mask with bits set at and below current invocation's index |
gl_SubgroupLtMask | UVec4 | Mask with bits set below current invocation's index |
CUDA to GLSL Translation Table
CUDA Concept | GLSL Equivalent | Description |
---|---|---|
blockDim | gl_WorkGroupSize | The size of a thread block (CUDA) or work group (GLSL) |
gridDim | gl_NumWorkGroups | The size of the grid (CUDA) or the number of work groups (GLSL) |
blockIdx | gl_WorkGroupID | The index of the current block (CUDA) or work group (GLSL) |
threadIdx | gl_LocalInvocationID | The index of the current thread within its block (CUDA) or work group (GLSL) |
blockIdx * blockDim + threadIdx | gl_GlobalInvocationID | The global index of the current thread (CUDA) or invocation (GLSL) |
Types
ThreadGenerator[A; B; C] = proc (buffers: A; shared: ptr B; args: C): ThreadClosure {. nimcall.}
- Source Edit
Templates
template runComputeOnCpu(numWorkGroups, workGroupSize: UVec3; compute, ssbo, smem, args: typed)
- Source Edit
template runComputeOnCpu(numWorkGroups, workGroupSize: UVec3; compute, ssbo, args: typed)
- Source Edit
Exports
-
DVec4, dvec4, IVec3, bvec4, w=, BVec2, bvec2, uvec3, dvec4, ivec3, bvec4, bvec4, z=, vec3, dvec4, w, bvec2, vec4, $, uvec4, bvec2, dvec2, dvec3, vec4, TVec4, x=, TVec2, uvec4, UVec3, uvec2, dvec4, dvec3, uvec2, DVec3, ivec3, BVec3, y=, ivec2, vec2, uvec4, UVec2, uvec2, bvec4, dvec2, ivec3, vec2, ivec4, bvec4, dvec3, DVec2, bvec3, Vec3, IVec4, ivec3, dvec4, z, $, ivec4, UVec4, uvec3, bvec3, bvec2, vec4, TVec, vec4, TVec3, $, ivec2, ivec2, vec3, []=, uvec4, Vec4, y, uvec3, BVec4, ivec4, dvec3, vec3, bvec3, ivec2, vec2, uvec2, x, dvec2, Vec2, dvec2, uvec4, bvec3, ivec4, IVec2, vec3, vec2, ivec4, uvec3, vec4, [], optimizeReconvergePoints, computeShader, subgroupAny, gl_SubgroupGtMask, subgroupBallot, atomicAdd, atomicCompSwap, subgroupShuffleXor, subgroupAllEqual, subgroupBallotFindLSB, subgroupElect, gl_SubgroupLeMask, subgroupMemoryBarrier, subgroupAdd, gl_SubgroupLtMask, gl_SubgroupSize, atomicOr, subgroupMin, subgroupInverseBallot, atomicXor, atomicExchange, AtomicInt, subgroupExclusiveAdd, subgroupShuffleDown, subgroupBallotInclusiveBitCount, subgroupBallotBitCount, subgroupBroadcastFirst, atomicAnd, subgroupAll, subgroupBallotExclusiveBitCount, subgroupShuffle, subgroupMax, barrier, memoryBarrier, gl_SubgroupEqMask, subgroupInclusiveAdd, subgroupBallotFindMSB, subgroupBroadcast, subgroupBallotBitExtract, subgroupBarrier, subgroupShuffleUp, groupMemoryBarrier, gl_SubgroupGeMask, SubgroupSize