Source Edit

Description

runComputeOnCpu is a function that simulates a GPU-like compute environment on the CPU. It organizes work into workgroups and invocations, similar to how compute shaders operate on GPUs.

Warning: The thread pool size must be at least MaxConcurrentWorkGroups * (ceilDiv(workgroupSizeX * workgroupSizeY * workgroupSizeZ, SubgroupSize) + 1). Compile with: -d:ThreadPoolSize=N where N meets this requirement.

Warning: Using barrier() within conditional branches may lead to undefined behavior. The emulator is modeled using a single barrier that must be accessible from all threads within a workgroup.

Parameters

numWorkGroups: UVec3 The number of workgroups in each dimension (x, y, z).
workGroupSize: UVec3 The size of each workgroup in each dimension (x, y, z).
compute: ThreadGenerator[A, B, C] The compute shader procedure to execute.
ssbo: A Storage buffer object(s) containing the data to process.
smem: B Shared memory for each workgroup.
args: C Additional arguments passed to the compute shader.

Compute Function Signature

The compute shader procedure can be written in two ways:

With shared memory:

proc computeFunction[A, B, C](
  buffers: A,     # Storage buffer (typically ptr T)
  shared: ptr B,  # Shared memory for workgroup-local data
  args: C         # Additional arguments
) {.computeShader.}

Without shared memory:

proc computeFunction[A, C](
  buffers: A,     # Storage buffer (typically ptr T)
  args: C         # Additional arguments
) {.computeShader.}

Example

type
  Buffers = object
    input, output: seq[float32]
  Shared = seq[float32]
  Args = object
    factor: int32

proc myComputeShader(
    buffers: ptr Buffers,
    shared: ptr Shared,
    args: Args) {.computeShader.} =
  # Computation logic here

let numWorkGroups = uvec3(4, 1, 1)
let workGroupSize = uvec3(256, 1, 1)
var buffers: Buffers
let coarseFactor = 4'i32

runComputeOnCpu(
  numWorkGroups, workGroupSize,
  myComputeShader,
  addr buffers,
  newSeq[float32](workGroupSize.x),
  Args(factor: coarseFactor)
)

GLSL Built-in Variables

GLSL Constant	Type	Description
gl_WorkGroupID	UVec3	ID of the current workgroup [0..gl_NumWorkGroups)
gl_WorkGroupSize	UVec3	Size of the workgroup (x, y, z)
gl_NumWorkGroups	UVec3	Total number of workgroups (x, y, z)
gl_NumSubgroups	uint32	Number of subgroups in the workgroup
gl_SubgroupID	uint32	ID of the current subgroup [0..gl_NumSubgroups)
gl_GlobalInvocationID	UVec3	Global ID of the current invocation [0..gl_NumWorkGroups * gl_WorkGroupSize)
gl_LocalInvocationID	UVec3	Local ID within the workgroup [0..gl_WorkGroupSize)
gl_SubgroupSize	uint32	Size of subgroups (constant across all subgroups)
gl_SubgroupInvocationID	uint32	ID of the invocation within the subgroup [0..gl_SubgroupSize)
gl_SubgroupEqMask	UVec4	Mask with bit set only at current invocation's index
gl_SubgroupGeMask	UVec4	Mask with bits set at and above current invocation's index
gl_SubgroupGtMask	UVec4	Mask with bits set above current invocation's index
gl_SubgroupLeMask	UVec4	Mask with bits set at and below current invocation's index
gl_SubgroupLtMask	UVec4	Mask with bits set below current invocation's index

CUDA to GLSL Translation Table

CUDA Concept	GLSL Equivalent	Description
`blockDim`	`gl_WorkGroupSize`	The size of a thread block (CUDA) or work group (GLSL)
`gridDim`	`gl_NumWorkGroups`	The size of the grid (CUDA) or the number of work groups (GLSL)
`blockIdx`	`gl_WorkGroupID`	The index of the current block (CUDA) or work group (GLSL)
`threadIdx`	`gl_LocalInvocationID`	The index of the current thread within its block (CUDA) or work group (GLSL)
`blockIdx * blockDim + threadIdx`	`gl_GlobalInvocationID`	The global index of the current thread (CUDA) or invocation (GLSL)

Imports

core, vectors, transform, lockstep, api

Types

ThreadGenerator[A; B; C] = proc (buffers: A; shared: ptr B; args: C): ThreadClosure {. nimcall.}: Source Edit

Templates

template runComputeOnCpu(numWorkGroups, workGroupSize: UVec3; compute, ssbo, smem, args: typed): Source Edit
template runComputeOnCpu(numWorkGroups, workGroupSize: UVec3; compute, ssbo, args: typed): Source Edit

Exports

DVec4, dvec4, IVec3, bvec4, w=, BVec2, bvec2, uvec3, dvec4, ivec3, bvec4, bvec4, z=, vec3, dvec4, w, bvec2, vec4, $, uvec4, bvec2, dvec2, dvec3, vec4, TVec4, x=, TVec2, uvec4, UVec3, uvec2, dvec4, dvec3, uvec2, DVec3, ivec3, BVec3, y=, ivec2, vec2, uvec4, UVec2, uvec2, bvec4, dvec2, ivec3, vec2, ivec4, bvec4, dvec3, DVec2, bvec3, Vec3, IVec4, ivec3, dvec4, z, $, ivec4, UVec4, uvec3, bvec3, bvec2, vec4, TVec, vec4, TVec3, $, ivec2, ivec2, vec3, []=, uvec4, Vec4, y, uvec3, BVec4, ivec4, dvec3, vec3, bvec3, ivec2, vec2, uvec2, x, dvec2, Vec2, dvec2, uvec4, bvec3, ivec4, IVec2, vec3, vec2, ivec4, uvec3, vec4, [], optimizeReconvergePoints, computeShader, subgroupAny, gl_SubgroupGtMask, subgroupBallot, atomicAdd, atomicCompSwap, subgroupShuffleXor, subgroupAllEqual, subgroupBallotFindLSB, subgroupElect, gl_SubgroupLeMask, subgroupMemoryBarrier, subgroupAdd, gl_SubgroupLtMask, gl_SubgroupSize, atomicOr, subgroupMin, subgroupInverseBallot, atomicXor, atomicExchange, AtomicInt, subgroupExclusiveAdd, subgroupShuffleDown, subgroupBallotInclusiveBitCount, subgroupBallotBitCount, subgroupBroadcastFirst, atomicAnd, subgroupAll, subgroupBallotExclusiveBitCount, subgroupShuffle, subgroupMax, barrier, memoryBarrier, gl_SubgroupEqMask, subgroupInclusiveAdd, subgroupBallotFindMSB, subgroupBroadcast, subgroupBallotBitExtract, subgroupBarrier, subgroupShuffleUp, groupMemoryBarrier, gl_SubgroupGeMask, SubgroupSize

src/computesim