Montag, Oktober 24, 2016

Compiling shaders: dynamically uniform variables and "convergent" intrinsics

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
 if (cond) {  
   gl_FragColor = v;  
 } else {  
   gl_FragColor = vec4(0.);  
 }  
   
 ... cannot be transformed to ...
   
 if (cond) {
   // The implicitly computed derivate of texcoord
   // may be wrong here if neighbouring pixels don't
   // take the same code path.
   gl_FragColor = texture(u_sampler, texcoord);
 } else {
   gl_FragColor = vec4(0.);  
 }

 ... but the reverse transformation is allowed.

Another example is:

 if (cond) {
   v = texelFetch(u_sampler[1], texcoord, 0);
 } else {
   v = texelFetch(u_sampler[2], texcoord, 0);
 }

 ... cannot be transformed to ...

 v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
 // Incorrect, unless cond happens to be dynamically uniform.

 ... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
 uint64_t v = ballotARB(cond);
 if (other_cond) {
   use(v);
 }

 ... cannot be transformed to ...

 bool cond = ...;
 if (other_cond) {
   use(ballotARB(cond));
   // Here, ballotARB returns 1-bits only for threads/work items
   // that take the if-branch.
 }

 ... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

  1. An instruction can be moved from location A to location B only if B dominates or post-dominates A.

    This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

    This is LLVM's convergent function attribute as I understand it.

  2. An instruction can be moved from location A to location B only if A dominates or post-dominates B.

    This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

    This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.

  3. Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

    This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
 if (idx == 1u) {
   v = texture(u_sampler[idx], texcoord);
 } else if (idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

 ... cannot be transformed to ...

 uint idx = ...;
 if (idx == 1u || idx == 2u) {
   v = texture(u_sampler[idx], texcoord);
 }

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
 v2 = texture(u_sampler[1], texcoord);
 v = cond ? v1 : v2;

 ... cannot be transformed to ...

 v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
 v2 = texelFetch(u_sampler, texcoord[1], 0);
 v = cond ? v1 : v2;

 ... is equivalent to ...

 v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
 texelFetch(uniform sampler, texcoord, lod) (co-convergent)
 ballotARB(uniform cond) convergent co-convergent
 barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)