Efficient Rendering Study Notes (2022.07.19)

Home

Forward Rendering

Do everything we need to shade a pixel^Lauritzen10
For each light
- Shadow attenuation (sampling shadow maps)
- Distance attenuation
- Evaluate lighting and accumulate
Object rendering pass does everything^KimBarrero11
Single pass over geometry generates “final” image^Pesce14
Lights are bound to draw calls (via uniforms)^Pesce14
Accurate culling of light influence on geometry requires CSG splits^Pesce14
Multiple lights require either loops / branches in the shaders or shader permutations^Pesce14

Characteristics:

Advantages:
- Transparency via alpha blending^{OlssonBilleterAssarsson13}^Olsson15
- MSAA and related techniques through hardware features (much less memory storage is required)^{OlssonBilleterAssarsson13}^Olsson15
- Fastest in its baseline case (single light per pixel, “simple” shaders or even baked lighting)^Pesce14
  - Doesn’t have a “constant” up-front investment, you pay as you go (more lights, more textures…)
- Least memory necessary (least bandwidth, at least in theory). Makes MSAA possible^Pesce14^Olsson15
  - Single frame buffer^Olsson15
- Easy to integrate with shadowmaps (can render them one-at-atime, or almost)^Pesce14
- No extra pass over geometry^Pesce14
- Any material, except ones that require screen-space passes like Jimenez’s SS-SSS^Pesce14
- Single pass^Olsson15
- Simple if only few lights^Olsson15
  - e.g., the sun
- Varying shading models is easy
- Flexible^Olsson15
  - Forward or Deferred
Issues:
- Computing which lights affect each body consumes CPU time, and in the worst cast, it becomes an O(n × m) operation^Koonce07, Ineffective light culling^Lauritzen10, Light culling not efficient^Andersson11^Pesce14
  - Object space at best
- Shaders often require more than one render pass to perform lighting, with complicated shaders requiring worst-case O(n) render passes for n lights^Koonce07^Lauritzen10
- Adding new lighting models or light types requires changing all effect source files^Koonce07
  - Lighting / texturing variations have to be dealt with dynamic branches which are often problematic for the shader compiler (must allocate registers for the worst case …), conditional moves(wasted work and registers) or shader permutations(combinatorial explosion)^Pesce14
- ~~Shaders quickly encounter the instruction count limit of Shader Model 2.0^Koonce07~~
- Memory footprint of all inputs^Lauritzen10
  - Everything must be resident at the same time
- Shading small triangles is inefficient^Lauritzen10
- Shader permutations not efficient^Andersson11^Pesce14^Olsson15
- Expensive & more difficult decaling / destruction masking^Andersson11, Decals needs to be multiplass, lit twice.^Pesce14
  - Alternatively, for static decals mesh can be cut and texture layering used (more shader variations)^Pesce14
  - or for dynamic decals color can be splatted before main pass (but that costs an access to the offscreen buffer regardless or not a decal is there)^Pesce14
- Complex shaders might not run optimally^Pesce14^Olsson15
  - Texturing and lighting (and shadowing) is done in the same pass, thus shaders can require a lot of registeres and yield limited occupancy
  - Accessing many textures in sequence might create more trashing than accessing them in separate passes
- Many “modern” rendering effecs require a depth/normal pre-pass anyways (i.e. SSAO, screen-space shadwos, reflections, and so on)^Pesce14
- All shading is done on geometry, which means we pay all the eventual inefficiencies (e.g. partial quads, overdraw) on all shaders^Pesce14^Olsson15
- No shadow map reuse^Olsson15

Classic forward rendering:^{StewartThomas13}

Depth pre-pass
- Prevents overdraw when shading
Forward shading
- Pixel Shader
  - Iterates through light list set for each object
  - Evaluates material
    - Diffuse texture, spec mask, bump map, etc.

Modern Forward Shading:^Olsson15

Optional Pre-Z / Geometry Pass
Light Assignment
- Build Light Acceleration Structure (Grid)
Geometry Pass
- Just your normal shading pass
- For each fragment
  - Look up light list in acceleration structure
  - Loop over lights and accumulate shading
  - Write shading to frame buffer

Z Pre-Pass rendering

Construct depth-only pass (Z pre-pass) first to fill the z buffer with depth data, and at the same time fill the z culling. Then render the scene using this occlusion data to prevent pixel overdraw.^{EngelShaderX709}

ZPrePassRenderer ^{EngelShaderX709}

A naïve multi-light solution that accompanies a Z pre-pass renderer design pattern would just render a limited number of lights in the pixel shader.^{EngelShaderX709}

A more advanced approach stores light source properties such as position, light color, and other light properties in texture following a 2D grid that is laid out in the game world.^{EngelShaderX709}

In order to render many lights:^{EngelSiggraph09}

Re-render geometry for each light^{EngelSiggraph09}
- Lots of geometry throughput
Write pixel shader with four or eight lights^{EngelSiggraph09}
- Draw lights per-object
- Need to split up geometry following light distribution
Store light properties in textures and index into this texture^{EngelSiggraph09}
- Dependent texture look-up and lights are not fully dynamic

Space Marine:^KimBarrero11

Reject occluded objects early in G-Buffer
- Hi-Z to reject beofre ROP(Raster Operation)
Front-to-back
Only draw:
- maximum 75 objects
- Big enough objects in project space
Other objects will be drawn to Z-buffer in Gbuffer pass

Unreal:^Anagnostou17

Uses reverse-Z

Lighting Pass

Single Pass Lighting

For each object:
  Render mesh, applying all lights in one shader

^Hargreaves04

For each object:
  Find all lights affecting object
  Render all lighting and material in a single shader

^Valient07

Good for scenes with small number of lights (e.g. outdoor sunlight)^Hargreaves04
Difficult to organize if there are many lights^Hargreaves04^{HargreavesHarris04}^Valient07
- Code generation can result in thousands of combiations for a single template shader^{HargreavesHarris04}^Valient07
- Shader for each material vs. light setup combination^Valient07
Easy to overflow shader length limitations^Hargreaves04
Hidden surfaces can cause wasted shading^{HargreavesHarris04}^Valient07
Hard to integrate with shadows^{HargreavesHarris04}
- Stencil = No Go
- Shadow Maps = Easy to overflow VRAM
  - All shadow maps have to be in memory^Valient07

Multipass Lighting

For each light:
  For each object affected by the light:
    framebuffer += object * light

^Hargreaves04

For each light:
  For each object:
    Add lighting from single light to frame buffer

^Valient07

Worst case complexity is num_objects * num_lights^Hargreaves04^Lee09
Sorting by light or by object are mutually exclusive^Hargreaves04
- Hard to maintain good batching
Ideally the scene should be split exactly along light boundaries, but getting this right for dynamic lights can be a lot of CPU work^Hargreaves04
Hidden surfaces can cause wasted shading^{HargreavesHarris04}^Valient07
High batch cound (1/object/light)^{HargreavesHarris04}
- Even higher if shadow-casting
Lots of repeated work each pass:^{HargreavesHarris04}^Valient07^Lee09^Lauritzen10
- Vertex transform & setup
- Anisotropic filtering
- Not a scalable solution^Lauritzen10
Can only be justified when targeting graphics that generally consist of low- and medium-poly-count scenes with no complex materials, a very small number of light types, and where illumination comes from a few lights spread all over the scene^Placeres06
Shader for each material and light type^Valient07
Hard to optimize, we were often vertex bound^Lee09
High vertex processing cost^Trebilco09

Tiled Forward Shading

Advantages:
- Light management is decoupled from geometry^{OlssonAssarsson11}
- Light data can be uploaded to the GPU once per scene^{OlssonAssarsson11}
- FSAA works as expected^{OlssonAssarsson11}
- Common terms in the rendering equation can be factored out^{OlssonAssarsson11}
- Light accumulation is done in register, at full floating point precision^{OlssonAssarsson11}
- Same shading function as Tiled Deferred^{OlssonAssarsson11}
Disadvantages:
- Each fragment may be shaded more than once^{OlssonAssarsson11}
  - Can be addressed by using a pre-z pass

Basic Algorithm^{OlssonBilleterAssarsson13}

Subdivide screen into tiles
(Optional): pre-Z pass
(Optional): find min / max z-bounds for each tile
Assign lights to each tile
Render geometry and compute shading for each generated fragment

// 1D texture holding per-tile light lists
uniform isampleBuffer tex_tileLightLists;

// uniform buffer holding each tile's light count and
// start offset of the tile's light list (in
// tex_tileLightIndices
uniform TileLightListRanges
{
  ivec2 u_lightListRange[MAX_NUM_TILES];
}

void shading_function(inout FragmentData aFragData)
{
  // ...

  // find fragment's tile using gl_FragCoord
  ivec2 tileCoord = ivec2(gl_FragCoord.xy) / ivec2(TILE_SIZE_X, TILE_SIZE_Y);
  int tileIdx = tileCoord.x + tileCoord.y * LIGHT_GRID_SIZE_X;

  // fetch tile's light data start offset (.y) and 
  // number of lights (.x)
  ivec2 lightListRange = u_lightListRange[tileIdx].xy;

  // iterate over lights affecting this tile
  for (int i = 0; i < lightListRange.x; ++i)
  {
    int lightIndex = lightListRange.y + i;

    // fetch global light ID
    int globalLightId = texelFetch(tex_tileLightLists, lightIndex).x;

    // get the light's data (position, colors, ...)
    LightData lightData;
    light_get_data(lightData, globalLightId);

    // compute shading from the light
    shade(aFragData, lightData);
  }
  // ...
}

Subdivision of Screen

Regular N × N pixel tiles

Optional pre-Z Pass

Required if we wish to find the Z-bounds for each tile
In the final rendering pass, it can reduce the number of samples that need to be shaded through early-Z tests and similar hardware features
- Should only include opaque geometry

Optional Min / Max Z-Bounds

Yields a further significant improvement
Yields smaller per-tile bounding volumes
- Reduces the number of lights that affect a tile

Light Assignment

CPU variant:
- Find the screen-space axis-aligned bounding boxes (AABBs) for each light source and loop over all the tiles that are contained in the 2D region of the AABB
  - If min / max depth is available, perform additional test to discard lights that are outside of the tile in the Z-direction
GPU variant:
- Each tile gets its own thread group

Rendering and Shading

For each generated sample,
- Look up which lights affect that sample by checking what lights are assigned to the sample’s tile

Transparency Support

// assign lights to 2D tiles
tilesD = build_2d_tiles();
lightLists2D = assign_lights_to_2d_tiles(tiles2D);

// draw opaque geometry in pre-Z pass and find tiles'
// extents in the Z-direction
depthBuffer = render_preZ_pass();
tileZBounds = reduce_z_bounds(tiles2D, depthBuffer);

// for transparent geometry, prune lights against maximum Z-direction
lightListsTrans = prune_lights_max(lightLists2D, tileZBounds);

// for opaque geometry additionally prune lights against 
// minimum Z-direction
lightListsOpaque = prune_lights_min(lightListsTrans, tileZBounds);

// ...

// later: rendering
draw(opaque geometry, lightListsOpaque);
draw(transparent geometry, lightListsTrans);

Forward+ Rendering

Goal:^{HaradaMcKeeYang13}
- Materials may need to be both physically and nonphysically based
- Artists want complete freedom regarding the number of lights that can be placed in a scene at once
- Rendering data should be decoupled from the underlying rendering engine

Forward+:^{StewartThomas13}

Depth pre-pass
- Prevents overdraw when shading
- Provides tile depth bounds
- Separate depth prepass + depth buffer for transparents^{NeubeltPettineo14}
- May include vertex normal and velocity^Pettineo15
Tiled light culling
- Compute shader
- Generates per-tile light list
- Transparent light list generated per-tile^{NeubeltPettineo14}^Pettineo15
  - TileMinDepth = TileMin(transparentDepth)
  - TileMaxDepth = TileMax(opaqueDepth)
  - Culled using depth buffer^Pettineo15
- Async compute -> mostly free^Pettineo15
Forward shading
- Pixel Shader
  - Iterates through light list calculated by tiled light culling
  - Evaluates material
    - Diffuse texture, spec mask, bump map, etc.
Forward+ Light-culling stage before final shading^{HaradaMcKeeYang12}
Stages:^{HaradaMcKeeYang12}
- Depth Pre-Pass (Z prepass^{HaradaMcKeeYang13})
- Light Culling^{HaradaMcKeeYang13}
- Final Shading^{HaradaMcKeeYang13}
Advantages:
- Requires less memory traffic than compute-based deferred lighting^{HaradaMcKeeYang12}
  - Same memory as forward, more bandwidth, enables MSAA^Pesce14
- Any material (same as forward)^Pesce14
- Compared to forward, no mesh splitting is necessary, much less shader permutations, less draw calls^Pesce14
- Compared to forward it handels dynamic lights with good culling^Pesce14
Disadvantages:
- Geometry submitted twice^{StewartThomas13}
- Small triangles^{StewartThomas13}
- Light occlusion culling requires a full depth pre-pass for a total of two geometrical passes^Pesce14
  - Can be sidestepped with a clustered light grid
- All shadowmaps need to be generated upfront (more memory) or splatted in screen-space in a pre-pass
- All lighting permutations need to be addressed as dynamic branches in the shader^Pesce14
  - Not good if we need to support mnay kinds of light/shadow types
- Compared to forward, seems a steep price to pay to just get rid of geometry cutting^Pesce14
  - Even if this “solved” shader permutations, its solution is the same as doing forward with shaders that dynamic branch over light types/number of ligts and setting these parameters per draw call

Light Culling

Similar to the light-accumulation step of deferred lighting^{HaradaMcKeeYang13}
Calculates a list of light indices overlapping a pixel^{HaradaMcKeeYang12} instead of lighting components^{HaradaMcKeeYang13}
- However, per-pixel calculation has some issues:
  - Memory footprint
  - Efficiency of computation at light-culling stage
Split the screen into tiles and light indices are calculated on a per-tile basis^{HaradaMcKeeYang12}
Implemented using a single compute shader^{StewartThomas13}
How to reduce false positives?
- Lights are too far away!
- 3D implementation uses too much memory
- 2.5 Culling!

Implementation

Gather Approach

Thread group per tile^{HaradaMcKeeYang12}^{StewartThomas13}
- e.g. [numthreads(16, 16, 1)] for 16 × 16 tile size
Frustum of the tile is calculated using the range of the screen space of the tile and max/min depth values of the pixels^{HaradaMcKeeYang12}^{StewartThomas13}
Kernel first uses all the threads in a thread group to read a light to the local register^{HaradaMcKeeYang12}
- Overlap of the lights to the frustum of the tile is checked in parallel
- If overlaps, thread accumulates the light to TLS using local atomic operations
  - groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE]^{StewartThomas13}
- Flushes lights to the global memory using all threads^{StewartThomas13}
  - RWBuffer<uint> g_PerTileLightIndexBufferOut : register(u0);^{StewartThomas13}
- 256 lights are culled in parallel (for 16 × 16 tile size)^{StewartThomas13}
Simple and effective if the number of lights is not too large^{HaradaMcKeeYang12}

// GET_GROUP_IDX: thread group index in X direction (SV_GroupID)
// GET_GROUP_IDY: thread group index in Y direction (SV_GroupID)
// GET_GLOBAL_IDX: global thread index in X direction (SV_DispatchThreadID)
// GET_GLOBAL_IDY: global thread index in Y direction (SV_DispatchThreadID)
// GET_LOCAL_IDX: local thread index in X direction (SV_GroupThreadID)
// GET_LOCAL_IDY: local thread index in Y direction (SV_GroupThreadID)

// No global memory write is necessary until all lights are tested
groupshared u32 ldsLightIdx[LIGHT_CAPACITY];  // Light index storage
groupshared u32 ldsLightIdxCounter; // Light index counter for the storage

void appendLightToList(int i)
{
  u32 dstIdx = 0;
  InterlockedAdd(ldsLightIdxCounter, 1, dstIdx);
  if (dstIdx < LIGHT_CAPACITY)
  {
    ldsLightIdx[dstIdx] = i;
  }
}

...

  // 1: computation of the frustum of a tile in view space
  float4 frustum[4];
  { // construct frustum
    float4 v[4];
    // projToView: 
    //   takes screen-space pixel indices and depth value
    //   returns coordinates in view space
    v[0] = projToView(8 * GET_GROUP_IDX,        8 * GET_GROUP_IDY,        1.f);
    v[1] = projToView(8 * (GET_GROUP_IDX + 1),  8 * GET_GROUP_IDY,        1.f);
    v[2] = projToView(8 * (GET_GROUP_IDX + 1),  8 * (GET_GROUP_IDY + 1),  1.f);
    v[3] = projToView(8 * GET_GROUP_IDX,        8 * (GET_GROUP_IDY + 1),  1.f);
    float4 o = make_float4(0.f, 0.f, 0.f, 0.f);
    for (int i = 0; i < 4; ++i)
    {
      // createEquation:
      //   Creates a plane equation from three vertex positions
      frustum[i] = createEquation(o, v[i], v[(i + 1) & 3]);
    }
  }

  ...

  // 2: clip the frustum by using the max / min depth values of the pixels in the tile
  float depth = depthIn.Load(uint3(GET_GLOBAL_IDX, GET_GLOBAL_IDY, 0));
  float4 viewPos = projToView(GET_GLOBAL_IDX, GET_GLOBAL_IDY, depth);

  int lIdx = GET_LOCAL_IDX + GET_LOCAL_IDY * 8;
  { // calculate bound
    if (lIdx == 0)  // initialize
    {
      ldsZMax = 0;  // max z coordinates
      ldsZMin = 0xffffffff; // min z coordinates
    }
    GroupMemoryBarrierWithGroupSync();
    u32 z = asuint(viewPos.z);
    if (depth != 1.f)
    {
      AtomMax(ldsZMax, z);
      AtomMin(ldsZMin, z);
    }
    GroupMemoryBarrierWithGroupSync();
    maxZ = asfloat(ldsZMax);
    minZ = asfloat(ldsZMin);
  }

  ...

  // 3: cull lights
  // 8 x 8 thread group is used, thus 64 lights are processed in parallel
  for (int i = 0; i < nBodies; i += 64)
  {
    int il = lIdx + i;
    if (il < nBodies)
    {
      // overlaps:
      //   light-geometry overlap check using separating axis theorem
      if (overlaps(frustum, gLightGeometry[i]))
      {
        // appendLightToList
        //   Store light index to the lsit of the overlapping lights
        appendLightToList(il);
      }
    }
  }

  ...

  // 4: fill the light indices to the assigned contiguous memory of gLightIdx using all the threads in a thread group
  { // write back
    u32 startOffset = 0;

    if (lIdx == 0)
    { // reserve memory
      if (ldsLightIdxCounter != 0)
      {
        InterlockedAdd(gLightIdxCounter, ldsLightIdxCounter, startOffset);

        ptLowerBound[tileIdx] = startOffset;
        ldsLightIdxStart = startOffset;
      }
      GroupMemoryBarrierWithGroupSync();
      startOffset = ldsLightIdxStart;

      for (int i = lIdx; i < ldsLightIdxCounter; i += 64)
      {
        gLightIdx[startOffset + i] = ldsLightIdx[i];
      }
    }
  }

^{HaradaMcKeeYang13}

Scatter Approach

Computes which tile a light overlaps and writes the light-and tile-index data to a buffer^{HaradaMcKeeYang12}
Thread per light^{HaradaMcKeeYang12}
The data of the buffer (ordered by light index at this point) needs to be sorted by tile index^{HaradaMcKeeYang12}
- We want a list of light indices per tile
- Radix sort
- Run kernels to find the start and end offsets of each tile in the buffer

2.5 Culling^Harada12

Additional memory usage
- 0B global memory
- 4B local memory per work group
Additional computation complexity
- A few bit and arithmetic instructions
- A few lines of codes for light culling
- No changes for other stages
Additional runtime overhead
- < 10% compared to the original light culling

IDEA:

Split frustum in z direction
- Uniform split for a frustum
- Varying split among frustums

FRUSTUM CONSTRUCTION:

Calculate depth bound
- max/min values of depth
Split depth direction into 32 cells
- min value and cell size
Flag occupied cell
- 32 bit depth mask per work group

LIGHT CULLING:

If a light overlaps to the frustum
- Calculate depth mask for the light
- Check overlap using the depth mask of the frustum
Depth mask & Depth mask

frustum[0-4] ← Compute 4 planes at the boundary of a tile
z ← Fetch depth value of the pixel
ldsMinZ ← atomMin(z)
ldsMaxZ ← atomMax(z)
frustum[5, 6] ← Compute 2 planes using ldsMinZ, ldsMaxZ
depthMaskT ← atomOr(1 << getCellIndex(z))
for all the lights do
  iLight ← lights[i]
  if overlaps(iLight, frustum) then
    depthMaskL ← Compute mask using light extent
    overlapping ← depthMaskT ∧ depthMaskL
    if overlapping then
      appendLight(i)
    end if
  end if
end for
flushLightIndices()

Shading

Goes through the list of lights and evaluates materials using information stored for each light^{HaradaMcKeeYang12}
High pixel overdraw can kill performance^{HaradaMcKeeYang12}
- Depth Pre-Pass is critical

#define LIGHT_LOOP_BEGIN
  int tileIndex = GetTileIndex(screenPos);
  uint startIndex;
  uint endIndex;
  GetTileOffsets(tileIndex, startIndex, endIndex);

  for (uint lightListIdx = startIdx; lightListIdx < endIdx; ++lightListIdx)
  {
    int lightIdx = LightIndexBuffer[lightListIdx];
    LightParams directLight;
    LightParams indirectLight;

    if (isIndirectLight(lightIdx))
    {
      FetchIndirectLight(lightIdx, indirectLight);
    }
    else
    {
      FetchDirectLight(lightIndex, directLight);
    }
#define LIGHT_LOOP_END
  }

...

float4 PS( PSInput i ) : SV_TARGET
{
  float3 colorOut = 0;
#LIGHT_LOOP_BEGIN
  colorOut += EvaluateMicrofacet(directLight, indirectLight);
#LIGHT_LOOP_END
  return float4(colorOut, 1.f);
}

^{HaradaMcKeeYang13}

Render Passes^{HaradaMcKeeYang13}

Forward+RenderPasses

One-Bounce Indirect Illumination^{HaradaMcKeeYang13}

Generate a reflective shadow map(RSM) of the scene from the point of view of the light
- Normal buffer, color buffer, world-space buffers are generated
CS is executed to create spotlights at the location captured in the RSM
- Generated spotlights are appended to the main light list

Forward++ Rendering^{StewartThomas13}

Alpha Blended Geometry
Shadow Casting Lights
Global Illumination using VPLs
Optimizations for depth discontinuities

Alpha Blended Geometry

Can’t use the opaque scene’s light lists
- Frustum’s z extent was based on the opaque z-buffer
Render blended geometry to new z-buffer
Compute new set of tiled light lists
- minZ calculated from blended z-buffer
- maxZ calculated from opaque z-buffer
Render blended scene using new light lists
Geometry rendered using two-sided lighting
- Suitable for window glass & game objects
- Particle systems are better off using a custom method
  - Calculating list of lights per emitter
  - Lighting in vertex shader
Two-sided lighting:
- Lighting func that accumulates lighting for front and back faces

Shadow Casting Lights

Render shadow maps
- 2D Texture Atlas or Texture Array
For each shadow casting lights, store shadow map index in alpha channel of light color
Use dynamic branch in forward lighting pass to compute shadow term

// global list of lights (shadow casting + non-shadow casting)
uint shadowIndex = uint(g_PointLightColor[lightIndex].a * 255.0);
if (shadowIndex < 255)  // is it shadow casting?
{
  // Point light
  int face = DirectionToCubeMapFace(lightDirection);

  // pixel position to light space where the cube map faces
  float4 texCoord = mul(float4(position, 1), g_ShadowViewProj[shadowIndex][face]);
  texCoord.xyz /= texCoord.w;
  texCoord.xy = 0.5 * texCoord.xy + 0.5;

  // undersample per face
  texCoord.xy *= g_ShadowScaleAndBias.xx;
  texCoord.xy *= g_ShadowScaleAndBias.yy;

  // set texture coordinates in the atlas
  texCoord.xy += float2(face, shadowIndex);
  texCoord.xy *= rcp(float2(6, MAX_POINT_LIGHT_SHADOWS));

  texCoord.z -= g_ShadowZBias;

  // hardware PCF
  shadowTerm = FilterShadow(g_PointLightShadowAtlas, texCoord.xyz);
}

Only update shadow maps if necessary
- If geometry has changed
- If lighting position has changed
- Spread cost of update over multiple frames
- Be mindful of multi GPU scenarios
Index into projected texture in similar manner to shadow maps
- Bit pack the index along with the shadow map index

Depth Discontinuities

Reduce false positives at depth discontinuities
Two methods presented
1. Split depth range in two at halfway point
  - Keep two light lists per tile (one for each depth region)
  - In the forward shading pass, each pixel determines which list to use
2. 2.5D, partition depth range into 32 cells
  - Determine the cell for each pixel in the tile
  - Make a bit mask representing which cells are occupied in that tile
  - Each light gets a similar bit mask (easy for spheres)
  - Logical AND the light bit mask with the tile bit mask

Clustered Forward+^Leadbetter14

Avoids the need for a depth pre-pass by calculating light lists at multiple depths for each sub-rectangle and using the most appropriate cluster during surface shading.

Deferred Rendering

Goal:

Decouple shading from geometric complexity^Olsson15
Solve overdraw (shade once / sample)^Olsson15

Q: Why deferred rendering?
A: Combine conventional rendering techniques with the advantages of image space techniques^Calver03

Advantages:
- Lights major cost is based on the screen area covered^Calver03, Predictable performance proportional to the lights’ screen-space areas^{Shishkovtsov05}^Valient07
- All lighting is per-pixel and all surfaces are lit equally^Calver03^Thibieroz04
- Lights can be occluded like other objects, this allows fast hardware Z-Reject^Calver03
- Shadow mapping is fairly cheap^Calver03
- Easily integrates with popular shadow techniques^{HargreavesHarris04}^Placeres06
- Excellent batching^Hargreaves04, Greatly simplifies batching^{HargreavesHarris04}, Cuts down on large numbers of batches^{Shishkovtsov05}
- Render each triangle exactly once^Hargreaves04, Only a single geometry pass is required^Thibieroz04^Lee09^Thibieroz11, Executes only texturing on geometry so it suffers less from partial quads, overdraw^Pesce14^Olsson15
- Shade each visible pixel exactly once^Hargreaves04, “Perfect” O(1) depth complexity for lighting^{HargreavesHarris04}^Thibieroz04, Perfect depth complexity for lighting^{Shishkovtsov05}^Placeres06^{KnightRitchieParrish11}^Olsson15
- Easy to add new types of lighting shader^Hargreaves04^Koonce07
- Other kinds of postprocessing(blur, heathaze) are just special lights, and fit neatly into the existing framework^Hargreaves04, Simplifies rendering of multiple special effects^Placeres06, G-Buffer already produces data required for post-processing^Thibieroz11
- Simple engine management^{HargreavesHarris04}^{Shishkovtsov05}
- Lots of small lights ~ one big light^{HargreavesHarris04}^{HargreavesHarris04}
  - Forward can do it too!^{OlssonBilleterAssarsson13}
- Reduces CPU usage^{Shishkovtsov05}
- Lighting costs are independent of scene complexity^Koonce07^Stewart15, Adding more layers of effects generally results in a linear, fixed cost per frame for additional full-screen post-processing passes regardless of the number of models on screen^{FilionMcNaughton08}^{EngelShaderX709}^{EngelSiggraph09}^Kaplanyan10^{KnightRitchieParrish11}^Thibieroz11
- No additional render passes on geometry for lighting, resulting in fewer draw calls and fewer state changes required to render the scene^Koonce07^{EngelSiggraph09}^Thibieroz11, Less draw calls, less shader permutations, one or few lighting shaders that can be hand-optimized well^Pesce14^Schulz14
- Material shaders do not perform lighting, freeing up instructions for additional geometry processing^Koonce07
- Simpler shaders^Valient07
- More complex materials can be implemented^Lee09
- Not all buffers need to be updated with matching data, e.g., decal tricks
- Faster lighting^{KnightRitchieParrish11}
- Decouples texturing from lighting^Pesce14^Stewart15
- Potentially can be faster on complex shaders^Pesce14
- Allows volumetric or multipass decals (and special effects) on the G-Buffer (without computing the lighting twice)^Pesce14
- Allows full-screen material passes like analytic geometric specular antialiasing (pre-filtering), which really works only done on the G-Buffer^Pesce14
- Fails in forward on all hard edges (split normals), and screen-space subsurface scattering^Pesce14
- Trivial light management^Olsson15
  - Enables many lights
- Simple (light) shader management^Olsson15
- Shadow map reuse^Olsson15
Disadvantages:
- Large frame-buffer size^Calver03, Framebuffer bandwidth can easily get out of hand^Hargreaves04^Placeres06^{EngelSiggraph09}^Kaplanyan10^Thibieroz11^{OlssonBilleterAssarsson13}^Pesce14^Olsson15
- Potentially high fill-rate^Calver03^Placeres06^Kaplanyan10^Lauritzen10^{OlssonBilleterAssarsson13}^{StewartThomas13}^Pesce14^Olsson15^Arntzen20
  - Reading lighting inputs from G-Buffer is an overhead^Lauritzen10
  - Accumulating ligthing with additive blending is an overhead^Lauritzen10
    - Requires high precision^Olsson15
- Multiple light equations difficult^Calver03, Forces a single lighting model across the entire scene (everything has to be 100% per-pixel)^Hargreaves04
- ~~High hardware specifications^Calver03~~
- Transparency is very hard^Calver03, Alpha blending is a nightmare!^Hargreaves04^Placeres06^Valient07^Kaplanyan10^{OlssonAssarsson11}, Forward rendering required for translucent objects^Thibieroz11^{OlssonBilleterAssarsson13}^Pesce14^Olsson15^Arntzen20
  - If a tiled or clustered deferred is used, the light information can be passed to a forward+ pass for transparencies
- Can’t take advantage of hardware multisampling^Hargreaves04 AA is problematic^{HargreavesHarris04}^Placeres06 MSAA difficult compared to Forward Renderer^{EngelSiggraph09}^Kaplanyan10^{OlssonAssarsson11}, Costly and complex MSAA^Thibieroz11^{StewartThomas13}
  - MYTH!! MSAA did not prove to be an issue!!^Valient07
  - Existing multi-sampling techniques are too heavy for deferred pipeline^Kaplanyan10
  - Post-process antialiasing doesn’t remove aliasing completely^Kaplanyan10
    - Need to super-sample in most cases
- ~~Not good approach for older hardware^Hargreaves04~~
- ~~Not good when you have many directional lights^{HargreavesHarris04}~~
  - ~~Shading complexity will be O(R * L) (R = screen resolution, L = lights)~~
  - MYTH!!^{Shishkovtsov05}
- Recalculate full lighting equation for every light^{EngelSiggraph09}
- Limited material representation in G-Buffer^{EngelSiggraph09}, Limited materials variations^Kaplanyan10^Pesce14
  - MYTH?^Lee09
  - Only Phong BRDF (normal + glossiness)^Kaplanyan10
    - No aniso materials^Kaplanyan10
- Can’t do lighting computations per object/vertex (i.e. GI), needs to pass everything per pixel in the G-Buffer^Pesce14
  - Alternative: store baked data in a voxel structured
- Accessing lighting related textures (gobos, cubemaps) might be less cache-coherent^Pesce14
- All lights (that cast shadows) must have their shadow maps built before the shading pass^{OlssonAssarsson11}
- Significant engine rework^Thibieroz11
- In general it has lots of enticing benefits over forward, and it -might- be faster in complex lighting / material / decal scenarios, but the baseline simple lighting/shading case is much more expensive^Pesce14
- Difficult to do multiple shading models^Olsson15
  - Custom shaders
- No forward shading support^Arntzen20
- No volumetric lighting^Arntzen20

For each object:
  Render to multiple targets

For each light:
  Apply light as a 2D postprocess

^Hargreaves04

For each object:
  Render surface properties into the G-Buffer
For each light and lit pixel
  Use G-Buffer to compute lighting
  Add result to frame buffer

^Valient07

Traditional deferred shading:^Andersson09

Graphics pipeline rasterizes gbuffer for opaque surfaces
- Normal, albedos, roughness, etc.
- Render scene geometry into G-Buffer MRT^{StewartThomas13}
  - Store material properties (albedo, specular, normal, etc.)
  - Write to depth buffer as normal
Light sources are rendered & accumulate lighting to a texture (accumulation buffer)^{StewartThomas13}
- Light volume or screen-space tile rendering
- Use G-Buffer RTs as inputs^{StewartThomas13}
- Render geometries enclosing light area^{StewartThomas13}
Combine shading & lighting for final output

Modern Deferred Shading:^Olsson15

Render Scene to G-Buffers
Light Assignment
- Build Light Acceleration Structure (Grid)
Full Screen Pass
- Quad (or CUDA, or Compute Shaders, or SPUs)
- For each pixel
  - Fetch G-Buffer Data
  - Look up light list in acceleration structure
  - Loop over lights and accumulate shading
  - Write shading

Worst case complexity is num_objects + num_lights^Hargreaves04
Perfect batching^Hargreaves04
Many small lights are just as cheap as a few big ones^Hargreaves04
On MMO, given the lack of control of the game environment and the poort scalability of lighting costs within a forward renderer, deferred-shading renderer is preferable^Koonce07
Object rendering pass saves all surface parameters^KimBarrero11
Lighting pass saves lighting result^KimBarrero11
Combiner pass combines lighting result + surface material in screen space^KimBarrero11

G-Buffers

G-Buffers are 2D images that store geometric details in a texture, storing positions, normals and other details at every pixel. The key ingredient to hardware acceleration of G-Buffers is having the precision to store and process data such as position on a per-pixel basis. The higher precision we have to store the G-Buffer at, the slower the hardware renders.^Calver03

Thin G-Buffer

The smaller the better!^Kaplanyan10

Crysis 3:^{SousaWenzelRaine13}
- Minimize redundant drawcalls
- AB details on G-Buffer with proper glossiness
- Tons of vegetation → Deferred translucency
- Multiplatform friendly

G-Buffer encoding requirements:^Pesce15

Fast when implemented in a shader
As compact as possible
Makes sense under linear interpolation (hardware “blendable”, for pixel-shader based decals)
As stable as possible, and secondarily as precise as possible

Advantages:

Unified solution across all platforms^{SousaWenzelRaine13}
Deferred Rendering for less BW/Memory than vanilla^{SousaWenzelRaine13}
- Good for MSAA + avoiding tiled rendering on Xbox360
Tackle glossiness for transparent geometry on G-Buffer
- Alpha blended cases, e.g. Decals, Deferred Decals, Terrain Layers
- Can composite all such cases directly into G-Buffer
- Avoid need for multipass
Deferred sub-surface scattering
- Visual + performance win, in particular for vegetation rendering

What to Store?

Depth

^Calver03^Hargreaves04^{HargreavesHarris04}^Thibieroz04^Placeres06^{FilionMcNaughton08}^{EngelShaderX709}^{EngelSiggraph09}^Lee09^{LobanchikovGruen09}^Kaplanyan10^{KnightRitchieParrish11}^Thibieroz11^Moradin19^Huseyin20^Pesce20

Use depth data to reconstruct position data. Provided by the depth buffer.

Format Suggestion:

24bpp^Kaplanyan10
D32 ^Huseyin20
- Reveresed-Z
In GBuffer, G_Buffer.z = length(Input.PosInViewSpace);
In VS, out.vEyeToScreen = float3(Input.ScreenPos.x * ViewAspect, Input.ScreenPos.y, invTanHalfFOV);
In PS, float3 PixelPos = normalize(Input.vEyeToScreen) * G_Buffer.z;^Placeres06

float3 vViewPos.xy = INTERPOLANT VPOS * half2(2.0f, -2.0f) + half2(-1.0f, 1.0f)) * 0.5 * p vCameraNearSize * p vRecipRenderTargetSize;
vViewPos.zw = half2(1.0f, 1.0f);
vViewPos.xyz = vViewPos.xyz * fSampledDepth;
float3 vWorldPos = mul(p_mInvViewTransform, vViewPos).xyz;

^{FilionMcNaughton08}

// input SV_POSITION as pos2d
New_pos2d = ((pos2d.xy) * (2 / screenres.xy)) - float2(1, 1);
viewSpacePos.x = gbuffer_depth * tan(90 - HORZFOV/2) * New_pos2d.x;
viewSpacePos.y = gbuffer_depth * tan(90 - VERTFOV/2) * New_pos2d.y;
viewSpacePos.z = gbuffer_depth;

^{LobanchikovGruen09}

Stencil

^Kaplanyan10^Huseyin20^Pesce20

Format Suggestion:

8bpp^Huseyin20

Stencil to mark objects in lighting groups^Kaplanyan10

Portals / indoors
Custom environment reflections
Different ambient and indirect lighting

Normal

^Calver03^Hargreaves04^{HargreavesHarris04}^Thibieroz04^Placeres06^Andersson09^{EngelShaderX709}^{EngelSiggraph09}^Lee09^{LobanchikovGruen09}^Kaplanyan10^{KnightRitchieParrish11}^Thibieroz11^Huseyin20^Pesce20

Format Suggestions:

R10G10B10A2_FLOAT^Hargreaves04^Pesce20
- 2-bit alpha reserved to mark hair^Pesce20
U10V10W10A2^Thibieroz04, U8V8W8Q8^Thibieroz04
24bpp^Kaplanyan10
- Too quantized
- Lighting is banded / of low quality
RGBA8_UNORM^Huseyin20

Considerations:

Model space vs Tangent space^Thibieroz04

Optimizations:

Reconstruct z from xy(z = sqrt(1 - x² - y²))^Hargreaves04^{HargreavesHarris04}^Placeres06
If all the lighting is performed in view space, then the front-faced polygons are always gonig to have negative or positive Z components^Placeres06

Packing:

float2 pack_normal(float3 norm)
{
  float2 res;
  res = 0.5 * (norm.xy + float2(1, 1));
  res.x *= (norm.z < 0 ? -1.0 : 1.0);
  return res;
}

Unpacking:

float3 unpack_normal(float2 norm)
{
  float3 res;
  res.xy = (2.0 * abs(norm)) - float2(1, 1);
  res.z = (norm.x < 0 ? -1.0 : 1.0) * sqrt(abs(1 - res.x * res.x - res.y * res.y));
  return res;
}

Crytek:

Because we are storing normalized normals, we are wasting 24bpp.^Kaplanyan10
Create a cube of 256³ values, and find the quantized value with the minimal error for a ray. Bake this into a cubemap of results.^Kaplanyan10
Extract the most meaningful and unique part of this symmetric cubemap
Save into 2D texture
Look it up during G-Buffer generation
Scale the normal
Output the adjusted normal into G-Buffer
However, not “blendable”^Pesce15

Baseline: XYZ

Store all three components of the normal^{Pranckevicius09}

// Encoding
half4 encode(half3 n, float3 view)
{
  return half4(n.xyz * 0.5 + 0.5, 0);
}

// Decoding
half3 decode(half4 enc, float3 view)
{
  return enc.xyz * 2.0 - 1.0;
}

Octahdral Normal Vectors^{CigolleDonowEvangelakosMaraMcGuireMeyer14}

Map the sphere to an octahedron, project down into the z = 0 plane, and the reflect the -z-hemisphere over the appropriate diagonal.

// float3 to oct

// returns ±1
float2 signNotZero(float2 v)
{
  return float2((v.x >= 0.0) ? +1.0 : -1.0, (v.y >= 0.0) ? +1.0 : -1.0);
}

// assume normalized input. output is on [-1, 1] for each component
float2 float3ToOct(float2 v)
{
  // project the sphere onto the octahedron, and then onto the xy plane
  float2 p = v.xy * (1.0 / (abs(v.x) + abs(v.y) + abs(v.z)));

  // reflect the folds of the lower hemisphere over the diagonals
  return (v.z <= 0.0) ? ((1.0 - abs(p.yx)) * signNotZero(p)) : p;
}

// oct to float3
float3 octToFloat3(float2 e)
{
  float3 v = float3(e.xy, 1.0 - abs(e.x) - abs(e.y));
  if (v.z < 0)
  {
    v.xy = (1.0 - abs(v.yx)) * signNotZero(v.xy);
  }

  return normalize(v);
}

Diffuse Albedo

^Calver03^Hargreaves04^{HargreavesHarris04}^Thibieroz04^Andersson09^{EngelShaderX709}^{EngelSiggraph09}^Lee09^{LobanchikovGruen09}^{KnightRitchieParrish11}^Thibieroz11^Moradin19^Huseyin20^Pesce20

Format Suggestions:

R8G8B8A8^Hargreaves04^Thibieroz04
RGBA8_SRGB^Huseyin20
R10G10B10A2^Pesce20

Etc.

Specular / Exponent Map^Calver03^{HargreavesHarris04}
Emissive^Calver03^{HargreavesHarris04}^Pesce20
- R8^Pesce20
Light Map^{HargreavesHarris04}^Lee09
Material ID^Calver03^{HargreavesHarris04}^{LobanchikovGruen09}
Roughness^Andersson09^Moradin19^Pesce20
- R8^Pesce20
AO^{LobanchikovGruen09}^Moradin19
Glossiness^Lee09^{LobanchikovGruen09}^Kaplanyan10
- 8bpp^Kaplanyan10
- Non deferrable^Kaplanyan10
  - Required at lighting accumulation pass
  - Specular is non-accumulative otherwise
Specular Power^{EngelShaderX709}^{EngelSiggraph09}^Lee09
Motion Vector^{EngelShaderX709}^{EngelSiggraph09}
- Velocity^Moradin19
- Screen Space Velocity^Huseyin20
  - RG16 FLOAT
Velocity Mask^Moradin19
Shadow^{EngelShaderX709}^{EngelSiggraph09}
Specular Terms^Thibieroz11
Sky Mask^Moradin19
Vertex Normal^Moradin19
Metalness^Moradin19^Huseyin20^Pesce20
- R8 UNORM^{Huseyin20 R8}^Pesce20
Reflectance (f0)^Huseyin20
- R8 UNORM^Huseyin20
Smoothness^Huseyin20
- R8 UNORM
Translucency^Pesce20
- R8

Examples

Example 1: Beyond3D^Calver03

MRTs	R	G	B	A
RT 0	Pos.X	Pos.Y	Pos.Z	ID
RT 1	Norm.X	Norm.Y	Norm.Z	Material ID
RT 2	Diffuse Albedo.R	Diffuse Albedo.G	Diffuse Albedo.B	Diffuse Term
RT 3	Specular Emissive.R	Specular Emissive.G	Specular Emissive.B	Specular Term

Material Lookup texture
Kspecblend
KAmb
KEmm
…

Example 2: Climax Studios GDC 2004 ^Hargreaves04

MRTs	R	G	B	A
DS	Depth R32F
RT 0	Norm.X R10F	Norm.Y G10F	Norm.Z B10F	Scattering A2F
RT 1	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Emissive Term A8
RT 2 (could be palettized)	Material Parameters R8	Material Parameters G8	Material Parameters B8	Material Parameters A8

Example 3: ShaderX2^Thibieroz04

MRTs	R8	G8	B8	A8
RT 0	Pos.X R16F		Pos.Y G16F
RT 1	Pos.Z R16F
RT 2	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Normal.Z A8
RT 3	Normal.X A8		Normal.Y L8

Example 4: Killzone 2^Valient07

MRTs	R8	G8	B8	A8
DS	Depth 24bpp			Stencil
RT 0	Lighting Accumulation.R	Lighting Accumulation.G	Lighting Accumulation.B	Intensity
RT 1	Normal.X FP16		Normal.Y FP16
RT 2	Motion Vectors XY		Spec-Power	Spec-Intensity
RT 3	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Sun-Occlusion A8

Position computed from depth buffer and pixel coordinates
Lighting accumulation - output buffer
Intensity - luminance of Lighting accumulation
- Scaled to range [0…2]
Normal.z = sqrt(1.0f - Normal.x² - Normal.y²)
Motion vectors - screen space
Specular power - stored as log₂(original)/10.5
- High range and still high precision for low shininess
Sun Occlusion - pre-rendered static sun shadows
- Mixed with real-time sun shadow for higher quality

Analysis:

Pros:
- Highly packed data structure
- Many extra attributes
- Allows MSAA with hardware support
Cons:
- Limited output precision and dynamic range
  - Lighting accumulation in gamma space
  - Can use different color space (LogLuv)
- Attribute packing and unpacking overhead

Example 5: StarCraft II^{FilionMcNaughton08}

MRTs	R	G	B	A
RT 0	Unlit & Emissive R16G16B16F			Unused
RT 1	Normal R16G16B16F			Depth
RT 2	Diffuse Albedo.R	Diffuse Albedo.G	Diffuse Albedo.B	AO
RT 3	Specular Albedo.R	Specular Albedo.G	Specular Albedo.B	Unused

Depth values for lighting, fog volumes, dynamic AO, smart displacement, DoF, projections, edge detection, thickness measurement
Normals for dynamic AO
Diffuse and specular for lighting

Example 6: S.T.A.L.E.R: Clear Skies^{LobanchikovGruen09}

S.T.A.L.K.E.R. originally used a 3-RT G-Buffer:

3D Pos + material ID (RGBA16F RT0)
Normal + AO (RGBA16F RT1)
Color + Gloss (RGBA8 RT2)

S.T.A.L.E.R: Clear Skies:

Normal + Depth + Material ID + AO (RGBA16F RT0)
- Pack AO and material ID into the usable bits of the last 16 bit fp channel of RT0
  - Pack data into a 32bit uint as a bit pattern that is a valid 16bit fp number
  - Cast the uint to float using asfloat()
  - Cast back for unpacking using asuint()
  - Extract bits
Color + Gloss (RGBA8 RT1)
Trade packing math vs. Less G-Buffer texture ops

Example 7: Split/Second^{KnightRitchieParrish11}

MRTs	R	G	B	A
RT 0	Diffuse Albedo.R	Diffuse Albedo.G	Diffuse Albedo.B	Specular amount
RT 1	Normal.X	Normal.Y	Normal.Z	Motion ID + MSAA edge
RT 3	Prelit.R	Prelit.G	Prelit.B	Specular power

Example 8: Crysis 3^{SousaWenzelRaine13}

MRTs	R	G	B	A
DS	Depth D24			AmbID, Decals S8
RT 0	Normal.X R8	Normal.Y G8	Gloss, Z Sign B8	Translucency A8
RT 1	Diffuse Albedo.Y R8	Diffuse Albedo.Cb, .Cr G8	Specular Y B8	Per-Project A8

WS Normal packed into 2 components
- Stereographic projection worked ok in practice (also cheap)
- (X, Y) = (x / (1 - z), y / (1 - z))
- (x, y, z) = (2X / (1 + X² + Y²), 2Y / (1 + X² + Y²), (-1 + X² + Y²) / (1 + X² + Y²))
Glossiness + Normal Z Sign packed together
- GlossZsign = (Gloss * Zsign) * 0.5 + 0.5
Albedo in Y’CbCr color space
- Y’ = 0.299 × R + 0.587 × G + 0.114 × B
- C_B = 0.5 + (-0.168 × R - 0.331 × G + 0.5 × B)
- C_R = 0.5 + (0.5 × R - 0.418 × G - 0.081 × B)
- R = Y’ + 1.402 × (C_R - 0.5)
- G = Y’ - 0.344 × (C_B - 0.5) - 0.714 × (C_R - 0.5)
- B = Y’ - 1.772 × (C_B - 0.5)

Example 9: Destiny^{TatarchukTchouVenzon13}

MRTs	R	G	B	A
RT 0	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	AO A8
RT 1	Normal.X * (Biased Specular Smoothness) R8	Normal.Y * (Biased Specular Smoothness) G8	Normal.Z * (Biased Specular Smoothness) B8	Material ID A8
DS	Depth D24			Stencil S8

Example 10: inFAMOUS: Second Son^Bentley14

</tr>

MRTs	R	G	B	A
RT 0	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Shadow Refr A8
RT 1	Normal.α R16	Normal.β G16	Vertex Normal.α B16	Vertex Normal.β A16
RT 2	Sun Shadow R8	AO G8	Spec Occl B8	Gloss A8
RT 3	Wetness Params RGBA8
RT 4	Ambient Diffuse.R R16F	Ambient Diffuse.G G16F	Ambient Diffuse.B B16F	Amb Atten A16F
RT 5	Emissive.R R16F	Emissive.G G16F	Emissive.B B16F	Alpha A16F
D32f	Depth D24
S8	Stencil S8

Example 11: Ryze^Schulz14

<td colspan="3"style="background-color:rgba(127, 255, 255, 0.5); color:black">Specular YCbCr / Transmittance CbCr GBA8</td>

MRTs	R	G	B	A
RT 0	Normal.X R8	Normal.Y G8	Normal.Z B8	Translucency Luminance / Prebaked AO Term A8
RT 1	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Subsurface Scatering Profile A8
RT 2	Roughness R8

Normals encoded using BFN approach to avoid 8 bit precision issues
Specular color stored as YCbCr to better support blending to GBuffer (e.g. decals)
- Allow blending of non-metal decals despite not being able to write alpha during blend ops
- Can still break when blending colored specular (rare case that was avoided on art side)
Specular chrominance aliased with transmittance luminance
- Exploiting mutual exclusivity: colored specular just for metal, translucency just for dielectrics
Support for prebaked AO value but was just used rarely in the end

Example 12: Uncharted 4^ElGarawany16

16 bits-per-pixel unsigned buffers
Constantly moving bits around between features during production
- Lots of visual tests to determine exactly how many bits were needed for the various features
Heavy use of GCN parameter packing intrinsics

Channels	G-Buffer 0	Channels	G-Buffer 1
R	r g	R	ambientTranslucency sunShadowHigh specOcclusion
G	b spec	G	heightmapShadowing sunShadowLow metallic
B	normalx normaly	B	dominantDirectionX dominantDirectionY
A	iblUseParent normalExtra roughness	A	ao extraMaterialMask sheen thinWallTranslucency

A third optional G-Buffer is used by more complicated materials
- Interpreted differently based on the type of the material
- Fabric, hair, skin, silk, etc.

Example 13: Jurassic World: Evolution^{TheCodeCorsairJWE21}

Tiled Forward Lighting
- 8 × 8 pixel tiles extruded towards the far plane to create subfrustums
- CS is dispatched per tile
Depth Prepass
Thin GBuffer

MRTs	R	G	B	A
RT 0	Normal.X R	Normal.Y G	Normal.Z B	Roughness A
RT 1	Motion Vectors

Example 14: Mafia: Definitive Edition^{TheCodeCorsairMDE21}

MRTs	R	G	B	A
RT 0	Normal.X R16F	Normal.Y G16F	Normal.Z B16F	Roughness A16F
RT 1	Diffuse Albedo.R R8	Diffuse Albedo.G G8	Diffuse Albedo.B B8	Metalness A8
RT 2	Motion Vectors RGB16U			Encoded Vertex Normal A16U
RT 3	Specular Intensity R8	0.5 G8	Curvature or Thickness (for SSS) B8	SSS Profile A8
RT 4	Emissive.R R11F	Emissive.G G11F	Emissive.B B10F

Example 15: Digital Combat Simulator^Poulet21

Five R8G8_UNORM layers with MSAA activated
1. Normal using a basic encoding scheme
  - Store X and Y components and reconstructing the Z
2. Albedo is stored across three channels encoded using YUV
  - First channel of the second layer contains the Y
3. First and second channel stores the U and V
4. Roughness in the first channel, metalness in the second channel
5. Precomputed AO provided by texture in the first channel
Normal encoding example from SSAO:

ld_ms(texture2dmsarray)(float,float,float,float) r1.zw, r5.xyww, GBufferMap.zwxy, l(0)
ld_ms(texture2dmsarray)(float,float,float,float) r0.w, r5.xyzw, GBufferMap.yzwx, l(0)
mad r1.zw, r1.zzzw, l(0.0000, 0.0000, 2.0000, 2.0000), l(0.0000, 0.0000, -1.0000, -1.0000)
add r5.x, r1.w, r1.z
add r5.z, -r1.w, r1.z
mul r5.xz, r5.xxzx, l(0.5000, 0.0000, 0.5000, 0.0000)
add r1.z, abs(r5.z), abs(r5.x)
add r5.y, r1.z, l(-1.0000)
dp3 r1.z, r5.xyzx, r5.xyzx
rsq r1.z, r1.z
mul r5.xyz, r1.zzzz, r5.xyzx
ge r0.w, l(0.5000), r0.w
movc r5.w, r0.w, r5.y, -r5.y

Example 16: Unity^{LagardeGolubev18}

MRTs	R	G	B	A
RT 0 (sRGB)	BaseColor.R R8	BaseColor.G G8	BaseColor.B B8	Specular Occlusion A8
RT 1	Normal.xy (Octahedral 12/12) RGB8			Perceptual Smoothness A8
RT 2	Material Data RGB8			FeaturesMask(3) / Material Data A8
RT 3	Static diffuse lighting R11G11B10F
RT 4 (Optional)	Extra specular occlusion data RG8		Ambient Occlusion B8	Light Layering Mask
RT 5 (Optional)	4 Shadow Masks RGBA8

Overview

Don’t bother with any lighting while drawing scene geometry^Hargreaves04
Render to a “fat” framebuffer format, using MRT to store data^Hargreaves04
- Drawback of fat-format encoding is the reading speed^{Shishkovtsov05}
Apply lighting as a 2D postprocess, using these buffers as input^Hargreaves04

Example Passes

Example 1: Unity^{LagardeGolubev18}

Opaque Material Render Pass

Depth Prepass
GBuffer
- Tag stencil for regular lighting or split lighting
Render Shadow
- Async Light list generation + Light/Material classification
- Async SSAO (Use Normal buffer)
- Async SSR (Use Normal buffer)
Deferred directional cascade shadow
- (Use Normal buffer for normal shadow bias)
Tile deferred lighting
- Indirect dispatch for each shader variants
  - Read stencil
    - No lighting: skip forward material and sky
    - Regular lighting: output lighting
    - Split lighting: separate diffuse and specular
Forward Opaque
- (Optional) Output BaseColor + Diffusion Profile
- (Optional) Output + Tag stencil for split lighting
SS Subsurface Scattering
- Test stencil for split lighting
- Combine lighting

Geometry Phase

Each geometry shader is responsible for filling the G-Buffers with correct parameters.^Calver03

The major advantage over the conventional real-time approach to Renderman style procedural textures is that the entire shader is devoted to generating output parameters and that it is run only once regardless of the number or types of lights affecting this surface (generating depth maps also requires the geometry shaders to be run but usually with much simpler functions).^Calver03

Another advantage is that after this phase how the G-Buffer was filled is irrelevant, this allows for impostors and particles to be mixed in with normal surfaces and be treated in the same manner (lighting, fog, etc.).^Calver03

Some portions of the light equation that stay constant can be computed here and stored in the G-Buffer if necessary, this can be used if you light model uses Fresnel (which are usually only based on surface normal and view directional).^Calver03

Killzone 2^Valient07

Fill the G-Buffer with all geometry (static, skinned, etc.)
  Write depth, motion, specular, etc. properties
Initialize light accumulation buffer with pre-baked light
  Ambient, Incandescence, Constant specular
  Lightmaps on static geometry
    YUV color space, S3TC5 with Y in Alpha
    Sun occlusion in B channel
    Dynamic range [0...2]
  Image based lighting on dynamic geometry

Optimizations

Export Cost

Render objects in front-to-back order^Thibieroz11
Use fewer render targets in your MRT config^Thibieroz11
- Less fetches during shading passes
- Less memory usage
Avoid slow formats^Thibieroz11
Data Packing^Thibieroz11
- Trade render target storage for a few extra ALU instructions

Light Accumulation Pass^Valient07

Light is rendered as convex geometry
- Point light - sphere
- Spot light - cone
- Sun - full-screen quad

For each light:
  Find and mark visible lit pixels
  If light contributes to screen
    Render shadow map
    Shade lit pixels and add to framebuffer

Lighting Phase

The real power of deferred lighting is that lights are first class citizens, this complete separation of lighting and geometry allows lights to be treated in a totally different way from standard rendering. This makes the artist’s job easier as there is less restrictions on how lights affect surfaces, this allows for easy customizable lighting rigs.^Calver03

Light shaders have access to the parameters stored in the G-Buffer at each pixel they light.^Calver03

Add lighting contributions into accumulation buffer^Thibieroz11

Use G-Buffer RTs as inputs
Render geometries enclosing light area

Render convex bounding geometry
Read G-Buffer
Compute radiance
Blend into frame buffer

^{HargreavesHarris04}

Keep diffuse and specular separate

For each light:
  diffuse += diffuse(GBuffer.N, L)
  specular += GBuffer.spec * specular(GBuffer.N, GBuffer.P, L)

^{HargreavesHarris04}

Final full-screen pass modulates diffuse color:

framebuffer = diffuse * GBuffer.diffuse + specular

^{HargreavesHarris04}

Per-Sample Pixel Shader Execution:^Thibieroz09

struct PS_INPUT_EDGE_SAMPLE
{
  float4 Pos : SV_POSITION;
  uint uSample : SV_SAMPLEINDEX;
};

// Multisampled G-Buffer textures declaration
Texture2DMS <float4, NUM_SAMPLES> txMRT0;
Texture2DMS <float4, NUM_SAMPLES> txMRT1;
Texture2DMS <float4, NUM_SAMPLES> txMRT2;
// Pixel shader for shading pass of edge samples in DX10.1
// This shader is run at sample frequency
// Used with the following depth-stencil state values so that only
// samples belonging to edge pixels are rendered, as detected in
// the previous stencil pass.
// StencilEnable = TRUE
// StencilReadMask = 0x80
// Front/BackFaceStencilFail = Keep
// Front/BackfaceStencilDepthFail = Keep
// Front/BackfaceStencilPass = Keep;
// Front/BackfaceStencilFunc = Equal;
// The stencil reference value is set to 0x80

float4 PSLightPass_EdgeSampleOnly( PS_INPUT_EDGE_SAMPLE input ) : SV_TARGET
{
  // Convert screen coordinates to integer
  int3 nScreenCoordinates = int3(input.Pos.xy, 0);
  
  // Sample G-Buffer textures for current sample
  float4 MRT0 = txMRT0.Load( nScreenCoordinates, input.uSample);
  float4 MRT1 = txMRT1.Load( nScreenCoordinates, input.uSample);
  float4 MRT2 = txMRT2.Load( nScreenCoordinates, input.uSample);
  
  // Apply light equation to this sample
  float4 vColor = LightEquation(MRT0, MRT1, MRT2);
  
  // Return calculated sample color
  return vColor;
}

Conventional Deferred Shading^Lauritzen10:

For each light
- Use rasterizer to scatter light volume and cull
- Read lighting inputs from G-Buffer
- Compute lighting
- Accumulate lighting with additive blending
Reorders computation to extract coherence

Modern Implementation^Lauritzen10:

Cull with screen-aligned quads
- Cover light extents with axis-aligned bounding box
  - Full light meshes(spheres, cones) are generally overkill
  - Can use oriented bounding box for narrow spot lights
- Use conservative single-direction depth test
  - Two-pass stencil is more expensive than it is worth
  - Depth bounds test on some hardware, but not batch-friendly

for each G-Buffer sample
{
  sampleAttr = load attributes from G-Buffer

  for each light
  {
    color += shade(sampleAttr, light)
  }

  output pixel color;
}

^{OlssonBilleterAssarsson13}

uniform vec3 lightPosition;
uniform vec3 lightColor;
uniform float lightRange;

void main()
{
  vec3 color = texelFetch(colorTex, gl_FragCoord.xy);
  vec3 specular = texelFetch(specularTEx, gl_FragCoord.xy);
  vec3 normal = texelFetch(normalTex, gl_FragCoord.xy);
  vec3 position = fetchPosition(gl_FragCoord.xy);

  vec3 shading = doLight(position, normal, color,
                         specular, lightPosition,
                         lightColor, lightRange);

  resultColor = vec4(shading, 1.0);
}

^Olsson15

Red Dead Redemption 2:^Huseyin20

Global light pass
- Fullscreen quad
Local light pass
- Low-poly sphere shape for point light volumes
- Octahderon like shape for spotlight volumes
- Rendered back-to-front with additive blending

Plus(+) Methods: Algorithm Steps:^Drobot17

List of rendering entities
Spatial acceleration structure with culled entity lists
Execution algorithm per sampling point
- Traverse acceleration structure
- Iterate over existing entities
Aka Tiled / Clustered Forward+ / Deferred+

Lighting Optimizations:^{LagardeGolubev18}

Focus on removing false positives
- Ex: narrow shadow casting spot lights
False positives are more expensive in lighting pass
- Light culling execute async during shadow rendering
  - List building work is absorbed by leveraging asynchronous compute
- Deferred lighting pass is not running async
- Final lighting shader has higher loop complexity and greater register pressure
- Move cost where it can be hidden
- High register pressure in lighting pass
Hierarchical approach:
1. Find screen-space AABB for each visible light
2. Big tile 64 × 64 prepass
  - Coarse intersection test
3. Build Tile or Cluster Light list
  - Narrow intersection test
  - Tile:
    - Based on Fine Prune Tile Lighting (FPTL)
    - Build FTPL light list for tile 16 × 16
      - Fine pruning: test if any depth pixel is in volume
      - Aggressive removal of false positives
      - One light list per tile. Allows attribute to be read into scalar registers
  - Cluster:
    - 32 × 32 with 64 clusters
    - Use geometric series for cluster position and size
    - Half of cluster (32) consumes between near and max per tile depth
      - Good resolution in visible range
      - Permit queries behind max per tile depth
        
        Particles, volume, FX

Bandwidth Problem^Olsson15

New type of overdraw
- Light overdraw
N lights cover a certain pixel
- N reads from the same G-Buffer location

for each light
  for each covered pixel
    read G-Buffer // repeated reads
    compute shading
    read + write frame buffer // repeated reads and writes

Re-write loop!

for each pixel
  read G-Buffer
  for each affecting light
    compute shading
  write frame buffer

Modern shading solution:

for each pixel
  read G-Buffer
  for each possibly affecting light
    if affecting
      compute shading
  write frame buffer

Share between groups of similar pixels
- Lots of coherency between samples
- Coherent access
- Little storage
- Conservatice lists

Pre-Tiled Shading

Advantages:

Precise per-pixel light culling^Zhdan16
A lot of work is done outside of the shader Weaknesses:
Massive overdraw & ROP cost when having lots of big light sources^Andersson11^Zhdan16
Expensive to have multiple per-pixel materials in light shaders^Andersson11
MSAA lighting can be slow (non-coherent, extra bandwidth)^Andersson11
Lighting is likely to become bandwidth limited^Zhdan16

Full screen lights

For lights that are truly global and have no position and size (ambient and directional are the traditional types), we create a full screen quad that executes the pixel shader at every pixel.^Calver03^Hargreaves04

Global directional lights has little benefit in using deferred rendering methods on them, and it would actually be slower to resample the deferred buffers again for the entire screen.^{FilionMcNaughton08}

Shaped lights

Shaped lights can be implemented via a full screen quad in exactly the same way of directional lights just with a different algorithm computing the lights direction and attenuation, but the attenuation allows us to pre-calculate where the light no longer makes any contribution.^Calver03

Optimization^Calver03

The attenuation model I use is a simple texture lookup based on distance. The distance is divided by the maximum distance that the light can possible effect and then this is used to lookup a 1D texture. The last texel should be 0, (no constant term) if the following optimisations are to be used.

Optimization^Placeres06

Shade only the pixels influenced by the bounding object involves rendering a full screen quad, but enabling clipping and rejection features to discard many noninfluenced pixels. This requires dynamic branching.

Light Volumes

We create a mesh that encloses the light affecting volume with any pixels found to be in the interior of the volume executing the light shader.^Calver03^Hargreaves04

Each pixel most be hit once and once only. If the light volume causes the light shader to be executed more than once it will be equivalent to having n lights affecting this pixel.^Calver03
The near and far clip planes must not effect the projected shape. We need the projected geometry not to be clipped at the near and far plane as this will cause holes in our lights.^Calver03

Spot Light^{HargreavesHarris04}
- Cone
Point Light^{HargreavesHarris04}
- Sphere
Direction Light^{HargreavesHarris04}
- Quad or Box

For convex volumes the first problem is completely removed by just using back or front face culling.^Calver03^Hargreaves04

We can’t remove the near plane, but we can effectively remove the far plane by placing it at infinity.^Calver03

Convex volumes cover the vast majority of lights shaders (e.g. spheres for point lights, cones for spotlights, etc.) and we can adapt them to use the fast z-reject hardware that is usually available.^Calver03

Dealing with the light volume rendering:^Hargreaves04

Camera is outside the light bounding mesh
- Simple back face culling (each pixel most be hit once and once only)
Camera is inside the light bounding mesh
- Draw backfaces
Light volume intersects the far clip plane
- Draw frontfaces
Light volume intersects both near and far clip planes
- Light is too big

Optimizations

S.T.A.L.K.E.R case:^{Shishkovtsov05}

Hierarchical occlusion culling system
- Coarsest test: Sector-portal culling followed by CPU-based occlusion culling
- DX9’s occlusion query to eliminate the completely occluded lights
- Stencil mask to tag affected pixels for each light
Sun optimization

Pass 0: Render full-screen quad only where 0x03==stencil count
        (where attributes are stored)
  If ((N dot L) * ambient_occlusion_term > 0)
    discard fragment
  Else
    color = 0, stencil = 0x01
Pass 1: Render full-screen quad only where 0x03==stencil count
  Perform light accumulation / shading 

^{Shishkovtsov05}

Social Stage:^Placeres06
- Filter the lights and effects on the scene to produce a smaller list of sources to be processed
  1. Execute visiblity and occlusion algorithms to discard lights whose influence is not appreciable
  2. Project visible sources bounding objects into screen space
  3. Combine similar sources that are too close in screen space or influence almost the same screen area
  4. Discard sources with a tiny contribution because of their projected bounding object being too small or too far
  5. Check that more than a predefined number of sources do not affect each screen region. Choose the biggest, strongest, and closer sources.
Individual Stage:^Placeres06
- Global Sources
  - Most fill-rate expensive
    1. Enable the appropriate shaders
    2. Render a quad covering the screen
- Local Sources
  1. Select the appropriate level of detail.
  2. Enable and configure the source shaders
  3. Compute the minimum and maximum screen cord values of the projected bounding object
  4. Enable the scissor test
  5. Enable the clipping planes
  6. Render a screen quad or the bounding object

Only shade the area where the light volume intersects scene geometry^Hargreaves04 Only shade surfaces inside light volume!!^{HargreavesHarris04}
- Light volume Z Tests
  1. Drawing light volume backfaces
    - Use D3DCMP_GREATER to reject “floating in the air” portions of the light
  2. Drawing frontfaces
    - Use D3DCMP_LESS to reject “buried underground” light regions
- Tabula Rasa uses “greater” depth test and “clockwise” winding(inverted winding)^Koonce07
  - Light volumes in Tabula Rasa never get clipped by the far clip plane
StarCraft II case:^{FilionMcNaughton08}
- Early-Z to reduce cost of hidden lights
- Early stencil to reduce cost of objects behind light that are not lit by it

Other optimizations:

Clipping^{HargreavesHarris04}
Occlusion query^{HargreavesHarris04}
Z-Cull^{HargreavesHarris04}

Stencil Cull

Render light volume with color write disabled^{HargreavesHarris04}
- Depth Func = LESS, Stencil Func = ALWAYS
- Stencil Z-FAIL = REPLACE (with value X)
- Rest of stencil ops set to KEEP
Render with lighting shader^{HargreavesHarris04}
- Depth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = X
- Unlit pixels will be culled because stencil will not match the reference value * Only regions that fail depth test represent objects within the light volume^{HargreavesHarris04}

Killzone 2 case:^Valient07

Marks pixels in front of the far light boundary
- Render back-faces of light volume
- Depth test GREATER-EQUAL
- Write to stencil on depth pass
- Skipped for very small distant lights
Find amount of lit pixels inside the volume
- Start pixel query
- Render front faces of light volume
- Depth test LESS-EQUAL
- Don’t write anything - only EQUAL stencil test
Rendering:
- Render front-faces of light volume
  - Depth test - LESS-EQUAL
  - Stencil test - EQUAL
  - Runs only on marked pixels inside light
If light is “very small”
- Don’t do any stencil marking

Light Shader Occlusion Optimisations

The basis of using occlusion culling with light shaders is that the depth buffer used for the creation of the G-Buffer is available at no cost (this is only true if the resolution of the G-Buffer is the same as destination colour buffer and that we are using the same projection matrix for the geometry shaders and light shaders).^Calver03

I simply turn off the occlusion culling if the light shader hits the near plane and just render the back faces without depth testing. Its means some pixels run the pixel shader unnecessarily but it’s very cheap on the CPU and the actual difference is usually only a few pixels.^Calver03

Accessing Light Properties

Avoid using dynamic constant buffer indexing in pixel shader^Thibieroz11
- Generates redundant memory operations repeated for every pixel

ex)

struct LIGHT_STRUCT
{
  float4 vColor;
  float4 vPos;
};
cbuffer cbPointLightArray
{
  LIGHT_STRUCT g_Light[NUM_LIGHTS];
};

float4 PS_PointLight(PS_INPUT i) : SV_TARGET
{
  // ...
  uint uIndex = i.uPrimIndex / 2;
  float4 vColor = g_Light[uIndex].vColor;   // NO!
  float4 vLightPos = g_Light[uIndex].vPos;  // NO!
}

Instead fetch light properties from CB in VS (or GS)
And pass them to PS as interpolants
- No actual interpolation needed
- Use noninterpolation to reduce number of shader instruction

PS_QUAD_INPUT VS_PointLight(VS_INPUT i)
{
  PS_QUAD_INPUT Out = (PS_QUAD_INPUT)0;

  // Pass position
  Out.vPosition = float4(i.vNDCPosition, 1.0);

  // Pass light properties to PS
  uint uIndex = i.uVertexIndex / 4;
  Out.vLightColor = g_Light[uIndex].vColor;
  Out.vLightPos = g_Light[uLightIndex].vPos;

  return Out;
}

struct PS_QUAD_INPUT
{
  nointerpolation float4 vLightColor : LCOLOR;
  nointerpolation float4 vLightPos : LPOS;
  float4 vPosition : SV_POSITION;
};

^Thibieroz11

Tiled Shading

Amortizes overhead^Lauritzen10.

Advantages:
- Fastest and most flexible^Lauritzen10^Olsson15
- Enable efficient MSAA^Lauritzen10^Olsson15
- G-Buffers are read only once for each lit sample^{OlssonBilleterAssarsson13}
- Framebuffer is written to once^{OlssonAssarsson11}^{OlssonBilleterAssarsson13}
- Common terms of the rendering equation can be factored out and computed once instead of recomputing them for each light^{OlssonAssarsson11}^{OlssonBilleterAssarsson13}
- Work becomes coherent within each tile^{OlssonAssarsson11}^{OlssonBilleterAssarsson13}
  - Each sample in a tile requires the same amount of work
    - Allows for efficient implementation on SIMD-like architectures
- Low bandwidth^Olsson15
- Simple light assignment^Olsson15
- Trivial light list lookup^Olsson15
- High performance^Olsson15
- Transparency^Olsson15
- Constant & absolute minimal bandwith^Andersson09^{OlssonAssarsson11}
  - Read gbuffers & depth once!
- Doens’t need intermediate light buffers^Andersson09
  - Can take a lot of memory with HDR, MSAA & color specular
- Scales up to huge amount of big overlapping light sources^Andersson09
  - Fine-grained culling (16 × 16)
  - Only ALU cost, good future scaling
  - Could be useful for accumulating VPLs
- Light accumulatino is done in register, at full floating point precision^{OlssonAssarsson11}
- Lighting phase takes all visible lights in one go^Zhdan16
Disadvantages:
- Still tricky to afford many shadowed lights per pixel^Pesce14^Olsson15, Makes dynamic shadows harder^Pesce14
  - No shadow map reuse^Olsson15
- Complex light shader^Olsson15
- View dependence^Olsson15
  - 2D light assignment^Olsson15
  - Depth discontinuities^Olsson15^Drobot17
- ~~Requires DX 11 HW^Andersson09~~
  - ~~CS 4.0 / 4.1 difficult due to atomics & scattered groupshared writes~~
- Culling overhead for small light sources^Andersson09
  - Can accumulate them using standard light volume rendering
  - Or separate CS for tile-classific
- Potentially performance^Andersson09
  - MSAA texture loads / UAV writing might be slower then standard PS
- Can’t output to MSAA texture^Andersson09
  - DX11 CS UAV limitation
- Less accurate culling with tile granularity^Zhdan16
- Frustum-primitive tests are either too coarse or too slow^Zhdan16
Challenges:
- Frustum primitive culling not accurate, creates false positives^Schulz14^Zhdan16
  - Often considerably more pixels shaded than with stencil tested light volumes
- Handling light resources (all resources need to be accessible from CS)^Schulz14
  - Shadow maps stored in large atlas
  - Diffuse and specular probe cubemaps stored in texture arrays
  - Projector textures stored in texture array (have to use standardized dimensions and format)
- Keeping GPRs under control^Schulz14
  - Dynamic branching for different light types
  - Deep branching requires additional GPRs and lower occupancy
  - Had to manually rearrange code to stay within desired GPR limit

Divide the screen into a grid^{BalestraEngstad08}^Andersson11^{WhiteBarreBrisebois11}^{OlssonBilleterAssarsson13}
- (Optional) Find min / max Z-bounds for each tile^{OlssonBilleterAssarsson13}
Find which lights intersect each cell^{BalestraEngstad08}^Andersson11^{OlssonBilleterAssarsson13}
- +How many lights^Andersson09
Render quads over each cell calculating up to 8 lights per pass^{BalestraEngstad08}
- Results in a light buffer
- Only apply the visible light sources on pixels in each tile^Andersson09^Andersson11^{OlssonBilleterAssarsson13}

Algorithm:^{OlssonAssarsson11}

Render the (opaque) geometry into the G-Buffers^{StewartThomas13}
- Ordinary deferred geometry pass
- Store material properties^{StewartThomas13}
- Provides tile depth bounds^{StewartThomas13}
Construct a screen space grid, covering the frame buffer, with some fixed tile size, t = (x, y), e.g. 32 × 32 pixels^{WhiteBarreBrisebois11}^{StewartThomas13}
- (Optional) Find min / max Z-bounds for each tile^{OlssonBilleterAssarsson13}
For each light: find the screen space extents of the light volume and append the light ID to each affected grid cell^{OlssonBilleterAssarsson13}^{StewartThomas13}
- Find the screen space extents of the light bounding sphere and then insert the light into the covered grid cells^{OlssonAssarsson11}
  - Store the culling results in a texture:^{WhiteBarreBrisebois11}
    - Column == Light ID
    - Row == Tile ID
- Cull analytical lights (point, cone, line), per tile^{WhiteBarreBrisebois11}^{StewartThomas13}
- Compute shader^{StewartThomas13}
- Generates per-tile light list^{StewartThomas13}
- Same compute shader then can use per-tile list to do shading^{StewartThomas13}
- Can sort lights by kind to reduce branching^Bentley14
For each fragment in the frame buffer, with location f = (x, y)
1. Sample the G-Buffers at f
2. Accumulate light contributions from all lights in tile at ⌊f /t⌋
3. Output total light contributions to frame buffer at f
  - Compute lighting for all contributing lights, per tile^{WhiteBarreBrisebois11} ^{OlssonBilleterAssarsson13}

Pseudocode:^{OlssonAssarsson11}

vec3 computeLight(vec3 position, vec3 normal, vec3 albedo,
                  vec3 specular, vec3 viewDir, float shininess,
                  ivec2 fragPos)
{
  ivec2 l = ivec2(fragPos.x / LIGHT_GRID_CELL_DIM_X,
                  fragPos.y / LIGHT_GRID_CELL_DIM_Y);
  int count = lightGrid[l.x + l.y * gridDim.x].x;
  int offset = lightGrid[l.x + l.y * gridDim.x].y;

  vec3 shading = vec3(0.0);

  for (int i = 0; i < count; ++i)
  {
    ivec2 dataInd = ivec2((offset + i) % TILE_DATA_TEX_WIDTH,
                          (offset + i) / TILE_DATA_TEX_WIDTH);
    int lightId = texelFetch(tileDataTex, dataInd, 0).x;
    shading += applyLight(position, normal, albedo, specular,
                          shininess, viewDir, lightId);
  }  

  return shading;
}

void main()
{
  ivec2 fragPos = ivec2(gl_FragCoord.xy);
  vec3 albedo = texelFetch(albedoTex, fragPos).xyz;
  vec4 specShine = texelFetch(specularShininessTex, fragPos);
  vec3 position = unProject(gl_FragCoord.xy, texelFetch(depthTex, fragPos));
  vec3 normal = texelFetch(normalTex, fragPos).xyz;
  vec3 viewDir = -normalize(position);

  gl_fragColor = computeLight(position, normal, albedo, 
                              specShine.xyz, viewDir, specShine.w, 
                              fragPos);
}

PhyreEngine Implementation:^Swoboda09

Calculate affecing lights per tile
- Build a frustum around the tile using the min and max depth values in that tile
- Perform frustum check with each light’s bounding volume
- Compare light direction with tile average normal value
Choose fast paths based on tile contents
- No lights affect the tile? Use fast path
- Check material values to see if any pixels are marked as lit

Screen tile classification is a powerful technique with many applications:^Swoboda09

Full screen effect optimization - DoF, SSAO
Soft particles
Affecting lights
Occluder information
We can also choose whether to process MSAA per tile

To facilitate look up from shaders, we must store the data structure in a suitable format:^{OlssonAssarsson11}

Light Grid contains an offset to and size of the light list for each tile
Tile Light Index Lists contains light indices, referring to the lights in the Global Light Lists

Global Light List
L₀	L₁	L₂	L₃	L₄	L₅	L₆	L₇	…

Tile Light Index Lists
0	0	6	3	0	6	4	4	…

Tile Light Index Lists
0	1	4	7	…
1	3	3	1	…
66	67	69	…	…
1	2	2	…	…

Red Dead Redemption 2:^Huseyin20

Uses tile-based deferred rendering path for calculating the lighting of environment maps

Basic tiled culling:^Stewart15

Input: light list, scene depth
Output: per-tile list of intersecting lights

calculate depth bounds for the tile;
calculate frustum planes for the tile;

for i ← thread_index to num_lights do
  current_light ← light_list[i];
  test intersection against tile bounding volume;
  if intersection then
    thread-safe increment of list counter;
    write light index to per-tile list;
  end
  i ← i + num_threads_per_tile;
end

Z Prepass

groupshared uint ldsZMin;
groupshared uint ldsZMax;

[numthreads(16, 16, 1)]
void CalculateDepthBoundsCS(int32 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID)
{
  uint localIdxFlattened = localIdx.x + localIdx.y * 16;

  if (localIdxFlattened == 0)
  {
    ldsZMin = 0x7f7fffff; // FLT_MAX as a uint
    ldsZMax = 0;
  }

  GroupMemoryBarrierWithGropuSync();

  float depth = g_DepthTexture.Load(uint3(globalIdx.x, globalIdx.y, 0)).x;

  uint z = asuint( ConvertProjDepthToView( depth ) ); // reinterpret as uint

  if (depth != 0.0)
  {
    InterlockedMax( ldsZMax, z );
    InterlockedMin( ldsZMin, z );
  }

  GroupMemoryBarrierWithGroupSync();

  float maxZ = asfloat( ldsZMax );
  float minZ = asfloat( ldsZMin );
}

^Thomas15

Parallel Reduction:^Thomas15

Atomics are useful but not efficient
Compute-friendly algorithm
Great material alrady available:
- Optimizing Parallel Reduction in CUDA. ^Harris07

Algorthm:^Thomas15

for each time
- take a thread id, compare it with an another thread id’s value (by some stride)
- take the min value of them

depth[tid] = min(depth[tid], depth[tid + 8])

depth[tid] = min(depth[tid], depth[tid + 4])

depth[tid] = min(depth[tid], depth[tid + 2])

depth[tid] = min(depth[tid], depth[tid + 1])

^Thomas15

Implementation:^Thomas15

First pass reads 4 depth samples
Needs to be separate pass (thread group size would be half the size of the original per-pixel shader)
Write bounds to UAV
- Maybe useful for other things too

groupshared uint ldsZMin[64];
groupshared uint ldsZMax[64];

[numthreads(8, 8, 1)]
void CalculateDepthBoundsCS(uint3 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID)
{
  uint2 sampleIdx = globalIdx.xy * 2;

  float depth00 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x, sampleIdx.y, 0)).x;      float viewPosZ00 = ConvertProjDepthToView(depth00);
  float depth01 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x, sampleIdx.y+1, 0)).x;    float viewPosZ01 = ConvertProjDepthToView(depth01);
  float depth10 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x+1, sampleIdx.y, 0)).x;    float viewPosZ10 = ConvertProjDepthToView(depth10);
  float depth11 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x+1, sampleIdx.y+1, 0)).x;  float viewPosZ11 = ConvertProjDepthToView(depth11);

  float minZ00 = (depth00 != 0.f) ? viewPosZ00 : FLT_MAX; float maxZ00 = (depth00 != 0.f) ? viewPosZ00 : 0.0f;
  float minZ10 = (depth01 != 0.f) ? viewPosZ10 : FLT_MAX; float maxZ10 = (depth01 != 0.f) ? viewPosZ10 : 0.0f;
  float minZ01 = (depth10 != 0.f) ? viewPosZ01 : FLT_MAX; float maxZ01 = (depth10 != 0.f) ? viewPosZ01 : 0.0f;
  float minZ11 = (depth11 != 0.f) ? viewPosZ11 : FLT_MAX; float maxZ11 = (depth11 != 0.f) ? viewPosZ11 : 0.0f;

  uint threadNum = localIdx.x + localIdx.y * 8;

  ldsZMin[threadNum] = min(minZ00, min(minZ01, min(minZ10, minZ11)));
  ldsZMax[threadNum] = max(maxZ00, max(maxZ01, max(maxZ10, maxZ11)));

  GroupMemoryBarrierWithGroupSync();

  if (threadNum < 32)
  {
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 32]);  ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 32]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 16]);  ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 16]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 8]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 8]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 4]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 4]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 2]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 2]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 1]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 1]);
  }

  GroupMemoryBarrierWithGroupSync();

  if (threadNum == 0)
  {
    g_DepthBounds[groupIdx.xy] = float2(ldsZMin[0], ldsZMax[0]);
  }
}

^Thomas15

Depth bounds calculation:^Stewart15

Texture2D<float> g_SceneDepthBuffer;

// Thread Group Shared Memory (aka local data share, or LDS)
groupshared uint ldsZMin;
groupshared uint ldsZMax;

// Convert a depth value from postprojection space
// into view space
float ConvertProjDepthToView(float z)
{
  return (1.f / (z * g_mProjectionInv._34 + g_mProjecitonInv._44));
}

#define TILE_RES (16)
[numthreads(TILE_RES, TILE_RES, 1)]
void CullLightsCS(uint3 globalIdx : SV_DispatchThreadID,
                  uint3 localIdx  : SV_GroupThreadID,
                  uint3 groupIdx  : SV_GroupID)
{
  float depth = g_SceneDepthBuffer.Load(uint3(globalIdx.x, globalIdx.y, 0)).x;
  float viewPosZ = ConvertProjDepthToView(depth);
  uint z = asuint(viewPosZ);

  uint threadNum = localIdx.x + localIdx.y * TILE_RES;

  // There is no way to initialize shared memory at compile time, so thread zero does it at runtime
  if (threadNum == 0)
  {
    ldsZMin = 0x7f7fffff; // FLT_MAX as a uint
    ldsZMax = 0;
  }
  GroupMemoryBarrierWithGroupSync();

  // Parts of the depth buffer that were never written
  // (e.g., the sky) will be zero (the companion code uses
  // inverted 32-bit float depth for better precision).
  if (depth != 0.f)
  {
    // Calculate the minimum and maximum depth for this tile
    // to form the front and back of the frustum
    InterlockedMin(ldsZMin, z);
    InterlockedMax(ldsZMax, z);
  }
  GroupMemoryBarrierWithGroupSync();

  float minZ = asfloat(ldsZMin);
  float maxZ = asfloat(ldsZMax);

  // Frustum plane  and intersection code goes here
  ...
}

Light Culling

Frustum-based culling:^Zhdan16
- Sphere vs frustum planes are the most commonly used test
- In fact, it is a frustum-box test
- Extremely inaccurate with large spheres
- False positives!

Frustum planes calculation:^Stewart15

// Plane equation from three points, simplified
// for the case where the first position is the origin.
// N is normalized so that the plane equation can
// be used to compute signed distance
float4 CreatePlaneEquation(float3 Q, float3 R)
{
  // N = normalize(cross(Q-P, R-P))
  // except we know P is the origin
  float3 N = normalize(cross(Q, R))
  // D = -(N dot P), except we know P is the origin
  return float4(N, 0);
}

// Convert a point from postprojectino space into view space
float3 ConvertProjToView(float4 p)
{
  p = mul(p, g_mProjectionInv);
  return (p/p.w).xyz;
}

void CullLightsCS(uint3 globalIdx : SV_DispatchThreadID,
                  uint3 localIdx  : SV_GroupThreadID,
                  uint3 groupIdx  : SV_GroupID)
{
  // Depth bounds code goes here
  ...
  float4 frustumEqn[4];
  { // Construct frustum planes for this tile
    uint pxm = TILE_RES * groupIdx.x;
    uint pym = TILE_RES * groupIdx.y;
    uint pxp = TILE_RES * (groupIdx.x + 1);
    uint pyp = TILE_RES * (groupIdx.y + 1);
    uint width = TILE_RES * GetNumTilesX();
    uint height = TILE_RES * GetNumTilesY();

    // Four corners of the tile, clockwise from top-left
    float3 p[4];
    p[0] = ConvertProjToView(float4(pxm / (float) width * 2.f - 1.f, (height - pym) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[1] = ConvertProjToView(float4(pxp / (float) width * 2.f - 1.f, (height - pym) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[2] = ConvertProjToView(float4(pxp / (float) width * 2.f - 1.f, (height - pyp) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[3] = ConvertProjToView(float4(pxm / (float) width * 2.f - 1.f, (height - pyp) / (float) height * 2.f - 1.f, 1.f, 1.f));

    // Create plane equations for the four sides, with
    // the positive half-space outside the frustum
    for (uint i = 0; i < 4; ++i)
    {
      frustumEqn[i] = CreatePlaneEquation(p[i], p[(i + 1) & 3]);
    }
  }

  // Intersection code goes here
  ...
}

Intersection testing:^Stewart15

Buffer<float4> g_LightBufferCenterAndRadius;

#define MAX_NUM_LIGHTS_PER_TILE (256)
groupshared uint ldsLightIdxCounter;
groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];

// Point-plane distance, simplified for the case where
// the plane passes through the origin
float GetSignedDistnaceFromPlane(float3 p, float4 eqn)
{
  // dot(eqn.xyz, p) + eqn.w, except we know eqn.w is zero
  return dot(eqn.xyz, p);
}

#define NUM_THREADS (TILE_RES * TILE_RES)
void CullLightsCS(...)
{
  // Depth bounds and frustum planes code goes here
  ...
  if (threadNum = 0)
  {
    ldsLightIdxCounter = 0;
  }

  // Loop over the lights and do a
  // sphere versus frustum intersection test
  for (uint i = threadNum; i < g_uNumLights; i += NUM_THREADS)
  {
    float4 p = g_LightBufferCenterAndRadius[i];
    float r = p.w;
    float3 c = mul(float4(p.xyz, 1), g_mView).xyz;

    // Test if sphere is intersecting or inside frustum
    if ((GetSignedDistanceFromPlane(c, frustumEqn[0]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[1]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[2]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[3]) < r) && 
        (-c.z + minZ < r) && (c.z - maxZ < r))
    {
      // Do a thread-safe increment of the list counter
      // and put the index of this light into the list
      uint dstIdx = 0;
      InterlockedAdd(ldsLightIdxCounter, 1, dstIdx);
      ldsLightIdx[dstIdx] = i;
    }
  }
  GroupMemoryBarrierWithGroupSync();
}

AABB

Using Arvo Intersection Test^Thomas15

bool TestSphereVsAABB(float3 sphereCenter, float sphereRadius, float3 AABBCenter, float3 AABBHalfSize)
{
  float3 delta = max(0, abs(AABBCenter - sphereCenter) - AABBHalfSize);
  float distSq = dot(delta, delta);
  return distSq <= sphereRadius * sphereRadius;
}

Rasterization

Rounded AABB?:^Zhdan16
- Doesn’t suit for spot lights
- Works badly for very long frustums
- Problematic for wide FOV
Average tile frustum angle is small:^Zhdan16
- FOV = 100°, Tile size = 16 × 16 pixels
- Angle = FOV &dot; (til_size / screen_height) = 0.8° (at 1080p)
Frustum can be represented as a single ray at tile center^Zhdan16
- Or 4 rays at tile corners
  - Works better
- Replace frustum test with ray intersection test:
  - Ray-sphere, ray-cone, …
  - Compare tile min-max z with min-max among all intersections
Culling on compute sucks^Zhdan16
- Total operations = X × Y × N
- X = tile grid width
- Y = tile grid height
- N = number of lights
- Reduce the order of enumeration
  - Subdivide screen into 4 - 8 sub-screens
  - Coarsely cull lights against sub-screen frustums
  - Select corresponding sub-screen during culling phase
However, compute shader is still too slow^Zhdan16
Light culling used graphics^Zhdan16
- Use rasterizer to generate light fragments
  - Empty tiles will be natively skipped
- Use depth test to account for occlusion
  - Useless work for occluded tiles will be skipped
- Use primitive-ray intersection on PS for fine culling and light list updating
Idea Overview^Zhdan16
- Culling phase tile → 1 pixel
- Light volume → proxy geometry
- Coarse XY-culling → rasterization
- Coarse Z-culling → depth test
- Precise culling → pixel shader
Break tiled shading into 3 phases:^Zhdan16
- Reduction
- Culling
- Lighting

New Culling Method:^Zhdan16

Camera frustum culling
- Cull lights against camera frustum
- Split visible lights into “outer” and “inner”
- Can be done in CPU
Depth buffers creation
- For each tile:
  - Find and copy max depth for “outer” lights
  - Find and copy min depth for “inner” lights
- Depth test is a key to high perforamance!
  - Use [earlydepthstencil] in shader
Rasterization & classification
- Render light geometry with depth test
  - “outer” - max depth buffer
    - Front faces with direct depth test
  - “inner” - min depth buffer
    - Back faces with inverted depth test
- Use PS for precise culling and per-tile light list creation * Common light types * Light geometry can be replaced with proxy geometry * Point light (omni)
- Geosphere (2 subdivisions, octa-based)
- Close enough to sphere
- Low poly works well at low resolution
- Equilateral triangles can ease rasterizer’s life * Directional light (spot)
- Old CRT-TV
  - Easy for parameterization
    - From a searchlight
    - To a hemisphere
    - Plane part can be used to handle area lights
      - Advantages:^Zhdan16
        
        No work for tiles without lights and for occluded lights
        
        Coarse culling is almost free
        
        Incredible speed up with small lights
        
        Complex proxy models can be used!
        
        Mathematically it is a branch-and-bound procedure

Computer Shader Implementation

Primarily for analytical light sources^Andersson11
- Point lights, cone lights, line lights
- No shadows
- Requires compute shader 5.0
Hybrid Graphics / Compute shading pipeline:^Andersson11
- Graphics pipeline rasterizes gbuffers for opaque surfaces
- Compute pipeline uses gbuffers, culls lights, computes lighting & combines with shading
- Graphics pipeline renders transparent surfaces on top
MSAA is simpler^Sousa13

^Andersson09

Requiresments & setup

Input data:
- gbuffers, depth buffer
- light constants
- list of lights^Andersson11
Output data:^Andersson11
- Fully composited & lit HDR texture
Output is fully composited & lit HDR texture
1 thread per pixel, 16 x 16 thread groups (aka tile)^Andersson11

Texture2D<float4> gbufferTexture1 : register(t0);
Texture2D<float4> gbufferTexture2 : register(t1);
Texture2D<float4> gbufferTexture3 : register(t2);
Texture2D<float4> depthTexture : register(t3);

RWTexture2D<float4> outputTexture : register(u0);

#define BLOCK_SIZE (16)

[numthreads(BLOCK_SIZE, BLOCK_SIZE, 1)]
void csMain(
  uint3 groupId : SV_GroupID,
  uint3 groupThreadId : SV_GroupThreadID,
  uint groupIndex : SV_GroupIndex,
  uint3 dispatchThreadId : SV_DispatchThreadID
)
{
  …
}

Load gbuffers & depth

Calculate min & max z in threadgroup / tile

Using InterlockedMin/Max on groupshared variable
Atomics only work on ints
But casting works (z is always +)
Can skip if we could resolve out min & max z to a texture directly using HiZ / Z Culling

groupshared uint minDepthInt;
groupshared uint maxDepthInt;

// --- globals above, function below -------

float depth = depthTexture.Load(uint3(texCoord, 0)).r;
uint depthInt = asuint(depth);

  minDepthInt = 0xFFFFFFFF</span>;
  maxDepthInt = 0;
  GroupMemoryBarrierWithGroupSync();

  InterlockedMin(minDepthInt, depthInt);
  InterlockedMax(maxDepthInt, depthInt);

  GroupMemoryBarrierWithGroupSync();

  float minGroupDepth = asfloat(minDepthInt);
  float maxGroupDepth = asfloat(maxDepthInt);

Determine visible light sources for each tile

Cull all light sources against tile "frustum"
- Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sources
Input (global):^Andersson11
- Light list, frustum & SW occlusion culled

Output for each tile is:

# of visible light sources
Index list of visible light sources

	Lights	Indices
Global list	1000+	0 1 2 3 4 5 6 7 8 …
Tile visible list	~0-40+	0 2 5 6 8 …

Key part of the algorithm and compute shader

Each thread switches to process light sources instead of a pixel
- Wow, parallelism switcheroo!
- 256 light sources in parallel per tile
- Multiple iterations for >256 lights
Intersect light source & tile
- Many variants dep. on accuracy requirements & performance
- Tile min & max z is used as a shader "depth bounds" test
For visible lights, append light index to index list
- Atomic add to threadgroup shared memory. "inlined stream compaction"
- Prefix sum + stream compaction should be faster than atomics, but more limiting
Switch back to processing pixels
- Synchronize the thread group
- We now know which light sources affect the tile

struct Light
{
    float3 pos;
    float sqrRadius;
    float3 color;
    float invSqrRadius;
};
int lightCount;
StructuredBuffer<Light> lights;

groupshared uint visibleLightCount = 0;
groupshared uint visibleLightIndices[1024];

// ----- globals above, cont. function below ---------

uint threadCount = BLOCK_SIZE * BLOCK_SIZE;
uint passCount = (lightCount + threadCount - 1) / threadCount;

for (uint passIt = 0; passIt < passCount; ++passIt)
{
    uint lightIndex = passIt * threadCount + groupIndex;

    // prevent overrun by clmaping to a last "null" light
    lightIndex = min(lightIndex, lightCount);

    if (intersects(lights[lightIndex], tile))
    {
        uint offset;
        InterlockedAdd(visibleLightCount, 1, offset);
        visibleLightIndices[offset] = lightIndex;
    }
}

GroupMemoryBarrierWithGroupSync();

For each pixel, accumulate lighting from visible lights

Read from tile visible light index list in threadgroup shared memory

float3 diffuseLight = 0;
float3 specularLight = 0;

for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt)
{
  uint lightIndex = visibleLightIndices[lightIt];
  Light light = lights[lightIndex];

  evaluateAndAccumulateLight(
    light,
    gbufferParameters,
    diffuseLight,
    specularLight
  );
}

Combine lighting & shading albedos / parameters

Output is non-MSAA HDR texture
Render transparent surfaces on top

float3 color = 0;

for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt)
{
  uint lightIndex = visibleLightIndices[lightIt];
  Light light = lights[lightIndex];

  color += diffuseAlbedo * evaluateLightDiffuse(light, gbuffer);
  color += specularAlbedo * evaluateLightSpecular(light, gbuffer);
  );
}

^Andersson11

Optimizations

Be cache friendly^Stewart15
Choose a suboptimal tile size^Stewart15

Depth range optimization^{OlssonAssarsson11}

Compute min and max Z value for each tile. This requires access to the z buffer.

Half Z Method^Stewart15

// Test if sphere is intersecting or inside frustum
if ((GetSignedDistanceFromPlane(c, frustumEqn[0]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[1]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[2]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[3]) < r) && 
    (-c.z + minZ < r) && (c.z - maxZ < r))
{
  if (-c.z + minZ < r && c.z - halfZ < r)
  {
    // Do a thread-safe increment of the list counter
    // and put the index of this light into the list
    uint dstIdx = 0;
    InterlockedAdd(ldsLightIdxCounterA, 1, dstIdx);
    ldsLightIdxA[dstIdx] = i;
  }
  if (-c.z + halfZ < r && c.z - maxZ < r)
  {
    // Do a thread-safe increment of the list counter
    // and put the index of this light into the list
    uint dstIdx = 0;
    InterlockedAdd(ldsLightIdxCounterB, 1, dstIdx);
    ldsLightIdxB[dstIdx] = i;
  }
}

Parallel Reduction^Stewart15

Texture2D<float> g_SceneDepthBuffer;
RWTexture2D<float4> g_DepthBounds;

#define TILE_RES (16)
#define NUM_THREADS_1D (TILE_RES / 2)
#define NUM_THREADS (NUM_THREADS_1D * NUM_THREADS_1D)

// Thread Group Shared Memory (aka local data share, or LDS)
groupshared float ldsZMin[NUM_THREADS];
groupshared float ldsZMax[NUM_THREADS];

// Convert a depth value from postprojection space
// into view space
float ConvertProjDepthToView(float z)
{
  return (1.f / (z * g_mProjectionInv._34 + g_mProjectionInv._44));
}

[numthreads(NUM_THREADS_1D, NUM_THREADS_1D, 1)]
viud DepthBoundsCS( uint3 globalIdx : SV_DispatchThreadID,
                    uint3 localIdx  : SV_GroupThreadID,
                    uint3 groupIdx  : SV_GroupID)
{
  uint2 sampleIdx = globalIdx.xy * 2;

  // Load four depth samples
  float depth00 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x,     sampleIdx.y,     0)).x;
  float depth01 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x,     sampleIdx.y + 1, 0)).x;
  float depth10 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x + 1, sampleIdx.y,     0)).x;
  float depth11 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x + 1, sampleIdx.y + 1, 0)).x;

  float viewPosZ00 = ConvertProjDepthToView(depth00);
  float viewPosZ01 = ConvertProjDepthToView(depth01);
  float viewPosZ10 = ConvertProjDepthToView(depth10);
  float viewPosZ11 = ConvertProjDepthToView(depth11);

  uint threadNum = localIdx.x + localIdx.y * NUM_THREADS_1D;

  // Use parallel reduction to calculate the depth bounds
  {
    // Parts of the depth buffer that were never written
    // (e.g., the sky) will be zero (the companion code uses
    // inverted 32-bit float depth for better precision)
    float minZ00 = (depth00 != 0.f) ? viewPosZ00 : FLT_MAX;
    float minZ01 = (depth01 != 0.f) ? viewPosZ01 : FLT_MAX;
    float minZ10 = (depth10 != 0.f) ? viewPosZ10 : FLT_MAX;
    float minZ11 = (depth11 != 0.f) ? viewPosZ11 : FLT_MAX;

    float maxZ00 = (depth00 != 0.f) ? viewPosZ00 : 0.0f;
    float maxZ01 = (depth01 != 0.f) ? viewPosZ01 : 0.0f;
    float maxZ10 = (depth10 != 0.f) ? viewPosZ10 : 0.0f;
    float maxZ11 = (depth11 != 0.f) ? viewPosZ11 : 0.0f;

    // Initialize shared memory
    ldsZMin[threadNum] = min(minZ00, min(minZ01, min(minZ10, minZ11)));
    ldsZMax[threadNum] = max(maxZ00, max(maxZ01, max(maxZ10, maxZ11)));
    GroupMemoryBarrierWithGroupSync();

    // Minimum and maximum using parallel reduction, with the 
    // loop manually unrolled for 8x8 thread groups (64 threads
    // per thread group)
    if (threadNum < 32)
    {
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 32]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 32]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 16]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 16]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 8]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 8]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 4]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 4]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 2]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 2]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 1]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 1]);
    }
  }
  GroupMemoryBarrierWithGroupSync();

  float minZ = ldsZMin[0];
  float maxZ = ldsZMax[0];
  float halfZ = 0.5f * (minZ + maxZ);

  // Calculate a second set of depth values: the maximum
  // on the near side of Half Z and the minimum on the far
  // side of Half Z
  {
    // See the companion code for details
    ...
  }

  // The first thread writes to the depth bounds texture
  if (threadNum == 0)
  {
    float maxZ2 = ldsZMax[0];
    float minZ2 = ldsZMin[0];

    g_DepthBounds[groupIdx.xy] = float4(minZ, maxZ2, minZ2, maxZ);
  }
}

Light Pre-Pass Renderer

This is the second rendering pass where we store light properties of all lights in a light buffer(aka L-Buffer).^{EngelShaderX709}

LightPrePassRenderer ^{EngelShaderX709}

Compared to a deferred renderer, the light pre-pass renderer offers more flexibility regarding material implementations. Compared to a Z pre-pass renderer, it offers less flexibility but a flexible and fast multi-light solution.^{EngelShaderX709}

Because the light buffer only has to hold light properties, the cost of rendering one light source is lower than for a similar setup in a deferred renderer.^{EngelShaderX709}

Advantages:
- Easier to retrofit into “traditional” rendering pipelines^Lee09
- Lower memory and bandwidth usage^Lee09
- Can reuse your primary shaders for forward rendering of alpha^Lee09
- One material fetch per pixel regardless of number of lights^Thibieroz11
Disadvantages:
- Alpha blending is problematic^Lee09
  - MSAA and alpha to coverage can help
- Encoding different material types is not elegant^Lee09
  - Coherent fragment program dynamic branching can help
- 2x geometry pass too expensive on both CPU & GPU^Andersson11
- Tile-based deferred shading has major potential^Andersson11
- Two geometry passes required^Thibieroz11
- Unique lighting model^Thibieroz11

Version A^{EngelSiggraph09}

Geometry Pass
- Fill up normal and depth buffer
Lighting Pass
- Store light properties in light buffer
1. Geometry Pass
  - Fetch light buffer and apply different material terms per surface by reconstructing the lighting equation

Version B^{EngelSiggraph09}

Similar to S.T.A.L.K.E.R: Clear Skies

Geometry Pass
- Fill up normal + spec. power and depth buffer and a color buffer for the ambient pass
Lighting Pass
- Store light properties in light buffer
Ambient + Resolve (MSAA) Pass
- Fetch light buffer use its content as diffuse and specular content and add the ambient term while resolving into the main buffer

Dragon Age II^{Papathanasis11}

Extra pass to render scene normals
Render all dynamic light spheres to a light buffer
Allows for hundres of lights

Clustered Shading

Clustered shading explore higher dimensional tiles, which we collectively call clusters. Each cluster has a fixed maximum 3D extent.^{OlssonBilleterAssarssonHpg12}

Deferred Algorithm:^{OlssonBilleterAssarssonHpg12}

Render scene to G-Buffers
Cluster assignment
Find unique clusters
Assign lights to clusters
Shade samples

Advantages:

Flexibility^Persson15
- Forward rendering compatible
  - Custom materials or light models
  - Transparency
Deferred rendering compatible^Persson15
- Screen-space decals
- Performance
Simplicity^Persson15
- Unified lighting solution
- Easier to implement than full blown Tiled Deferred / Tiled Forward
Performance^Persson15
- Typically same or better than Tiled Deferred
- Better worst-case performance
- Depth discontinuities? “It just works”

Avanlanche solution:^Persson15

Only spatial clustering
- 64 × 64 pixels, 16 depth slices
CPU light assignment
- Compact memory structure easy
Implicit cluster bounds only
- Scene-independent
- Deferred pass could potentially use explicit
Exponential depth slicing
- Huge depth range! [0.1m ~ 50,000m]
- Default list
  - [0.1, 0.23, 0.52, 1.2, 2.7, 6.0, 14, 31, 71, 161, 365, 828, 1880, 4270, 9696, 22018, 50000]
- Limit far to 500
  - We have “distant lights” systems for light visualization beyond that
  - [0.1, 0.17, 0.29, 0.49, 0.84, 1.43, 2.44, 4.15, 7.07, 12.0, 20.5, 35, 59, 101, 172, 293, 500]
- Special near 0.1 - 5.0 cluster
  - Tweaked visually from player standing on flat ground
  - [0.1, 5.0, 6.8, 9.2, 12.6, 17.1, 23.2, 31.5, 42.9, 58.3, 79.2, 108, 146, 199, 271, 368, 500]

Data Structure

Cluster “pointers” in 3D texture^Persson15
- R32G32_UINT
  - R = Offset
  - G = [PointLightCount, SpotLightCount]
Light index list in texture buffer^Persson15
- R16_UINT
- Tightly packed
Light & shadow data in CB^Persson15
- PointLight: 3 × float4
- SpotLight: 4 × float4

int3 tex_coord = int3(In.Position.xy, 0); // Screen-space position ...
float depth = Depth.Load(tex_coord);      // ... and depth

int slice = int(max(log2(depth * ZParam.x + ZParam.y) * scale + bias, 0));  // Look up cluster
int4 cluster_coord = int4(tex_coord >> 6, slice, 0);  // TILE_SIZE = 64

uint2 light_data = LightLookup.Load(cluster_coord); // Fetch light list
uint light_index = light_data.x;                    // Extract parameters
const uint point_light_count = light_data.y & 0xFFFF;
const uint spot_light_count  = light_data.y >> 16;

for (uint pl = 0; pl < point_light_counter; ++pl) // Point lights
{
  uint index = LightIndices[light_index++].x;

  float3 LightPos = PointLights[index].xyz;
  float3 Color = PointLights[index + 1].rgb;
  // Compute pointlight here ...
}

for (uint sl = 0; sl < spot_light_count; ++sl)    // Spot lights
{
  uint index = LightIndices[light_index++].x;

  float3 LightPos = SpotLights[index].xyz;
  float3 Color = SpotLights[index + 1].rgb;
  // Compute spotlight here ...
}

^Persson15

Memory optimization^Persson15
- Naive approach: Allocate theoretical max
  - All clusters address all lights
    - Not likely
  - Might be several megabytes
  - Most never used
- Semi-conservative approach
  - Construct massive worst-case scenario
    - Multiply by 2, or what makes you comfortable
    - Still likely only a small fraction of theoretical max
- Assert at runtime that you never go over allocation
  - Warn if you ever get close

Cluster Assignment

Goal: compute an integer cluster key for a given view sample from the information available in the G-Buffer^{OlssonBilleterAssarssonHpg12}
- Use position, normal (optional)
Regular subdivision / quantization of the sample positions^{OlssonBilleterAssarssonHpg12}
- Fast, predictable cluster sizes
Uniform screen space tiling used in tiled deferred shading + extend this by also subdividing along the z-axis in view space (or NDC)^{OlssonBilleterAssarssonHpg12}
Due to non-linear nature of NDC, subdivide the z-axis in view space by spacing the divisions exponentially^{OlssonBilleterAssarssonHpg12}
Subdivision:^{OlssonBilleterAssarssonHpg12}
- near_k: near plane for a division k
- h_k: depth of a division k
  - near_k = near_{k - 1} + h_{k - 1}
- near₀ = near
- h₀ = 2 near tan θ / S_y
  - 2θ: field of view
  - S_y: number of subdivisions in the Y direction
- near_k = near ( 1 + 2 tan θ / S_y)^k
Cluster key tuple (i, j, k):^{OlssonBilleterAssarssonHpg12}
- Can be extended with a number of bits that encode a quantized normal direction
- i, j = 2D tile id - gl_FragCoord.xy^Olsson15
- k = ≈log(view space z)^Olsson15

Granite:

Instead of frustums, more grid-like structure provided much simpler culling math^Arntzen20

Shadow of the Tomb Raider:^Moradin19

Light shapes are approximated with an icosahedron
All frustums used for cone lights are just boxes scaled differently on both ends

Avalanche Studios:^Persson15

Want to minimize false positives
Must be conservative
- But still tight
- Preferably exact
  - But not too expensive
  - Surprisingly hard!
99% frustum culling code useless
- Made for view-frustum culling
  - Large frustum vs. small sphere
  - We need small frustum vs. large sphere
- Sphere vs. six planes won’t do
Pointlight Culling:
- Iterative sphere refinement
  - Loop over z, reduce sphere
  - Loop over y, reduce sphere
  - Loop over x, test against sphere
  - Culls better than AABB
    - Similar cost
    - Typically culling 20-30%

for (int z = z0; z <= z1; ++z)
{
  float4 z_light = light;
  if (z != center_z)
  {
    const ZPlane& plane = (z < center_z) ? z_planes[z + 1] : -z_planes[z];
    z_light = project_to_plane(z_light, plane);
  }

  for (int y = y0; y < y1; ++y)
  {
    float3 y_light = z_light;

    if (y != center_y)
    {
      const YPlane& plane = (y < center_y) ? y_planes[y + 1] : -y_planes[y];
      y_light = project_to_plane(y_light, plane);
    }

    int x = x0;
    do
    {
      ++x;
    } while (x < x1 && GetDistance(x_planes[x], y_light_pos) >= y_light_radius);

    int xs = x1;
    do
    {
      --xs;
    } while (xs >= x && -GetDistance(x_planes[xs], y_light_pos) >=- y_light_radius);

    for (--x; x <= xs; ++x)
    {
      light_lists.AddPointLight(base_cluster + x, light_index);
    }
  }
}

Spotlight Culling:
- Iterative plane narrowing
  - Find sphere cluster bounds
  - In each six directions, do plane-cone test and shrink
- Cone vs. bounding-sphere cull remaining “cube”

Sparse vs Dense Cluster Grid^Olsson15

Sparse Cluster Grid
- Only store cells that contain samples
- Requires pre-z pass / deferred
- No redundant light assignment
- Geometry info useful for other things
Dense Cluster Grid
- Must assign lights to all clusters
- Can be done on CPU / Asynchronously
- Can access any point in view volume
- SAME shading cost as for sparse

Explicit vs Implicit Cluster^Olsson15

Explicit cluster bounds
- Actual bounding box of samples
- Some storage
- Some cost to build
- Tight bounds
- Extra geometry pass for forward shading
Implicit cluster bounds
- Implied by grid coordinate
- No storage
- Can have large empty space

Finding Unique Clusters

Local Sorting^{OlssonBilleterAssarssonHpg12}
- Sort samples in each screen space tile locally
  - Allows us to:
    - Perform the sorting operation in on-chip shared memory
    - Use local indices to link back to the source pixel
- Extract unique clusters from each tile using a parallel compaction
  - Compute and store a link from each sample to its associated cluster
- Globally unique list of clusters
Cluster key defines implicit 3D bounds, and optionally an implicit normal cone^{OlssonBilleterAssarssonHpg12}
Compute the explicit bounds by performing a reduction over the samples in each cluster^{OlssonBilleterAssarssonHpg12}
- e.g., Perform a min/max reduction to find the AABB enclosing each cluster
Results of the reduction are stored separately in memory^{OlssonBilleterAssarssonHpg12}

Light Assignment^{OlssonBilleterAssarssonHpg12}

Goal: calculate the list of lights influencing each cluster
Fully hierarchical approach based on a spatial tree over the lights is better than tiled shading’s brute force approach
Bounding Volume Hierarchy (BVH)
- Sort the lights according to the Z-order (Morton Code) based on the discretized centre position of each light
- Leaves of the search tree is get directly from the sorted data
- 32 consecutive leaves are grouped into a bounding volume (AABB) to form the first level above the leaves
- Next level is constructed by again combining 32 consecutive elements
  - Continue until a single root element remains
- To construct upper levels of the tree, we launch a CUDA warp (32 threads) for each node to be constructed
For each cluster, we traverse this BVH using depth-first traversal (warp is allocated)
- At each level, the bounding box of the cluster is tested against the bounding volumes of the child nodes
- For the leaf nodes, the sphere bounding the light source is used
- The branching factor of 32 allows efficient SIMD-traversal on the GPU and keeps the search tree relatively shallow (up to 5 levels)
  - Avoid expensive recursion
  - Support up to 32 million lights

Shading^{OlssonBilleterAssarssonHpg12}

To match the pixel and the clusters, we need a direct mapping between the cluster key and the index into the list of unique clusters.

In the sorting approach, we explicitly store this index for each pixel. When the unique cluster is established, store the index to the correct pixel in a full screen buffer.

Cluster Key Packing^{OlssonBilleterAssarssonHpg12}

Allocate 8 bits to each i and j components, which allows up to 8192 × 8192 size RTs. Depth index k is determined from settings for the near and far planes and ClusterK .

The paper uses 10 bits, 4 bits for the actually depth data, and 6 bits for the optional normal clustering.

Tile Sorting^{OlssonBilleterAssarssonHpg12}

To the cluster ke we attach an additional 10 bits of meta-data, which identifies the sample’s original position relative to its tile. We perfrom a tile-local sort of the cluster keys and the associated meta-data. The sort only considers the up-to 16 bits of the cluster key; the meta-data is used as a link back to the original sample after sorting. In each tile, we count the number of unique cluster keys. Using a prefix operation over the counts from each tile, we find the total number of unique cluster keys and assign each cluster a unique ID in the range [0…numClusters). We write the unique ID back to each pixel that is a member of the cluster. The unique ID also serves as an offset in memory to where the cluster’s data is stored.

Shadows

Needs all shadow buffers upfront^Persson15
- One large atlas
  - Variable size buffers
  - Dynamically adjustable resolution
Lights are cheap, shadow maps are not^Persson15
- Still need to be conservative about shadow casters
Decouple light and shadow buffers^Persson15
- Similar lights can share shadow buffers
- Userful for car lights etc.

Conservative Rasterization

If any part of a primitive overlaps a pixel, that pixel is considered covered and is then rasterized^{OrtegrenPersson16}

Algorithm:^{OrtegrenPersson16}

For each light type:
1. Shell pass
  - Find min / max depths in every tile for every light
  - Conservative Rasterization^Persson15
2. Fill pass
  - Use the min / max depths and fill indices into the light linked list
  - Compute shader^Persson15 * When all light types have been processed, light assignment is complete, and the light linked list can be used when shading geometry
Lights as meshes^Persson15
- Typically very low-res
- Can be LODed

Light Shape Representation:

Meshes are created as unit shapes, where vertices are constrained to -1 to 1 in the x-, y-, and z-directions
- To allow arbitrary scaling of the shape depending on the actual light size

Shell Pass

Responsible for finding the clusters for a light shape that encompasses it in cluster space
Finds the near / far clusters for each tile for each light and stores them in an R8G8 RT
- Number of RTs == Maximum number of visible lights for each light type
- All RTs have the same size and format and are set up in a Texture2DArray for each light type
- Sizes of the RTs are the same as the x- and y-dimensions of the cluster structure (tile dimension)
Set D3D12_CONSERVATIVE_RASTERIZATION_MODE_ON flag when creating a PSO.
Vertex Shader
- Each light type has its own custom vertex shader for translating, rotating, and scaling the light mesh to fit the actual light
- SV_InstanceID is used to extract the position, scale, and other properties to transform each vertex to the correct location in world space
- Sent to GS containing the view-space position and its light ID
- Unit mesh^Persson15
- One draw-call per light type^Persson15
Geometry Shader
- Assigns array ID^Persson15
  - Can be done in VS now though
Pixel Shader
- Compute exact depth range^Persson15
Texture Array, e.g. 24 × 16 × N, R8G8 format^Persson15
Conservative Rasterization^Persson15
- Touch all relevant tiles
Compute exact depth range within pixel^Persson15
- Triangle fully covers pixel
  - Compute min & max from depth gradient
- Pixel fully covers triangle
  - Use min & max from vertices
- Partial coverage
  - Compute min & max at intersections
MIN blending resolves overlap
- Output 1-G to G to accomplish MINMAX

Fill Pass

Compute shader^Persson15
- 1 thread per tile per light
- Light linked-list

Source Code Analysis

Root Signature:

Default
- 1 CBV
  - Camera information
- 1 SRV
  - Point / spot lights

Draw:

PSO Point
- Input Layout: POSITION, R8G8B8A8_SNORM
- Root Signature: Default
- VS: LAPointLight.vertex
- GS: LA.geometry
- PS: LAfront.pixel

Future Work

Clustering strategies^Persson15
- Screen-space tiles, depth vs. distance
- View-space cascades
- World space
  - Allows light evaluation outside of view-frustum (reflections etc.)
- Dynamic adjustments?
Shadows
- Culling clusters based on max-z in shadow buffer?
Light assignment
- Asynchronous compute

Alternative Implementations

Alternative clustering scheme
- World-space, fixed grid clustering scheme^Persson15
Alternative light list^Persson15
- Bitfield of lights
  - Single fetch
  - Constant and low memory requirements
- Suitable with low to moderate light counts
Clustered lightmapping?^Persson15
- LightMap stores light bitfield per texel
- Shadow fetched for enabled lights
  - Dead space optimization?
Limited dynamic lighting support^Persson15
- Turn lights on/off
- Vary light color, intensity, falloff
- Reduce radius

Per-Pixel Linked List^Bezrati14

struct LightFragmentLink
{
  float m_LightDepthMax;
  float m_LightDepthMin;
  
  uint m_LightIndex;
  uint m_Next;
};

Compressed version:

struct LightFragmentLink
{
  uint m_DepthInfo;
  uint m_IndexNetx;
}

G-Buffer
Fill Linked List
Light G-Buffer
Custom Materials
Alpha

Light Linked List (LLL)

Generate down-sized depth buffer
Use conservative depth selection
Use GatherRed
Shader steps
- Software depth test
  - Software test front faces
- Acquire min / max depth
- Allocate a LLL fragment

Depth Test:

// If Z test fails for the front face, skip all fragments
if ((pFace == true) & (light_depth > depth_buffer))
{
  return;
}

Depth Bounds RWByteAddressBuffer
Encode Depth + ID
- 16 bits ID
- 16 bits Depth
- uint new_bounds_info = (light_index << 16) | f32tof16(light_depth);
Use InterlockedExchange
Fragment Links
- Use a RWStructuredBuffer for storage

struct LightFragmentLink
{
  uint m_DepthInfo; // High bits min depth, low bits max depth
  uint m_IndexNext; // Light index and link to the next fragment
};
RWStructuredBuffer<LightFragmentLink> g_LightFragmentLinkedBuffer;

Allocate LLL Fragment
- Increment current count

// Allocate
uint new_lll_idx = g_LightFragmentLinkedBuffer.IncrementCounter();

// Don't overflow
if (new_lll_idx >= g_VP_LLLMaxCount)
{
  return;
}

Track last entry
- StartOffset RWByteAddressBuffer
  - InterlockedExchange
Light fragment encoding
- Fill the linked light fragment and store it

// Final output
LightFragmentLink element;

// Pack the light depth
element.m_DepthInfo = (light_depth_min << 16)  light_depth_max;

// Index / Link
element.m_IndexNext = (light_index << 24) | (prev_lll_idx & 0xFFFFFFFF);

// Store the element
g_LightFragmentLinkedBuffer[new_lll_idx] = element;

Lighting the G-Buffer

Draw full-screen quad
Access the LLL
Apply the light

Accessing the SRVs:

Fetch the first linked element offset
- The first linked element is encoded in the lower 24 bits

uint src_index = LLLIndexFromScreenUVs(screen_uvs);
uint first_offset = g_LightStartOffsetView[src_index];

// Decode the first element index
uint element_index = (first_offset & 0xFFFFFF);

Light Loop:

Start the lighting loop
- An element index equal to 0xFFFFFF is invalid

// Iterate over the light linked list
while (element_index != 0xFFFFFF)
{
  // Fetch
  LightFragmentLink element = g_LightFragmentLinkedView[element_index];

  // Update the next element index
  element_index = (element.m_IndexNext & 0xFFFFFF);
}

Decoding light depth:

Decode the light min / max depth
Compare the light depth

// Decode the light bounds
float light_depth_max = f16tof32(element.m_DepthInfo >> 0);
float light_depth_min = f16tof32(element.m_DepthInfo >> 16);

// Do depth bounds check
if ((l_depth > light_depth_max) || (l_depth < light_depth_min))
{
  continue;
}

Access light info:

Fetch the full light information

// Decode the light index
uint light_index = (element.m_IndexNext >> 24);

// Access
GPULightEnv light_env = g_LinkedLightsEnvs[light_index];

// Detect the light type
switch (light_env.m_LightType)
{
  ...

3D Light Grid^Anagnostou17

View space light grid of dimensions 29 x 16 x 32
Screen space tile of 64 x 64 pixels and 32 z-partitions
- Partitioning is exponential
Assign 9 lights and 2 reflection probes
Axis-aligned box of each cell to perform light bounding volume intersections
To store the light indices, a linked list is used which is then converted to a contiguous array during the “compact” pass

Optimizations

The most important optimization for the lighting pass is to render only those lights that actually affect the final image, and for those lights, render only the affected pixels.^{Shishkovtsov05}^Thibieroz11

Social Stage:
- Filter the lights and effects on the scene to produce a smaller list of sources to be processed
  1. Execute visiblity and occlusion algorithms to discard lights whose influence is not appreciable
  2. ~~Project visible sources bounding objects into screen space~~
  3. ~~Combine similar sources that are too close in screen space or influence almost the same screen area~~
  4. Discard sources with a tiny contribution because of their projected bounding object being too small or too far
  5. ~~Check that more than a predefined number of sources do not affect each screen region. Choose the biggest, strongest, and closer sources.~~
Individual Stage:
1. Select the appropriate level of detail.
2. Enable and configure the source shaders
3. ~~Compute the minimum and maximum screen cord values of the projected bounding object~~
4. ~~Enable the scissor test~~
5. ~~Enable the clipping planes~~
6. ~~Render a screen quad or the bounding object~~

Constant Waterfall sucks!^{WhiteBarreBrisebois11}
- Kills performance
- Use the aL register when iterating over lights
- If set properly, ALU / lighting will run at 100% efficiency

int lightCounter[4] = { count, start, step, 0 };
pDevice->SetPixelShaderConstantI(0, lightCounter, 1);
// NO

int tileLightCount : register(i0);
float4 lightParams[NUM_LIGHT_PARAMS] : register(c0);

[loop]
for (int iLight = 0;  // start
     iLight < tileLightCount; // count * step
     ++iLight)  // step
{
  float4 params1 = lightParams[iLight + 0]; // mov r0 c0[0 + aL]
  float4 params2 = lightParams[iLight + 1]; // mov r1 c0[1 + aL]
  float4 params3 = lightParams[iLight + 2]; // mov r2 c0[2 + aL]
}

^{WhiteBarreBrisebois11}

These shaders are ALU bound^{WhiteBarreBrisebois11}
- Simplify math especially in the loops
- Get rid of complicated non 1:1 instructions (e.g. smoothstep)
- Play with microcode: -normalize(v) is faster than normalize(-v)
- Move code around to help with dual-issuing
- Use shader predicates to help the compiler
  - [flatten], [branch], [isolate], [ifAny], [ifAll], etc.
  - Tweak GPRs
Use GPU freebies^{WhiteBarreBrisebois11}
- Texture sampler scale / bias (* 2 - 1)
Simply / remove unnedded code via permutations
Upload constants via the constnat buffer pointers

Sun Rendering

S.T.A.L.K.E.R case:^{Shishkovtsov05}

Skybox doesn’t need to be shaded
Pixels facing away from the sun don’t need to have complex shading applied
Pixels with AO term of 0 can also be excluded from processing

Killzone 2 case:^Valient07

Stencil mark potentially lit pixels
- Use only sun occlusion from G-Buffer

Level of Detail Lighting

* LoD to decide how many instructions per pixel:^{[Placeres06](#Placeres06)} * Closest: Perform both diffuse and specular * Normal: Diffuse + Specular * t * Far: Diffuse

Blending Cost

Use Discard() to get rid of pixels not contributing any light
- Use this regardless of the light processing method used
- Can result in a significant increase in performance

Shadows

Shadow Maps

The key is using the little used variant known as forward shadow mapping. With forward shadow mapping the objects position is projected into shadow map space and then depths compared there.^Calver03^Thibieroz04

The first step is to calculate the shadow map; this is exactly the same as a conventional renderer.^Calver03

When the light that generated the shadow map is rendered, the shadow map is attached to the light shader in the standard fashion (a cube map for the point light case).^Calver03

Point Light Shadow Maps
- Cube maps^Koonce07
Else
- 2D textures^Koonce07
All textures are floating-point textures, multisample jitter sampling to smooth out the shadows.^Koonce07
Static shadow lights are built only once and reused each frame^Koonce07
Dynamic shadow maps are rebuilt each frame^Koonce07
By flagging geometry as static or dynamic, static shadow maps can cull out dynamic geometries^Koonce07

Efficient Omni Lights

Three major options:^{Shishkovtsov05}

Using a cube map for storing distance from the light center, R32F or R8G8B8A8 packed
Using a 2D surface with “unrolled” cube-map faces, with reindexing done through a small cube map (called virtual shadow depth cube texture)
Treating the point light as six spotlights and rendering each separately

| |Cube Map|Virtual Shadow Depth Cube Texture|Six Spotlights| |—|——|———————————|————–| |Scalability and Continuity|Low
Few Fixed sizes
All faces are the same|Moderate
Faces can be of different sizes, but only from a few fixed sets|Excellent
Any variation of sizes is possible| |Hardware Filtering Support|No|Yes|Yes| |Cost of Filtering|Moderate|Excellent for bilinear
Moderate for arbitrary percentage-closer filtering|Excellent| |Render Target Switches|Six|One|One| |Packing Support|No|Yes|Yes| |Cost of Screen Space Stencil Masking|Low|Low|Moderate
Some stencil overdraw| |Memory Cost and Bandwidth Usage|High
Surface is almost unusuable for everything else|Moderate
Few fixed sizes limits packing ability|Excellent| ^{Shishkovtsov05}

Post Processing Phase

Glow
Auto-Exposure
Distortion
Edge-smoothing
Fog

HDR

Render your scene to multiple 32 bit buffers, then use a 64 bit accumulation buffer during the light phase.^Hargreaves04

Minor Architectures

The X-Ray Rendering Architecture^{LobanchikovGruen09}

G-Stage
Light Stage
Light Combine
Transparent Objects
Bloom/Exposition
Final Combine-2
Post-Effects

G-Stage

Output geometry attributes (albedo, specular, position, normal, AO, material)
MSAA output (subsample geometry data)

Light Stage

Calculate lighting (diffuse light-RGB, specular light - intensity only)
Interleaved rendering with shadowmap
Draw emissive objects
MSAA output (subsample lighting)
Read from MSAA source (use G-Stage data)

Light Combine

Deferred lighting is applied here
Hemisphere lighting is calculated here (both using OA light-map and SSAO)
Perform tone-mapping here
Output Hi and Lo part of tone-mapped image into 2 RTs
MSAA output (subsample data combination)
Read from MSAA source (use G-Stage data and Light Stage data)

Transparent Objects

Basic forward rendering
MSAA output

Bloom / exposition

Use Hi RT as a source for bloom / luminance estimation

Final combine-2

Apply DoF, distortion, bloom

Post-Effects

Apply black-outs, film grain, etc.

Light Indexed Deferred Rendering

Three basic render passes:^Trebilco09

Render depth only pre-pass
Disable depth writes (depth testing only) and render light volumes into a light index texture
- Standard deferred lighting / shadow volume techniques can be used to find what fragments are hit by each light volume
Render geometry using standard forward rendering
- Lighting is done using the light index texture to access lighting properties in each shader

In order to support multiple light indexes per-fragment, it would be ideal to store the first light index in the texture’s red channel, second light index in the blue index, etc.^Trebilco09

Advantages:^Trebilco09
- vs Standard Deferred Rendering
  - Uses forward rendering so no need for “fat buffers” to store normal/position type data
  - Can layer on existing light schemes
  - Small buffers size (varies depending on how many lights per fragment are supported)
  - Light calculations like the reflection vector only needs to be calculated once
  - MSAA can be supported with fewer resources
  - Transparency can be supported
- vs Multi-pass Forward Rendering
  - Can render lots of lights with only a fragment size cost per light
  - Only two passes of the scene geometry - depth only pass then a forward render color pass
  - Do not have to break geometry up into pieces for individual lighting - can render huge vertex buffers
  - No Object → Light interactions need to be calculated on the CPU (for non-shadowing lights)
  - Light calculations like reflection vectors only needs to be calculated once and texture lookups and filtering only need to be done once
  - Can render “mesh-shaped” lights - not limited to sphere / cone shaped lights
- vs Multi-light Forward Rendering
  - Can render lots of lights with only a fragment size cost per light
  - Do not have to break geometry up into pieces for individual lighting - can render huge vertex buffers
  - Can render “mesh-shaped” lights - not limited to sphere / cone shaped lights
  - No Object → Light interactions need to be calculated on the CPU (for non-shadowing lights)
Disadvantages:^Trebilco09
- vs Standard Deferred Rendering
  - Exotic lighting types are harder to support (e.g., projected texture light)
  - Need to set a limit on how many lights can hit each fragment (current implementation has a max of 16)
  - Need to pass the vertex geometry twice - once for depth pre-pass and once for the forward pass
    - Depth pre-pass is not vital for light indexed deferred rendering(LIDR) but it allows a lot of optimization
  - Shadows are harder to support
- vs Multi-pass Forward Rendering
  - Exotic lighting types are harder to support (e.g., projected texture light)
  - Need to set a limit on how many lights can hit each fragment (current implementation has a max of 16)
  - Requires a full screen buffer to store light index data
  - All scene shaders need to be updated to support LIDR
  - Slower on scenes that have few objects and lights
  - Shadows are harder to support
- vs Multi-light Forward Rendering
  - Exotic lighting types are harder to support (e.g., projected texture light)
  - Need to set a limit on how many lights can hit each fragment (current implementation has a max of 16)
  - Requires a full screen buffer to store light index data
  - Can require two passes of scene geometry - depth only pass then a forward render color pass
  - Slower on scenes that have few objects and lights
  - Shadows are harder to support

Matt Pettineo’s approach^Pettineo12

Depth-only Prepass
Depth buffer used by a compute shader to compute the list of intersecting lights per-tile
- List is stored in either an R8_UINT or R16_UINT typed buffer (8-bit for < 255 lights)

Space Marine^KimBarrero11

Pass	Budget (ms)
Depth-Pre	0.50
G-Buffer + Linear Depth	5.05
AO	2.25
Lighting	8.00
Combiner Pass	5.00
Blend	0.15
Gamma Conversion	1.30
FX	2.75
Post Processing	3.70
UI	0.50
Total	29.20

Screen-Space Classification^{KnightRitchieParrish11}

Divided the screen into 4 × 4 pixel tiles. Each tile is classified according to the minimum global light properties it requires:

Sky
- Fastest pixels because no lighting calculations required
- Sky color is simply copied directly from the G-Buffer
Sun light
- Pixels facing the sun requires sun and specular lighting calculations (unless they’re fully in shadow)
Solid shadow
- Pixels fully in shadow don’t require any shadow or sun light calculations
Soft shadow
- Pixels at the edge of shadows require expensive eight-tap percentage closer filtering (PCF) unless they face away from the sun
Shadow fade
- Pixels near the end of the dynamic shadow draw distance fade from full shadow to no shadow to avoid pops as geometry moves out of the shadow range
Light scattering
- All but the nearest pixels
Antialiasing
- Pixels at the edges of polygons require lighting calculations for both 2X MSAA fragments

Classify four during our screen-space shadow mask generation, the other three in a per-pixel pass.

Inferred Lighting

^Kircher12

Developed by Volition, Inc.
Low-res MRT Geometry Pass
- Normals, DSF ID, Depth
Low-res Lighting Pass
- Normals, Depth used
Full-res Material Pass
- DSF ID and Low-res Lighting Pass used

Features:

Lots of fully dynamic lights
Integrated alpha lighting (no forward rendering)
Hardware MSAA support
Lit rain (IL required)
Better foliage support (applies only to IL)
Screen-space decals (enhanced by IL)
Radial AO (RAO) (optimized by IL)

Hybrid Deferred Rendering^{SousaWenzelRaine13}

Deferred lighting still processed as usual
- L-Buffers now using BW friendlier R11G11B10F formats
- Precision was sufficient, since material properties not applied yet
Deferred shading composited via fullscreen pass
- For more complex shading such as Hair or Skin, process forward passes
Allowed us to drop almost all opaque forward passes
- Less drawcalls, but G-Buffer passes now with higher cost
  - Fast Double-Z prepass for some of the closest geometry helps slightly
- Overwall was nice win, on all platforms*

Destiny Engine Deferred Rendering^{TatarchukTchouVenzon13}

G-Buffers (96 bits)
- Depth, normal, material ids
- Opaque geometries + Decals
- Highly-compressed
L-Buffers
- Lighting accumulation
- Light Geometry
- Lights
Lit Result
- Full-screen shading

Advantages:
- Memory footprint fits in EDRAM (96 bpp)
- Single pass over geometry (especially important for decoratos / foliage)
- Unified lighting + materials (no matrix of lights vs. geometry types like in Halo)
- Allows cheap deferred decals
- Complex material appearance
- Separate lighting / shading / geometry shaders simplifies shaders
Disadvantages:

Rainbow Six Siege^ElMansouri16

Opaque Rendering

First person rendering
400 best occluders to depth buffer
Generate Hi-Z
Opaque culling & rendering

Shadow Rendering

All shadows are cache based
- Used cached Hi-Z for culling
Sunlight shadow done in full resolution
- Separate pass to relieve lighting resolve VGPR pressure
- Uses Hi-Z representation of the cached shadow map to reduce the work per pixel
Local lights are resolved in a quarter resolution
- Resolved results stored in a texture array
- Lower VGPR usage on light accumulation
- Bilateral upscale

Lighting

Clustered structure on the frustum
- 32 x 32 pixels based tile
- Z exponential distribution
Hierarchical culling of light volume to fill the structure
Local cubemaps regarded as lights
Shadows, cubemaps, and gobos reside in texture arrays
- Deferred uses pre-resolved shadow texture array
- Forward uses shadows depth buffer array

Checkerboard Rendering

Rendering to a 1/4 size (1/2 width by 1/2 height) resolution with MSAA 2X:
- We end up with half the samples of the full resolution image
D3D MSAA 2X standard pattern
- 2 Color and Z samples
Sample modifier or SV_SampleIndex input to enforce rendering all sample
Each sample falls on the exact pixel center of full screen render target

Issues

Transparency

The best (in speed terms) we can do currently is to fall-back to a non-deferred lighting system for transparent surfaces and blend them in post-processing.^Calver03^Hargreaves04

Depth peeling is the ultimate solution, but is prohibitively expensive at least for the time being.^Hargreaves04

StarCraft II uses multipass forward approach:^{FilionMcNaughton08}

Proved to be more scalable
No need for more than a single shadow map buffer for the local lighting

StarCraft II’s simple layered system:

Opaque Pass
1. Create depth map from opaque objects
2. Render opaque objects
3. Apply depth-dependent post-processing effects
Transparency Pass
1. Render transparent objects back to front
2. Key transparencies are allowed to perform pre-pass where they overwrite the g-buffer
  - Since all post-processing on previous g-buffer data has been applied, that information is no longer needed
3. Update AO deferred buffer
4. Render the transparency
5. Perform DoF pass on the areas covered by the transparency

Memory

No solutions but a warning that deferred lighting has a number of large render-targets.^Calver03

Anti-Aliasing

Super sampling lighting is a costly option^{HargreavesHarris04}
Filter object edges^{HargreavesHarris04}

Antialiasing becomes solely the responsibility of the application and the shader; we cannot rely on the GPU alone.^{Shishkovtsov05}

Edge Detection

Edge-smoothing filter by Fabio05.^Placeres06:

Edge-detection scan is applied to the screen. The filter uses discontinuities in the positions and normal stored in the GBuffer. The results can be stored in the stencil buffer as a mask for the next step.
The screen is blurred using only the pixels that are edges
- These pixels are masked in the stencil buffer
- However, color bleeding can occur (e.g., background color bleeding into the character)
- Thus, a kernel is applied to the edge pixels, but only the closest to the camera are combined
- Cloor bleeding reduction

Pixel Edge Detection (Pixel Shader):^Thibieroz09

// Pixel shader to detect pixel edges
// Used with the following depth-stencil state values:
// DepthEnable = TRUE
// DepthFunc = Always
// DepthWriteMask = ZERO
// StencilEnable = TRUE
// Front/BackFaceStencilFail = Keep
// Front/BackfaceStencilDepthFail = Keep
// Front/BackfaceStencilPass = Replace;
// Front/BackfaceStencilFunc = Always;
// The stencil reference value is set to 0x80

float4 PSMarkStencilWithEdgePixels( PS_INPUT input ) : SV_TARGET
{
  // Fetch and compare samples from GBuffer to determine if pixel
  // is an edge pixel or not
  bool bIsEdge = DetectEdgePixel(input);

  // Discard pixel if non-edge (only mark stencil for edge pixels)
  if (!bIsEdge) discard;
  
  // Return color (will have no effect since no color buffer bound) return
  float4(1,1,1,1);
}

Centroid-Based Edge Detection

An optimized way to detect edges is to leverage the GPU’s fixed function resolve feature. Centroid sampling is used to adjust the sample position of an interpolated pixel shader input so that it is contained within the area defined by the multisamples covered by the triangle.^Thibieroz09

Centroid sampling can be used to determine whether a sample belongs to an edge pixel or not.^Thibieroz09

This MSAA edge detection technique is quite fast, especially compared to a custom method of comparing every G-Buffer normal and depth samples. It only requires a few bits of storage in a G-Buffer render target.^Thibieroz09

This is a neat trick, but is not that useful^Thibieroz11
- Produces too many edges that don’t need to be shaded per sample
- Especially when tessellation is used!!
- Doesn’t detect edges from transparent textures

S.T.A.L.K.E.R.^{Shishkovtsov05}

Our solution was to trade some signal frequency at the discontinuities for smoothness, and to leave other parts of the image intact. We detect discontinuities in both depth and normal direction by taking 8+1 samples of depth and finding how depth at the current pixel differs from the ideal line passed through opposite corner points. The normals were used to fix issues such as a wall perpendicular to the floor, where the depth forms a perfect line (or will be similar at all samples) but an aliased edge exists. The normals were processed in a similar cross-filter manner, and the dot product between normals was used to determine the presence of an edge.

struct v2p  
{    
  float4 tc0: TEXCOORD0; // Center    
  float4 tc1: TEXCOORD1; // Left Top      
  float4 tc2: TEXCOORD2; // Right Bottom    
  float4 tc3: TEXCOORD3; // Right Top    
  float4 tc4: TEXCOORD4; // Left Bottom      
  float4 tc5: TEXCOORD5; // Left / Right    
  float4 tc6: TEXCOORD6; // Top /Bottom  
};      

/////////////////////////////////////////////////////////////////////  
uniform sampler2D s_distort;  
uniform half4 e_barrier;  // x=norm(~.8f), y=depth(~.5f)  
uniform half4 e_weights;  // x=norm, y=depth  
uniform half4 e_kernel;   // x=norm, y=depth    
/////////////////////////////////////////////////////////////////////  

half4 main(v2p I) : COLOR  
{   
  // Normal discontinuity filter   
  half3 nc = tex2D(s_normal, I.tc0);   
  half4 nd;   
  nd.x = dot(nc, (half3)tex2D(s_normal, I.tc1));   
  nd.y = dot(nc, (half3)tex2D(s_normal, I.tc2));   
  nd.z = dot(nc, (half3)tex2D(s_normal, I.tc3));   
  nd.w = dot(nc, (half3)tex2D(s_normal, I.tc4));   
  nd -= e_barrier.x;   
  nd = step(0, nd);   
  half ne = saturate(dot(nd, e_weights.x));     

  // Opposite coords     
  float4 tc5r = I.tc5.wzyx;   
  float4 tc6r = I.tc6.wzyx;     
  
  // Depth filter : compute gradiental difference:   
  // (c-sample1)+(c-sample1_opposite)   
  half4 dc = tex2D(s_position, I.tc0);   
  half4 dd;   
  dd.x = (half)tex2D(s_position, I.tc1).z +          
    (half)tex2D(s_position, I.tc2).z;   
  dd.y = (half)tex2D(s_position, I.tc3).z +          
    (half)tex2D(s_position, I.tc4).z;   
  dd.z = (half)tex2D(s_position, I.tc5).z +          
    (half)tex2D(s_position, tc5r).z;   
  dd.w = (half)tex2D(s_position, I.tc6).z +          
    (half)tex2D(s_position, tc6r).z;   
  dd = abs(2 * dc.z - dd)- e_barrier.y;   
  dd = step(dd, 0);   
  half de = saturate(dot(dd, e_weights.y));     
  
  // Weight     
  half w = (1 - de * ne) * e_kernel.x; 
  // 0 - no aa, 1=full aa     
  // Smoothed color   
  // (a-c)*w + c = a*w + c(1-w)   
  float2 offset = I.tc0 * (1-w);   
  half4 s0 = tex2D(s_image, offset + I.tc1 * w);   
  half4 s1 = tex2D(s_image, offset + I.tc2 * w);   
  half4 s2 = tex2D(s_image, offset + I.tc3 * w);   
  half4 s3 = tex2D(s_image, offset + I.tc4 * w);   
  return (s0 + s1 + s2 + s3)/4.h;  
} 

Tabula Rasa^Koonce07

Modified S.T.A.L.K.E.R.’s algorithm to be resolution independent.

We looked at changes in depth gradients and changes in normal angles by sampling all eight neighbors surrounding a pixel. We compare the maximum change in depth to the minimum change in depth to determine how much of an edge is present. By comparing relative changes in this gradient instead of comparing the gradient to fixed values, we are able to make the logic resolution independent.

We compare the changes in the cosine of the angle between the center pixel and its neighboring pixels along the same edges at which we test depth gradients.

The output of the edge detection is a per-pixel weight between zero and one. The weight reflects how much of an edge the pixel is on. We use this weight to do four bilinear samples when computing the final pixel color. The four samples we take are at the pixel center for a weight of zero and at the four corners of the pixel for a weight of one. This results in a weighted average of the target pixel with all eight of its neighbors.

////////////////////////////    // Neighbor offset table    ////////////////////////////    
const static float2 offsets[9] = 
{   
  float2( 0.0,  0.0), //Center       0    
  float2(-1.0, -1.0), //Top Left     1    
  float2( 0.0, -1.0), //Top          2    
  float2( 1.0, -1.0), //Top Right    3    
  float2( 1.0,  0.0), //Right        4    
  float2( 1.0,  1.0), //Bottom Right 5    
  float2( 0.0,  1.0), //Bottom       6    
  float2(-1.0,  1.0), //Bottom Left  7    
  float2(-1.0,  0.0)  //Left         8 
}; 

float DL_GetEdgeWeight(in float2 screenPos) 
{   
  float Depth[9];   
  float3 Normal[9];   
  
  //Retrieve normal and depth data for all neighbors.    
  for (int i=0; i<9; ++i)   
  {     
    float2 uv = screenPos + offsets[i] * PixelSize;     
    Depth[i] = DL_GetDepth(uv);   //Retrieves depth from MRTs

    Normal[i]= DL_GetNormal(uv);  //Retrieves normal from MRTs 
  }   
  
  //Compute Deltas in Depth.    
  float4 Deltas1;   
  float4 Deltas2;   
  Deltas1.x = Depth[1];   
  Deltas1.y = Depth[2];   
  Deltas1.z = Depth[3];   
  Deltas1.w = Depth[4];   
  Deltas2.x = Depth[5];   
  Deltas2.y = Depth[6];   
  Deltas2.z = Depth[7];   
  Deltas2.w = Depth[8];   
  
  //Compute absolute gradients from center.   
  Deltas1 = abs(Deltas1 - Depth[0]);   
  Deltas2 = abs(Depth[0] - Deltas2);   
  
  //Find min and max gradient, ensuring min != 0    
  float4 maxDeltas = max(Deltas1, Deltas2);   
  float4 minDeltas = max(min(Deltas1, Deltas2), 0.00001);   
  
  // Compare change in gradients, flagging ones that change    
  // significantly.    
  // How severe the change must be to get flagged is a function of the    
  // minimum gradient. It is not resolution dependent. The constant    
  // number here would change based on how the depth values are stored    
  // and how sensitive the edge detection should be.    
  float4 depthResults = step(minDeltas * 25.0, maxDeltas);   
  
  //Compute change in the cosine of the angle between normals.   
  Deltas1.x = dot(Normal[1], Normal[0]);   
  Deltas1.y = dot(Normal[2], Normal[0]);   
  Deltas1.z = dot(Normal[3], Normal[0]);   
  Deltas1.w = dot(Normal[4], Normal[0]);   
  Deltas2.x = dot(Normal[5], Normal[0]);   
  Deltas2.y = dot(Normal[6], Normal[0]);   
  Deltas2.z = dot(Normal[7], Normal[0]);   
  Deltas2.w = dot(Normal[8], Normal[0]);   
  Deltas1 = abs(Deltas1 - Deltas2);   
  
  // Compare change in the cosine of the angles, flagging changes   
  // above some constant threshold. The cosine of the angle is not a    
  // linear function of the angle, so to have the flagging be    
  // independent of the angles involved, an arccos function would be    
  // required.    
  float4 normalResults = step(0.4, Deltas1);   
  normalResults = max(normalResults, depthResults);   
  
  return (normalResults.x + normalResults.y +           
    normalResults.z + normalResults.w) * 0.25; 
} 

MSAA

MSAA allows a scene to be rendered at a higher resolution without having to pay the cost of shading more pixels.^Thibieroz09

To support MSAA, the MRTs must be rendered with MSAA.
Forward shading
- Each object is shaded and rendered directly into a multisampled surface
- Once all objects have been rendered, a resolve operation is required to convert the multi-sampled render target into a final, anti-aliased image
Deferred shading
- G-Buffer’s multisampled RTs are simply intermediate storage buffers leading to the construction of the final image
- Once all shading contributions to the scene have been rendered onto the multisampled accumulation buffer, then the resolve operation can take place on this buffer to produce the final, anti-aliased image
Multisampled resources:
- G-Buffer RTs
- Accumulation buffer receiving the contribution of shading passes and further rendering
- Depth-stencil buffer
In order to produce accurate results for MSAA, it is essential that the pixel shaders used during the shading passes are executed at per-sample frequency
- However, this has a significant impact on performance
- A sensible optimization is to detect pixels whose samples have different values and only perform per-sample pixel shader execution on those “edge” pixels

~~Run light shader at pixel resolution~~^Valient07

~~Read G-Buffer for both pixel samples~~
~~Compute lighting for both samples~~
~~Average results and add to frame buffer~~

S.T.A.K.E.R: Clear Skies:^{LobanchikovGruen09}

Render to MSAA G-Buffer
Mask edge pixels
Process only subsample #0 for plain pixels
- Output to all subsamples
Process each subsample for edge pixels independently
Early stencil hardware minimizes PS overhead

For each shader
  Plain pixel: run shader at pixel frequency
  Edge pixel: run at subpixel frequency

^{LobanchikovGruen09}

MSAA Compute Shader Lighting

Only edge pixels need full per-sample lighting^Andersson11
- But edges have bad screen-space coherency! Inefficient^{StewartThomas13}
  - Bad for branching^{StewartThomas13}
Compute Shader can build efficient coherent pixel list^Andersson11
- Evaluate lighting for each pixel (sample 0)
- Determine if pixel requires per-sample lighting
- If so, add to atomic list in shared memory
- When all pixels are done, synchronize
- Go through and light sample 1-3 for pixels in list

Comparisons

| |Deferred |Tiled Deferred |Tiled Forward| |————————-|———-|————————–|————-| |Innermost loop |Pixels |Lights |Lights | |Light data access pattern|Sequential|Random |Random | |Pixel data access pattern|Random |Sequential |Sequential | |Re-use Shadow Maps |Yes |No |No | |Shading Pass |Deferred |Deferred^a|Geometry | |G-Buffers |Yes |Yes |No | |Overdraw of shading |No |No |Yes | |Transparency |Difficult |Simple |Simple | |Supporting FSAA |Difficult |Difficult |Trivial | |Bandwidth Usage |High |Low |Low | |Light volume intersection|Per Pixel |Per Tile |Per Tile | ^{OlssonAssarsson11}

^aApply Tiled Forward for transparent objects

EA. SIGGRAPH. 2011.

Light Type (8 lights/tile, every tile)	Performance
Point	4.0 ms
Point (with Spec)	7.8 ms
Cone	5.1 ms
Cone (with Spec)	5.3 ms
Line	5.8 ms

Classification: 1.35ms (with resolves)

^{WhiteBarreBrisebois11}

Deferred vs Forward+

Deferred^Pesce20
- Frostbite
- Guerrilla’s Decima
- Call of Duty BO3/4/CW
- Red Dead Redemption 2
- Naughty Dog’s Uncharted/TLOU
Forward+^Pesce20
- Doom
- Call of Duty: Modern Warfare / Warzone

References

2003

Photo-realistic Deferred Lighting. Dean Calver, Climax / Snapshot Games. Beyond3D.

2004

Deferred Shading. Shawn Hargreaves, Climax / Microsoft. GDC 2004
Deferred Shading. Shawn Hargreaves, Climax / Microsoft. Mark Harris, NVIDIA. NVIDIA Developers Conference 2004.
Deferred Shading with Multiple Render Targets. Nicolas Thibieroz, PowerVR Technologies / AMD. ShaderX2.

2005

Deferred Shading in S.T.A.L.K.E.R.. Oleksandr Shyshkovtsov, GSC Game World / 4A Games. GPU Gems 2.

2006

Overcoming Deferred Shading Drawbacks. Frank Puig Placeres, University of Informatic Sciences / Amazon. ShaderX5.

2007

Deferred Shading in Tabula Rasa. Rusty Koonce, NCSoft Corporation / Facebook. GPU Gems 3.
Deferred Rendering in Killzone 2. Michal Valient, Guerilla Games / Epic Games. Developer Conference 2007.
Optimizing Parallel Reduction in CUDA. Mark Harris, NVIDIA.

2008

The Technology of Uncharted: Drake’s Fortune. Christophe Balestra, Naughty Dog / Retired. Pål-Kristian Engstad, Naughty Dog / Apple. GDC 2008.
StarCraft II: Effects & Techniques. Dominic Filion, Blizzard Entertainment / Snap Inc.. Rob McNaughton, Blizzard Entertainment. SIGGRAPH 2008: Advances in Real-Time Rendering in 3D Graphics and Games Course.

2009

Parallel Graphics in Frostbite - Current Future. Johan Andersson, DICE / Embark Studios. SIGGRAPH 2009: Beyond Programmable Shading Course.
Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer. Wolfgang Engel, Rockstar Games / The Forge. ShaderX7.
Light Pre-Pass; Deferred Lighting: Latest Development. Wolfgang Engel, Rockstar Games / The Forge. SIGGRAPH 2009: Advances in Real-Time Rendering in Games Course.
Pre-lighting in Resistance 2. Mark Lee, Insomniac Games / Walt Disney Animation Studios. GDC 2009.
GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1. Igor A. Lobanchikov, GSC Game World / Retired. Holger Gruen, AMD. GDC 2009.
Deferred Lighting and Post Processing on PLAYSTATION 3. Matt Swoboda, Sony Computer Entertainment / Notch. GDC 2009.
Deferred Shading with Multisampling Anti-Aliasing in DirectX 10. Nicolas Thibieroz, AMD. GDC 2009. ShaderX7.
Light-Indexed Deferred Rendering. Damian Trebilco, THQ / Situ Systems. ShaderX7.
Compact Normal Storage for small G-Buffers. Aras Pranckevičius, Unity Technologies / Freelancer. Blog.

2010

CryENGINE 3: Reaching the Speed of Light. Anton Kaplanyan, Crytek / Intel Corporation. SIGGRAPH 2010: Advances in Real-Time Rendering in Games Course.
Deferred Rendering for Current and Future Rendering Pipelines. Andrew Lauritzen, Intel Corporation. SIGGRAPH 2010: Beyond Programmable Shader Course.

2011

DirectX 11 Rendering in Battlefield 3. Johan Andersson, DICE / Embark Studios. GDC 2011
Rendering Tech of Space Marine. Pope Kim, Relic Entertainment / POCU. Daniel Barrero, Relic Entertainment. KGC 2011.
Screen-Space Classification for Efficient Deferred Shading. Balor Knight, Black Rock Studio. Matthew Ritchie, Black Rock Studio. George Parrish, Black Rock Studio. Game Engine Gems 2.
Tiled Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Ulf Assarsson, Chalmers University of Technology. Journal of Graphics, GPU, and Game Tools.
Dragon Age II DX11 Technology. Andreas Papathanasis, BioWare / Parallel Space Inc.. GDC 2011.
Deferred Shading Optimizations. Nicolas Thibieroz, AMD. GDC 2011.
More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run. John White, EA Black Box / Roblox. Colin Barré-Brisebois, DICE / SEED. SIGGRAPH 2011: Advances in Real-Time Rendering in Games Course.

2012

Forward+: Bringing Deferred Lighting to the Next Level. Takahiro Harada, AMD. Jay McKee, AMD. Jason C. Yang, AMD / DGene. Eurographics 2012.
A 2.5D Culling for Forward+. Takahiro Harada, AMD. SIGGRAPH ASIA 2012.
Lighting & Simplifying Saints Row: The Third. Scott Kircher, Volition. GDC 2012.
Clustered Deferred and Forward Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Ulf Assarsson, Chalmers University of Technology. HPG 2012.
Tiled and Clustered Forward Shading: Supporting Transparency and MSAA. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Ulf Assarsson, Chalmers University of Technology. SIGGRAPH 2012: Talks.
Light Indexed Deferred Rendering. Matt Pettineo, Ready at Dawn. The Danger Zone Blog.

2013

Tiled Forward Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology / University of Leeds. Ulf Assarsson, Chalmers University of Technology. GPU Pro 4.
Forward+: A Step Toward Film-Style Shading in Real Time. Takahiro Harada, AMD. Jay McKee, AMD. Jason C. Yang, AMD / DGene. GPU Pro 4.
The Rendering Technologies of Crysis 3. Tiago Sousa, Crytek / id Software. Carsten Wenzel, Crytek / Cloud Imperium Games. Chris Raine, Crytek. GDC 2013.
CryENGINE 3: Graphics Gems. Tiago Sousa, Crytek / id Software. Nickolay Kasyan, Crytek / AMD. Nicolas Schulz, Crytek. SIGGRAPH 2013: Advances in Real-Time Rendering in 3D Graphics and Games Course.
Tiled Rendering Showdown: Forward++ vs. Deferred Rendering. Jason Stewart, AMD. Gareth Thomas, AMD. GDC 2013.
Destiny: From Mythic Science Fiction to Rendering in Real-Time. Natalya Tatarchuk, Bungie / Unity Technologies. Chris Tchou, Bungie. Joe Venzon, Bungie. SIGGRAPH 2013: Advances in Real-Time Rendering in 3D Graphics and Games Course.

2014

inFAMOUS Second Son Engine Postmortem. Adrian Bentley, Sucker Punch Productions. GDC 2014.
Real-Time Lighting via Light Linked List. Abdul Bezrati, Insomniac Games. SIGGRAPH 2014: Advances in Real-Time Rendering in 3D Graphics and Games Course.
Forward Clustered Shading. Marc Fauconneau Dufresne, Intel Corporation. Intel Software Developer Zone.
The Making of Forza Horizon 2. Richard Leadbetter, Digital Foundary. Eurogamer.net.
Crafting a Next-Gen Material Pipeline for The Order: 1886. David Neubelt, Ready at Dawn. Matt Pettineo, Ready at Dawn. GDC 2014.
Notes on Real-Time Renderers. Angelo Pesce, Activision / Roblox. C0DE517E Blog.
Moving to the Next Generation—The Rendering Technology of Ryse. Nicolas Schulz, Crytek. GDC 2014.
Compute Shader Optimizations for AMD GPUs: Parallel Reduction. Wolfgang Engel, Rockstar Games / The Forge. Diary of a Graphics Programmer.
Survey of Efficient Representations for Independent Unit Vectors. Zina H. Cigolle, Williams College / Stripe. Sam Donow, Williams College / Hudson River Trading. Daniel Evangelakos, Williams College / Olive. Michael Mara, Williams College / Luminary Cloud. Morgan McGuire, Williams College / Roblox. Quirin Meyer, Elektrobit / Hochschule Coburg. JCGT.

2015

Real-Time Lighting via Light Linked List. Abdul Bezrati, Insomniac Games. GPU Pro 6.
More Efficient Virtual Shadow Maps for Many Lights. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Erik Sintorn, Chalmers University of Technology. IEEE Transactions on Visualization and Computer Graphics.
Practical Clustered Shading. Emil Persson, Avalanche Studios / Elemental Games. SIGGRAPH 2015: Real-Time Many-Light Management and Shadows with Clustered Shading Course.
Notes on G-Buffer normal encodings. Angelo Pesce, Activision / Roblox. C0DE517E Blog.
Introduction to Real-Time Shading with Many Lights. Ola Olsson, Chalmers University of Technology / Epic Games. SIGGRAPH 2015: Real-Time Many-Light Management and Shadows with Clustered Shading Course.
Rendering the Alternate History of The Order: 1886. Matt Pettineo, Ready at Dawn. SIGGRAPH 2015: Advances in Real-Time Rendering in Games Course.
Compute-Based Tiled Culling. Jason Stewart, AMD. GPU Pro 6.
Advancements in Tiled-Based Compute Rendering. Gareth Thomas, AMD. GDC 2015.

2016

Deferred Lighting in Uncharted 4. Ramy El Garawany, Naughty Dog / Google. SIGGRAPH 2016: Advances in Real-Time Rendering in Games Course. Rendering Tom Clancy’s Rainbow Six Siege. Jalal El Mansouri, Ubisoft Montréal / Haven Studios Inc.. GDC 2016 Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12. Kevin Örtegren, Avalanche Studios / Epic Games. Emil Persson, Avalanche Studios / Elemental Games. GPU Pro 7. Tiled Shading: Light Culling—Reaching the Speed of Light. Dmitry Zhdan, NVIDIA. GDC 2016.

2017

How Unreal Renders a Frame. Kostas Anagnostou, Radiant Worlds / Playground Games. Interplay of Light Blog.
Improved Culling for Tiled and Clustered Rendering. Michal Drobot, Infinity Ward. SIGGRAPH 2017: Advances in Real-Time Rendering in Games Course.
Cull That Cone! Improved Cone/Spotlight Visibility Tests for Tiled and Clustered Lighting. Bartłomiej Wroński, Santa Monica Studio / NVIDIA. Bart Wronski Blog.

2018

The Road Toward Unified Rendering with Unity’s High Definition Render Pipeline. Sébastien Lagarde, Unity Technologies. Evgenii Golubev, Unity Technologies. SIGGRAPH 2018: Advances in Real-Time Rendering in Games Course.

2019

Under the Hood of Shadow of the Tomb Raider. m0radin. m0rad.in Blog.

2020

Real-Time Samurai Cinema: Lighting, Atmosphere, and Tonemapping in Ghost of Tsushima. Jasmin Patry, Sucker Punch Productions. SIGGRAPH 2021: Advances in Real-Time Rendering in Games Course.
Clustered Shading Evolution in Granite. Hans-Kristian Arntzen, Arntzen Software AS. Maister’s Graphics Adventures Blog.
Graphics Study: Red Dead Redemption 2. Hüseyin, Our Machinery. imgeself Blog. Hallucinations re: the rendering of Cyberpunk 2077. Angelo Pesce, Roblox. C0DE517E Blog.

2021

The Rendering of Jurassic World: Evolution. The Code Corsair. The Code Corsair Blog.
The Rendering of Mafia: Definitive Edition. The Code Corsair. The Code Corsair Blog.
Digital combat simulator: frame analysis. Thomas Poulet, Ubisoft Berlin / Huawei. Blog.

People by Company

Company	People	Referene
Snapshot Games	Dean Calver	Photo-realistic Deferred Lighting
Microsoft	Shawn Hargreaves	Deferred Shading
Microsoft	Shawn Hargreaves	Deferred Shading
NVIDIA	Mark Harris	Deferred Shading
AMD	Nicolas Thibieroz	Deferred Shading with Multiple Render Targets
		Deferred Shading with Multisampling Anti-Aliasing in DirectX 10
		Deferred Shading Optimizations
	Holger Gruen	GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1
	Takahiro Harada	Forward+: Bringing Deferred Lighting to the Next Level
		A 2.5D Culling for Forward+
		Forward+: A Step Toward Film-Style Shading in Real Time
	Jay McKee	Forward+: Bringing Deferred Lighting to the Next Level
	Jay McKee	Forward+: A Step Toward Film-Style Shading in Real Time
	Jason Stewart	Tiled Rendering Showdown: Forward++ vs. Deferred Rendering
	Gareth Thomas	Tiled Rendering Showdown: Forward++ vs. Deferred Rendering
4A Games	Oleksandr Shyshkovtsov	Deferred Shading in S.T.A.L.K.E.R.
Amazon	Frank Puig Placeres	Overcoming Deferred Shading Drawbacks
Facebook	Rusty Koonce	Deferred Shading in Tabula Rasa
Epic Games	Michal Valient	Deferred Rendering in Killzone 2
	Ola Olsson	Tiled Shading
		Clustered Deferred and Forward Shading
		Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
		Tiled Forward Shading
		More Efficient Virtual Shadow Maps for Many Lights
		Introduction to Real-Time Shading with Many Lights
	Kevin Örtegren	Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12
Apple	Pål-Kristian Engstad	The Technology of Uncharted: Drake's Fortune
Snap Inc.	Dominic Filion	StarCraft II: Effects & Techniques
Blizzard Entertainment	Rob McNaughton	StarCraft II: Effects & Techniques
Embark Studios	Johan Andersson	Parallel Graphics in Frostbite - Current Future
Embark Studios	Johan Andersson	DirectX 11 Rendering in Battlefield 3
The Forge	Wolfgang Engel	Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer
The Forge	Wolfgang Engel	Light Pre-Pass; Deferred Lighting: Latest Development
Walt Disney Animation Studios	Mark Lee	Pre-lighting in Resistance 2
Retired	Christophe Balestra	The Technology of Uncharted: Drake's Fortune
Retired	Igor A. Lobanchikov	GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1
Notch	Matt Swoboda	Deferred Lighting and Post Processing on PLAYSTATION 3
Situ Systems	Damian Trebilco	Light-Indexed Deferred Rendering
Intel Corporation	Anton Kaplanyan	CryENGINE 3: Reaching the Speed of Light
	Andrew Lauritzen	Deferred Rendering for Current and Future Rendering Pipelines
	Marc Fauconneau Dufresne	Forward Clustered Shading
POCU	Pope Kim	Rendering Tech of Space Marine
Relic Entertainment	Daniel Barrero	Rendering Tech of Space Marine
Black Rock Studio	Balor Knight	Screen-Space Classification for Efficient Deferred Shading
	Matthew Ritchie	Screen-Space Classification for Efficient Deferred Shading
	George Parrish	Screen-Space Classification for Efficient Deferred Shading
Chalmers University of Technology	Ulf Assarsson	Tiled Shading
		Clustered Deferred and Forward Shading
		Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
		Tiled Forward Shading
	Markus Billeter	Clustered Deferred and Forward Shading
		Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
		Tiled Forward Shading
		More Efficient Virtual Shadow Maps for Many Lights
	Erik Sintorn	More Efficient Virtual Shadow Maps for Many Lights
Parallel Space Inc.	Andreas Papathanasis	Dragon Age II DX11 Technology
Roblox	John White	More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run
Roblox	Angelo Pesce	Notes on Real-Time Renderers
SEED	Colin Barré-Brisebois	More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run
DGene	Jason C. Yang	Forward+: Bringing Deferred Lighting to the Next Level
DGene	Jason C. Yang	Forward+: A Step Toward Film-Style Shading in Real Time
Volition	Scott Kircher	Lighting & Simplifying Saints Row: The Third
Ready at Dawn	Matt Pettineo	Light Indexed Deferred Rendering
		Crafting a Next-Gen Material Pipeline for The Order: 1886
		Rendering the Alternate History of The Order: 1886
	David Neubelt	Crafting a Next-Gen Material Pipeline for The Order: 1886
id Software	Tiago Sousa	The Rendering Technologies of Crysis 3
id Software	Tiago Sousa	CryENGINE 3: Graphics Gems
Cloud Imperium Games	Carsten Wenzel	The Rendering Technologies of Crysis 3
Crytek	Chris Raine	The Rendering Technologies of Crysis 3
Crytek	Nicolas Schulz	Moving to the Next Generation—The Rendering Technology of Ryse
Unity Technologies	Natalya Tatarchuk	Destiny: From Mythic Science Fiction to Rendering in Real-Time
Bungie	Chris Tchou	Destiny: From Mythic Science Fiction to Rendering in Real-Time
Bungie	Joe Venzon	Destiny: From Mythic Science Fiction to Rendering in Real-Time
Sucker Punch Productions	Adrian Bentley	inFAMOUS Second Son Engine Postmortem
Insomniac Games	Abdul Bezrati	Real-Time Lighting via Light Linked List
Digital Foundary	Richard Leadbetter	The Making of Forza Horizon 2
Google	Ramy El Garawany	Deferred Lighting in Uncharted 4
Haven Studios Inc.	Jalal El Mansouri	Rendering Tom Clancy’s Rainbow Six Siege
Elemental Games	Emil Persson	Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12
Playground Games	Kostas Anagnostou	How Unreal Renders a Frame

Z Pre-Pass

@startuml
start
split
group Render Opaque Objects
    :Depth Buffer;
floating note left: Z Pre-Pass
floating note right: Sort Front-To-Back
    :Switch Off Depth Write;
    :Forward Rendering;
floating note left: Sort Front-To-Back
end group
split again
group Transparent Objects
    :Switch Off Depth Write;
    :Forward Rendering;
    floating note right: Sort Back-To-Front
end group
end split
stop
@enduml

Efficient Rendering Study Notes (2022.07.19)

Forward Rendering

Z Pre-Pass rendering

Lighting Pass

Single Pass Lighting

Multipass Lighting

Tiled Forward Shading

Basic AlgorithmOlssonBilleterAssarsson13

Subdivision of Screen

Optional pre-Z Pass

Optional Min / Max Z-Bounds

Light Assignment

Rendering and Shading

Transparency Support

Forward+ Rendering

Light Culling

Implementation

Gather Approach

Scatter Approach

2.5 CullingHarada12

Shading

Render PassesHaradaMcKeeYang13

One-Bounce Indirect IlluminationHaradaMcKeeYang13

Forward++ RenderingStewartThomas13

Alpha Blended Geometry

Shadow Casting Lights

Depth Discontinuities

Clustered Forward+Leadbetter14

Deferred Rendering

G-Buffers

Thin G-Buffer

What to Store?

Depth

Stencil

Normal

Baseline: XYZ

Octahdral Normal VectorsCigolleDonowEvangelakosMaraMcGuireMeyer14

Diffuse Albedo

Etc.

Examples

Example 1: Beyond3DCalver03

Example 2: Climax Studios GDC 2004 Hargreaves04

Example 3: ShaderX2Thibieroz04

Example 4: Killzone 2Valient07

Example 5: StarCraft IIFilionMcNaughton08

Example 6: S.T.A.L.E.R: Clear SkiesLobanchikovGruen09

Example 7: Split/SecondKnightRitchieParrish11

Example 8: Crysis 3SousaWenzelRaine13

Example 9: DestinyTatarchukTchouVenzon13

Example 10: inFAMOUS: Second SonBentley14

Example 11: RyzeSchulz14

Example 12: Uncharted 4ElGarawany16

Example 13: Jurassic World: EvolutionTheCodeCorsairJWE21

Example 14: Mafia: Definitive EditionTheCodeCorsairMDE21

Example 15: Digital Combat SimulatorPoulet21

Example 16: UnityLagardeGolubev18

Overview

Example Passes

Example 1: UnityLagardeGolubev18

Geometry Phase

Killzone 2Valient07

Optimizations

Export Cost

Light Accumulation PassValient07

Lighting Phase

Bandwidth ProblemOlsson15

Pre-Tiled Shading

Full screen lights

Shaped lights

OptimizationCalver03

OptimizationPlaceres06

Light Volumes

Optimizations

Stencil Cull

Light Shader Occlusion Optimisations

Accessing Light Properties

Tiled Shading

Z Prepass

Light Culling

AABB

Basic Algorithm^{OlssonBilleterAssarsson13}

2.5 Culling^Harada12

Render Passes^{HaradaMcKeeYang13}

One-Bounce Indirect Illumination^{HaradaMcKeeYang13}

Forward++ Rendering^{StewartThomas13}

Clustered Forward+^Leadbetter14

Octahdral Normal Vectors^{CigolleDonowEvangelakosMaraMcGuireMeyer14}

Example 1: Beyond3D^Calver03

Example 2: Climax Studios GDC 2004 ^Hargreaves04

Example 3: ShaderX2^Thibieroz04

Example 4: Killzone 2^Valient07

Example 5: StarCraft II^{FilionMcNaughton08}

Example 6: S.T.A.L.E.R: Clear Skies^{LobanchikovGruen09}

Example 7: Split/Second^{KnightRitchieParrish11}

Example 8: Crysis 3^{SousaWenzelRaine13}

Example 9: Destiny^{TatarchukTchouVenzon13}

Example 10: inFAMOUS: Second Son^Bentley14

Example 11: Ryze^Schulz14

Example 12: Uncharted 4^ElGarawany16

Example 13: Jurassic World: Evolution^{TheCodeCorsairJWE21}

Example 14: Mafia: Definitive Edition^{TheCodeCorsairMDE21}

Example 15: Digital Combat Simulator^Poulet21

Example 16: Unity^{LagardeGolubev18}

Example 1: Unity^{LagardeGolubev18}

Killzone 2^Valient07

Light Accumulation Pass^Valient07

Bandwidth Problem^Olsson15

Optimization^Calver03

Optimization^Placeres06

Depth range optimization^{OlssonAssarsson11}

Half Z Method^Stewart15

Parallel Reduction^Stewart15

Version A^{EngelSiggraph09}

Version B^{EngelSiggraph09}

Dragon Age II^{Papathanasis11}

Sparse vs Dense Cluster Grid^Olsson15

Explicit vs Implicit Cluster^Olsson15

Light Assignment^{OlssonBilleterAssarssonHpg12}

Shading^{OlssonBilleterAssarssonHpg12}

Cluster Key Packing^{OlssonBilleterAssarssonHpg12}

Tile Sorting^{OlssonBilleterAssarssonHpg12}

Per-Pixel Linked List^Bezrati14

3D Light Grid^Anagnostou17

The X-Ray Rendering Architecture^{LobanchikovGruen09}

Matt Pettineo’s approach^Pettineo12

Space Marine^KimBarrero11

Screen-Space Classification^{KnightRitchieParrish11}

Hybrid Deferred Rendering^{SousaWenzelRaine13}

Destiny Engine Deferred Rendering^{TatarchukTchouVenzon13}

Rainbow Six Siege^ElMansouri16

S.T.A.L.K.E.R.^{Shishkovtsov05}