Skip to the content.

Efficient Rendering Study Notes (2022.07.19)

Home

Forward Rendering

Characteristics:

Classic forward rendering:StewartThomas13

Modern Forward Shading:Olsson15

  1. Optional Pre-Z / Geometry Pass
  2. Light Assignment
    • Build Light Acceleration Structure (Grid)
  3. Geometry Pass
    • Just your normal shading pass
    • For each fragment
      • Look up light list in acceleration structure
      • Loop over lights and accumulate shading
      • Write shading to frame buffer

Z Pre-Pass rendering

Construct depth-only pass (Z pre-pass) first to fill the z buffer with depth data, and at the same time fill the z culling. Then render the scene using this occlusion data to prevent pixel overdraw.EngelShaderX709

ZPrePassRendererEngelShaderX709

A naïve multi-light solution that accompanies a Z pre-pass renderer design pattern would just render a limited number of lights in the pixel shader.EngelShaderX709

A more advanced approach stores light source properties such as position, light color, and other light properties in texture following a 2D grid that is laid out in the game world.EngelShaderX709

In order to render many lights:EngelSiggraph09

Space Marine:KimBarrero11

Unreal:Anagnostou17

Lighting Pass

Single Pass Lighting

For each object:
  Render mesh, applying all lights in one shader

Hargreaves04

For each object:
  Find all lights affecting object
  Render all lighting and material in a single shader

Valient07

Multipass Lighting

For each light:
  For each object affected by the light:
    framebuffer += object * light

Hargreaves04

For each light:
  For each object:
    Add lighting from single light to frame buffer

Valient07

Tiled Forward Shading

Basic AlgorithmOlssonBilleterAssarsson13

  1. Subdivide screen into tiles
  2. (Optional): pre-Z pass
  3. (Optional): find min / max z-bounds for each tile
  4. Assign lights to each tile
  5. Render geometry and compute shading for each generated fragment
// 1D texture holding per-tile light lists
uniform isampleBuffer tex_tileLightLists;

// uniform buffer holding each tile's light count and
// start offset of the tile's light list (in
// tex_tileLightIndices
uniform TileLightListRanges
{
  ivec2 u_lightListRange[MAX_NUM_TILES];
}

void shading_function(inout FragmentData aFragData)
{
  // ...

  // find fragment's tile using gl_FragCoord
  ivec2 tileCoord = ivec2(gl_FragCoord.xy) / ivec2(TILE_SIZE_X, TILE_SIZE_Y);
  int tileIdx = tileCoord.x + tileCoord.y * LIGHT_GRID_SIZE_X;

  // fetch tile's light data start offset (.y) and 
  // number of lights (.x)
  ivec2 lightListRange = u_lightListRange[tileIdx].xy;

  // iterate over lights affecting this tile
  for (int i = 0; i < lightListRange.x; ++i)
  {
    int lightIndex = lightListRange.y + i;

    // fetch global light ID
    int globalLightId = texelFetch(tex_tileLightLists, lightIndex).x;

    // get the light's data (position, colors, ...)
    LightData lightData;
    light_get_data(lightData, globalLightId);

    // compute shading from the light
    shade(aFragData, lightData);
  }
  // ...
}

Subdivision of Screen

Optional pre-Z Pass

  1. Required if we wish to find the Z-bounds for each tile
  2. In the final rendering pass, it can reduce the number of samples that need to be shaded through early-Z tests and similar hardware features
    • Should only include opaque geometry

Optional Min / Max Z-Bounds

Light Assignment

Rendering and Shading

Transparency Support

// assign lights to 2D tiles
tilesD = build_2d_tiles();
lightLists2D = assign_lights_to_2d_tiles(tiles2D);

// draw opaque geometry in pre-Z pass and find tiles'
// extents in the Z-direction
depthBuffer = render_preZ_pass();
tileZBounds = reduce_z_bounds(tiles2D, depthBuffer);

// for transparent geometry, prune lights against maximum Z-direction
lightListsTrans = prune_lights_max(lightLists2D, tileZBounds);

// for opaque geometry additionally prune lights against 
// minimum Z-direction
lightListsOpaque = prune_lights_min(lightListsTrans, tileZBounds);

// ...

// later: rendering
draw(opaque geometry, lightListsOpaque);
draw(transparent geometry, lightListsTrans);

Forward+ Rendering

Forward+:StewartThomas13

Light Culling

Implementation

Gather Approach
// GET_GROUP_IDX: thread group index in X direction (SV_GroupID)
// GET_GROUP_IDY: thread group index in Y direction (SV_GroupID)
// GET_GLOBAL_IDX: global thread index in X direction (SV_DispatchThreadID)
// GET_GLOBAL_IDY: global thread index in Y direction (SV_DispatchThreadID)
// GET_LOCAL_IDX: local thread index in X direction (SV_GroupThreadID)
// GET_LOCAL_IDY: local thread index in Y direction (SV_GroupThreadID)

// No global memory write is necessary until all lights are tested
groupshared u32 ldsLightIdx[LIGHT_CAPACITY];  // Light index storage
groupshared u32 ldsLightIdxCounter; // Light index counter for the storage

void appendLightToList(int i)
{
  u32 dstIdx = 0;
  InterlockedAdd(ldsLightIdxCounter, 1, dstIdx);
  if (dstIdx < LIGHT_CAPACITY)
  {
    ldsLightIdx[dstIdx] = i;
  }
}

...

  // 1: computation of the frustum of a tile in view space
  float4 frustum[4];
  { // construct frustum
    float4 v[4];
    // projToView: 
    //   takes screen-space pixel indices and depth value
    //   returns coordinates in view space
    v[0] = projToView(8 * GET_GROUP_IDX,        8 * GET_GROUP_IDY,        1.f);
    v[1] = projToView(8 * (GET_GROUP_IDX + 1),  8 * GET_GROUP_IDY,        1.f);
    v[2] = projToView(8 * (GET_GROUP_IDX + 1),  8 * (GET_GROUP_IDY + 1),  1.f);
    v[3] = projToView(8 * GET_GROUP_IDX,        8 * (GET_GROUP_IDY + 1),  1.f);
    float4 o = make_float4(0.f, 0.f, 0.f, 0.f);
    for (int i = 0; i < 4; ++i)
    {
      // createEquation:
      //   Creates a plane equation from three vertex positions
      frustum[i] = createEquation(o, v[i], v[(i + 1) & 3]);
    }
  }

  ...

  // 2: clip the frustum by using the max / min depth values of the pixels in the tile
  float depth = depthIn.Load(uint3(GET_GLOBAL_IDX, GET_GLOBAL_IDY, 0));
  float4 viewPos = projToView(GET_GLOBAL_IDX, GET_GLOBAL_IDY, depth);

  int lIdx = GET_LOCAL_IDX + GET_LOCAL_IDY * 8;
  { // calculate bound
    if (lIdx == 0)  // initialize
    {
      ldsZMax = 0;  // max z coordinates
      ldsZMin = 0xffffffff; // min z coordinates
    }
    GroupMemoryBarrierWithGroupSync();
    u32 z = asuint(viewPos.z);
    if (depth != 1.f)
    {
      AtomMax(ldsZMax, z);
      AtomMin(ldsZMin, z);
    }
    GroupMemoryBarrierWithGroupSync();
    maxZ = asfloat(ldsZMax);
    minZ = asfloat(ldsZMin);
  }

  ...

  // 3: cull lights
  // 8 x 8 thread group is used, thus 64 lights are processed in parallel
  for (int i = 0; i < nBodies; i += 64)
  {
    int il = lIdx + i;
    if (il < nBodies)
    {
      // overlaps:
      //   light-geometry overlap check using separating axis theorem
      if (overlaps(frustum, gLightGeometry[i]))
      {
        // appendLightToList
        //   Store light index to the lsit of the overlapping lights
        appendLightToList(il);
      }
    }
  }

  ...

  // 4: fill the light indices to the assigned contiguous memory of gLightIdx using all the threads in a thread group
  { // write back
    u32 startOffset = 0;

    if (lIdx == 0)
    { // reserve memory
      if (ldsLightIdxCounter != 0)
      {
        InterlockedAdd(gLightIdxCounter, ldsLightIdxCounter, startOffset);

        ptLowerBound[tileIdx] = startOffset;
        ldsLightIdxStart = startOffset;
      }
      GroupMemoryBarrierWithGroupSync();
      startOffset = ldsLightIdxStart;

      for (int i = lIdx; i < ldsLightIdxCounter; i += 64)
      {
        gLightIdx[startOffset + i] = ldsLightIdx[i];
      }
    }
  }

HaradaMcKeeYang13

Scatter Approach

2.5 CullingHarada12

IDEA:

FRUSTUM CONSTRUCTION:

LIGHT CULLING:

 1: frustum[0-4] ← Compute 4 planes at the boundary of a tile
 2: z ← Fetch depth value of the pixel
 3: ldsMinZ ← atomMin(z)
 4: ldsMaxZ ← atomMax(z)
 5: frustum[5, 6] ← Compute 2 planes using ldsMinZ, ldsMaxZ
 6: depthMaskT ← atomOr(1 << getCellIndex(z))
 7: for all the lights do
 8:   iLight ← lights[i]
 9:   if overlaps(iLight, frustum) then
10:     depthMaskL ← Compute mask using light extent
11:     overlapping ← depthMaskT ∧ depthMaskL
12:     if overlapping then
13:       appendLight(i)
14:     end if
15:   end if
16: end for
17: flushLightIndices()

Shading

#define LIGHT_LOOP_BEGIN
  int tileIndex = GetTileIndex(screenPos);
  uint startIndex;
  uint endIndex;
  GetTileOffsets(tileIndex, startIndex, endIndex);

  for (uint lightListIdx = startIdx; lightListIdx < endIdx; ++lightListIdx)
  {
    int lightIdx = LightIndexBuffer[lightListIdx];
    LightParams directLight;
    LightParams indirectLight;

    if (isIndirectLight(lightIdx))
    {
      FetchIndirectLight(lightIdx, indirectLight);
    }
    else
    {
      FetchDirectLight(lightIndex, directLight);
    }
#define LIGHT_LOOP_END
  }

...

float4 PS( PSInput i ) : SV_TARGET
{
  float3 colorOut = 0;
#LIGHT_LOOP_BEGIN
  colorOut += EvaluateMicrofacet(directLight, indirectLight);
#LIGHT_LOOP_END
  return float4(colorOut, 1.f);
}

HaradaMcKeeYang13

Render PassesHaradaMcKeeYang13

Forward+RenderPasses

One-Bounce Indirect IlluminationHaradaMcKeeYang13

Forward++ RenderingStewartThomas13

Alpha Blended Geometry

Shadow Casting Lights

// global list of lights (shadow casting + non-shadow casting)
uint shadowIndex = uint(g_PointLightColor[lightIndex].a * 255.0);
if (shadowIndex < 255)  // is it shadow casting?
{
  // Point light
  int face = DirectionToCubeMapFace(lightDirection);

  // pixel position to light space where the cube map faces
  float4 texCoord = mul(float4(position, 1), g_ShadowViewProj[shadowIndex][face]);
  texCoord.xyz /= texCoord.w;
  texCoord.xy = 0.5 * texCoord.xy + 0.5;

  // undersample per face
  texCoord.xy *= g_ShadowScaleAndBias.xx;
  texCoord.xy *= g_ShadowScaleAndBias.yy;

  // set texture coordinates in the atlas
  texCoord.xy += float2(face, shadowIndex);
  texCoord.xy *= rcp(float2(6, MAX_POINT_LIGHT_SHADOWS));

  texCoord.z -= g_ShadowZBias;

  // hardware PCF
  shadowTerm = FilterShadow(g_PointLightShadowAtlas, texCoord.xyz);
}

Depth Discontinuities

Clustered Forward+Leadbetter14

Deferred Rendering

Goal:

Q: Why deferred rendering?
A: Combine conventional rendering techniques with the advantages of image space techniquesCalver03

For each object:
  Render to multiple targets

For each light:
  Apply light as a 2D postprocess

Hargreaves04

For each object:
  Render surface properties into the G-Buffer
For each light and lit pixel
  Use G-Buffer to compute lighting
  Add result to frame buffer

Valient07

Traditional deferred shading:Andersson09

  1. Graphics pipeline rasterizes gbuffer for opaque surfaces
    • Normal, albedos, roughness, etc.
    • Render scene geometry into G-Buffer MRTStewartThomas13
      • Store material properties (albedo, specular, normal, etc.)
      • Write to depth buffer as normal
  2. Light sources are rendered & accumulate lighting to a texture (accumulation buffer)StewartThomas13
  3. Combine shading & lighting for final output

Modern Deferred Shading:Olsson15

  1. Render Scene to G-Buffers
  2. Light Assignment
    • Build Light Acceleration Structure (Grid)
  3. Full Screen Pass
    • Quad (or CUDA, or Compute Shaders, or SPUs)
    • For each pixel
      • Fetch G-Buffer Data
      • Look up light list in acceleration structure
      • Loop over lights and accumulate shading
      • Write shading

G-Buffers

G-Buffers are 2D images that store geometric details in a texture, storing positions, normals and other details at every pixel. The key ingredient to hardware acceleration of G-Buffers is having the precision to store and process data such as position on a per-pixel basis. The higher precision we have to store the G-Buffer at, the slower the hardware renders.Calver03

Thin G-Buffer

The smaller the better!Kaplanyan10

G-Buffer encoding requirements:Pesce15

Advantages:

What to Store?

Depth

Calver03Hargreaves04HargreavesHarris04Thibieroz04Placeres06FilionMcNaughton08EngelShaderX709EngelSiggraph09Lee09LobanchikovGruen09Kaplanyan10KnightRitchieParrish11Thibieroz11Moradin19Huseyin20Pesce20

Use depth data to reconstruct position data. Provided by the depth buffer.

Format Suggestion:

float3 vViewPos.xy = INTERPOLANT VPOS * half2(2.0f, -2.0f) + half2(-1.0f, 1.0f)) * 0.5 * p vCameraNearSize * p vRecipRenderTargetSize;
vViewPos.zw = half2(1.0f, 1.0f);
vViewPos.xyz = vViewPos.xyz * fSampledDepth;
float3 vWorldPos = mul(p_mInvViewTransform, vViewPos).xyz;

FilionMcNaughton08

// input SV_POSITION as pos2d
New_pos2d = ((pos2d.xy) * (2 / screenres.xy)) - float2(1, 1);
viewSpacePos.x = gbuffer_depth * tan(90 - HORZFOV/2) * New_pos2d.x;
viewSpacePos.y = gbuffer_depth * tan(90 - VERTFOV/2) * New_pos2d.y;
viewSpacePos.z = gbuffer_depth;

LobanchikovGruen09

Stencil

Kaplanyan10Huseyin20Pesce20

Format Suggestion:

Stencil to mark objects in lighting groupsKaplanyan10

Normal

Calver03Hargreaves04HargreavesHarris04Thibieroz04Placeres06Andersson09EngelShaderX709EngelSiggraph09Lee09LobanchikovGruen09Kaplanyan10KnightRitchieParrish11Thibieroz11Huseyin20Pesce20

Format Suggestions:

Considerations:

Optimizations:

Packing:

float2 pack_normal(float3 norm)
{
  float2 res;
  res = 0.5 * (norm.xy + float2(1, 1));
  res.x *= (norm.z < 0 ? -1.0 : 1.0);
  return res;
}

Unpacking:

float3 unpack_normal(float2 norm)
{
  float3 res;
  res.xy = (2.0 * abs(norm)) - float2(1, 1);
  res.z = (norm.x < 0 ? -1.0 : 1.0) * sqrt(abs(1 - res.x * res.x - res.y * res.y));
  return res;
}

Crytek:

Baseline: XYZ

// Encoding
half4 encode(half3 n, float3 view)
{
  return half4(n.xyz * 0.5 + 0.5, 0);
}

// Decoding
half3 decode(half4 enc, float3 view)
{
  return enc.xyz * 2.0 - 1.0;
}

Octahdral Normal VectorsCigolleDonowEvangelakosMaraMcGuireMeyer14

Map the sphere to an octahedron, project down into the z = 0 plane, and the reflect the -z-hemisphere over the appropriate diagonal.

// float3 to oct

// returns ±1
float2 signNotZero(float2 v)
{
  return float2((v.x >= 0.0) ? +1.0 : -1.0, (v.y >= 0.0) ? +1.0 : -1.0);
}

// assume normalized input. output is on [-1, 1] for each component
float2 float3ToOct(float2 v)
{
  // project the sphere onto the octahedron, and then onto the xy plane
  float2 p = v.xy * (1.0 / (abs(v.x) + abs(v.y) + abs(v.z)));

  // reflect the folds of the lower hemisphere over the diagonals
  return (v.z <= 0.0) ? ((1.0 - abs(p.yx)) * signNotZero(p)) : p;
}
// oct to float3
float3 octToFloat3(float2 e)
{
  float3 v = float3(e.xy, 1.0 - abs(e.x) - abs(e.y));
  if (v.z < 0)
  {
    v.xy = (1.0 - abs(v.yx)) * signNotZero(v.xy);
  }

  return normalize(v);
}

Diffuse Albedo

Calver03Hargreaves04HargreavesHarris04Thibieroz04Andersson09EngelShaderX709EngelSiggraph09Lee09LobanchikovGruen09KnightRitchieParrish11Thibieroz11Moradin19Huseyin20Pesce20

Format Suggestions:

Etc.

Examples

Example 1: Beyond3DCalver03

MRTs R G B A
RT 0 Pos.X Pos.Y Pos.Z ID
RT 1 Norm.X Norm.Y Norm.Z Material ID
RT 2 Diffuse Albedo.R Diffuse Albedo.G Diffuse Albedo.B Diffuse Term
RT 3 Specular Emissive.R Specular Emissive.G Specular Emissive.B Specular Term
Material Lookup texture
Kspecblend
KAmb
KEmm

Example 2: Climax Studios GDC 2004 Hargreaves04

MRTs R G B A
DS Depth R32F
RT 0 Norm.X R10F Norm.Y G10F Norm.Z B10F Scattering A2F
RT 1 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Emissive Term A8
RT 2 (could be palettized) Material Parameters R8 Material Parameters G8 Material Parameters B8 Material Parameters A8

Example 3: ShaderX2Thibieroz04

MRTs R8 G8 B8 A8
RT 0 Pos.X R16F Pos.Y G16F
RT 1 Pos.Z R16F
RT 2 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Normal.Z A8
RT 3 Normal.X A8 Normal.Y L8

Example 4: Killzone 2Valient07

MRTs R8 G8 B8 A8
DS Depth 24bpp Stencil
RT 0 Lighting Accumulation.R Lighting Accumulation.G Lighting Accumulation.B Intensity
RT 1 Normal.X FP16 Normal.Y FP16
RT 2 Motion Vectors XY Spec-Power Spec-Intensity
RT 3 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Sun-Occlusion A8

Analysis:

Example 5: StarCraft IIFilionMcNaughton08

MRTs R G B A
RT 0 Unlit & Emissive R16G16B16F Unused
RT 1 Normal R16G16B16F Depth
RT 2 Diffuse Albedo.R Diffuse Albedo.G Diffuse Albedo.B AO
RT 3 Specular Albedo.R Specular Albedo.G Specular Albedo.B Unused

Example 6: S.T.A.L.E.R: Clear SkiesLobanchikovGruen09

S.T.A.L.K.E.R. originally used a 3-RT G-Buffer:

S.T.A.L.E.R: Clear Skies:

Example 7: Split/SecondKnightRitchieParrish11

MRTs R G B A
RT 0 Diffuse Albedo.R Diffuse Albedo.G Diffuse Albedo.B Specular amount
RT 1 Normal.X Normal.Y Normal.Z Motion ID + MSAA edge
RT 3 Prelit.R Prelit.G Prelit.B Specular power

Example 8: Crysis 3SousaWenzelRaine13

MRTs R G B A
DS Depth D24 AmbID, Decals S8
RT 0 Normal.X R8 Normal.Y G8 Gloss, Z Sign B8 Translucency A8
RT 1 Diffuse Albedo.Y R8 Diffuse Albedo.Cb, .Cr G8 Specular Y B8 Per-Project A8

Example 9: DestinyTatarchukTchouVenzon13

MRTs R G B A
RT 0 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 AO A8
RT 1 Normal.X * (Biased Specular Smoothness) R8 Normal.Y * (Biased Specular Smoothness) G8 Normal.Z * (Biased Specular Smoothness) B8 Material ID A8
DS Depth D24 Stencil S8

Example 10: inFAMOUS: Second SonBentley14

</tr>
MRTs R G B A
RT 0 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Shadow Refr A8
RT 1 Normal.α R16 Normal.β G16 Vertex Normal.α B16 Vertex Normal.β A16
RT 2 Sun Shadow R8 AO G8 Spec Occl B8 Gloss A8
RT 3 Wetness Params RGBA8
RT 4 Ambient Diffuse.R R16F Ambient Diffuse.G G16F Ambient Diffuse.B B16F Amb Atten A16F
RT 5 Emissive.R R16F Emissive.G G16F Emissive.B B16F Alpha A16F
D32f Depth D24
S8 Stencil S8

Example 11: RyzeSchulz14

<td colspan="3"style="background-color:rgba(127, 255, 255, 0.5); color:black">Specular YCbCr / Transmittance CbCr GBA8</td>
MRTs R G B A
RT 0 Normal.X R8 Normal.Y G8 Normal.Z B8 Translucency Luminance / Prebaked AO Term A8
RT 1 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Subsurface Scatering Profile A8
RT 2 Roughness R8

Example 12: Uncharted 4ElGarawany16

Channels G-Buffer 0 Channels G-Buffer 1
R r g R ambientTranslucency sunShadowHigh specOcclusion
G b spec G heightmapShadowing sunShadowLow metallic
B normalx normaly B dominantDirectionX dominantDirectionY
A iblUseParent normalExtra roughness A ao extraMaterialMask sheen thinWallTranslucency

Example 13: Jurassic World: EvolutionTheCodeCorsairJWE21

MRTs R G B A
RT 0 Normal.X R Normal.Y G Normal.Z B Roughness A
RT 1 Motion Vectors

Example 14: Mafia: Definitive EditionTheCodeCorsairMDE21

MRTs R G B A
RT 0 Normal.X R16F Normal.Y G16F Normal.Z B16F Roughness A16F
RT 1 Diffuse Albedo.R R8 Diffuse Albedo.G G8 Diffuse Albedo.B B8 Metalness A8
RT 2 Motion Vectors RGB16U Encoded Vertex Normal A16U
RT 3 Specular Intensity R8 0.5 G8 Curvature or Thickness (for SSS) B8 SSS Profile A8
RT 4 Emissive.R R11F Emissive.G G11F Emissive.B B10F

Example 15: Digital Combat SimulatorPoulet21

ld_ms(texture2dmsarray)(float,float,float,float) r1.zw, r5.xyww, GBufferMap.zwxy, l(0)
ld_ms(texture2dmsarray)(float,float,float,float) r0.w, r5.xyzw, GBufferMap.yzwx, l(0)
mad r1.zw, r1.zzzw, l(0.0000, 0.0000, 2.0000, 2.0000), l(0.0000, 0.0000, -1.0000, -1.0000)
add r5.x, r1.w, r1.z
add r5.z, -r1.w, r1.z
mul r5.xz, r5.xxzx, l(0.5000, 0.0000, 0.5000, 0.0000)
add r1.z, abs(r5.z), abs(r5.x)
add r5.y, r1.z, l(-1.0000)
dp3 r1.z, r5.xyzx, r5.xyzx
rsq r1.z, r1.z
mul r5.xyz, r1.zzzz, r5.xyzx
ge r0.w, l(0.5000), r0.w
movc r5.w, r0.w, r5.y, -r5.y

Example 16: UnityLagardeGolubev18

MRTs R G B A
RT 0 (sRGB) BaseColor.R R8 BaseColor.G G8 BaseColor.B B8 Specular Occlusion A8
RT 1 Normal.xy (Octahedral 12/12) RGB8 Perceptual Smoothness A8
RT 2 Material Data RGB8 FeaturesMask(3) / Material Data A8
RT 3 Static diffuse lighting R11G11B10F
RT 4 (Optional) Extra specular occlusion data RG8 Ambient Occlusion B8 Light Layering Mask
RT 5 (Optional) 4 Shadow Masks RGBA8

Overview

Example Passes

Example 1: UnityLagardeGolubev18

Opaque Material Render Pass

  1. Depth Prepass
  2. GBuffer
    • Tag stencil for regular lighting or split lighting
  3. Render Shadow
    • Async Light list generation + Light/Material classification
    • Async SSAO (Use Normal buffer)
    • Async SSR (Use Normal buffer)
  4. Deferred directional cascade shadow
    • (Use Normal buffer for normal shadow bias)
  5. Tile deferred lighting
    • Indirect dispatch for each shader variants
      • Read stencil
        • No lighting: skip forward material and sky
        • Regular lighting: output lighting
        • Split lighting: separate diffuse and specular
  6. Forward Opaque
    • (Optional) Output BaseColor + Diffusion Profile
    • (Optional) Output + Tag stencil for split lighting
  7. SS Subsurface Scattering
    • Test stencil for split lighting
    • Combine lighting

Geometry Phase

Each geometry shader is responsible for filling the G-Buffers with correct parameters.Calver03

The major advantage over the conventional real-time approach to Renderman style procedural textures is that the entire shader is devoted to generating output parameters and that it is run only once regardless of the number or types of lights affecting this surface (generating depth maps also requires the geometry shaders to be run but usually with much simpler functions).Calver03

Another advantage is that after this phase how the G-Buffer was filled is irrelevant, this allows for impostors and particles to be mixed in with normal surfaces and be treated in the same manner (lighting, fog, etc.).Calver03

Some portions of the light equation that stay constant can be computed here and stored in the G-Buffer if necessary, this can be used if you light model uses Fresnel (which are usually only based on surface normal and view directional).Calver03

Killzone 2Valient07

Fill the G-Buffer with all geometry (static, skinned, etc.)
  Write depth, motion, specular, etc. properties
Initialize light accumulation buffer with pre-baked light
  Ambient, Incandescence, Constant specular
  Lightmaps on static geometry
    YUV color space, S3TC5 with Y in Alpha
    Sun occlusion in B channel
    Dynamic range [0...2]
  Image based lighting on dynamic geometry

Optimizations

Export Cost

Light Accumulation PassValient07

For each light:
  Find and mark visible lit pixels
  If light contributes to screen
    Render shadow map
    Shade lit pixels and add to framebuffer

Lighting Phase

The real power of deferred lighting is that lights are first class citizens, this complete separation of lighting and geometry allows lights to be treated in a totally different way from standard rendering. This makes the artist’s job easier as there is less restrictions on how lights affect surfaces, this allows for easy customizable lighting rigs.Calver03

Light shaders have access to the parameters stored in the G-Buffer at each pixel they light.Calver03

Add lighting contributions into accumulation bufferThibieroz11

Render convex bounding geometry
Read G-Buffer
Compute radiance
Blend into frame buffer

HargreavesHarris04

For each light:
  diffuse += diffuse(GBuffer.N, L)
  specular += GBuffer.spec * specular(GBuffer.N, GBuffer.P, L)

HargreavesHarris04

framebuffer = diffuse * GBuffer.diffuse + specular

HargreavesHarris04

Per-Sample Pixel Shader Execution:Thibieroz09

struct PS_INPUT_EDGE_SAMPLE
{
  float4 Pos : SV_POSITION;
  uint uSample : SV_SAMPLEINDEX;
};

// Multisampled G-Buffer textures declaration
Texture2DMS <float4, NUM_SAMPLES> txMRT0;
Texture2DMS <float4, NUM_SAMPLES> txMRT1;
Texture2DMS <float4, NUM_SAMPLES> txMRT2;
// Pixel shader for shading pass of edge samples in DX10.1
// This shader is run at sample frequency
// Used with the following depth-stencil state values so that only
// samples belonging to edge pixels are rendered, as detected in
// the previous stencil pass.
// StencilEnable = TRUE
// StencilReadMask = 0x80
// Front/BackFaceStencilFail = Keep
// Front/BackfaceStencilDepthFail = Keep
// Front/BackfaceStencilPass = Keep;
// Front/BackfaceStencilFunc = Equal;
// The stencil reference value is set to 0x80

float4 PSLightPass_EdgeSampleOnly( PS_INPUT_EDGE_SAMPLE input ) : SV_TARGET
{
  // Convert screen coordinates to integer
  int3 nScreenCoordinates = int3(input.Pos.xy, 0);
  
  // Sample G-Buffer textures for current sample
  float4 MRT0 = txMRT0.Load( nScreenCoordinates, input.uSample);
  float4 MRT1 = txMRT1.Load( nScreenCoordinates, input.uSample);
  float4 MRT2 = txMRT2.Load( nScreenCoordinates, input.uSample);
  
  // Apply light equation to this sample
  float4 vColor = LightEquation(MRT0, MRT1, MRT2);
  
  // Return calculated sample color
  return vColor;
}

Conventional Deferred ShadingLauritzen10:

Modern ImplementationLauritzen10:

for each G-Buffer sample
{
  sampleAttr = load attributes from G-Buffer

  for each light
  {
    color += shade(sampleAttr, light)
  }

  output pixel color;
}

OlssonBilleterAssarsson13

uniform vec3 lightPosition;
uniform vec3 lightColor;
uniform float lightRange;

void main()
{
  vec3 color = texelFetch(colorTex, gl_FragCoord.xy);
  vec3 specular = texelFetch(specularTEx, gl_FragCoord.xy);
  vec3 normal = texelFetch(normalTex, gl_FragCoord.xy);
  vec3 position = fetchPosition(gl_FragCoord.xy);

  vec3 shading = doLight(position, normal, color,
                         specular, lightPosition,
                         lightColor, lightRange);

  resultColor = vec4(shading, 1.0);
}

Olsson15

Red Dead Redemption 2:Huseyin20

Plus(+) Methods: Algorithm Steps:Drobot17

Lighting Optimizations:LagardeGolubev18

Bandwidth ProblemOlsson15

for each light
  for each covered pixel
    read G-Buffer // repeated reads
    compute shading
    read + write frame buffer // repeated reads and writes
for each pixel
  read G-Buffer
  for each affecting light
    compute shading
  write frame buffer
for each pixel
  read G-Buffer
  for each possibly affecting light
    if affecting
      compute shading
  write frame buffer

Pre-Tiled Shading

Advantages:

Full screen lights

For lights that are truly global and have no position and size (ambient and directional are the traditional types), we create a full screen quad that executes the pixel shader at every pixel.Calver03Hargreaves04

Global directional lights has little benefit in using deferred rendering methods on them, and it would actually be slower to resample the deferred buffers again for the entire screen.FilionMcNaughton08

Shaped lights

Shaped lights can be implemented via a full screen quad in exactly the same way of directional lights just with a different algorithm computing the lights direction and attenuation, but the attenuation allows us to pre-calculate where the light no longer makes any contribution.Calver03

OptimizationCalver03

The attenuation model I use is a simple texture lookup based on distance. The distance is divided by the maximum distance that the light can possible effect and then this is used to lookup a 1D texture. The last texel should be 0, (no constant term) if the following optimisations are to be used.

OptimizationPlaceres06

Shade only the pixels influenced by the bounding object involves rendering a full screen quad, but enabling clipping and rejection features to discard many noninfluenced pixels. This requires dynamic branching.

Light Volumes

We create a mesh that encloses the light affecting volume with any pixels found to be in the interior of the volume executing the light shader.Calver03Hargreaves04

  1. Each pixel most be hit once and once only. If the light volume causes the light shader to be executed more than once it will be equivalent to having n lights affecting this pixel.Calver03
  2. The near and far clip planes must not effect the projected shape. We need the projected geometry not to be clipped at the near and far plane as this will cause holes in our lights.Calver03

For convex volumes the first problem is completely removed by just using back or front face culling.Calver03Hargreaves04

We can’t remove the near plane, but we can effectively remove the far plane by placing it at infinity.Calver03

Convex volumes cover the vast majority of lights shaders (e.g. spheres for point lights, cones for spotlights, etc.) and we can adapt them to use the fast z-reject hardware that is usually available.Calver03

Dealing with the light volume rendering:Hargreaves04

  1. Camera is outside the light bounding mesh
    • Simple back face culling (each pixel most be hit once and once only)
  2. Camera is inside the light bounding mesh
    • Draw backfaces
  3. Light volume intersects the far clip plane
    • Draw frontfaces
  4. Light volume intersects both near and far clip planes
    • Light is too big

Optimizations

S.T.A.L.K.E.R case:Shishkovtsov05

Pass 0: Render full-screen quad only where 0x03==stencil count
        (where attributes are stored)
  If ((N dot L) * ambient_occlusion_term > 0)
    discard fragment
  Else
    color = 0, stencil = 0x01
Pass 1: Render full-screen quad only where 0x03==stencil count
  Perform light accumulation / shading 

Shishkovtsov05

  1. Social Stage:Placeres06
    • Filter the lights and effects on the scene to produce a smaller list of sources to be processed
      1. Execute visiblity and occlusion algorithms to discard lights whose influence is not appreciable
      2. Project visible sources bounding objects into screen space
      3. Combine similar sources that are too close in screen space or influence almost the same screen area
      4. Discard sources with a tiny contribution because of their projected bounding object being too small or too far
      5. Check that more than a predefined number of sources do not affect each screen region. Choose the biggest, strongest, and closer sources.
  2. Individual Stage:Placeres06
    • Global Sources
      • Most fill-rate expensive
        1. Enable the appropriate shaders
        2. Render a quad covering the screen
    • Local Sources
      1. Select the appropriate level of detail.
      2. Enable and configure the source shaders
      3. Compute the minimum and maximum screen cord values of the projected bounding object
      4. Enable the scissor test
      5. Enable the clipping planes
      6. Render a screen quad or the bounding object

Other optimizations:

Stencil Cull
  1. Render light volume with color write disabledHargreavesHarris04
    • Depth Func = LESS, Stencil Func = ALWAYS
    • Stencil Z-FAIL = REPLACE (with value X)
    • Rest of stencil ops set to KEEP
  2. Render with lighting shaderHargreavesHarris04
    • Depth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = X
    • Unlit pixels will be culled because stencil will not match the reference value * Only regions that fail depth test represent objects within the light volumeHargreavesHarris04

Killzone 2 case:Valient07

Light Shader Occlusion Optimisations

The basis of using occlusion culling with light shaders is that the depth buffer used for the creation of the G-Buffer is available at no cost (this is only true if the resolution of the G-Buffer is the same as destination colour buffer and that we are using the same projection matrix for the geometry shaders and light shaders).Calver03

I simply turn off the occlusion culling if the light shader hits the near plane and just render the back faces without depth testing. Its means some pixels run the pixel shader unnecessarily but it’s very cheap on the CPU and the actual difference is usually only a few pixels.Calver03

Accessing Light Properties

ex)

struct LIGHT_STRUCT
{
  float4 vColor;
  float4 vPos;
};
cbuffer cbPointLightArray
{
  LIGHT_STRUCT g_Light[NUM_LIGHTS];
};

float4 PS_PointLight(PS_INPUT i) : SV_TARGET
{
  // ...
  uint uIndex = i.uPrimIndex / 2;
  float4 vColor = g_Light[uIndex].vColor;   // NO!
  float4 vLightPos = g_Light[uIndex].vPos;  // NO!
}
PS_QUAD_INPUT VS_PointLight(VS_INPUT i)
{
  PS_QUAD_INPUT Out = (PS_QUAD_INPUT)0;

  // Pass position
  Out.vPosition = float4(i.vNDCPosition, 1.0);

  // Pass light properties to PS
  uint uIndex = i.uVertexIndex / 4;
  Out.vLightColor = g_Light[uIndex].vColor;
  Out.vLightPos = g_Light[uLightIndex].vPos;

  return Out;
}

struct PS_QUAD_INPUT
{
  nointerpolation float4 vLightColor : LCOLOR;
  nointerpolation float4 vLightPos : LPOS;
  float4 vPosition : SV_POSITION;
};

Thibieroz11

Tiled Shading

Amortizes overheadLauritzen10.

  1. Divide the screen into a gridBalestraEngstad08Andersson11WhiteBarreBrisebois11OlssonBilleterAssarsson13
  2. Find which lights intersect each cellBalestraEngstad08Andersson11OlssonBilleterAssarsson13
  3. Render quads over each cell calculating up to 8 lights per passBalestraEngstad08

Algorithm:OlssonAssarsson11

  1. Render the (opaque) geometry into the G-BuffersStewartThomas13
  2. Construct a screen space grid, covering the frame buffer, with some fixed tile size, t = (x, y), e.g. 32 × 32 pixelsWhiteBarreBrisebois11StewartThomas13
  3. For each light: find the screen space extents of the light volume and append the light ID to each affected grid cellOlssonBilleterAssarsson13StewartThomas13
  4. For each fragment in the frame buffer, with location f = (x, y)
    1. Sample the G-Buffers at f
    2. Accumulate light contributions from all lights in tile at ⌊f /t⌋
    3. Output total light contributions to frame buffer at f

Pseudocode:OlssonAssarsson11

vec3 computeLight(vec3 position, vec3 normal, vec3 albedo,
                  vec3 specular, vec3 viewDir, float shininess,
                  ivec2 fragPos)
{
  ivec2 l = ivec2(fragPos.x / LIGHT_GRID_CELL_DIM_X,
                  fragPos.y / LIGHT_GRID_CELL_DIM_Y);
  int count = lightGrid[l.x + l.y * gridDim.x].x;
  int offset = lightGrid[l.x + l.y * gridDim.x].y;

  vec3 shading = vec3(0.0);

  for (int i = 0; i < count; ++i)
  {
    ivec2 dataInd = ivec2((offset + i) % TILE_DATA_TEX_WIDTH,
                          (offset + i) / TILE_DATA_TEX_WIDTH);
    int lightId = texelFetch(tileDataTex, dataInd, 0).x;
    shading += applyLight(position, normal, albedo, specular,
                          shininess, viewDir, lightId);
  }  

  return shading;
}

void main()
{
  ivec2 fragPos = ivec2(gl_FragCoord.xy);
  vec3 albedo = texelFetch(albedoTex, fragPos).xyz;
  vec4 specShine = texelFetch(specularShininessTex, fragPos);
  vec3 position = unProject(gl_FragCoord.xy, texelFetch(depthTex, fragPos));
  vec3 normal = texelFetch(normalTex, fragPos).xyz;
  vec3 viewDir = -normalize(position);

  gl_fragColor = computeLight(position, normal, albedo, 
                              specShine.xyz, viewDir, specShine.w, 
                              fragPos);
}

PhyreEngine Implementation:Swoboda09

  1. Calculate affecing lights per tile
    • Build a frustum around the tile using the min and max depth values in that tile
    • Perform frustum check with each light’s bounding volume
    • Compare light direction with tile average normal value
  2. Choose fast paths based on tile contents
    • No lights affect the tile? Use fast path
    • Check material values to see if any pixels are marked as lit

Screen tile classification is a powerful technique with many applications:Swoboda09

To facilitate look up from shaders, we must store the data structure in a suitable format:OlssonAssarsson11

  1. Light Grid contains an offset to and size of the light list for each tile
  2. Tile Light Index Lists contains light indices, referring to the lights in the Global Light Lists
Global Light List
L0 L1 L2 L3 L4 L5 L6 L7
Tile Light Index Lists
0 0 6 3 0 6 4 4
Tile Light Index Lists
0 1 4 7
1 3 3 1
66 67 69
1 2 2

Red Dead Redemption 2:Huseyin20

Basic tiled culling:Stewart15

Input: light list, scene depth
Output: per-tile list of intersecting lights

calculate depth bounds for the tile;
calculate frustum planes for the tile;

for i ← thread_index to num_lights do
  current_light ← light_list[i];
  test intersection against tile bounding volume;
  if intersection then
    thread-safe increment of list counter;
    write light index to per-tile list;
  end
  i ← i + num_threads_per_tile;
end

Z Prepass

groupshared uint ldsZMin;
groupshared uint ldsZMax;

[numthreads(16, 16, 1)]
void CalculateDepthBoundsCS(int32 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID)
{
  uint localIdxFlattened = localIdx.x + localIdx.y * 16;

  if (localIdxFlattened == 0)
  {
    ldsZMin = 0x7f7fffff; // FLT_MAX as a uint
    ldsZMax = 0;
  }

  GroupMemoryBarrierWithGropuSync();

  float depth = g_DepthTexture.Load(uint3(globalIdx.x, globalIdx.y, 0)).x;

  uint z = asuint( ConvertProjDepthToView( depth ) ); // reinterpret as uint

  if (depth != 0.0)
  {
    InterlockedMax( ldsZMax, z );
    InterlockedMin( ldsZMin, z );
  }

  GroupMemoryBarrierWithGroupSync();

  float maxZ = asfloat( ldsZMax );
  float minZ = asfloat( ldsZMin );
}

Thomas15

Parallel Reduction:Thomas15

Algorthm:Thomas15

depth[tid] = min(depth[tid], depth[tid + 8])

depth[tid] = min(depth[tid], depth[tid + 4])

depth[tid] = min(depth[tid], depth[tid + 2])

depth[tid] = min(depth[tid], depth[tid + 1])

Thomas15

Implementation:Thomas15

groupshared uint ldsZMin[64];
groupshared uint ldsZMax[64];

[numthreads(8, 8, 1)]
void CalculateDepthBoundsCS(uint3 globalIdx : SV_DispatchThreadID, uint3 localIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID)
{
  uint2 sampleIdx = globalIdx.xy * 2;

  float depth00 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x, sampleIdx.y, 0)).x;      float viewPosZ00 = ConvertProjDepthToView(depth00);
  float depth01 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x, sampleIdx.y+1, 0)).x;    float viewPosZ01 = ConvertProjDepthToView(depth01);
  float depth10 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x+1, sampleIdx.y, 0)).x;    float viewPosZ10 = ConvertProjDepthToView(depth10);
  float depth11 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x+1, sampleIdx.y+1, 0)).x;  float viewPosZ11 = ConvertProjDepthToView(depth11);

  float minZ00 = (depth00 != 0.f) ? viewPosZ00 : FLT_MAX; float maxZ00 = (depth00 != 0.f) ? viewPosZ00 : 0.0f;
  float minZ10 = (depth01 != 0.f) ? viewPosZ10 : FLT_MAX; float maxZ10 = (depth01 != 0.f) ? viewPosZ10 : 0.0f;
  float minZ01 = (depth10 != 0.f) ? viewPosZ01 : FLT_MAX; float maxZ01 = (depth10 != 0.f) ? viewPosZ01 : 0.0f;
  float minZ11 = (depth11 != 0.f) ? viewPosZ11 : FLT_MAX; float maxZ11 = (depth11 != 0.f) ? viewPosZ11 : 0.0f;

  uint threadNum = localIdx.x + localIdx.y * 8;

  ldsZMin[threadNum] = min(minZ00, min(minZ01, min(minZ10, minZ11)));
  ldsZMax[threadNum] = max(maxZ00, max(maxZ01, max(maxZ10, maxZ11)));

  GroupMemoryBarrierWithGroupSync();

  if (threadNum < 32)
  {
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 32]);  ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 32]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 16]);  ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 16]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 8]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 8]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 4]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 4]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 2]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 2]);
    ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 1]);   ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 1]);
  }

  GroupMemoryBarrierWithGroupSync();

  if (threadNum == 0)
  {
    g_DepthBounds[groupIdx.xy] = float2(ldsZMin[0], ldsZMax[0]);
  }
}

Thomas15

Depth bounds calculation:Stewart15

Texture2D<float> g_SceneDepthBuffer;

// Thread Group Shared Memory (aka local data share, or LDS)
groupshared uint ldsZMin;
groupshared uint ldsZMax;

// Convert a depth value from postprojection space
// into view space
float ConvertProjDepthToView(float z)
{
  return (1.f / (z * g_mProjectionInv._34 + g_mProjecitonInv._44));
}

#define TILE_RES (16)
[numthreads(TILE_RES, TILE_RES, 1)]
void CullLightsCS(uint3 globalIdx : SV_DispatchThreadID,
                  uint3 localIdx  : SV_GroupThreadID,
                  uint3 groupIdx  : SV_GroupID)
{
  float depth = g_SceneDepthBuffer.Load(uint3(globalIdx.x, globalIdx.y, 0)).x;
  float viewPosZ = ConvertProjDepthToView(depth);
  uint z = asuint(viewPosZ);

  uint threadNum = localIdx.x + localIdx.y * TILE_RES;

  // There is no way to initialize shared memory at compile time, so thread zero does it at runtime
  if (threadNum == 0)
  {
    ldsZMin = 0x7f7fffff; // FLT_MAX as a uint
    ldsZMax = 0;
  }
  GroupMemoryBarrierWithGroupSync();

  // Parts of the depth buffer that were never written
  // (e.g., the sky) will be zero (the companion code uses
  // inverted 32-bit float depth for better precision).
  if (depth != 0.f)
  {
    // Calculate the minimum and maximum depth for this tile
    // to form the front and back of the frustum
    InterlockedMin(ldsZMin, z);
    InterlockedMax(ldsZMax, z);
  }
  GroupMemoryBarrierWithGroupSync();

  float minZ = asfloat(ldsZMin);
  float maxZ = asfloat(ldsZMax);

  // Frustum plane  and intersection code goes here
  ...
}

Light Culling

Frustum planes calculation:Stewart15

// Plane equation from three points, simplified
// for the case where the first position is the origin.
// N is normalized so that the plane equation can
// be used to compute signed distance
float4 CreatePlaneEquation(float3 Q, float3 R)
{
  // N = normalize(cross(Q-P, R-P))
  // except we know P is the origin
  float3 N = normalize(cross(Q, R))
  // D = -(N dot P), except we know P is the origin
  return float4(N, 0);
}

// Convert a point from postprojectino space into view space
float3 ConvertProjToView(float4 p)
{
  p = mul(p, g_mProjectionInv);
  return (p/p.w).xyz;
}

void CullLightsCS(uint3 globalIdx : SV_DispatchThreadID,
                  uint3 localIdx  : SV_GroupThreadID,
                  uint3 groupIdx  : SV_GroupID)
{
  // Depth bounds code goes here
  ...
  float4 frustumEqn[4];
  { // Construct frustum planes for this tile
    uint pxm = TILE_RES * groupIdx.x;
    uint pym = TILE_RES * groupIdx.y;
    uint pxp = TILE_RES * (groupIdx.x + 1);
    uint pyp = TILE_RES * (groupIdx.y + 1);
    uint width = TILE_RES * GetNumTilesX();
    uint height = TILE_RES * GetNumTilesY();

    // Four corners of the tile, clockwise from top-left
    float3 p[4];
    p[0] = ConvertProjToView(float4(pxm / (float) width * 2.f - 1.f, (height - pym) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[1] = ConvertProjToView(float4(pxp / (float) width * 2.f - 1.f, (height - pym) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[2] = ConvertProjToView(float4(pxp / (float) width * 2.f - 1.f, (height - pyp) / (float) height * 2.f - 1.f, 1.f, 1.f));
    p[3] = ConvertProjToView(float4(pxm / (float) width * 2.f - 1.f, (height - pyp) / (float) height * 2.f - 1.f, 1.f, 1.f));

    // Create plane equations for the four sides, with
    // the positive half-space outside the frustum
    for (uint i = 0; i < 4; ++i)
    {
      frustumEqn[i] = CreatePlaneEquation(p[i], p[(i + 1) & 3]);
    }
  }

  // Intersection code goes here
  ...
}

Intersection testing:Stewart15

Buffer<float4> g_LightBufferCenterAndRadius;

#define MAX_NUM_LIGHTS_PER_TILE (256)
groupshared uint ldsLightIdxCounter;
groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];

// Point-plane distance, simplified for the case where
// the plane passes through the origin
float GetSignedDistnaceFromPlane(float3 p, float4 eqn)
{
  // dot(eqn.xyz, p) + eqn.w, except we know eqn.w is zero
  return dot(eqn.xyz, p);
}

#define NUM_THREADS (TILE_RES * TILE_RES)
void CullLightsCS(...)
{
  // Depth bounds and frustum planes code goes here
  ...
  if (threadNum = 0)
  {
    ldsLightIdxCounter = 0;
  }

  // Loop over the lights and do a
  // sphere versus frustum intersection test
  for (uint i = threadNum; i < g_uNumLights; i += NUM_THREADS)
  {
    float4 p = g_LightBufferCenterAndRadius[i];
    float r = p.w;
    float3 c = mul(float4(p.xyz, 1), g_mView).xyz;

    // Test if sphere is intersecting or inside frustum
    if ((GetSignedDistanceFromPlane(c, frustumEqn[0]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[1]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[2]) < r) && 
        (GetSignedDistanceFromPlane(c, frustumEqn[3]) < r) && 
        (-c.z + minZ < r) && (c.z - maxZ < r))
    {
      // Do a thread-safe increment of the list counter
      // and put the index of this light into the list
      uint dstIdx = 0;
      InterlockedAdd(ldsLightIdxCounter, 1, dstIdx);
      ldsLightIdx[dstIdx] = i;
    }
  }
  GroupMemoryBarrierWithGroupSync();
}
AABB
bool TestSphereVsAABB(float3 sphereCenter, float sphereRadius, float3 AABBCenter, float3 AABBHalfSize)
{
  float3 delta = max(0, abs(AABBCenter - sphereCenter) - AABBHalfSize);
  float distSq = dot(delta, delta);
  return distSq <= sphereRadius * sphereRadius;
}
Rasterization

New Culling Method:Zhdan16

  1. Camera frustum culling
    • Cull lights against camera frustum
    • Split visible lights into “outer” and “inner”
    • Can be done in CPU
  2. Depth buffers creation
    • For each tile:
      • Find and copy max depth for “outer” lights
      • Find and copy min depth for “inner” lights
    • Depth test is a key to high perforamance!
      • Use [earlydepthstencil] in shader
  3. Rasterization & classification
    • Render light geometry with depth test
      • “outer” - max depth buffer
        • Front faces with direct depth test
      • “inner” - min depth buffer
        • Back faces with inverted depth test
    • Use PS for precise culling and per-tile light list creation * Common light types * Light geometry can be replaced with proxy geometry * Point light (omni)
    • Geosphere (2 subdivisions, octa-based)
    • Close enough to sphere
    • Low poly works well at low resolution
    • Equilateral triangles can ease rasterizer’s life * Directional light (spot)
    • Old CRT-TV
      • Easy for parameterization
        • From a searchlight
        • To a hemisphere
        • Plane part can be used to handle area lights
          • Advantages:Zhdan16
            • No work for tiles without lights and for occluded lights
            • Coarse culling is almost free
            • Incredible speed up with small lights
            • Complex proxy models can be used!
            • Mathematically it is a branch-and-bound procedure

Computer Shader Implementation

Andersson09

  1. Load gbuffers & depth
  2. Calculate min & max z in threadgroup / tile
    • Using InterlockedMin/Max on groupshared variable
    • Atomics only work on ints
    • But casting works (z is always +)
    • Can skip if we could resolve out min & max z to a texture directly using HiZ / Z Culling
    • groupshared uint minDepthInt;
      groupshared uint maxDepthInt;

      // --- globals above, function below -------

      float depth = depthTexture.Load(uint3(texCoord, 0)).r;
      uint depthInt = asuint(depth);
      minDepthInt = 0xFFFFFFFF</span>; maxDepthInt = 0; GroupMemoryBarrierWithGroupSync(); InterlockedMin(minDepthInt, depthInt); InterlockedMax(maxDepthInt, depthInt); GroupMemoryBarrierWithGroupSync(); float minGroupDepth = asfloat(minDepthInt); float maxGroupDepth = asfloat(maxDepthInt);
  3. Determine visible light sources for each tile
    • Cull all light sources against tile "frustum"
      • Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sources
    • Input (global):Andersson11
      • Light list, frustum & SW occlusion culled
    • Output for each tile is:
      • # of visible light sources
      • Index list of visible light sources
      • Lights Indices
        Global list 1000+ 0 1 2 3 4 5 6 7 8 …
        Tile visible list ~0-40+ 0 2 5 6 8 …
      • Key part of the algorithm and compute shader
      1. Each thread switches to process light sources instead of a pixel
        • Wow, parallelism switcheroo!
        • 256 light sources in parallel per tile
        • Multiple iterations for >256 lights
      2. Intersect light source & tile
        • Many variants dep. on accuracy requirements & performance
        • Tile min & max z is used as a shader "depth bounds" test
      3. For visible lights, append light index to index list
        • Atomic add to threadgroup shared memory. "inlined stream compaction"
        • Prefix sum + stream compaction should be faster than atomics, but more limiting
      4. Switch back to processing pixels
        • Synchronize the thread group
        • We now know which light sources affect the tile
    • struct Light
      {
          float3 pos;
          float sqrRadius;
          float3 color;
          float invSqrRadius;
      };
      int lightCount;
      StructuredBuffer<Light> lights;

      groupshared uint visibleLightCount = 0; groupshared uint visibleLightIndices[1024];
      // ----- globals above, cont. function below --------- uint threadCount = BLOCK_SIZE * BLOCK_SIZE; uint passCount = (lightCount + threadCount - 1) / threadCount;
      for (uint passIt = 0; passIt < passCount; ++passIt) {
      uint lightIndex = passIt * threadCount + groupIndex;


      // prevent overrun by clmaping to a last "null" light
      lightIndex = min(lightIndex, lightCount);

      if (intersects(lights[lightIndex], tile))
      {
      uint offset;
      InterlockedAdd(visibleLightCount, 1, offset);
      visibleLightIndices[offset] = lightIndex;

      }
      }

      GroupMemoryBarrierWithGroupSync();
  4. For each pixel, accumulate lighting from visible lights
    • Read from tile visible light index list in threadgroup shared memory
    • float3 diffuseLight = 0;
      float3 specularLight = 0;

      for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt) {
      uint lightIndex = visibleLightIndices[lightIt];
      Light light = lights[lightIndex];

      evaluateAndAccumulateLight(
      light,
      gbufferParameters,
      diffuseLight,
      specularLight
      ); }
  5. Combine lighting & shading albedos / parameters
    • Output is non-MSAA HDR texture
    • Render transparent surfaces on top
    • float3 color = 0;

      for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt) {
      uint lightIndex = visibleLightIndices[lightIt];
      Light light = lights[lightIndex];

      color += diffuseAlbedo * evaluateLightDiffuse(light, gbuffer);
      color += specularAlbedo * evaluateLightSpecular(light, gbuffer);
      ); }
      Andersson11

Optimizations

Depth range optimizationOlssonAssarsson11

Compute min and max Z value for each tile. This requires access to the z buffer.

Half Z MethodStewart15
// Test if sphere is intersecting or inside frustum
if ((GetSignedDistanceFromPlane(c, frustumEqn[0]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[1]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[2]) < r) && 
    (GetSignedDistanceFromPlane(c, frustumEqn[3]) < r) && 
    (-c.z + minZ < r) && (c.z - maxZ < r))
{
  if (-c.z + minZ < r && c.z - halfZ < r)
  {
    // Do a thread-safe increment of the list counter
    // and put the index of this light into the list
    uint dstIdx = 0;
    InterlockedAdd(ldsLightIdxCounterA, 1, dstIdx);
    ldsLightIdxA[dstIdx] = i;
  }
  if (-c.z + halfZ < r && c.z - maxZ < r)
  {
    // Do a thread-safe increment of the list counter
    // and put the index of this light into the list
    uint dstIdx = 0;
    InterlockedAdd(ldsLightIdxCounterB, 1, dstIdx);
    ldsLightIdxB[dstIdx] = i;
  }
}
Parallel ReductionStewart15
Texture2D<float> g_SceneDepthBuffer;
RWTexture2D<float4> g_DepthBounds;

#define TILE_RES (16)
#define NUM_THREADS_1D (TILE_RES / 2)
#define NUM_THREADS (NUM_THREADS_1D * NUM_THREADS_1D)

// Thread Group Shared Memory (aka local data share, or LDS)
groupshared float ldsZMin[NUM_THREADS];
groupshared float ldsZMax[NUM_THREADS];

// Convert a depth value from postprojection space
// into view space
float ConvertProjDepthToView(float z)
{
  return (1.f / (z * g_mProjectionInv._34 + g_mProjectionInv._44));
}

[numthreads(NUM_THREADS_1D, NUM_THREADS_1D, 1)]
viud DepthBoundsCS( uint3 globalIdx : SV_DispatchThreadID,
                    uint3 localIdx  : SV_GroupThreadID,
                    uint3 groupIdx  : SV_GroupID)
{
  uint2 sampleIdx = globalIdx.xy * 2;

  // Load four depth samples
  float depth00 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x,     sampleIdx.y,     0)).x;
  float depth01 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x,     sampleIdx.y + 1, 0)).x;
  float depth10 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x + 1, sampleIdx.y,     0)).x;
  float depth11 = g_SceneDepthBuffer.Load(uint3(sampleIdx.x + 1, sampleIdx.y + 1, 0)).x;

  float viewPosZ00 = ConvertProjDepthToView(depth00);
  float viewPosZ01 = ConvertProjDepthToView(depth01);
  float viewPosZ10 = ConvertProjDepthToView(depth10);
  float viewPosZ11 = ConvertProjDepthToView(depth11);

  uint threadNum = localIdx.x + localIdx.y * NUM_THREADS_1D;

  // Use parallel reduction to calculate the depth bounds
  {
    // Parts of the depth buffer that were never written
    // (e.g., the sky) will be zero (the companion code uses
    // inverted 32-bit float depth for better precision)
    float minZ00 = (depth00 != 0.f) ? viewPosZ00 : FLT_MAX;
    float minZ01 = (depth01 != 0.f) ? viewPosZ01 : FLT_MAX;
    float minZ10 = (depth10 != 0.f) ? viewPosZ10 : FLT_MAX;
    float minZ11 = (depth11 != 0.f) ? viewPosZ11 : FLT_MAX;

    float maxZ00 = (depth00 != 0.f) ? viewPosZ00 : 0.0f;
    float maxZ01 = (depth01 != 0.f) ? viewPosZ01 : 0.0f;
    float maxZ10 = (depth10 != 0.f) ? viewPosZ10 : 0.0f;
    float maxZ11 = (depth11 != 0.f) ? viewPosZ11 : 0.0f;

    // Initialize shared memory
    ldsZMin[threadNum] = min(minZ00, min(minZ01, min(minZ10, minZ11)));
    ldsZMax[threadNum] = max(maxZ00, max(maxZ01, max(maxZ10, maxZ11)));
    GroupMemoryBarrierWithGroupSync();

    // Minimum and maximum using parallel reduction, with the 
    // loop manually unrolled for 8x8 thread groups (64 threads
    // per thread group)
    if (threadNum < 32)
    {
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 32]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 32]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 16]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 16]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 8]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 8]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 4]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 4]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 2]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 2]);
      ldsZMin[threadNum] = min(ldsZMin[threadNum], ldsZMin[threadNum + 1]);
      ldsZMax[threadNum] = max(ldsZMax[threadNum], ldsZMax[threadNum + 1]);
    }
  }
  GroupMemoryBarrierWithGroupSync();

  float minZ = ldsZMin[0];
  float maxZ = ldsZMax[0];
  float halfZ = 0.5f * (minZ + maxZ);

  // Calculate a second set of depth values: the maximum
  // on the near side of Half Z and the minimum on the far
  // side of Half Z
  {
    // See the companion code for details
    ...
  }

  // The first thread writes to the depth bounds texture
  if (threadNum == 0)
  {
    float maxZ2 = ldsZMax[0];
    float minZ2 = ldsZMin[0];

    g_DepthBounds[groupIdx.xy] = float4(minZ, maxZ2, minZ2, maxZ);
  }
}

Light Pre-Pass Renderer

This is the second rendering pass where we store light properties of all lights in a light buffer(aka L-Buffer).EngelShaderX709

LightPrePassRenderer EngelShaderX709

Compared to a deferred renderer, the light pre-pass renderer offers more flexibility regarding material implementations. Compared to a Z pre-pass renderer, it offers less flexibility but a flexible and fast multi-light solution.EngelShaderX709

Because the light buffer only has to hold light properties, the cost of rendering one light source is lower than for a similar setup in a deferred renderer.EngelShaderX709

Version AEngelSiggraph09

Version BEngelSiggraph09

Similar to S.T.A.L.K.E.R: Clear Skies

Dragon Age IIPapathanasis11

Clustered Shading

Clustered shading explore higher dimensional tiles, which we collectively call clusters. Each cluster has a fixed maximum 3D extent.OlssonBilleterAssarssonHpg12

Deferred Algorithm:OlssonBilleterAssarssonHpg12

  1. Render scene to G-Buffers
  2. Cluster assignment
  3. Find unique clusters
  4. Assign lights to clusters
  5. Shade samples

Advantages:

Avanlanche solution:Persson15

Data Structure
int3 tex_coord = int3(In.Position.xy, 0); // Screen-space position ...
float depth = Depth.Load(tex_coord);      // ... and depth

int slice = int(max(log2(depth * ZParam.x + ZParam.y) * scale + bias, 0));  // Look up cluster
int4 cluster_coord = int4(tex_coord >> 6, slice, 0);  // TILE_SIZE = 64

uint2 light_data = LightLookup.Load(cluster_coord); // Fetch light list
uint light_index = light_data.x;                    // Extract parameters
const uint point_light_count = light_data.y & 0xFFFF;
const uint spot_light_count  = light_data.y >> 16;

for (uint pl = 0; pl < point_light_counter; ++pl) // Point lights
{
  uint index = LightIndices[light_index++].x;

  float3 LightPos = PointLights[index].xyz;
  float3 Color = PointLights[index + 1].rgb;
  // Compute pointlight here ...
}

for (uint sl = 0; sl < spot_light_count; ++sl)    // Spot lights
{
  uint index = LightIndices[light_index++].x;

  float3 LightPos = SpotLights[index].xyz;
  float3 Color = SpotLights[index + 1].rgb;
  // Compute spotlight here ...
}

Persson15

Cluster Assignment

Granite:

Shadow of the Tomb Raider:Moradin19

Avalanche Studios:Persson15

for (int z = z0; z <= z1; ++z)
{
  float4 z_light = light;
  if (z != center_z)
  {
    const ZPlane& plane = (z < center_z) ? z_planes[z + 1] : -z_planes[z];
    z_light = project_to_plane(z_light, plane);
  }

  for (int y = y0; y < y1; ++y)
  {
    float3 y_light = z_light;

    if (y != center_y)
    {
      const YPlane& plane = (y < center_y) ? y_planes[y + 1] : -y_planes[y];
      y_light = project_to_plane(y_light, plane);
    }

    int x = x0;
    do
    {
      ++x;
    } while (x < x1 && GetDistance(x_planes[x], y_light_pos) >= y_light_radius);

    int xs = x1;
    do
    {
      --xs;
    } while (xs >= x && -GetDistance(x_planes[xs], y_light_pos) >=- y_light_radius);

    for (--x; x <= xs; ++x)
    {
      light_lists.AddPointLight(base_cluster + x, light_index);
    }
  }
}
Sparse vs Dense Cluster GridOlsson15
Explicit vs Implicit ClusterOlsson15

Finding Unique Clusters

Light AssignmentOlssonBilleterAssarssonHpg12

ShadingOlssonBilleterAssarssonHpg12

To match the pixel and the clusters, we need a direct mapping between the cluster key and the index into the list of unique clusters.

In the sorting approach, we explicitly store this index for each pixel. When the unique cluster is established, store the index to the correct pixel in a full screen buffer.

Cluster Key PackingOlssonBilleterAssarssonHpg12

Allocate 8 bits to each i and j components, which allows up to 8192 × 8192 size RTs. Depth index k is determined from settings for the near and far planes and ClusterK.

The paper uses 10 bits, 4 bits for the actually depth data, and 6 bits for the optional normal clustering.

Tile SortingOlssonBilleterAssarssonHpg12

To the cluster ke we attach an additional 10 bits of meta-data, which identifies the sample’s original position relative to its tile. We perfrom a tile-local sort of the cluster keys and the associated meta-data. The sort only considers the up-to 16 bits of the cluster key; the meta-data is used as a link back to the original sample after sorting. In each tile, we count the number of unique cluster keys. Using a prefix operation over the counts from each tile, we find the total number of unique cluster keys and assign each cluster a unique ID in the range [0…numClusters). We write the unique ID back to each pixel that is a member of the cluster. The unique ID also serves as an offset in memory to where the cluster’s data is stored.

Shadows

Conservative Rasterization

Algorithm:OrtegrenPersson16

Light Shape Representation:

Shell Pass
Fill Pass
Source Code Analysis

Root Signature:

Draw:

Future Work

Alternative Implementations

Per-Pixel Linked ListBezrati14

struct LightFragmentLink
{
  float m_LightDepthMax;
  float m_LightDepthMin;
  
  uint m_LightIndex;
  uint m_Next;
};
struct LightFragmentLink
{
  uint m_DepthInfo;
  uint m_IndexNetx;
}

Light Linked List (LLL)

Depth Test:

// If Z test fails for the front face, skip all fragments
if ((pFace == true) & (light_depth > depth_buffer))
{
  return;
}
struct LightFragmentLink
{
  uint m_DepthInfo; // High bits min depth, low bits max depth
  uint m_IndexNext; // Light index and link to the next fragment
};
RWStructuredBuffer<LightFragmentLink> g_LightFragmentLinkedBuffer;
// Allocate
uint new_lll_idx = g_LightFragmentLinkedBuffer.IncrementCounter();

// Don't overflow
if (new_lll_idx >= g_VP_LLLMaxCount)
{
  return;
}
// Final output
LightFragmentLink element;

// Pack the light depth
element.m_DepthInfo = (light_depth_min << 16)  light_depth_max;

// Index / Link
element.m_IndexNext = (light_index << 24) | (prev_lll_idx & 0xFFFFFFFF);

// Store the element
g_LightFragmentLinkedBuffer[new_lll_idx] = element;

Lighting the G-Buffer

Accessing the SRVs:

uint src_index = LLLIndexFromScreenUVs(screen_uvs);
uint first_offset = g_LightStartOffsetView[src_index];

// Decode the first element index
uint element_index = (first_offset & 0xFFFFFF);

Light Loop:

// Iterate over the light linked list
while (element_index != 0xFFFFFF)
{
  // Fetch
  LightFragmentLink element = g_LightFragmentLinkedView[element_index];

  // Update the next element index
  element_index = (element.m_IndexNext & 0xFFFFFF);
}

Decoding light depth:

// Decode the light bounds
float light_depth_max = f16tof32(element.m_DepthInfo >> 0);
float light_depth_min = f16tof32(element.m_DepthInfo >> 16);

// Do depth bounds check
if ((l_depth > light_depth_max) || (l_depth < light_depth_min))
{
  continue;
}

Access light info:

// Decode the light index
uint light_index = (element.m_IndexNext >> 24);

// Access
GPULightEnv light_env = g_LinkedLightsEnvs[light_index];

// Detect the light type
switch (light_env.m_LightType)
{
  ...

3D Light GridAnagnostou17

Optimizations

The most important optimization for the lighting pass is to render only those lights that actually affect the final image, and for those lights, render only the affected pixels.Shishkovtsov05Thibieroz11

  1. Social Stage:
    • Filter the lights and effects on the scene to produce a smaller list of sources to be processed
      1. Execute visiblity and occlusion algorithms to discard lights whose influence is not appreciable
      2. Project visible sources bounding objects into screen space
      3. Combine similar sources that are too close in screen space or influence almost the same screen area
      4. Discard sources with a tiny contribution because of their projected bounding object being too small or too far
      5. Check that more than a predefined number of sources do not affect each screen region. Choose the biggest, strongest, and closer sources.
  2. Individual Stage:
    1. Select the appropriate level of detail.
    2. Enable and configure the source shaders
    3. Compute the minimum and maximum screen cord values of the projected bounding object
    4. Enable the scissor test
    5. Enable the clipping planes
    6. Render a screen quad or the bounding object
int lightCounter[4] = { count, start, step, 0 };
pDevice->SetPixelShaderConstantI(0, lightCounter, 1);
// NO

int tileLightCount : register(i0);
float4 lightParams[NUM_LIGHT_PARAMS] : register(c0);

[loop]
for (int iLight = 0;  // start
     iLight < tileLightCount; // count * step
     ++iLight)  // step
{
  float4 params1 = lightParams[iLight + 0]; // mov r0 c0[0 + aL]
  float4 params2 = lightParams[iLight + 1]; // mov r1 c0[1 + aL]
  float4 params3 = lightParams[iLight + 2]; // mov r2 c0[2 + aL]
}

WhiteBarreBrisebois11

Sun Rendering

S.T.A.L.K.E.R case:Shishkovtsov05

Killzone 2 case:Valient07

Level of Detail Lighting

* LoD to decide how many instructions per pixel:[Placeres06](#Placeres06) * Closest: Perform both diffuse and specular * Normal: Diffuse + Specular * t * Far: Diffuse

Blending Cost

Shadows

Shadow Maps

The key is using the little used variant known as forward shadow mapping. With forward shadow mapping the objects position is projected into shadow map space and then depths compared there.Calver03Thibieroz04

The first step is to calculate the shadow map; this is exactly the same as a conventional renderer.Calver03

When the light that generated the shadow map is rendered, the shadow map is attached to the light shader in the standard fashion (a cube map for the point light case).Calver03

Efficient Omni Lights

Three major options:Shishkovtsov05

| |Cube Map|Virtual Shadow Depth Cube Texture|Six Spotlights| |—|——|———————————|————–| |Scalability and Continuity|Low
Few Fixed sizes
All faces are the same|Moderate
Faces can be of different sizes, but only from a few fixed sets|Excellent
Any variation of sizes is possible| |Hardware Filtering Support|No|Yes|Yes| |Cost of Filtering|Moderate|Excellent for bilinear
Moderate for arbitrary percentage-closer filtering|Excellent| |Render Target Switches|Six|One|One| |Packing Support|No|Yes|Yes| |Cost of Screen Space Stencil Masking|Low|Low|Moderate
Some stencil overdraw| |Memory Cost and Bandwidth Usage|High
Surface is almost unusuable for everything else|Moderate
Few fixed sizes limits packing ability|Excellent| Shishkovtsov05

Post Processing Phase

HDR

Render your scene to multiple 32 bit buffers, then use a 64 bit accumulation buffer during the light phase.Hargreaves04

Minor Architectures

The X-Ray Rendering ArchitectureLobanchikovGruen09

  1. G-Stage
  2. Light Stage
  3. Light Combine
  4. Transparent Objects
  5. Bloom/Exposition
  6. Final Combine-2
  7. Post-Effects

G-Stage

Light Stage

Light Combine

Transparent Objects

Bloom / exposition

Final combine-2

Post-Effects

Light Indexed Deferred Rendering

Three basic render passes:Trebilco09

  1. Render depth only pre-pass
  2. Disable depth writes (depth testing only) and render light volumes into a light index texture
    • Standard deferred lighting / shadow volume techniques can be used to find what fragments are hit by each light volume
  3. Render geometry using standard forward rendering
    • Lighting is done using the light index texture to access lighting properties in each shader

In order to support multiple light indexes per-fragment, it would be ideal to store the first light index in the texture’s red channel, second light index in the blue index, etc.Trebilco09

Matt Pettineo’s approachPettineo12

Space MarineKimBarrero11

Pass Budget (ms)
Depth-Pre 0.50
G-Buffer + Linear Depth 5.05
AO 2.25
Lighting 8.00
Combiner Pass 5.00
Blend 0.15
Gamma Conversion 1.30
FX 2.75
Post Processing 3.70
UI 0.50
Total 29.20

Screen-Space ClassificationKnightRitchieParrish11

Divided the screen into 4 × 4 pixel tiles. Each tile is classified according to the minimum global light properties it requires:

  1. Sky
    • Fastest pixels because no lighting calculations required
    • Sky color is simply copied directly from the G-Buffer
  2. Sun light
    • Pixels facing the sun requires sun and specular lighting calculations (unless they’re fully in shadow)
  3. Solid shadow
    • Pixels fully in shadow don’t require any shadow or sun light calculations
  4. Soft shadow
    • Pixels at the edge of shadows require expensive eight-tap percentage closer filtering (PCF) unless they face away from the sun
  5. Shadow fade
    • Pixels near the end of the dynamic shadow draw distance fade from full shadow to no shadow to avoid pops as geometry moves out of the shadow range
  6. Light scattering
    • All but the nearest pixels
  7. Antialiasing
    • Pixels at the edges of polygons require lighting calculations for both 2X MSAA fragments

Classify four during our screen-space shadow mask generation, the other three in a per-pixel pass.

Inferred Lighting

Kircher12

Features:

Hybrid Deferred RenderingSousaWenzelRaine13

Destiny Engine Deferred RenderingTatarchukTchouVenzon13

  1. G-Buffers (96 bits)
    • Depth, normal, material ids
    • Opaque geometries + Decals
    • Highly-compressed
  2. L-Buffers
    • Lighting accumulation
    • Light Geometry
    • Lights
  3. Lit Result
    • Full-screen shading

Rainbow Six SiegeElMansouri16

Opaque Rendering

Shadow Rendering

Lighting

Checkerboard Rendering

Issues

Transparency

The best (in speed terms) we can do currently is to fall-back to a non-deferred lighting system for transparent surfaces and blend them in post-processing.Calver03Hargreaves04

Depth peeling is the ultimate solution, but is prohibitively expensive at least for the time being.Hargreaves04

StarCraft II uses multipass forward approach:FilionMcNaughton08

StarCraft II’s simple layered system:

  1. Opaque Pass
    1. Create depth map from opaque objects
    2. Render opaque objects
    3. Apply depth-dependent post-processing effects
  2. Transparency Pass
    1. Render transparent objects back to front
    2. Key transparencies are allowed to perform pre-pass where they overwrite the g-buffer
      • Since all post-processing on previous g-buffer data has been applied, that information is no longer needed
    3. Update AO deferred buffer
    4. Render the transparency
    5. Perform DoF pass on the areas covered by the transparency

Memory

No solutions but a warning that deferred lighting has a number of large render-targets.Calver03

Anti-Aliasing

Antialiasing becomes solely the responsibility of the application and the shader; we cannot rely on the GPU alone.Shishkovtsov05

Edge Detection

Edge-smoothing filter by Fabio05.Placeres06:

  1. Edge-detection scan is applied to the screen. The filter uses discontinuities in the positions and normal stored in the GBuffer. The results can be stored in the stencil buffer as a mask for the next step.
  2. The screen is blurred using only the pixels that are edges
    • These pixels are masked in the stencil buffer
    • However, color bleeding can occur (e.g., background color bleeding into the character)
    • Thus, a kernel is applied to the edge pixels, but only the closest to the camera are combined
    • Cloor bleeding reduction

Pixel Edge Detection (Pixel Shader):Thibieroz09

// Pixel shader to detect pixel edges
// Used with the following depth-stencil state values:
// DepthEnable = TRUE
// DepthFunc = Always
// DepthWriteMask = ZERO
// StencilEnable = TRUE
// Front/BackFaceStencilFail = Keep
// Front/BackfaceStencilDepthFail = Keep
// Front/BackfaceStencilPass = Replace;
// Front/BackfaceStencilFunc = Always;
// The stencil reference value is set to 0x80

float4 PSMarkStencilWithEdgePixels( PS_INPUT input ) : SV_TARGET
{
  // Fetch and compare samples from GBuffer to determine if pixel
  // is an edge pixel or not
  bool bIsEdge = DetectEdgePixel(input);

  // Discard pixel if non-edge (only mark stencil for edge pixels)
  if (!bIsEdge) discard;
  
  // Return color (will have no effect since no color buffer bound) return
  float4(1,1,1,1);
}

Centroid-Based Edge Detection

An optimized way to detect edges is to leverage the GPU’s fixed function resolve feature. Centroid sampling is used to adjust the sample position of an interpolated pixel shader input so that it is contained within the area defined by the multisamples covered by the triangle.Thibieroz09

Centroid sampling can be used to determine whether a sample belongs to an edge pixel or not.Thibieroz09

This MSAA edge detection technique is quite fast, especially compared to a custom method of comparing every G-Buffer normal and depth samples. It only requires a few bits of storage in a G-Buffer render target.Thibieroz09

S.T.A.L.K.E.R.Shishkovtsov05

Our solution was to trade some signal frequency at the discontinuities for smoothness, and to leave other parts of the image intact. We detect discontinuities in both depth and normal direction by taking 8+1 samples of depth and finding how depth at the current pixel differs from the ideal line passed through opposite corner points. The normals were used to fix issues such as a wall perpendicular to the floor, where the depth forms a perfect line (or will be similar at all samples) but an aliased edge exists. The normals were processed in a similar cross-filter manner, and the dot product between normals was used to determine the presence of an edge.

struct v2p  
{    
  float4 tc0: TEXCOORD0; // Center    
  float4 tc1: TEXCOORD1; // Left Top      
  float4 tc2: TEXCOORD2; // Right Bottom    
  float4 tc3: TEXCOORD3; // Right Top    
  float4 tc4: TEXCOORD4; // Left Bottom      
  float4 tc5: TEXCOORD5; // Left / Right    
  float4 tc6: TEXCOORD6; // Top /Bottom  
};      

/////////////////////////////////////////////////////////////////////  
uniform sampler2D s_distort;  
uniform half4 e_barrier;  // x=norm(~.8f), y=depth(~.5f)  
uniform half4 e_weights;  // x=norm, y=depth  
uniform half4 e_kernel;   // x=norm, y=depth    
/////////////////////////////////////////////////////////////////////  

half4 main(v2p I) : COLOR  
{   
  // Normal discontinuity filter   
  half3 nc = tex2D(s_normal, I.tc0);   
  half4 nd;   
  nd.x = dot(nc, (half3)tex2D(s_normal, I.tc1));   
  nd.y = dot(nc, (half3)tex2D(s_normal, I.tc2));   
  nd.z = dot(nc, (half3)tex2D(s_normal, I.tc3));   
  nd.w = dot(nc, (half3)tex2D(s_normal, I.tc4));   
  nd -= e_barrier.x;   
  nd = step(0, nd);   
  half ne = saturate(dot(nd, e_weights.x));     

  // Opposite coords     
  float4 tc5r = I.tc5.wzyx;   
  float4 tc6r = I.tc6.wzyx;     
  
  // Depth filter : compute gradiental difference:   
  // (c-sample1)+(c-sample1_opposite)   
  half4 dc = tex2D(s_position, I.tc0);   
  half4 dd;   
  dd.x = (half)tex2D(s_position, I.tc1).z +          
    (half)tex2D(s_position, I.tc2).z;   
  dd.y = (half)tex2D(s_position, I.tc3).z +          
    (half)tex2D(s_position, I.tc4).z;   
  dd.z = (half)tex2D(s_position, I.tc5).z +          
    (half)tex2D(s_position, tc5r).z;   
  dd.w = (half)tex2D(s_position, I.tc6).z +          
    (half)tex2D(s_position, tc6r).z;   
  dd = abs(2 * dc.z - dd)- e_barrier.y;   
  dd = step(dd, 0);   
  half de = saturate(dot(dd, e_weights.y));     
  
  // Weight     
  half w = (1 - de * ne) * e_kernel.x; 
  // 0 - no aa, 1=full aa     
  // Smoothed color   
  // (a-c)*w + c = a*w + c(1-w)   
  float2 offset = I.tc0 * (1-w);   
  half4 s0 = tex2D(s_image, offset + I.tc1 * w);   
  half4 s1 = tex2D(s_image, offset + I.tc2 * w);   
  half4 s2 = tex2D(s_image, offset + I.tc3 * w);   
  half4 s3 = tex2D(s_image, offset + I.tc4 * w);   
  return (s0 + s1 + s2 + s3)/4.h;  
} 

Tabula RasaKoonce07

Modified S.T.A.L.K.E.R.’s algorithm to be resolution independent.

We looked at changes in depth gradients and changes in normal angles by sampling all eight neighbors surrounding a pixel. We compare the maximum change in depth to the minimum change in depth to determine how much of an edge is present. By comparing relative changes in this gradient instead of comparing the gradient to fixed values, we are able to make the logic resolution independent.

We compare the changes in the cosine of the angle between the center pixel and its neighboring pixels along the same edges at which we test depth gradients.

The output of the edge detection is a per-pixel weight between zero and one. The weight reflects how much of an edge the pixel is on. We use this weight to do four bilinear samples when computing the final pixel color. The four samples we take are at the pixel center for a weight of zero and at the four corners of the pixel for a weight of one. This results in a weighted average of the target pixel with all eight of its neighbors.

////////////////////////////    // Neighbor offset table    ////////////////////////////    
const static float2 offsets[9] = 
{   
  float2( 0.0,  0.0), //Center       0    
  float2(-1.0, -1.0), //Top Left     1    
  float2( 0.0, -1.0), //Top          2    
  float2( 1.0, -1.0), //Top Right    3    
  float2( 1.0,  0.0), //Right        4    
  float2( 1.0,  1.0), //Bottom Right 5    
  float2( 0.0,  1.0), //Bottom       6    
  float2(-1.0,  1.0), //Bottom Left  7    
  float2(-1.0,  0.0)  //Left         8 
}; 

float DL_GetEdgeWeight(in float2 screenPos) 
{   
  float Depth[9];   
  float3 Normal[9];   
  
  //Retrieve normal and depth data for all neighbors.    
  for (int i=0; i<9; ++i)   
  {     
    float2 uv = screenPos + offsets[i] * PixelSize;     
    Depth[i] = DL_GetDepth(uv);   //Retrieves depth from MRTs

    Normal[i]= DL_GetNormal(uv);  //Retrieves normal from MRTs 
  }   
  
  //Compute Deltas in Depth.    
  float4 Deltas1;   
  float4 Deltas2;   
  Deltas1.x = Depth[1];   
  Deltas1.y = Depth[2];   
  Deltas1.z = Depth[3];   
  Deltas1.w = Depth[4];   
  Deltas2.x = Depth[5];   
  Deltas2.y = Depth[6];   
  Deltas2.z = Depth[7];   
  Deltas2.w = Depth[8];   
  
  //Compute absolute gradients from center.   
  Deltas1 = abs(Deltas1 - Depth[0]);   
  Deltas2 = abs(Depth[0] - Deltas2);   
  
  //Find min and max gradient, ensuring min != 0    
  float4 maxDeltas = max(Deltas1, Deltas2);   
  float4 minDeltas = max(min(Deltas1, Deltas2), 0.00001);   
  
  // Compare change in gradients, flagging ones that change    
  // significantly.    
  // How severe the change must be to get flagged is a function of the    
  // minimum gradient. It is not resolution dependent. The constant    
  // number here would change based on how the depth values are stored    
  // and how sensitive the edge detection should be.    
  float4 depthResults = step(minDeltas * 25.0, maxDeltas);   
  
  //Compute change in the cosine of the angle between normals.   
  Deltas1.x = dot(Normal[1], Normal[0]);   
  Deltas1.y = dot(Normal[2], Normal[0]);   
  Deltas1.z = dot(Normal[3], Normal[0]);   
  Deltas1.w = dot(Normal[4], Normal[0]);   
  Deltas2.x = dot(Normal[5], Normal[0]);   
  Deltas2.y = dot(Normal[6], Normal[0]);   
  Deltas2.z = dot(Normal[7], Normal[0]);   
  Deltas2.w = dot(Normal[8], Normal[0]);   
  Deltas1 = abs(Deltas1 - Deltas2);   
  
  // Compare change in the cosine of the angles, flagging changes   
  // above some constant threshold. The cosine of the angle is not a    
  // linear function of the angle, so to have the flagging be    
  // independent of the angles involved, an arccos function would be    
  // required.    
  float4 normalResults = step(0.4, Deltas1);   
  normalResults = max(normalResults, depthResults);   
  
  return (normalResults.x + normalResults.y +           
    normalResults.z + normalResults.w) * 0.25; 
} 

MSAA

MSAA allows a scene to be rendered at a higher resolution without having to pay the cost of shading more pixels.Thibieroz09

Run light shader at pixel resolutionValient07

S.T.A.K.E.R: Clear Skies:LobanchikovGruen09

For each shader
  Plain pixel: run shader at pixel frequency
  Edge pixel: run at subpixel frequency

LobanchikovGruen09

MSAA Compute Shader Lighting

Comparisons

| |Deferred |Tiled Deferred |Tiled Forward| |————————-|———-|————————–|————-| |Innermost loop |Pixels |Lights |Lights | |Light data access pattern|Sequential|Random |Random | |Pixel data access pattern|Random |Sequential |Sequential | |Re-use Shadow Maps |Yes |No |No | |Shading Pass |Deferred |Deferreda|Geometry | |G-Buffers |Yes |Yes |No | |Overdraw of shading |No |No |Yes | |Transparency |Difficult |Simple |Simple | |Supporting FSAA |Difficult |Difficult |Trivial | |Bandwidth Usage |High |Low |Low | |Light volume intersection|Per Pixel |Per Tile |Per Tile | OlssonAssarsson11

aApply Tiled Forward for transparent objects

EA. SIGGRAPH. 2011.

Light Type
(8 lights/tile, every tile)
Performance
Point 4.0 ms
Point (with Spec) 7.8 ms
Cone 5.1 ms
Cone (with Spec) 5.3 ms
Line 5.8 ms

WhiteBarreBrisebois11

Deferred vs Forward+


References

2003

Photo-realistic Deferred Lighting. Dean Calver, Climax / Snapshot Games. Beyond3D.

2004

Deferred Shading. Shawn Hargreaves, Climax / Microsoft. GDC 2004
Deferred Shading. Shawn Hargreaves, Climax / Microsoft. Mark Harris, NVIDIA. NVIDIA Developers Conference 2004.
Deferred Shading with Multiple Render Targets. Nicolas Thibieroz, PowerVR Technologies / AMD. ShaderX2.

2005

Deferred Shading in S.T.A.L.K.E.R.. Oleksandr Shyshkovtsov, GSC Game World / 4A Games. GPU Gems 2.

2006

Overcoming Deferred Shading Drawbacks. Frank Puig Placeres, University of Informatic Sciences / Amazon. ShaderX5.

2007

Deferred Shading in Tabula Rasa. Rusty Koonce, NCSoft Corporation / Facebook. GPU Gems 3.
Deferred Rendering in Killzone 2. Michal Valient, Guerilla Games / Epic Games. Developer Conference 2007.
Optimizing Parallel Reduction in CUDA. Mark Harris, NVIDIA.

2008

The Technology of Uncharted: Drake’s Fortune. Christophe Balestra, Naughty Dog / Retired. Pål-Kristian Engstad, Naughty Dog / Apple. GDC 2008.
StarCraft II: Effects & Techniques. Dominic Filion, Blizzard Entertainment / Snap Inc.. Rob McNaughton, Blizzard Entertainment. SIGGRAPH 2008: Advances in Real-Time Rendering in 3D Graphics and Games Course.

2009

Parallel Graphics in Frostbite - Current Future. Johan Andersson, DICE / Embark Studios. SIGGRAPH 2009: Beyond Programmable Shading Course.
Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer. Wolfgang Engel, Rockstar Games / The Forge. ShaderX7.
Light Pre-Pass; Deferred Lighting: Latest Development. Wolfgang Engel, Rockstar Games / The Forge. SIGGRAPH 2009: Advances in Real-Time Rendering in Games Course.
Pre-lighting in Resistance 2. Mark Lee, Insomniac Games / Walt Disney Animation Studios. GDC 2009.
GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1. Igor A. Lobanchikov, GSC Game World / Retired. Holger Gruen, AMD. GDC 2009.
Deferred Lighting and Post Processing on PLAYSTATION 3. Matt Swoboda, Sony Computer Entertainment / Notch. GDC 2009.
Deferred Shading with Multisampling Anti-Aliasing in DirectX 10. Nicolas Thibieroz, AMD. GDC 2009. ShaderX7.
Light-Indexed Deferred Rendering. Damian Trebilco, THQ / Situ Systems. ShaderX7.
Compact Normal Storage for small G-Buffers. Aras Pranckevičius, Unity Technologies / Freelancer. Blog.

2010

CryENGINE 3: Reaching the Speed of Light. Anton Kaplanyan, Crytek / Intel Corporation. SIGGRAPH 2010: Advances in Real-Time Rendering in Games Course.
Deferred Rendering for Current and Future Rendering Pipelines. Andrew Lauritzen, Intel Corporation. SIGGRAPH 2010: Beyond Programmable Shader Course.

2011

DirectX 11 Rendering in Battlefield 3. Johan Andersson, DICE / Embark Studios. GDC 2011
Rendering Tech of Space Marine. Pope Kim, Relic Entertainment / POCU. Daniel Barrero, Relic Entertainment. KGC 2011.
Screen-Space Classification for Efficient Deferred Shading. Balor Knight, Black Rock Studio. Matthew Ritchie, Black Rock Studio. George Parrish, Black Rock Studio. Game Engine Gems 2.
Tiled Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Ulf Assarsson, Chalmers University of Technology. Journal of Graphics, GPU, and Game Tools.
Dragon Age II DX11 Technology. Andreas Papathanasis, BioWare / Parallel Space Inc.. GDC 2011.
Deferred Shading Optimizations. Nicolas Thibieroz, AMD. GDC 2011.
More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run. John White, EA Black Box / Roblox. Colin Barré-Brisebois, DICE / SEED. SIGGRAPH 2011: Advances in Real-Time Rendering in Games Course.

2012

Forward+: Bringing Deferred Lighting to the Next Level. Takahiro Harada, AMD. Jay McKee, AMD. Jason C. Yang, AMD / DGene. Eurographics 2012.
A 2.5D Culling for Forward+. Takahiro Harada, AMD. SIGGRAPH ASIA 2012.
Lighting & Simplifying Saints Row: The Third. Scott Kircher, Volition. GDC 2012.
Clustered Deferred and Forward Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Ulf Assarsson, Chalmers University of Technology. HPG 2012.
Tiled and Clustered Forward Shading: Supporting Transparency and MSAA. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Ulf Assarsson, Chalmers University of Technology. SIGGRAPH 2012: Talks.
Light Indexed Deferred Rendering. Matt Pettineo, Ready at Dawn. The Danger Zone Blog.

2013

Tiled Forward Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology / University of Leeds. Ulf Assarsson, Chalmers University of Technology. GPU Pro 4.
Forward+: A Step Toward Film-Style Shading in Real Time. Takahiro Harada, AMD. Jay McKee, AMD. Jason C. Yang, AMD / DGene. GPU Pro 4.
The Rendering Technologies of Crysis 3. Tiago Sousa, Crytek / id Software. Carsten Wenzel, Crytek / Cloud Imperium Games. Chris Raine, Crytek. GDC 2013.
CryENGINE 3: Graphics Gems. Tiago Sousa, Crytek / id Software. Nickolay Kasyan, Crytek / AMD. Nicolas Schulz, Crytek. SIGGRAPH 2013: Advances in Real-Time Rendering in 3D Graphics and Games Course.
Tiled Rendering Showdown: Forward++ vs. Deferred Rendering. Jason Stewart, AMD. Gareth Thomas, AMD. GDC 2013.
Destiny: From Mythic Science Fiction to Rendering in Real-Time. Natalya Tatarchuk, Bungie / Unity Technologies. Chris Tchou, Bungie. Joe Venzon, Bungie. SIGGRAPH 2013: Advances in Real-Time Rendering in 3D Graphics and Games Course.

2014

inFAMOUS Second Son Engine Postmortem. Adrian Bentley, Sucker Punch Productions. GDC 2014.
Real-Time Lighting via Light Linked List. Abdul Bezrati, Insomniac Games. SIGGRAPH 2014: Advances in Real-Time Rendering in 3D Graphics and Games Course.
Forward Clustered Shading. Marc Fauconneau Dufresne, Intel Corporation. Intel Software Developer Zone.
The Making of Forza Horizon 2. Richard Leadbetter, Digital Foundary. Eurogamer.net.
Crafting a Next-Gen Material Pipeline for The Order: 1886. David Neubelt, Ready at Dawn. Matt Pettineo, Ready at Dawn. GDC 2014.
Notes on Real-Time Renderers. Angelo Pesce, Activision / Roblox. C0DE517E Blog.
Moving to the Next Generation—The Rendering Technology of Ryse. Nicolas Schulz, Crytek. GDC 2014.
Compute Shader Optimizations for AMD GPUs: Parallel Reduction. Wolfgang Engel, Rockstar Games / The Forge. Diary of a Graphics Programmer.
Survey of Efficient Representations for Independent Unit Vectors. Zina H. Cigolle, Williams College / Stripe. Sam Donow, Williams College / Hudson River Trading. Daniel Evangelakos, Williams College / Olive. Michael Mara, Williams College / Luminary Cloud. Morgan McGuire, Williams College / Roblox. Quirin Meyer, Elektrobit / Hochschule Coburg. JCGT.

2015

Real-Time Lighting via Light Linked List. Abdul Bezrati, Insomniac Games. GPU Pro 6.
More Efficient Virtual Shadow Maps for Many Lights. Ola Olsson, Chalmers University of Technology / Epic Games. Markus Billeter, Chalmers University of Technology. Erik Sintorn, Chalmers University of Technology. IEEE Transactions on Visualization and Computer Graphics.
Practical Clustered Shading. Emil Persson, Avalanche Studios / Elemental Games. SIGGRAPH 2015: Real-Time Many-Light Management and Shadows with Clustered Shading Course.
Notes on G-Buffer normal encodings. Angelo Pesce, Activision / Roblox. C0DE517E Blog.
Introduction to Real-Time Shading with Many Lights. Ola Olsson, Chalmers University of Technology / Epic Games. SIGGRAPH 2015: Real-Time Many-Light Management and Shadows with Clustered Shading Course.
Rendering the Alternate History of The Order: 1886. Matt Pettineo, Ready at Dawn. SIGGRAPH 2015: Advances in Real-Time Rendering in Games Course.
Compute-Based Tiled Culling. Jason Stewart, AMD. GPU Pro 6.
Advancements in Tiled-Based Compute Rendering. Gareth Thomas, AMD. GDC 2015.

2016

Deferred Lighting in Uncharted 4. Ramy El Garawany, Naughty Dog / Google. SIGGRAPH 2016: Advances in Real-Time Rendering in Games Course. Rendering Tom Clancy’s Rainbow Six Siege. Jalal El Mansouri, Ubisoft Montréal / Haven Studios Inc.. GDC 2016 Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12. Kevin Örtegren, Avalanche Studios / Epic Games. Emil Persson, Avalanche Studios / Elemental Games. GPU Pro 7. Tiled Shading: Light Culling—Reaching the Speed of Light. Dmitry Zhdan, NVIDIA. GDC 2016.

2017

How Unreal Renders a Frame. Kostas Anagnostou, Radiant Worlds / Playground Games. Interplay of Light Blog.
Improved Culling for Tiled and Clustered Rendering. Michal Drobot, Infinity Ward. SIGGRAPH 2017: Advances in Real-Time Rendering in Games Course.
Cull That Cone! Improved Cone/Spotlight Visibility Tests for Tiled and Clustered Lighting. Bartłomiej Wroński, Santa Monica Studio / NVIDIA. Bart Wronski Blog.

2018

The Road Toward Unified Rendering with Unity’s High Definition Render Pipeline. Sébastien Lagarde, Unity Technologies. Evgenii Golubev, Unity Technologies. SIGGRAPH 2018: Advances in Real-Time Rendering in Games Course.

2019

Under the Hood of Shadow of the Tomb Raider. m0radin. m0rad.in Blog.

2020

Real-Time Samurai Cinema: Lighting, Atmosphere, and Tonemapping in Ghost of Tsushima. Jasmin Patry, Sucker Punch Productions. SIGGRAPH 2021: Advances in Real-Time Rendering in Games Course.
Clustered Shading Evolution in Granite. Hans-Kristian Arntzen, Arntzen Software AS. Maister’s Graphics Adventures Blog.
Graphics Study: Red Dead Redemption 2. Hüseyin, Our Machinery. imgeself Blog. Hallucinations re: the rendering of Cyberpunk 2077. Angelo Pesce, Roblox. C0DE517E Blog.

2021

The Rendering of Jurassic World: Evolution. The Code Corsair. The Code Corsair Blog.
The Rendering of Mafia: Definitive Edition. The Code Corsair. The Code Corsair Blog.
Digital combat simulator: frame analysis. Thomas Poulet, Ubisoft Berlin / Huawei. Blog.

People by Company

Company People Referene
Snapshot Games Dean Calver Photo-realistic Deferred Lighting
Microsoft Shawn Hargreaves Deferred Shading
Deferred Shading
NVIDIA Mark Harris Deferred Shading
AMD Nicolas Thibieroz Deferred Shading with Multiple Render Targets
Deferred Shading with Multisampling Anti-Aliasing in DirectX 10
Deferred Shading Optimizations
Holger Gruen GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1
Takahiro Harada Forward+: Bringing Deferred Lighting to the Next Level
A 2.5D Culling for Forward+
Forward+: A Step Toward Film-Style Shading in Real Time
Jay McKee Forward+: Bringing Deferred Lighting to the Next Level
Forward+: A Step Toward Film-Style Shading in Real Time
Jason Stewart Tiled Rendering Showdown: Forward++ vs. Deferred Rendering
Gareth Thomas Tiled Rendering Showdown: Forward++ vs. Deferred Rendering
4A Games Oleksandr Shyshkovtsov Deferred Shading in S.T.A.L.K.E.R.
Amazon Frank Puig Placeres Overcoming Deferred Shading Drawbacks
Facebook Rusty Koonce Deferred Shading in Tabula Rasa
Epic Games Michal Valient Deferred Rendering in Killzone 2
Ola Olsson Tiled Shading
Clustered Deferred and Forward Shading
Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
Tiled Forward Shading
More Efficient Virtual Shadow Maps for Many Lights
Introduction to Real-Time Shading with Many Lights
Kevin Örtegren Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12
Apple Pål-Kristian Engstad The Technology of Uncharted: Drake's Fortune
Snap Inc. Dominic Filion StarCraft II: Effects & Techniques
Blizzard Entertainment Rob McNaughton StarCraft II: Effects & Techniques
Embark Studios Johan Andersson Parallel Graphics in Frostbite - Current Future
DirectX 11 Rendering in Battlefield 3
The Forge Wolfgang Engel Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer
Light Pre-Pass; Deferred Lighting: Latest Development
Walt Disney Animation Studios Mark Lee Pre-lighting in Resistance 2
Retired Christophe Balestra The Technology of Uncharted: Drake's Fortune
Igor A. Lobanchikov GSC Game World’s S.T.A.L.K.E.R: Clear Sky—A Showcase for Direct3D 10.0/1
Notch Matt Swoboda Deferred Lighting and Post Processing on PLAYSTATION 3
Situ Systems Damian Trebilco Light-Indexed Deferred Rendering
Intel Corporation Anton Kaplanyan CryENGINE 3: Reaching the Speed of Light
Andrew Lauritzen Deferred Rendering for Current and Future Rendering Pipelines
Marc Fauconneau Dufresne Forward Clustered Shading
POCU Pope Kim Rendering Tech of Space Marine
Relic Entertainment Daniel Barrero Rendering Tech of Space Marine
Black Rock Studio Balor Knight Screen-Space Classification for Efficient Deferred Shading
Matthew Ritchie Screen-Space Classification for Efficient Deferred Shading
George Parrish Screen-Space Classification for Efficient Deferred Shading
Chalmers University of Technology Ulf Assarsson Tiled Shading
Clustered Deferred and Forward Shading
Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
Tiled Forward Shading
Markus Billeter Clustered Deferred and Forward Shading
Tiled and Clustered Forward Shading: Supporting Transparency and MSAA
Tiled Forward Shading
More Efficient Virtual Shadow Maps for Many Lights
Erik Sintorn More Efficient Virtual Shadow Maps for Many Lights
Parallel Space Inc. Andreas Papathanasis Dragon Age II DX11 Technology
Roblox John White More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run
Angelo Pesce Notes on Real-Time Renderers
SEED Colin Barré-Brisebois More Performance! Five Rendering Ideas from Battlefield 3 and Need For Speed: The Run
DGene Jason C. Yang Forward+: Bringing Deferred Lighting to the Next Level
Forward+: A Step Toward Film-Style Shading in Real Time
Volition Scott Kircher Lighting & Simplifying Saints Row: The Third
Ready at Dawn Matt Pettineo Light Indexed Deferred Rendering
Crafting a Next-Gen Material Pipeline for The Order: 1886
Rendering the Alternate History of The Order: 1886
David Neubelt Crafting a Next-Gen Material Pipeline for The Order: 1886
id Software Tiago Sousa The Rendering Technologies of Crysis 3
CryENGINE 3: Graphics Gems
Cloud Imperium Games Carsten Wenzel The Rendering Technologies of Crysis 3
Crytek Chris Raine The Rendering Technologies of Crysis 3
Nicolas Schulz Moving to the Next Generation—The Rendering Technology of Ryse
Unity Technologies Natalya Tatarchuk Destiny: From Mythic Science Fiction to Rendering in Real-Time
Bungie Chris Tchou Destiny: From Mythic Science Fiction to Rendering in Real-Time
Joe Venzon Destiny: From Mythic Science Fiction to Rendering in Real-Time
Sucker Punch Productions Adrian Bentley inFAMOUS Second Son Engine Postmortem
Insomniac Games Abdul Bezrati Real-Time Lighting via Light Linked List
Digital Foundary Richard Leadbetter The Making of Forza Horizon 2
Google Ramy El Garawany Deferred Lighting in Uncharted 4
Haven Studios Inc. Jalal El Mansouri Rendering Tom Clancy’s Rainbow Six Siege
Elemental Games Emil Persson Clustered Shading: Assigning Lights Using Conservative Rasterization in DirectX 12
Playground Games Kostas Anagnostou How Unreal Renders a Frame

@startuml
start
split
group Render Opaque Objects
    :Depth Buffer;
floating note left: Z Pre-Pass
floating note right: Sort Front-To-Back
    :Switch Off Depth Write;
    :Forward Rendering;
floating note left: Sort Front-To-Back
end group
split again
group Transparent Objects
    :Switch Off Depth Write;
    :Forward Rendering;
    floating note right: Sort Back-To-Front
end group
end split
stop
@enduml