References

SIGGRAPH: Advances in Real-Time Rendering in Games

SIGGRAPH

SIGGRAPH Asia

JCGT

HPG

GDC

Digital Dragons

Books

Blog

Study Notes

Forward Rendering

Simple pseudo-code:Ortiz18

// Shaders
Shader simpleShader

// Buffers:
Buffer display

for mesh in scene
    for light in scene
        display += simpleShader(mesh, light)
  • AdvantagesOlsson14
    • Single pass
    • Low storage overhead
      • Single frame buffer
    • Simple if only few lights (e.g., the sun)
    • MSAA works
    • Varying shading models is easy
    • Transparency works
  • Issues:
    • Computing which lights affect each body consumes CPU time, worst case O(n × m) operationKoonce07
    • Shader often require more than one render pass to perform lighting, worst cast O(n) render passes for n lightsKoonce07Ortiz18
    • Adding new lighting models or light types requires changing all effect source filesKoonce07
    • Shaders quickly encounter the instruction count limit of Shader Model 2.0Koonce07
    • OverdrawingOlsson14Ortiz18
    • Multi PassOlsson14
    • Light managementOlsson14
      • Expensive for many lights
    • Batching coupled with lightingOlsson14
    • Shader managementOlsson14
      • Permutations of #lights/type

Single PassValient07

  • For each object
    1. Find all lights affecting object
    2. Render all lighting and material in a single shader
  • Shader combinations explosion
    • Shader for each material vs light setup combination
  • All shadow maps have to be in memory
  • Waster shader cycles
    • Invisible surfaces / overdraw
    • Triangles outside light influence

Multi-PassValient07

  • For each light
    1. For each object
    2. Add lighting from single light to frame buffer
  • Shader for each material and light type
  • Wasted shader cycles
    • Invisible surfaces / overdraw
    • Triangles outside light influence
    • Lots of repeated work
      • Full vertex shaders, texture filtering

Examples

Example 1: Jurassic World EvolutionCodeCorsair0321

  • Tiled Forward Lighting
    • 8 x 8 pixel tiles
  • Depth Prepass
  • Thin GBuffer
    • Normal + roughness
    • Motion Vectors

Deferred Shading

  1. For each objectValient07
    • Render surface properties into the G-Buffer
  2. For each light and lit pixelValient07
    1. Use G-Buffer to compute lighting
    2. Add result to frame buffer

Simple Z pre-pass pseudo-code:Ortiz18

// Buffers:
Buffer display
Buffer depthBuffer

// Shaders:
Shader simpleShader
Shader writeDepth

// Visibility
for mesh in scene
    if mesh.depth < depthBuffer.depth
        depthBuffer = writeDepth(mesh)

// Shading and lighting
for mesh in scene
    if mesh.depth == depthBuffer.depth
        for light in scene
            display += simpleShader(mesh, light)

Simple multi-pass deferred rendering pseudo-code:Ortiz18

// Buffers:
Buffer display
Buffer GBuffer

// Shaders:
Shader simpleShader
Shader writeShadingAttributes

// Visibility & materials
for mesh in scene
    if mesh.depth < GBuffer.depth
        GBuffer = writeShadingAttributes(mesh)

// Shading & lighting - multi-pass
for light in scene
    display += simpleShader(GBuffer, light)

Simple single-pass deferred rendering pseudo-code:Ortiz18

// Buffers:
Buffer display
Buffer GBuffer

// Shaders:
Shader manyLightShader
Shader writeShadingAttributes

// Visibility & materials
for mesh in scene
    if mesh.depth < GBuffer.depth
        GBuffer = writeShadingAttributes(mesh)

// Shading & lighting - multi-pass
display = manyLightShader(GBuffer, scene.lights)

Builds an attribute buffer, also known as the G-Buffer.Thibieroz03Calver03

For each object:
    Render to MRT

For each light:
    Apply light as a 2D postprocess

Hargreaves04

Geometry Phase

  • The only phase that actually uses an objects mesh dataCalver03
  • Output is G-BufferCalver03
  • Fill the G-Buffer with all geometry (static, skinned, etc.)Valient07
  • Initialize light accumulation buffer with pre-baked lightValient07
    • Ambient, incandescence, constant specular
    • Lightmaps on static geometry
      • YUV color space, S3TC5 with Y in Alpha
      • Sun occlusion in B channel
      • Dynamic range [0…2]
    • Image based lighting on dynamic geometry
  • Some portions of the light equation that stay constant can be computed here and stored in the G-Buffer if necessary (ex. Fresnel)Calver03
  • Precision is important when creating G-BuffersFilionMcNaughton08
  • Don’t bother with any lighting while drawing scene geometryHargreaves04
  • Render to a “fat” framebuffer format, using MRT to store data (position, normal, etc.)Hargreaves04

G-Buffer Building PassThibieroz11

How to reduce export cost (PS to MRT)?

  • Render objects in front-to-back order
  • Use fewer RTs
    • Each RT adds to export cost
  • Avoid slow formats
    • NVIDIA
      • RT export cost proportional to bit depth except 32bpp same speed as 32bpp sRGB formats are slower
      • Total export cost = Cost(RT0) + Cost(RT1) + …
    • AMD
      • R32G32B32A32, R32G32, R32, R32G32B32A32f, R32G32f, R16G16B16A16 (+ R32F, R16G16, R16 on older GPUs)
      • Total export cost = (Num RTs) * (Slowest RT)
  • Depth Buffer as Texture Input
    • No need to store depth into a color RT
  • Data Packing
    • Trade render target storage for a few extra ALU instructions
    • ALU cost is typically negligible compared to the performance saving of writing and fetching to/from fewer textures

What to Store?

  • PositionCalver03
    • World Space?
    • Screen Space?
      • Costs a few extra instructions to undo, but the advantages in bandwidth and visual fidelity make it worthwhile
  • Distance
    • Instead of storing positions, store the distance of the position in view spacePlaceres06
      • G_Buffer.z = length(Input.PosInViewSpace)
    • Each 2D screen pixel corresponds to a ray from the eyepoint into the 3D worldHargreaves04
    • Retrieval:Placeres06
      • VS
        • out.vEyeToScreen = float3(Input.ScreenPos.x * ViewAspect, Input.ScreenPos.y, invTanHalfFOV);
      • PS
        • float3 PixelPos = normalize(Input.vEyeToScreen) * G_Buffer.z;
  • Normal
    • Because normal vectors are unit length vectors, z = ± √ (1 - x2 - y2)Placeres06
      • If all the lighting is performed in view space, then the front-faced polygons are always going to have negative / positive Z components depending on the frame of reference used
    • Single byte is commonly used for a normal component when dealing with normal maps or other precomputed texturesPlaceres06
    • Store normals in A2R10G10B10Hargreaves04
    • Normals at 24bpp are too quantized, lighting is of a low qualityKaplanyan10
      • 24bpp = 256 x 256 x 256 cells = 16,777,216 values
      • However, we are storing the normalized normals:
        • 289,880 cells out of 16,777,216 cells ~ 1.73%
      • Use a normal cube map
  • Color
  • Material Atrributes
    • Specular power, glow factor, occlusion term, etc.Placeres06
    • Such values normally differs by material, and often could be constrained to specific valuesPlaceres06
      • If the specular power term can be constrained into [1, 4, 10, 20], then only two bits can be used to store the specular term
    • If a scene contains few materials, then a G-Buffer can hold a material id value to be used as an index to the material look-up tablePlaceres06
    • Could be palettized, then looked up in shader constants or a textureHargreaves04

Examples

Example 1Calver03
GBuffer
Pos.X Pos.Y Pos.Z ID
Norm.X Norm.Y Norm.Z MaterialIndex
Albedo.R Albedo.G Albedo.B DiffuseTerm
SpecularEmissive.R SpecularEmissive.G SpecularEmissive.B SpecularTerm
Material Lookup Texture
Kspecblend
KAmb
KEmm
Example 2 : StarCraft IIFilionMcNaughton08
  • Only opaque objects with some notable exceptions
  • Only local lights
    • Point and spot lights with a defined extent in space
    • Global directional lights are forward rendered normally
  • Color components not affected by local lighting (emissive, environment maps, forward-lit color components)
  • Depth
    • Lighting, fog volumes, dynamic AO, smart displacement, DoF, projections, edge detection, thickness measurement
  • Per-pixel normal
    • Dynamic AO
  • Ambient occlusion term, if using static AO. Baked AO textures are ignored if screen-space AO is enabled
  • Unlit diffuse material color
    • Lighting
  • Unlit specular material color
    • Lighting
  • Due to heavy use of HDR, all these buffers will normally be four-channel 16-bit floating point formats
MRTs GBuffer
MRT 0 UnlitAndEmissive.R UnlitAndEmissive.G UnlitAndEmissive.B Unused
MRT 1 Normal.X Normal.Y Normal.Z Depth
MRT 2 Albedo.R Albedo.G Albedo.B AO
MRT 3 (optional) Specular.R Specular.G Specular.B Unused
Example 3 : GDC 2004Hargreaves04
  • Position : A32B32G32R32F
  • Normal : A16B16G16R16F
  • Diffuse Color : A8R8G8B8
  • Material Parameters : A8R8G8B8
  • 256 bits per pixel

Optimized Version:

  • 128 bits per pixel
  • Depth : R32F
    • Can be optimized by using Depth buffer directly
  • Normal + scattering : A2R10G10B10
    • 2 bit subsurface scattering control
  • Diffuse color + emissive : A8R8G8B8
  • Other material parameters : A8R8G8B8
    • Specular intensity, specular power, occlusion factor, shadow etc.
Example 4 : Killzone 2Valient07
  • Depth
  • View-space normal
  • Specular intensity
  • Specular roughness / Power
  • Screen-space 2D motion vector
  • Albedo (texture color)

After deferred composition, post-processing(DoF, bloom, motion blur, colorize, ILR) is applied

MRTs R8 G8 B8 A8
DS Depth 24bpp Depth 24bpp Depth 24bpp Stencil
RT 0 Lighting Accumulation R Lighting Accumulation G Lighting Accumulation B Intensity
RT 1 Normal.X (FP16) Normal.X (FP16) Normal.Y (FP16) Normal.Y (FP16)
RT 2 Motion Vectors X Motion Vectors Y Spec-Power Spec-Intensity
RT 3 Diffuse Albedo R Diffuse Albedo G Diffuse Albedo B Sun-Occlusion
  • MRT - 4xRGBA8 + 24D8S (approx 36 MB)
  • 720p with Quincux MSAA
  • Position computed from depth buffer and pixel coordinates
  • Light accumulation - output buffer
    • Light is rendered as convex geometry
    • Point light - sphere
    • Spot light - cone
    • Sun - full-screen quad
    • For each light
      • Find and mark visible lit pixels
        • Mark pixels in front of the far light boundary
          • Render back-faces of light volume
          • Depth test GREATER-EQUAL
          • Write to stencil on depth pass
          • Skipped for very small distant lights
        • Find amount of lit pixels inside the volume
          • Start pixel query
          • Render front faces of light volume
          • Depth test LESS-EQUAL
          • Don’t write anything - only EQUAL stencil test
      • If light contributes to screen
        • Render shadow map
        • Shade lit pixels and add to framebuffer
          • Render front-faces of light volume
            • Depth test - LESS-EQUAL
            • Stencil test - EQUAL
            • Runs only on marked pixels inside light
          • Compute light equation
            • Read and unpack G-Buffer atrributes
            • Calculate light vector, color, distance attenuation
            • Perform shadow map filtering
          • Add Phong lighting to frame buffer
  • Intensity - luminance of lighting accumulation
    • Scaled to range [0…2]
  • Normal.z = sqrt(1.0f - Normal.x2 - Normal.y2)
  • Motion vectors - screen space
  • Specular power - stored as log2(original) / 10.5
    • High range and still high precision for low shininess
  • Sun occlusion - pre-rendered static sun shadows
    • Mixed with real-time sun shadow for higher quality

Analysis:

  • Pros:
    • Highly packed data structure
    • Many extra attributes
    • Allows MSAA with hardware support
  • Cons:
    • Limited output precision and dynamic range
      • Lighting accumulation in gamma space
      • Can use different color space (LogLuv)
    • Attribute packing and unpacking overhead

Light Optimization:

  • Determine light size on the screen
    • Approximation - angular size of light volume
  • If light is “very small”
    • Don’t do any stencil marking
    • Switch to non-shadow casting type
  • Shadows fade-out range
    • Artist defined light sizes at which:
      • Shadows start to fade out
      • Switch to non-shadow casting light
Example 5 : Mafia: Definitive EditionCodeCorsair0821
  • Depth Prepass
  • GBuffer Pass
MRTs R G B A
RT 0 Normal.X R16F Normal.Y G16F Normal.Z B16F Roughness A16F
RT 1 Albedo.R R8 Albedo.G G8 Albedo.B B8 Metalness A8
RT 2 MotionVectors.X R16U MotionVectors.Y G16U MotionVector.Z B16U Encoded Vertex Normal A16U
RT 3 Specular Intensity R8 0.5 G8 Curvature or Thickness (for SSS) B8 SSS Profile A8
RT 4 Emissive.R R11F Emissive.g G11F Emissive.b B11F
Example 6: Resident Evil 2Schreiner19
  • Depth prepass
  • G-Buffer pass Geometry + Decals
MRTs R G B A
RT 0 Emissive.R R11 Emissive.G G11 Emissive.B B10
RT 1 Albedo.R R8 Albedo.G G8 Albedo.B B8 Metalness A8
RT 2 Normal.R R10 Normal.G G10 Roughness B10 Miscellanous A2
RT 3 Baked AO R16 Velocity.Y G16 Velocity.Z B16 SSS A16
Example 7: Batman: Arkham KnightsMoradin20
</tr>
MRTs R G B A
RT 0 Emissive.R R11 Emissive.G G11 Emissive.B B10
RT 1 Normal.X R10 Normal.Y G10 Normal.Z B10 A2
RT 2 Albedo.R R8 Albedo.G G8 Albedo.B B8 Material Masking A8
RT 3 Material Properties R8 Material Properties G8 Material Properties B8 Material Properties A8

Lighting Phase

  • Apply lighting as a 2D postprocess using MRT buffers as inputHargreaves04

  • Full screen lightsCalver03
    • Use full screen quad
    • Ambient lighting, hemispherical lighting, sun directional lighting etc.
  • Shaped LightsCalver03
    • Using attenuations, we can cull the lights that no longer makes any contribution per-pixel
    • Based on the attenuation model, calculate which pixels are beyond the ranges and try not to run the pixel shaders
    • We can project the light volumes where the lighting contribution isn’t zero
      • Each pixel most be hit once and once only
      • Projected geometry must not be clipped by the near/far planes
        • Remove the far plane by setting it into infinity

Full screen lighting is actually slower than forward rendering.

  • Stenciling allows us to discard pixels that are covered by the light shapes but whose depth is too far behind the light to be affected by itThibieroz11
  • Early-Z will reduce cost of hidden lightsThibieroz11
  • Early stencil reduces cost of objects behind light that are not lit by itThibieroz11
    • Light shapes must be rendered again with color writes off before its normal lighting pass
    • Stencil operation is slightly different depending on whether the viewing camera sits inside the light shape (this has to be tested on the CPU)
    • If the camera is outside the light shape, only pass front facing parts
    • Else, color any of the light shape’s back facing pixels that failed the z-test
  • High-Level ManagerPlaceres06
    • Only sends the sources that influence the final image to the pipeline and executes the shaders only on the pixels that are influenced by the effect or light
    • Receives the list of all sources that should modulate the resulting render and all the information about them
      • One such piece of information: Bounding object that contains information on how strong the source effect should be applied to the rendering equation 1. Social Stage
        1. Execute visibility and occlusion algorithms to discard lights whose influence is not appreciable
        • Using the source-bounding object, the manager finds which lights and effects are not occluded by the scene geometry and are inside the view frustum 2. Project visible sources bouding objects into screen space 3. Combine similar sources that are too close in screen space or influence almost the same screen area 4. Discard sources with a tiny contribution because of their projected bounding object being too small or too far 5. Check that more than a predefined number of sources do not affect each screen region. Choose the biggest, strongest, and closer sources 2. Individual Phase
      • Lights and special effects in general can be classified in two main groups: global sources, local sources
      • Global Sources
        • Sun, DoF, fog, etc.
        • For each global sources:
          • Enable the appropriate shaders
          • Render a quad covering the screen
      • Local Sources
        • Only affect regions of the scene
        • Volumetric fog, heat haze, etc.
        • Pixels inside the source bounding-object have to be processed 1. Select the appropriate level of detail
        • Farther and smaller sources don’t have to be computed with the same quality than closer sources 2. If dynamic branching is not supported, render a mask in the stencil buffer
        • If dynamic branching is supported, if the pixel position is not found to be in the influence area of the bounding object, then the pixel is discarded 3. Enable and configure the source shaders 4. Compute the minimum and maximum screen cord values of the projected bounding object 5. Enable the scissor test
        • Using the rectangle that surrounds the screen projection of the source bounding object
        • Quickly rejects pixels at the fragment level 6. Enable the clipping planes
        • Further restrain the pixels that get into the fragment stage from the transform level 7. Render a screen quad or the bounding object

Shading PassesThibieroz11

  • Add light contributions to accumulation buffer
  • Light Volumes vs Screen-aligned quads
  • Cull lights as needed before sending them to the GPU
  • Don’t render lights on skybox area

  • S.T.A.L.K.E.R. uses hierarchical occlusion-culling system that utilized both the CPU and the GPUShishkovtsov05
    • Coarsest test was sector-portal culling followed by CPU-based occlusion culling (similar to the hierarchical z-buffer)

  • Most lighting in Ghost of Tsushima is deferred:Patry20
    • Lambertian diffuse
      • Optional translucency
      • Optional simplified asperity scattering (fuzziness)
    • Isotropic GGX specular
      • Optional colored specular via “metallic” channel
  • Other materials are forward lit, e.g. when using:Patry20
    • Anisotropic GGX specular
    • Full anisotropic asperity scattering BRDF
    • Subsurface scattering

Tiled Shading


Olsson14

The Bandwith problem:

  • New type of overdraw
    • Light overdraw
  • N lights cover a certain pixel
    • N reads from the same G-Buffer location
  • Deferred shading
for each light
    for each covered pixel
        read G-Buffer
        compute shading
        read + write frame buffer   // requires repeated reading and writing of the G-Buffers and frame buffer
for each pixel
    read G-Buffer
    for each affecting light
        compute shading
    write frame buffer  // single write
  • This scheme requires sequential access to lights for each pixel
    • Global list
      • inefficient
    • List per pixel
      • Lots of data
      • Slow construction
    • Share list in screen space tiles
      • e.g. 32 x 32 pixels
      • Simple construction
      • little storage
        • 1 list per 1024 pixels
      • coherent access
  • For each light,
    • Find screen space AABB to the light
    • Add index of the light to affected tiles
    • Perform min/max depth to cull away lights occupying empty space
  • Global Light List
  • Tile Light Index Lists
  • Light Grid
    • 2D grid
    • Each cell stores an offset and count that represent a range in a global buffer

ex)

Global Light List
L0 L1 L2 L3 L4 L5 L6 L7
Tile Light Index Lists
0 0 6 3 0 6 4 4
Tile Light Index Lists
0 1 4 7
1 3 3 1
66 67 69
1 2 2
  • Light grid provides:
    • Access to light list for each pixel
    • List shared within tile
      • Low memory usage
      • Good access coherency
      • Not pixel exact
  • Light grid building
    • Is pretty quick
    • CPU for hundres of lights
  1. Render Scene to G-Buffers
    • Store geometry attributes per pixel
      • G-Buffers
  2. Build Light Grid
  3. Full Screen Quad (or CUDA, or Compute Shaders, or SPUs)
    • For each pixel
      • Fetch G-Buffer Data
      • Find Tile
      • Loop over lights and accumulate shading
      • Write shading
  • Depth Range Optimization
    • Min/Max depth per tile
  • For each light
    • Test depth bounds
    • If light volume does not intersects with the depth bounds, then reject the light
  • However…
    • Tiles: 2D
    • Geometry Samples: 3D
      • Fragments / Pixels
    • Light density: 3D
    • View dependent
      • Unpredictable shading times

Simple single-pass deferred rendering using tiled shading pseudo-code:Ortiz18

// Buffers:
Buffer display
Buffer GBuffer
Buffer tileArray

// Shaders:
Shader manyLightShader
Shader writeShadingAttributes
CompShader lightInTile

// Visibility & materials
for mesh in scene
    if mesh.depth < GBuffer.depth
        GBuffer = writeShadingAttributes(mesh)

// Light culling
for tile in tileArray
    for light in scene
        if lightInTile(tile, light)
            tile += light

// Shading & lighting - multi-pass
display = manyLightShader(GBuffer, tileArray)

Based on the light attenuation, we can calculate the influence of a light to create a concrete volume of influence which is represented with different shapes for each type of light source.Ortiz18

Based on these shapes, perform a collision detection between the light volumes and tiles. Tile forward shading is sometimes also referred to as Forward+.Ortiz18

However, tiles has no information regarding the depth discontinuity of the lights.Ortiz18

Clustered Shading


Ortiz18

If tiled shading is a 2D approach, clustered shading is a 3D approach.

Simple single-pass deferred rendering using clustered shading pseudo-code:

Buffer display
Buffer GBuffer
Buffer clusterArray

// Shaders:
Shader manyLightShader
Shader writeShadingAttributes
CompShader lightInCluster

// Visibility & materials
for mesh in scene
    if mesh.depth < GBuffer.depth
        GBuffer = writeShadingAttributes(mesh)

// Light culling
for cluster in clusterArray
    for light in scene
        if lightInCluster(cluster, light)
            cluster += light

// Shading & lighting - multi-pass
display = manyLightShader(GBuffer, clusterArray)

Olsson14

  • Adding the 3rd dimension
    • Tile also in depth direction = cluster
  • Bounded volume around samples
    • Shading cost ~ Light density
  • New Challenges
    • Many more (potential) clusters
      • Must find those actually used
    • Adding lights no longer screen space

Algorithm:

  • Rasterize G-Buffers
    • Forward: pre-z pass
  • Cluster assignment
    • Cluster key:
      • ck = (i, j, k) integer tuple
    • i, j = 2D tile id
      • Divide the fragment coordinate by the tile size
    • k = &thickapprox;log(view space z)
    • (quantized normal)
  • Find unique clusters
    • Full screen pass
    • Flag used clusters
      • Read depth
      • Compute ck
      • Set cell in grid to one
    • Forward:
      • Geometry pass with side effect
    • Compact Non-Zero
      • Yields list of non-empty clusters
    • Parallel prefix sum
      • chag::pp (CUDA)
      • thrust (CUDA)
      • Compute Shader API?
  • Assign lights to clusters
    • Many clusters
    • Many lights
    • Hierarchical approach
      • Hierarchy over lights
        • 32-way tree (matches SIMD of GPU)
        • Dynamically rebuilt on GPU
      • Each cluster traverses light tree
        • BV tests
  • Shade view samples
    • Deferred: Full screen pass
    • Forward: Geometry pass

Light Volume Rendering


Thibieroz11

  • Render light volumes corresponding to light’s range
    • Fullscreen tri/quad (ambient or directional light)
    • Sphere (point light)
    • Cone / pyramid (spot light)
    • Custom shapes (level editor)
  • Tight fit between light coverage and processed area
  • 2D projection of volume define shaded area
  • Additively blend each light contribution to the accumulation buffer
  • Use early depth / stencil culling optimizations
  • Muse be geometry-optimized
    • Index re-use(post VS cache), sequential vertex reads (pre VS cache)
    • Common oversight for algorithmically generated meshes (spheres, cones, etc.)
    • Especially important when depth/stencil-only rendering is used

Kaplanyan10

  • Use artist-defined clipping geometry: clip volumes
    • Mast the stencil in addition to light volume masking

Screen-Aligned QuadsThibieroz11

  • Rendering a camera-facing quad for each light
  • Simpler geometry but coarser rendering
  • Not as simple as it seems
    • Spheres (point lights) project to ellipses in post-perspective space!
    • Can cause problems when close to camera
  • Additively render each quad onto accumulation buffer
    • Process light equation as normal
  • Set quad Z coordinates to Min Z of light
    • Eary Z will reject lights behind geometry with Z Mode = LESSEQUAL
  • Watch out for clipping issues
    • Need to clamp quad Z to near clip plane Z if: Light Min Z < Near Clip Plane Z < Light Max Z
  • Saves on geometry cost but not as accurate as volumes

Global Lights

  • Sunlight, fog, etc.Hargreaves04
  • Draw them as a fullscreen quadHargreaves04
  • Based on the G-Buffer, calculate the lighting in the pixel shader, then output the results to an intermediate lighting bufferHargreaves04
Sun Rendering
  • Full screen quadValient07
  • Stencil mark potentially lit pixelsValient07
    • Use only sun occlusion from G-Buffer
  • Run final shader on marked pixelsValient07
    • Approx. 50% of pixels skipped thanks to 1st pass
      • Also skybox / background
    • Simple directional light model
    • Shadow = min(RealTimeShadow, Occlusion)

Optimization:Shishkovtsov05

  • Single 2048 x 2048-pixel shadow map for most of the sun lighting, using a perspective shadow map-style projection transform to maximize area near the viewer
  • Skybox doesn’t need to be shaded (30 ~ 40% pixels)
  • Approximately 50% percent of pixels face away from the sun
  • Pixels that have an ambient occlusion term of zero cannot be reached by sun rays

Shishkovtsov05

Pass 0: 
Render full-screen quad only where 0x03==stencil count (where attributes are stored)    
If ((N dot L) * ambient_occlusion_term > 0)      
    discard fragment
Else
    color = 0, stencil = 0x01

Pass 1: 
Render full-screen quad only where 0x03==stencil count
Perform light accumulation / shading

Local Lights

  • Only render the affected pixelsHargreaves04
  • Project the light volume into screen-spaceHargreaves04
  • At author time, build a simple mesh that bounds the area affected by the lightHargreaves04
  • At run time, draw this in 3D space, running your lighting shader over each pixel that it coversHargreaves04

Convex Light Hulls

  • Based on the position of the camera, light bounding mesh, light process has to be process differently
  • It is important that each pixel only gets lit onceHargreaves04
  • If the camera is inside the volume, draw backfacesHargreaves04
  • If the light volume intersects the far clip plane, draw frontfacesHargreaves04
  • If the light volume intersects both near and far clip planes, your light is too big!Hargreaves04
  • Else, as long as the mesh is convex, backface culling can take care of thisHargreaves04

Stencil Light VolumesHargreaves04

  • Finding the intersection between scene geometry and an extruded shadow volume
  • Using stencil required changing renderstate for each light, which prevents batching up multiple lights into a single draw call

Light Volume Z TestsHargreaves04

  • Drawing light volume backfaces:
    • D3DCMP_GREATER to reject “floating in the air” portions of the light
  • Drawing front faces:
    • D3DCMP_LESS to reject “buried underground” light regions

Alpha Blending

  • BIG PROBLEM!Hargreaves04
  • Simply draw them after the deferred shadingHargreaves04
  • Use forward rendering passValient07
    • Single pass solution
      • Shader has four uberlights
      • No shadows
      • Per-vertex lighting version for particles
    • Lower resolution rendering available
      • Fill-rate intensive effects
      • Half and quarter screen size rendering
      • Half resolution rendering using MSAA HW

HDR

  • Render your scene to multiple 32 bit buffers, then use a 64 bit accumulation buffer during the light phaseHargreaves04

Antialiasing

Shishkovtsov05

struct v2p
{
    float4 tc0: TEXCOORD0; // Center    
    float4 tc1: TEXCOORD1; // Left Top      
    float4 tc2: TEXCOORD2; // Right Bottom    
    float4 tc3: TEXCOORD3; // Right Top    
    float4 tc4: TEXCOORD4; // Left Bottom      
    float4 tc5: TEXCOORD5; // Left / Right    
    float4 tc6: TEXCOORD6; // Top / Bottom  
};      
/////////////////////////////////////////////////////////////////////  
uniform sampler2D s_distort;  
uniform half4 e_barrier;  // x=norm(~.8f), y=depth(~.5f)  
uniform half4 e_weights;  // x=norm, y=depth  
uniform half4 e_kernel;   // x=norm, y=depth    
/////////////////////////////////////////////////////////////////////  

half4 main(v2p I) : COLOR  
{   
    // Normal discontinuity filter   
    half3 nc = tex2D(s_normal, I.tc0);   
    half4 nd;   
    nd.x = dot(nc, (half3)tex2D(s_normal, I.tc1));   
    nd.y = dot(nc, (half3)tex2D(s_normal, I.tc2));   
    nd.z = dot(nc, (half3)tex2D(s_normal, I.tc3));   
    nd.w = dot(nc, (half3)tex2D(s_normal, I.tc4));   
    nd -= e_barrier.x;   
    nd = step(0, nd);   
    half ne = saturate(dot(nd, e_weights.x));     
    
    // Opposite coords     
    float4 tc5r = I.tc5.wzyx;   
    float4 tc6r = I.tc6.wzyx;     
    
    // Depth filter : compute gradiental difference:   
    // (c-sample1)+(c-sample1_opposite)   
    half4 dc = tex2D(s_position, I.tc0);   
    half4 dd;   
    dd.x = (half)tex2D(s_position, I.tc1).z + (half)tex2D(s_position, I.tc2).z;   
    dd.y = (half)tex2D(s_position, I.tc3).z + (half)tex2D(s_position, I.tc4).z;   
    dd.z = (half)tex2D(s_position, I.tc5).z + (half)tex2D(s_position, tc5r).z;   
    dd.w = (half)tex2D(s_position, I.tc6).z + (half)tex2D(s_position, tc6r).z;   
    dd = abs(2 * dc.z - dd)- e_barrier.y;   
    dd = step(dd, 0);   
    half de = saturate(dot(dd, e_weights.y));     
    
    // Weight     
    half w = (1 - de * ne) * e_kernel.x; 
    // 0 - no aa, 1=full aa     
    // Smoothed color   
    // (a-c)*w + c = a*w + c(1-w)   
    float2 offset = I.tc0 * (1-w);   
    half4 s0 = tex2D(s_image, offset + I.tc1 * w);   
    half4 s1 = tex2D(s_image, offset + I.tc2 * w);   
    half4 s2 = tex2D(s_image, offset + I.tc3 * w);   
    half4 s3 = tex2D(s_image, offset + I.tc4 * w);   

    return (s0 + s1 + s2 + s3)/4.h;  
} 

Examples

Example 1: Phong modelThibieroz03

G-Buffer structure:

  1. Pixel position
    • World space position of the pixel
    • R16G16B16A16 FLOAT
  2. Pixel Normal vector
    • World space normalized normal vector
    • Choice:
      • Model space: Simplest.
      • Tangent space
    • R10G10B10A2 / R8G8B8A8 INT / FLOAT
  3. Pixel diffuse color
    • R8G8B8A8 FLOAT
  • References:

  • m4x4 dst, src0, src1: multiplies a 4-component vector by a 4x4 matrix. src0 is a 4-component vector, src1 is a 4x4 matrix (dst = mul(src1, src0) (if src0 is column vector))
  • mov dst, src: move floating-point data between registers

PreLightPass.vs

;-------------------------------------------------------------------
; Constants specified by the app
; c0-c3 = Global transformation matrix (World*View*Projection)
; c4-c7 = World transformation matrix
;
; Vertex components
; v0 = Vertex Position
; v1, v2, v3 = Inverse of tangent space vectors
; v4 = 2D texture coordinates (model coordinates)
;-------------------------------------------------------------------
vs_2_0

dcl_position v0 ; Vertex position
dcl_binormal v1 ; Transposed binormal
dcl_tangent v2 ; Transposed tangent
dcl_normal v3 ; Transposed normal
dcl_texcoord v4 ; Texture coordinates for diffuse and normal map

; Vertex transformation
m4x4 oPos, v0, c0 ; Transform vertices by WVP matrix

; Model texture coordinates
mov oT0.xy, v4.xy ; Simply copy texture coordinates

; World space coordinates
m4x3 oT1.xyz, v0, c4 ; Transform vertices by world matrix (no w
; needed)

; Inverse (transpose) of tangent space vectors
mov oT2.xyz, v1
mov oT3.xyz, v2
mov oT4.xyz, v3 ; Pass in transposed tangent space vectors
cbuffer ConstantBuffer : register(b0)
{
    float4x4 WVP;   // Global transformation matrix
    float4x4 World; // World transformation matrix
}

struct VSInput
{
    float4 Position : POSITION;     // Vertex position
    float3 Binormal : BINORMAL;     // Inverse of tangent space vectors >> Transposed binormal
    float3 Tangent : TANGENT;       // Inverse of tangent space vectors >> Transposed tangent
    float4 Normal : NORMAL;         // Inverse of tangent space vectors >> Transposed normal
    float2 TexCoord : TEXCOORD0;    // 2D texture coordinates for diffuse and normal map (model coordinates)
};

struct VSOutput
{
    float4 Position : SV_Position;
    float3 WorldPos : WorldPos;
    float3 Binormal : Binormal;
    float3 Tangent : Tangent;
    float3 Normal : Normal;
    float2 TexCoord : TexCoord0;
};

VSOutput main(VSInput vsInput)
{
    VSOutput vsOutput;

    // Vertex transformation
    vsOutput.Position = mul(WVP, vsInput.Position); // Transform vertices by WVP matrix

    // Model texture coordinates
    vsOutput.TexCoord = vsInput.TexCoord;   // Simply copy texture coordinates

    // World space coordinates
    vsOutput.WorldPos = mul(float3x3(World), vsInput.Position.xyz);   // Transform vertices by world matrix (no w needed)

    // Inverse(transpose) of tangent space vectors
    vsOutput.Binormal = vsInput.Binormal;
    vsOutput.Tangent = vsInput.Tangent;
    vsOutput.Normal = vsInput.Normal;   // Pass in transposed tangent space vectors

    return vsOutput;
}
  • def dst, fValue1, fValue2, fValue3, fValue4: Defines pixel shader floating-point constants. fValue1 to fValue4 are floating=point values
  • texld dst, src0, src1: Sample a texture at a particular sampler, using provided texture coordinates. src0 is a source register that provides the texture coordinates for the texture sample. src1 identifies the Sampler where # specifies which texture sampler number to sample.
  • mad dst, src0, src1, src2: Multiply and add instruction. Sets the destination register to (src0 * src1) + src2.
  • dp3 dst, src0, src1: Computes the three-component dot product of the source registers.

PreLightPass.ps

;-------------------------------------------------------------------
; Constants specified by the app
; c0-c3 = World transformation matrix for model
;-------------------------------------------------------------------
ps_2_0

; Samplers
dcl_2d s0 ; Diffuse map
dcl_2d s1 ; Normal map

; Texture coordinates
dcl t0.xy ; Texture coordinates for diffuse and normal map
dcl t1.xyz ; World-space position
dcl t2.xyz ; Binormal
dcl t3.xyz ; Tangent
dcl t4.xyz ; Normal (Transposed tangent space vectors)

; Constants
def c30, 1.0, 2.0, 0.0, 0.0
def c31, 0.2, 0.5, 1.0, 1.0

; Texture sampling
texld r2, t0, s1 ; r2 = Normal vector from normal map
texld r3, t0, s0 ; r3 = Color from diffuse map

; Store world-space coordinates into MRT#0
mov oC0, t1 ; Store pixel position in MRT#0

; Convert normal to signed vector
mad r2, r2, c30.g, -c30.r ; r2 = 2*(r2 - 0.5)

; Transform normal vector from tangent space to model space
dp3 r4.x, r2, t2
dp3 r4.y, r2, t3
dp3 r4.z, r2, t4 ; r4.xyz = model space normal

; Transform model space normal vector to world space. Note that only
; the rotation part of the world matrix is needed.
; This step is not required for static models if their
; original model space orientation matches their orientation
; in world space. This would save 3 instructions.
m4x3 r1.xyz, r4, c0

; Convert normal vector to fixed point
; This is not required if the destination MRT is float or signed
mad r1, r1, c31.g, c31.g ; r1 = 0.5*(r1 + 0.5)

; Store world-space normal into MRT#1
mov oC1, r1

; Store diffuse color into MRT#2
mov oC2, r3
cbuffer ConstantBuffer : register(b0)
{
    float4x4 World;
}

// Samplers
Texture2D<float3> g_DiffuseMap  : register(t0); // Diffuse map
Texture2D<float3> g_NormalMap   : register(t1);  // Normal map
SamplerState g_Sampler          : register(s0);

struct PSInput
{
    float4 Position : SV_Position;
    float3 WorldPos : WorldPos;     // World-space position
    float3 Binormal : Binormal;     // Binormal
    float3 Tangent : Tangent;       // Tangent
    float3 Normal : Normal;         // Normal (Transposed tangent space vectors)
    float2 TexCoord : TexCoord0;    // Texture Coordinates for diffuse and normal map
};

struct Mrt
{
    float3 WorldPos;
    float3 WorldNormal;
    float3 Albedo;
}

Mrt main(PSInput psInput)
{
    Mrt mrt;

    // Texture sampling
    float3 normal = g_NormalMap.Sample(g_Sampler, psInput.TexCoord);    // Normal vector from normal map
    float3 albedo = g_DiffuseMap.Sample(g_Sampler, psInput.TexCoord);   // Color from diffuse map

    // Store world-space coordinates into MRT#0
    mrt.WorldPos = psInput.WorldPos;

    // Convert normal to signed vector if needed
    // normal = 2.0 * (normal - 0.5);

    // Transform normal vector from tangent space to model space
    float3 modelNormal = float3(dot(normal, psInput.Binormal), dot(normal, psInput.Tangent), dot(normal, psInput.Normal));

    // Transform model space normal vector to world space. Note that only the rotation part of the world matrix is needed.
    // This step is not required for static models if their original model space orientation matches their orientation in world space.
    // This would save 3 instructions.
    float3 worldNormal = mul(float3x3(World), modelWorld);

    // Convert normal vector to fixed point
    // This is not required if the destination MRT is float or signed
    // worldNormal = 0.5 * (worldNormal + 0.5)

    // Store world-space normal into MRT#1
    mrt.WorldNormal = worldNormal;

    // Store diffuse color into MRT#2
    mrt.Albedo = albedo;

    return mrt;
}
  • rcp dst, src: Computes the reciprocal of the source scalar

LightPass.ps

;-------------------------------------------------------------------
; Constants specified by the app
; c0 : light position in world space
; c8 : camera position in world space
; c22: c22.a = 1/(light max range), c22.rgb = 1.0f
;-------------------------------------------------------------------
ps_2_0

; Samplers
dcl_2d s0 ; MRT#0 = Pixel position in world space
dcl_2d s1 ; MRT#1 = Pixel normal vector
dcl_2d s2 ; MRT#2 = Pixel diffuse color
dcl_2d s3 ; Falloff texture
dcl_cube s4 ; Cube normalization texture map

; Texture coordinates
dcl t0.xy ; Quad screen-space texture coordinates

; Constants
def c20, 0.5, 2.0, -2.0, 1.0
def c21, 8.0, -0.75, 4.0, 0.0

; Retrieve property buffer data from MRT textures
texld r0, t0, s0 ; r0.xyz = Pixel world space position
texld r2, t0, s1 ; r2.xyz = Pixel normal vector
texld r3, t0, s2 ; r3.rgb = Pixel color

; Convert normal to signed vector
; This is not required if the normal vector was stored in a signed
; or float format
mad r2, r2, c20.y, -c20.w ; r2 = 2*(r2 - 1)

; Calculate pixel-to-light vector
sub r1.xyz, c0, r0 ; r1 = Lpos - Vpos
mov r1.w, c20.w ; Set r1.w to 1.0
nrm r4, r1 ; Normalize vector (r4.w = 1.0/distance)

; Compute diffuse intensity
dp3 r5.w, r4, r2 ; r5.w = (N.L)

; FallOff
rcp r6, r4.w ; r6 = 1/(1/distance) = distance
mul r6, r6, c22.a ; Divide by light max range
texld r6, r6, s3 ; Sample falloff texture

; Compute halfway vector
sub r1.xyz, c8, r0 ; Compute view vector V (pixel to camera)
texld r1, r1, s4 ; Normalized vector with cube map
mad r1, r1, c20.y, -c20.w ; Convert vector to signed format
add r1, r1, r4 ; Add view and light vector
texld r1, r1, s4 ; Normalize half angle vector with cube map
mad r1, r1, c20.y, -c20.w ; Convert to signed format

; Compute specular intensity
dp3_sat r1.w, r1, r2 ; r1.w = sat(H.N)
pow r1.w, r1.w, c21.r ; r1.w = (H.N)^8
; Set specular to 0 if pixel normal is not facing the light
cmp r1.w, r5.w, r1.w, c21.w ; r1.w = ( (N.L)>=0 ) ? (H.N)^8 : 0

; Output final color
mad r0, r3, r5.w, r1.w ; Modulate diffuse color and diffuse

; intensity and add specular
mul r0, r0, r6 ; Modulate with falloff
mov oC0, r0 ; Output final color
cbuffer ConstantBuffer : register(b0)
{
    float4 LightPos;        // Light position in world space
    float4 CameraPos;       // Camera position in world space
    float LightAttenuation; // 1.0 / (light max range)
}

// Samplers
Texture2D<float3> g_WorldPos            : register(t0);
Texture2D<float3> g_WorldNormal         : register(t1);
Texture2D<float3> g_Albedo              : register(t2);
Texture2D<float3> g_Falloff             : register(t3);
TextureCube<float4> g_Normalization     : register(t4);
SamplerState g_DefaultSampler           : register(s0);
SamplerState g_CubeSampler              : register(s1);

// Texture coordinates
struct PSInput
{
    float4 Position : SV_Position;
    float2 TexCoord : TexCoord0;    // Quad screen-space texture coordinates
};

float4 main(PSInput psInput) : SV_Target
{
    // Retrieve property buffer data from MRT textures
    float3 worldPos = g_WorldPos.Sample(g_DefaultSampler, psInput.TexCoord);        // pixel world space position
    float3 worldNormal = g_WorldNormal.Sample(g_DefaultSampler, psInput.TexCoord);  // pixel normal vector
    float3 albedo = g_Albedo.Sample(g_DefaultSampler, psInput.TexCoord);            // pixel color

    // Convert normal to signed vector
    // This is not required if the normal was stored in a signed or float format
    // worldNormal = 2.0 * (worldNormal - 1.0)

    // Calculate pixel-to-light vector
    float4 lightDir = normalize(float4(LightPos - worldPos, 1.0));   // lightDir.w = 1.0 / distance

    // Compute diffuse intensity
    float diffuseIntensity = dot(lightDir, worldNormal);    // diffuse intensity = (N dot L)

    // Fall off
    float fallOff = 1.0 / lightDir.w;   // 1.0 / (1.0 / distance) = distance
    fallOff *= LightAttenuation;        // Divide by light max range
    fallOff = g_Falloff.Sample(g_DefaultSampler, float2(fallOff, fallOff)); // Sample falloff texture

    // Compute halfway vector
    float4 viewDir = CameraPos - worldPos;  // Compute view vector V (pixel to camera)
    viewDir = g_Normalization.Sample(g_CubeSampler, viewDir.xyz);   // Normalized vector with cube map
    // Convert vector to signed format
    // viewDir = 2.0 * (viewDir - 1.0);

    // Compute specular intensity
    float specularIntensity = pow(saturate(dot(viewDir, worldNormal)), 8.0);    // (H dot N)^8

    // Set specular to 0 if pixel normal is not facing the light
    specularIntensity = (diffuseIntensity >= 0.0) ? specularIntensity : 0.0;

    // Output final color
    float4 color = albedo * diffuseIntensity + specularIntensity; // modulate diffuse color and diffuse intensity and add specular
    color *= fallOff;   // Modulate with falloff

    return color;   // Output final color
}

Example 2: AMD GDC 2011Thibieroz11

  1. Light Pre-pass
    1. Render Normals
      • 1st geometry pass results in normal(and depth) buffer
        • Uses a single color RT
        • No MRT required
    2. Lighting Accumulation
      • Perform all lighting calculation into light buffer
        • Use normal and depth buffer as input textures
        • Render geometries enclosing light area
        • Write (LightColor * N.L * Attenuation) in RGB, specular in A
    3. Combine lighting with materials
      • 2nd geometry pass using light buffer as input
        • Fetch geometry material
        • Combine with light data
          • Advantages
            • One material fetch per pixel regardless of number of lights
          • Disadvantages
            • Two scene geometry passes required
            • Unique lighting model
  2. Light Volume Rendering
    1. Early Z culling Optimizations
      • When camera is inside the light volume
        • Set Z Mode = GREATER
        • Render volume’s back faces
      • Only samples fully inside the volume get shaded
        • Optimal use of early Z culling
        • No need for stencil
        • High efficiency
      • Previous optimization does not work if camera is outside volume
        • Back faces also pass the Z=GREATER test for objects in front of volume
      • Alternatively, when camera is outside the light volume:
        • Set Z Mode = LESSEQUAL
        • Render volume’s front faces
        • But generates wasted processing for objects behind the volume
      • Stencil can be used to mark samples inside the light volume
      • Render volume with stencil-only pass:
        • Clear stencil to 0
        • Set Z Mode = LESSEQUAL
        • If depth test fails:
          • Increment stencil for back faces
          • Decrement stencil for front faces
      • Render some geometry where stencil != 0

Example 3: AnKi 3D EngineCharitos19

  • Clustered Deferred Shading
    • Bin various primitives into a number of clusters
      • Happens in the CPU
      • Point lights
      • Spot lights
      • Decals
      • Fog density probes
      • Reflection probes
      • Diffuse GI probes

Passs:

  • Depth pre-pass
  • GBuffer pass
</tr>
MRTs R G B A
RT 0 Albedo.R R8U Albedo.G G8U Albedo.B B8U Subsurface term A8U
RT 1 Roughness R8U Metallic G8U Fresnel term B8U Emission scaling A8U
RT 2 WS Normal.X R10U_PACK32 WS Normal.Y G10U_PACK32 Emission B10U_PACK32 Sign of the Normal's Z A2U_PACK32
RT 3 Velocity R16S Velocity G16S
  • Volumetric lighting accumulation
    • Compute job that calculatese the lighting for eaach cell and the fog density
    • Every compute invocation operates on a single cluster
      • Each invocation a random point inside the cluster will be chosen and the lighting of each light will be accumulated to compute the final result
    • Final result is stored into a 3D texture
  • Light shading, forward shading, and fog
    • Light shading
      • Fullscreen(triangle) draw that unpacks the GBuffer and iterates the lights of the affected cluster

Example 4: The SurgeHammer18

  • Clustered Deferred Rendering
  • Volumetric Lighting
</tr>
MRTs R G B A
RT 0 Albedo.R R8 Albedo.G G8 Albedo.B B8 Material ID
RT 1 VS Normal.X R10 VS Normal.Y G10 VS Normal.Z B10 A2
RT 2 Roughness R8 Metalness G8 Occlusion B8 (shared) A8
RT 3 Motion Vectors X R16 Motion Vectors Y G16

PBR:

  • Material ID indexes directly into StructuredBuffer to query per=material data
    • Save G-Buffer space
  • (shared) - per pixel context dependent
    • mutual exclusive material data
    • based on per-material data
    • Emissive mask
      • Defines whether or not to interpret albedo as emissive
      • Emissive combined in final “combine” pass
      • Effectively saves dedicated emissive channel
    • Translucency

Clustered Deferred Rendering:

  • Switch from rasterization-based light volume rendering to full (async) compute-based approach
    • Low CPU overhead
      • Light culling runs entirely on GPU
      • Filling a buffer with light infos instead of dispatching thousands of drawcalls
    • Advantages on GPU
      • No need to fetch G-Buffer for every light
    • Async Compute: Lighting runs in parallel to shadow rendering (at least on consoles)
    • But: many more optimizations necessary to get better performance
  • Could render environment probes in the same pass
    • Environment probes are still clustered but rendered in a separate (pixeshader) pass together with SSR
  • Divide view frustum into a 3D grid
    • 16 x 8 x 24
  • Culling: Assignn lights to grid cells
    • Upload light culling info to GPU (StructuredBuffer with Position, AABB, etc.)
    • Create list of light indices for each cell (single large uint buffer)
  • Dispatching lighting compute shader
    • In fact we dispatch twice: unshadowed and shadowed lights
    • Unshadowed can run in parallel with shadowmap generation
  • Can use cluster information also for forward rendering
    • We do this for our lit transparent objects
    • Simply compute grid cell index for a position and query light list

Example 5: Rainbow Six SiegeElMansouri16

  • Single Frame:
    • GPU:
      • Graphics Pipeline + Async Pipeline
      • Average 5ms spent on geometry rendering
        • Heavy us of culling
        • Shadow caching
      • Average 5ms spent on lighting (SSR included)
        • Checkerboard rendering helps
        • SSAO & SSR ray trace done in async
    • CPU:
      • 10ms avg on the critical path
        • All passes and tasks able to fork and join to minimize critical path
        • Shadow caching
      • Max 4ms linear spent on opaque pass
        • Material based draw call system
  • Opaque rendering:
    1. First person rendering
      • GBuffer, Depth buffer
    2. 400 best occluders to depth buffer
    3. Generate Hi-Z Buffer
    4. Opaque culling & rendering
      • Hierarchical z-buffer used for culling
  • Lighting
    • Uses a clustered structure on the frustum:
      • 32 x 32 pixels based tile
      • Z exponential distribution
    • Hierarchical culling of light volume to fill the structure
    • Local cubemaps regarded as lights
    • Shadows, cubemaps and gobos reside in textures arrays
      • Deferred uses pre-resolved shadow texture array
      • Forward uses shadows depth buffer array
</tr>
MRTs R G B A
RT 0 Emissive.R R11 Emissive.G G11 Emissive.B B10
RT 1 Normal.X R10 Normal.Y G10 Normal.Z B10 A2
RT 2 Albedo.R R8 Albedo.G G8 Albedo.B B8 Material Masking A8
RT 3 Material Properties R8 Material Properties G8 Material Properties B8 Material Properties A8

Example 6: Uncharted 4ElGarawany16

  • 16 bits-per-pixel unsigned buffers
GBuffers R G B A
GBuffer 0 R
G
B
Spec
Normal.X
Normal.Y
iblUseParent, Normal Extra
Roughness
GBuffer 1 Ambient Translucency
Sun Shadow High
Spec Occlusion
Heightmap Shadowing
Sun Shadow Low
Metallic
Dominant Direction X
Dominant Direction Y
AO
Extra Material Mask
Sheen
Thin Wall Translucency
GBuffer 2 (optional) Used by more complicated materials
  • Using a material ID(bitmask) texture
    • 12-bits compressed into 8-bits

Example 7: CryENGINE 3Kaplanyan10

  • Indirect Lighting
    • Ambient term
    • Tagged ambient areas
    • Local cubemaps
    • Local deferred lights
    • Diffuse Indirect Lighting from LPVs
    • SSAO
  • Direct lighting
    • All direct light sources, with and without shadows
  • GBuffer. The smaller the better
    • Minimal G-Buffer layout: 64 bits / pixel
      • RT0: Depth 24bpp + Stencil 8bpp
      • RT1: Normals 24bpp + Glossiness 8bpp
        • Glossiness is non-deferrable
          • Required at lighting accumulation pass
          • Specular is non-accumulative otherwise
    • Stencil to mark objects in lighting groups
      • Portals / indoors
      • Custom environment reflections
      • Different ambient / indirect lighting
    • Problems:
      • Only Phong BRDF (normal + glossiness)
        • No aniso materials
      • Normals at 24bpp are too quantized
        • Lighting is banded / of low quality
        • Use baked normal map

Example 8: UnityGolubev18

  • Optimize with both Tile and Cluster approaches
  • Goal
    • Focus on removing false positives
      • Ex: Narrow shadow casting spot lights
    • False positives are more expensive in lighting pass
      • Light culling execute async during shadow rendering
      • Deferred lighting pass is not running async
      • Move cost where it can be hidden
      • High register pressure in lighting pass
  • Hierarchical approach
    1. Find screen-space AABB for each visible light
    2. Big tile 64 x 64 prepass
      • Coarse intersection test
      • Use AABBs for initial early out (2D no depth)
      • Follow up with exact intersection test between tile and convex hull
      • Use bounding spheres as an extra testing criteria
    3. Build Tile or Cluster Light list
      • Narrow intersection test

Forward+HaradaMcKeeYang13

  • Z prepass
    • Reduces the pixel overdraws of the final shading step
  • Light culling
    • Calculates the list of lights affecting a pixel
  • Final shading
    • Shades the entire surface
    • Any lights in a scene have to be accessible from shaders rather than binding some subset of lights for each objects

Light Culling

  • Split the screen into tiles
  • Light indices are calculated on a per-tile basis

Implementation and Optimization

  • Computation is done on a by-tile basis
    • Execute a thread group for a tile
  • CS for light culling is executed as a two-dimensional (2D) work group

Pseudo-code:

  • GET_GROUP_IDX: thread group index in X direction (SV_GroupID)
  • GET_GROUP_IDY: thread group index in Y direction (SV_GroupID)
  • GET_GLOBAL_IDX: global thread index in X direction (SV_DispatchThreadID)
  • GET_GLOBAL_IDY: global thread index in X direction (SV_DispatchThreadID)
  • GET_LOCAL_IDX: local thread index in X direction (SV_GroupThreadID)
  • GET_LOCAL_IDY: local thread index in X direction (SV_GroupThreadID)

  • The first step is computation of the frustum of a tile in view space
  • Calculate the view-space coordinates of the four corner points of the tile
float4 frustum[4];
{   // construct frustum
    float4 v[4];
    v[0] = projToView(8 * GET_GROUP_IDX, 8 * GET_GROUP_IDY, 1.f);
    v[1] = projToView(8 * (GET_GROUP_IDX + 1), 8 * GET_GROUP_IDY, 1.f);
    v[2] = projToView(8 * (GET_GROUP_IDX + 1), 8 * (GET_GROUP_IDY + 1), 1.f);
    v[3] = projToView(8 * GET_GROUP_IDX, 8 * (GET_GROUP_IDY + 1), 1.f);
    float4 o = make_float4(0.f, 0.f, 0.f, 0.f);
    for (int i = 0; i < 4; ++i)
    {
        frustum[i] = createEquation(o, v[i], v[(i + 1) & 3]);
    }
}
  • projToView() is a function that takes screen-space pixel indices and depth value and returns coordinates in view space
  • createEquation() creates a plane equation from three vertex positions
  • We can clip the frustum by using the max/min depth values from the depth buffer created in the depth prepass.
float depth = depthIn.Load(uint3(GET_GLOBAL_IDX, GET_GLOBAL_IDY, 0));

float4 viewPos = projToView(GET_GLOBAL_IDX, GET_GLOBAL_IDY, depth);

int lIdx = GET_LOCAL_IDX + GET_LOCAL_IDY * 8;
{   // Calculate bound
    if (lIdx == 0)  // Initialize
    {
        ldsZMax = 0;
        ldsZMin = 0xffffffff;
    }
    GroupMemoryBarrierWithGroupSync();
    u32 z = asuint(viewPos.z);
    if (depth != 1.f)
    {
        AtomMax(ldsZMax, z);
        AtomMin(ldsZMin, z);
    }
    GroupMemoryBarrierWithGroupSync();
    maxZ = asfloat(ldsZMax);
    minZ = asfloat(ldsZMin);
}
  • ldsZMax, ldsZMin store max/min z coordinates, which are bounds of a frustum in the z direction, in shared memory

MiniEngine version:

// Read all depth values for this tile and compute the tile min and max values
for (uint dx = GTid.x; dx < WORK_GROUP_SIZE_X; dx += 8)
{
    for (uint dy = GTid.y; dy < WORK_GROUP_SIZE_Y; dy += 8)
    {
        uint2 DTid = Gid * uint2(WORK_GROUP_SIZE_X, WORK_GROUP_SIZE_Y) + uint2(dx, dy);

        // If pixel coordinates are in bounds...
        if (DTid.x < ViewportWidth && DTid.y < ViewportHeight)
        {
            // Load and compare depth
            uint depthUInt = asuint(depthTex[DTid.xy]);
            InterlockedMin(minDepthUInt, depthUInt);
            InterlockedMax(maxDepthUInt, depthUInt);
        }
    }
}

GroupMemoryBarrierWithGroupSync();
  • We are using 8 x 8 for the size of a thread group
    • 64 lights are processed in parallel
for (int i = 0; i < nBodies; i += 64)
{
    int il = lIdx + i;
    if (il < nBodies)
    {
        if (overlaps(frustum, gLightGeometry[i]))
        {
            appendLightToList(il);
        }
    }
}
  • overlaps(): chekcs light-geometry overlap against a frustum using the separating axis theorem
    • If a light is overlapping the frustum, the light index is stored to the list of the overlapping lights in appendLightToList()
  • A tile is computed by a thread group, and so we can use shared memory for the first level storage
  • Light index storage and counter for the storage is allocated as follows:
groupshared u32 ldsLightIdx[LIGHT_CAPACITY];
groupshared u32 ldsLightIdxCounter;
void appendLightToList(int i)
{
    u32 dstIdx = 0;
    InterlockedAdd(ldsLightIdxCounter, 1, dstIdx);
    if (dstIdx < LIGHT_CAPACITY)
    {
        ldsLightIdx[dstIdx] = i;
    }
}

MiniEngine version:

groupshared uint tileLightCountSphere;
groupshared uint tileLightCountCone;
groupshared uint tileLightCountConeShadowed;

groupshared uint tileLightIndicesSphere[MAX_LIGHTS];
groupshared uint tileLightIndicesCone[MAX_LIGHTS];
groupshared uint tileLightIndicesConeShadowed[MAX_LIGHTS];

// find set of lights that overlap this tile
for (uint lightIndex = GI; lightIndex < MAX_LIGHTS; lightIndex += 64)
{
    LightData lightData = lightBuffer[lightIndex];
    float3 lightWorldPos = lightData.pos;
    float lightCullRadius = sqrt(lightData.radiusSq);

    bool overlapping = true;
    for (int p = 0; p < 6; p++)
    {
        float d = dot(lightWorldPos, frustumPlanes[p].xyz) + frustumPlanes[p].w;
        if (d < -lightCullRadius)
        {
            overlapping = false;
        }
    }
    
    if (!overlapping)
        continue;

    uint slot;

    switch (lightData.type)
    {
    case 0: // sphere
        InterlockedAdd(tileLightCountSphere, 1, slot);
        tileLightIndicesSphere[slot] = lightIndex;
        break;

    case 1: // cone
        InterlockedAdd(tileLightCountCone, 1, slot);
        tileLightIndicesCone[slot] = lightIndex;
        break;

    case 2: // cone w/ shadow map
        InterlockedAdd(tileLightCountConeShadowed, 1, slot);
        tileLightIndicesConeShadowed[slot] = lightIndex;
        break;
    }

    // update bitmask
    InterlockedOr(tileLightBitMask[lightIndex / 32], 1 << (lightIndex % 32));
}

GroupMemoryBarrierWithGroupSync();
  • Now indices of lights overlapping that frustum are collected in the shared memory
    • We have to write these to the global memory
  • For the storage of light indices in the global memory, we allocated two buffers:
    • gLightIdx: memory pool for the indices
      • First reserve memory here
        • Done by an atomic operation to gLightIdxCounter using a thread in the thread group
    • gLightIdxCounter: memory counter for the memory pool
  • Once a memory offset is obtained, we just fill the light indices to the assigned contiguous memory of gLightIdx using all the threads in a thread group
{   // Write back
    u32 startOffset = 0;
    if (lIdx == 0)
    {   // Reserve memory
        if (ldsLightIdxCounter != 0)
        {
            InterlockedAdd(gLightIdxCounter, ldsLightIdxCounter, startOffset);
        }

        ptLowerBound[tileIdx] = startOffset;
        ldsLightIdxStart = startOffset;
    }
    GroupMemoryBarrierWithGroupSync();
    startOffset = ldsLightIdxStart;

    for (int i = lIdx; i < ldsLightIdxCounter; i += 64)
    {
        gLightIdx[startOffset + i] = ldsLightIdx[i];
    }
}

MiniEngine version:

if (GI == 0)
{
    uint lightCount = 
        ((tileLightCountSphere & 0xff) << 0) |
        ((tileLightCountCone & 0xff) << 8) |
        ((tileLightCountConeShadowed & 0xff) << 16);
    lightGrid.Store(tileOffset + 0, lightCount);

    uint storeOffset = tileOffset + 4;
    uint n;
    for (n = 0; n < tileLightCountSphere; n++)
    {
        lightGrid.Store(storeOffset, tileLightIndicesSphere[n]);
        storeOffset += 4;
    }
    for (n = 0; n < tileLightCountCone; n++)
    {
        lightGrid.Store(storeOffset, tileLightIndicesCone[n]);
        storeOffset += 4;
    }
    for (n = 0; n < tileLightCountConeShadowed; n++)
    {
        lightGrid.Store(storeOffset, tileLightIndicesConeShadowed[n]);
        storeOffset += 4;
    }

    lightGridBitMask.Store4(tileIndex * 16, tileLightBitMask);
}

Shading

  • With UAV support, per-material instance information can be stored and accessed in linear structured buffers passed to material shaders

Tiled Shading

Kaplanyan10

Light Pre-Pass Renderer

LightPrePassRendererEngelShaderX709

  • Store depth / normalEngelShaderX709
    • Normal: R8G8B8A8
      • WS Normal will be of better quality when stored in spherical coordinates
  • Store light properties of all lights in a light bufferEngelShaderX709
    • Only light properties (no shadows, reflections, etc.)
    • Light dependant terms need to be stored by their locality in space by applying the diffuse term N.L and the attenuation factor to all of them
    • Typical L-Buffer:
      1. Σi N · Li * DiffuseRed * Attenuationi
      2. Σi N · Li * DiffuseGreen * Attenuationi
      3. Σi N · Li * DiffuseBlue * Attenuationi
      4. Σi N · Li * (N · Hi)n * Attenuationi
        • If the specular term can be separated in a later rendering pass, a dedicated material shininess value can be applied
          • i (N · Hi)n)mn
    • Applying a material shiniess value to the specular reflection can be done in several ways
      1. Similar to a deferred renderer, a material specular shininess can be stored in the normal / depth buffer (stencil area)
      2. Moving into a different color space that reuses some of the ideas here to achieve a tighter packed render target
      3. A separate term can be stored to reconstruct the specular term in a later render pass
        1. Σi N · Li * DiffuseRed * Attenuationi
        2. Σi N · Li * DiffuseGreen * Attenuationi
        3. Σi N · Li * DiffuseBlue * Attenuationi
        4. Σi N · Li * (N · Hi)n * Attenuationi
        5. Σi N · Li * Attenuationi
          • With the additional term, we can reconstruct the specular term:
            • i N · Li * (N · Hi)n * Attenuationi) / (Σi N · Li * Attenuationi)
          • The result of this equation can be used to apply a material shininess value
      4. The diffuse term stored in the first three channels of the light buffer can be converted to luminance and then used to reconstruct the specular reflection term
      5. The common rules for constructing a specular term can be bended by creating a new term that fits better into this renderer design

Comparison and Conclusion

|Forward / Z Pre-Pass Renderer|Light Pre-Pass Renderer| |—|—| |Lots of geometry throughput| | |Need to split up geometry following light distribution| | |Dependent texture look-up and lights are not fully dynamic| | EngelSiggraph09

|Deferred Renderer|Light Pre-Pass Renderer| |—|—| |Lights are independent from geometry| | |Geometry pass stores all material and light properties|Geometry pass fills up normal and depth buffer,
Lighting pass stores light properties in light buffer| |Only one geometry pass for the main view|(Version A) Second geometry buffer fetches light buffer and apply different material terms per surface by reconstructing the lighting equation
(Version B) Ambient + Resolve(MSAA) pass fetches light buffer and uses its content as diffuse / specular content and add the ambient term while resolving into the main buffer| |Lights are blit and therefore only limited by memory bandwidth|| |Memory bandwidth| | |Recalculate full lighting equation for every light| | |Limited material representation in G-Buffer| | |MSAA difficult compared to Forward renderer| | EngelSiggraph09

  • Compared to a deferred rendererEngelShaderX709
    • Offers more flexibility regarding material implementations
    • Reads memory bandwidth is lower
    • Using MSAA can be more efficient
    • Renders all geometry for the main view twice
  • Compared to a Z pre-pass rendererEngelShaderX709
    • Offers less flexibility but a flexible and fast multi-light solution
  • Implementation (D3D11)EngelSiggraph09
    • GS bounding box
      • construct bounding box in geometry shader
    • implement lighting with the compute shader
    • Balance quality / performance
      • Stop rendering dynamic lights after a certain range for example 40 meters and render glow cards instead
      • Use smaller light buffer for distant lights and scale up

Clustered Shading

Cluster Deferred Shading AlgorithmOlssonBilleterAssarsson12

  1. Render scene to G-Buffers
  2. Cluster assignment
  3. Find unique clusters
  4. Assign lights to clusters
  5. Shade samples

Cluster Assignment

  • Goal of the cluster assignment:
    • Compute an integer cluster key for a given view sample from the information available G-Buffer
      • position, normal (optional)
  • Subdivide the view frustum along x, y, z-axis in view space (or NDC)
    • Due to the non-linear nature of NDC, clusters close to the near plane becomes very thin, where as those towards the far plane becomes very long
    • Uniform subdivision in view space produces the opposite artifact
    • Space the divisions exponentially
  • neark = neark - 1 + hk - 1
  • near0 = near (first subdivision)
  • h0 = (2 near tan θ) / Sy (Sy = number of subdivisions in the Y direction, FoV = 2θ)
  • neark = near(1 + (2 tan θ) / Sy)k
  • ClusterK
  • Cluster key tuple ClusterKey

Finding Unique Clusters


@startuml
start
split
group Render Opaque Objects
    split
    :Normals;
    split again
    :Depth Buffer;
    end split
    :Switch Off Depth Write;
    :Light Buffer; 
floating note left: Sort Back-To-Front
    :Forward Rendering;
floating note left: Sort Front-To-Back
end group
split again
group Transparent Objects
    :Switch Off Depth Write;
    :Forward Rendering;
    floating note right: Sort Back-To-Front
end group
end split
stop
@enduml
k = \left \lfloor \frac{\log{\left(-z_{vs}/near\right)}}{\log{\left(1 + \frac{2 \tan{\theta}}{S_y} \right )}} \right \rfloor
\left(i, j, k \right ) = \left(\left \lfloor x_{ss}/t_x \right \rfloor, \left \lfloor y_{ss}/t_y \right \rfloor, \left \lfloor \frac{\log{\left(-z_{vs}/near\right)}}{\log{\left(1 + \frac{2 \tan{\theta}}{S_y} \right )}} \right \rfloor \right )