Skip to the content.

Rendering Pipeline Comparison (2022.07.27)

Home

Environment

Type Name
OS Windows 10 Pro 64
Processor Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (16 CPUs), ~2.9GHz
Memory 32768MB RAM
Device NVIDIA GeFore RTX 3080
VRAM 10077MB
Display 1920 × 1200 (32 bit) (59 Hz)
Build Configuration Debug / Windows

Comparison Methods

Bandwidth ComparisonThibieroz04

\textrm{Bandwidth}_{60\textrm{fps}} = \left(\textrm{W} \times \textrm{H} \times \left[ \textrm{MRT}_{\textrm{BPP}} \times   
 n_{\textrm{MRT}} \times n + \textrm{Z}_\textrm{BPP} \times \textrm{Overdraw} + \textrm{T}_\textrm{BPP} \times \textrm{T}_\textrm{B} + n \times \left(2 \times \textrm{BB}_\textrm{BPP} + \textrm{T}_\textrm{S} \times \textrm{T}_\textrm{BPP} \right ) \right ] + \textrm{C}_\textrm{Geometry} \right ) \times 60 \textrm{Bytes} / \textrm{s}

Storage Comparison

Duration

Frame Duration

FrameDurationBarAll

FrameDurationBarDeferred

FrameDurationBarDeferredNoDefault

FrameDurationBarForward

FrameDurationBarForwardNoDefault

Light Phase Duration

LightPhaseDurationBarAll

LightPhaseDurationBarAllNoDefault

LightPhaseDurationBarTiled

LightPhaseDurationBarClustered

Render Color Duration

RenderColorDurationBarAll

RenderColorDurationBarDeferred

RenderColorDurationBarDeferredNoDefault

RenderColorDurationBarForward

RenderColorDurationBarForwardNoDefault

RenderColorDurationBarTiled

RenderColorDurationBarTiledNoShading

RenderColorDurationBarClustered

Tile / Cluster Assignment Duration

TileClusterAssignmentDurationBarAll

TileClusterAssignmentDurationBarDeferred

TileClusterAssignmentDurationBarDeferredNoShading

TileClusterAssignmentDurationBarForward

TileClusterAssignmentDurationBarTiled

TileClusterAssignmentDurationBarClustered

Bandwidth

Frame

Pipeline Frame
DRAM Read/Write Utilization DRAM Activity L1 Cache L2 Cache
Percentage utilization of DRAM reads Percentage utilization of DRAM writes Total DRAM Read/Write Utilization Percentage of memory cycles that a read or write request to DRAM was active Read/write utilization Read/write utilization
Forward 16.333333 1 17.333333 17.666667 53.666667 13.666667
Forward+ 35.333333 2 37.333333 37.333333 24.333333 26.666667
Forward+ 2.5D Culling 35.666667 2 37.666667 37.666667 24.666667 26.666667
Forward+ 2.5D, AABB-based Culling 38.666667 2 40.666667 41 18.666667 29
Forward Clustered 34 2 36 36.333333 23.333333 26
Deferred 20 1 21 21 53 17
Deferred Tiled 37 2 39 39 21 28
Deferred Tiled 2.5D Culling 36.333333 2 38.333333 39.333333 20.333333 27.333333
Deferred Tiled 2.5D, AABB-based Culling 39 2 41 41.666667 15.666667 29.666667
Deferred Tiled (DICE) 36.666667 2 38.666667 39 21 27.666667
Deferred Tiled (DICE) 2.5D Culling 36.333333 2 38.333333 39.333333 21.333333 27.333333
Deferred Tiled (DICE) 2.5D, AABB-based Culling 39.666667 2 41.666667 42 16 29.666667
Deferred Tiled (Intel) 36.666667 2 38.666667 39.333333 21.333333 27.666667
Deferred Clustered 36 2 38 38.666667 20.666667 28

Geometry Phase

Pipeline Geometry Phase
DRAM Read/Write Utilization DRAM Activity L1 Cache L2 Cache
Percentage utilization of DRAM reads Percentage utilization of DRAM writes Total DRAM Read/Write Utilization Percentage of memory cycles that a read or write request to DRAM was active Read/write utilization Read/write utilization
Deferred 19.666667 21.333333 41 41.333333 59.666667 42.666667
Deferred Tiled 19.666667 21.666667 41.333333 42 60.333333 43
Deferred Tiled 2.5D Culling 19 21.666667 40.666667 41 60.333333 42.666667
Deferred Tiled 2.5D, AABB-based Culling 18.333333 21 39.333333 39.333333 58.666667 41.333333
Deferred Tiled (DICE) 18.666667 21.666667 40.333333 41 60.333333 42.666667
Deferred Tiled (DICE) 2.5D Culling 19 21.666667 40.666667 41.333333 60.333333 42.666667
Deferred Tiled (DICE) 2.5D, AABB-based Culling 18.666667 21.666667 40.333333 41 60.666667 43
Deferred Tiled (Intel) 18 21 39 39.666667 58.666667 41.666667
Deferred Clustered 18 21 39 39.666667 58.666667 41.666667

Light Phase

Pipeline Lighting Phase
DRAM Read/Write Utilization DRAM Activity L1 Cache L2 Cache
Percentage utilization of DRAM reads Percentage utilization of DRAM writes Total DRAM Read/Write Utilization Percentage of memory cycles that a read or write request to DRAM was active Read/write utilization Read/write utilization
Deferred 0 0 0 0 85 3.333333
Deferred Tiled 3 0 3 4 94.666667 4.666667
Deferred Tiled 2.5D Culling 3.333333 0 3.333333 4 94.333333 4.666667
Deferred Tiled 2.5D, AABB-based Culling 4 1 5 5 94.666667 5.333333
Deferred Tiled (DICE) 2 0 2 3 95.666667 4
Deferred Tiled (DICE) 2.5D Culling 2.333333 0 2.333333 3.333333 94.333333 4
Deferred Tiled (DICE) 2.5D, AABB-based Culling 4 1 5 5 93.333333 5
Deferred Tiled (Intel) 2 0 2 3 97 4
Deferred Clustered 3 0 3 4 97 5

Render Color

Pipeline Render Color
DRAM Read/Write Utilization DRAM Activity L1 Cache L2 Cache
Percentage utilization of DRAM reads Percentage utilization of DRAM writes Total DRAM Read/Write Utilization Percentage of memory cycles that a read or write request to DRAM was active Read/write utilization Read/write utilization
Forward 0 0 0 0.666667 82.666667 3.333333
Forward+ 2 1 3 3.666667 89.333333 5
Forward+ 2.5D Culling 2.333333 1 3.333333 4 92.666667 5.333333
Forward+ 2.5D, AABB-based Culling 3 2 5 5 91 7
Forward Clustered 2.666667 1 3.666667 4.666667 92.666667 6.333333

Tile / Cluster Assignment

Pipeline Tile / Cluster Assignment
DRAM Read/Write Utilization DRAM Activity L1 Cache L2 Cache
Percentage utilization of DRAM reads Percentage utilization of DRAM writes Total DRAM Read/Write Utilization Percentage of memory cycles that a read or write request to DRAM was active Read/write utilization Read/write utilization
Forward+ 1.333333 1 2.333333 3.333333 23.333333 12
Forward+ 2.5D Culling 1.333333 1 2.333333 2.666667 26 11
Forward+ 2.5D, AABB-based Culling 3.666667 2 5.666667 5.666667 41.666667 14
Forward Clustered 0 4 4 5 43 18
Deferred Tiled 0 1 1 2 24 12
Deferred Tiled 2.5D Culling 0 1 1 2 26 11
Deferred Tiled 2.5D, AABB-based Culling 1 2 3 3 43 14
Deferred Tiled (DICE) 2 0 2 3 95.666667 4
Deferred Tiled (DICE) 2.5D Culling 2.333333 0 2.333333 3.333333 94.333333 4
Deferred Tiled (DICE) 2.5D, AABB-based Culling 4 1 5 5 93.333333 5
Deferred Tiled (Intel) 2 0 2 3 97 4
Deferred Clustered 0.333333 4 4.333333 5 42.333333 18

Shadow Maps

Lights Scalability

Lauritzen10

Frame Duration

FrameDurationAll

FrameDurationDeferred

FrameDurationDeferredNoDefault

FrameDurationForward

FrameDurationForwardNoDefault

FrameDurationTiled

FrameDurationTiled2_5D

FrameDurationTiled2_5DAABB

FrameDurationClustered

Light Phase Duration

LightPhaseDurationDeferred

LightPhaseDurationDeferredNoDefault

LightPhaseDurationTiled

LightPhaseDurationTiled2_5D

LightPhaseDurationTiled2_5DAABB

LightPhaseDurationClustered

Render Color Duration

RenderColorDurationAll

RenderColorDurationDeferred

RenderColorDurationDeferredNoDefault

RenderColorDurationForward

RenderColorDurationForwardNoDefault

RenderColorDurationTiled

RenderColorDurationTiled2_5D

RenderColorDurationTiled2_5DAABB

RenderColorDurationClustered

Tile / Cluster Assignment Duration

TileClusterAssignmentDurationAll

TileClusterAssignmentDurationDeferred

TileClusterAssignmentDurationForward

TileClusterAssignmentDurationTiled

TileClusterAssignmentDurationTiled2_5D

TileClusterAssignmentDurationTiled2_5DAABB

TileClusterAssignmentDurationClustered

GBuffer: Fat Buffer vs Thin BufferKaplanyan10

Color Space EncodingKaplanyan10

Normal EncodingKaplanyan10

Frame Duration

NormalCompressionFrameDurationNoDefault

Render Color Duration

NormalCompressionRenderColorDurationNoDefault

Geometry Phase Duration

NormalCompressionGeometryPhaseDurationNoDefault

Lighting Phase Duration

NormalCompressionLightingPhaseDurationNoDefault

False Positive Rate

Forward+

Name Duration (ms) Bandwidth
DRAM RW Utilization DRAM Activity L1 Cache RW Utilization L2 Cache RW Utilization
% Utilization of DRAM Reads % Utilization of DRAM Writes Total DRAM RW Utilization
Scene Render Particle Update 0.091776
0.095168
0.097408
=0.094784
0
0
1
=0.333333
0 0
0
1
=0.333333
0
0
2
=0.666667
0 0
0
1
=0.333333
Z PrePass 0.083200
0.081600
0.082240
=0.082347
25
24
24
=24.333333
9
6
6
=7
34
30
30
=31.333333
34
31
31
=32
3 34
33
32
=33
Generate SSAO 0.156384
0.155328
0.156512
=0.156075
8
9
10
=9
7
7
8
=7.333333
15
16
18
=16.333333
15
17
19
=17
47
47
48
=47.333333
32
34
33
=33
Fill Light Grid 0.060064
0.059296
0.061600
=0.060320
1
1
5
=2.333333
3
3
2
=2.666667
4
4
7
=5
4
4
8
=5.333333
43
43
41
=42.333333
22
22
21
=21.666667
Main Render 0.031232
0.030720
0.030816
=0.030923
0 0 0 0 0 13
Render Shadow Map 0.107104
0.090272
0.089888
=0.095755
19
21
21
=20.333333
8
9
9
=8.666667
27
30
30
=29
28
30
30
=29.333333
3 26
28
28
=27.333333
Render Color 1.317600
1.318592
1.318752
=1.318315
3 2 5 5 91
92
83
=88.666667
7
7
6
=6.666667
Generate Camera Velocity 0.059232
0.058464
0.059392
=0.059029
8
7
10
=8.333333
5 13
12
15
=13.333333
13
13
15
=13.666667
32
33
32
=32.333333
21
20
20
=20.333333
Temporal Resolve 0.135456
0.134752
0.137024
=0.135744
42
39
45
=42
17
19
17
=17.666667
59
58
62
=59.666667
60
58
62
=60
68
68
67
=67.666667
69
67
74
=70
Particle Render 0.085088
0.083296
0.085344
=0.084576
3
3
5
=3.666667
1 4
4
6
=4.666667
4
5
7
=5.333333
2 3
3
4
=3.333333
Motion Blur 0.052416
0.055840
0.055616
=0.054624
44
50
49
=47.666667
15
14
14
=14.333333
59
64
63
=62
59
64
64
=62.333333
29
28
28
=28.333333
46
47
47
=46.666667
Total 2.186400
2.170144
2.181408
9
8
9
=8.666667
4 13
12
13
=12.666667
13
13
14
=13.333333
66
65
65
=65.333333
16
15
16
=15.666667

Forward+ 2.5D Culling with AABB-based Culling

Name Duration (ms) Bandwidth
DRAM RW Utilization DRAM Activity L1 Cache RW Utilization L2 Cache RW Utilization
% Utilization of DRAM Reads % Utilization of DRAM Writes Total DRAM RW Utilization
Scene Render Particle Update 0.091584
0.092256
0.091296
=0.091712
0
3
0
=1
0 0
3
0
=1
0
4
0
=1.333333
0 0
2
1
=1
Z PrePass 0.083328
0.081248
0.081120
=0.081898
25
24
24
=24.333333
9
10
10
=9.666667
34 34
35
34
=34.333333
3 34
33
32
=33
Generate SSAO 0.161536
0.159520
0.154848
=0.158634
8 7 15 15
15
16
=15.333333
48 33
Fill Light Grid 0.059744
0.059776
0.059584
=0.059701
1 3 4 4 43 22
Main Render 0.030752
0.030720
0.030720
0.030731
0 0 0 0 0 13
Render Shadow Map 0.106656
0.089920
0.089760
=0.095445
19
21
21
=20.333333
7
9
9
=8.333333
26
30
30
=28.666667
27
30
30
=29
3 26
28
28
=28.666667
Render Color 1.308896
1.311968
1.316896
=1.312587
3
3
4
=3.333333
2 5
5
6
=5.333333
5
5
6
=5.333333
92
92
84
89.333333
7
Generate Camera Velocity 0.058944
0.058464
0.058624
=0.058677
6
8
8
=7.333333
5
5
4
=4.666667
11
13
12
=12
12
13
13
=12.666667
32
32
33
=32.666667
19
21
22
=20.666667
Temporal Resolve 0.135200
0.134208
0.135776
=0.135061
44
38
43
=41.666667
17 61
55
60
=58.666667
61
56
60
=59
67
68
67
=67.333333
69
66
72
=69
Particle Render 0.083584
0.083776
0.083936
=0.083765
3
5
3
=3.666667
1 4
6
4
=4.666667
5
7
4
=5.333333
2 3
Motion Blur 0.052544
0.052224
0.052768
=0.052512
45
46
45
=45.333333
15
14
15
=14.666667
60 61
60
60
=60.333333
29
30
29
=29.333333
46
47
46
=46.333333
Total 2.186400
2.161280
2.162144
=2.169941
8 4 12 12 62
65
62
=63
15

Forward Clustered

Name Duration (ms) Bandwidth
DRAM RW Utilization DRAM Activity L1 Cache RW Utilization L2 Cache RW Utilization
% Utilization of DRAM Reads % Utilization of DRAM Writes Total DRAM RW Utilization
Scene Render Particle Update 0.094080
0.098720
0.090016
=0.094272
0 0 0 0 0 0
Z PrePass 0.083136
0.080256
0.081248
=0.081546
25
24
24
=24.333333
6
6
10
=7.333333
31
30
34
=31.666667
32
31
35
=32.666667
3 34
33
33
=33.333333
Generate SSAO 0.158592
0.155296
0.167648
=0.160512
7
8
8
=7.666667
7
8
7
=7.333333
14
16
15
=15
14
17
15
=15.333333
47 31
31
32
=31.333333
Fill Light Grid 0.454208
0.450720
0.454272
=0.453067
0 1 1 1 38
39
38
=38.333333
21
Main Render 0.031392
0.030752
0.031168
=0.031104
3
0
0
=1
3
0
0
=1
6
0
0
=2
7
0
0
=2.333333
0 17
15
15
=15.666667
Render Shadow Map 0.106048
0.089184
0.089056
=0.094763
19
21
21
=20.333333
8
9
9
=8.666667
27
30
30
=29
28
30
31
=29.666667
3 26
28
28
=27.333333
Render Color 2.018272
2.024256
2.027040
=2.023189
2 1 3 3 89
88
91
=89.333333
5
Generate Camera Velocity 0.059264
0.058496
0.059264
=0.059008
6
10
12
=9.333333
5
4
4
=4.333333
11
14
16
=13.666667
12
15
16
=14.333333
32 18
21
23
=20.666667
Temporal Resolve 0.136320
0.134176
0.136032
=0.135509
43
37
43
=41
16
17
16
=16.333333
59
54
59
=57.333333
60
55
60
=58.333333
67
68
67
=67.333333
69
65
74
=69.333333
Particle Render 0.083680
0.084256
0.082656
=0.083531
3
5
3
=3.666667
1
2
1
=1.333333
4
7
4
=5
5
7
5
=6.333333
2 3
Motion Blur 0.063776
0.054368
0.054528
=0.057557
45
55
45
=48.333333
20
17
20
=19
65
72
65
=67.333333
66
72
66
=68
29
26
29
=28
40
42
41
=41
Total 3.295520
3.267296
3.279776
=3.280864
5
5
6
=5.333333
2
3
3
=2.666667
7
8
9
=8
8
9
9
=8.666667
70
67
68
=68.333333
13

Deferred Tiled

Name Duration (ms) Bandwidth
DRAM RW Utilization DRAM Activity L1 Cache RW Utilization L2 Cache RW Utilization
% Utilization of DRAM Reads % Utilization of DRAM Writes Total DRAM RW Utilization
Scene Render Particle Update 0.094080
0.098720
0.090016
=0.094272
0 0 0 0 0 0
Z PrePass 0.083136
0.080256
0.081248
=0.081546
25
24
24
=24.333333
6
6
10
=7.333333
31
30
34
=31.666667
32
31
35
=32.666667
3 34
33
33
=33.333333
Generate SSAO 0.158592
0.155296
0.167648
=0.160512
7
8
8
=7.666667
7
8
7
=7.333333
14
16
15
=15
14
17
15
=15.333333
47 31
31
32
=31.333333
Fill Light Grid 0.454208
0.450720
0.454272
=0.453067
0 1 1 1 38
39
38
=38.333333
21
Main Render 0.031392
0.030752
0.031168
=0.031104
3
0
0
=1
3
0
0
=1
6
0
0
=2
7
0
0
=2.333333
0 17
15
15
=15.666667
Render Shadow Map 0.106048
0.089184
0.089056
=0.094763
19
21
21
=20.333333
8
9
9
=8.666667
27
30
30
=29
28
30
31
=29.666667
3 26
28
28
=27.333333
Render Color 2.018272
2.024256
2.027040
=2.023189
2 1 3 3 89
88
91
=89.333333
5
Generate Camera Velocity 0.059264
0.058496
0.059264
=0.059008
6
10
12
=9.333333
5
4
4
=4.333333
11
14
16
=13.666667
12
15
16
=14.333333
32 18
21
23
=20.666667
Temporal Resolve 0.136320
0.134176
0.136032
=0.135509
43
37
43
=41
16
17
16
=16.333333
59
54
59
=57.333333
60
55
60
=58.333333
67
68
67
=67.333333
69
65
74
=69.333333
Particle Render 0.083680
0.084256
0.082656
=0.083531
3
5
3
=3.666667
1
2
1
=1.333333
4
7
4
=5
5
7
5
=6.333333
2 3
Motion Blur 0.063776
0.054368
0.054528
=0.057557
45
55
45
=48.333333
20
17
20
=19
65
72
65
=67.333333
66
72
66
=68
29
26
29
=28
40
42
41
=41
Total 2.051136 11 5 16 16 56 18

Optimization

  1. Locate the bottleneck of the pipeline
  2. Optimize that stage

Pipelines to optimize:

  1. Forward+
  2. Forward Clustered
  3. Deferred Tiled
  4. Deferred Clustered
  5. Deferred Thin G-Buffer Tiled
  6. Deferred Thin G-Buffer Clustered

Locating the Bottleneck

Multiplatform GPU High-Level Optimization GuidelinesSousaKasyanSchulz12

  1. Generalize and always optimize for the worst case scenario
    1. Discover the biggest bottlenecks and address them by tackling the biggest time consumer. This means avoiding partial optimizations.
      • Ex) Crysis 1: If the camera was static, then motion blur was disabled. If the camera was moving fast, then motion blur was enabled. This kind of bad optimization strategy resulted in big performance peaks and an inconsistent frame rate.
    2. Once done, repeat ad nauseam!
  2. Don’t repeat work or do unnecessary work. For example:
    1. Don’t down-sample full-screen color targets or depth targets multiple times for different postprocessing functinos
    2. Minimize the number of memory transfers, render target clears, and any redundant full-screen passes
    3. Such repeated or redundant work adds up very quickly
      • Ex) A full-screen pass a 720 p costs ca. 0.25 ms on the Xbox 360 and ca. 0.4 ms on PS3. It is very easy to spend many milliseconds in a wasteful manner.
    4. Batch as much as possible in a single pass
  3. Take advantage of interframe coherency. Amortize costs across frames:
    1. This can provide a significant gain if done carefully, talking performance peaks and multi-GPU systems into account
    2. Distribute costs evenly
      • Ex) If the HUD updates every nth frame, then every n + 1-th frame update some similar-costing render technique
    3. For screen-space ambient occlusion(SSAO) and the like, the cost can be distributed across frames
  4. In the end, the key words for the most cases are: “share, share, share.” Share as many computations and as much bandwidth as is reasonably possible in a single pass.

Multiplatform Optimization: Best PracticeSousaKasyanSchulz12


References

Deferred Shading with Multiple Render Targets. Nicolas Thibieroz, PowerVR Technologies / AMD. ShaderX2.
Deferred Shading in Tabula Rasa. Rusty Koonce, NCSoft Corporation / Facebook. GPU Gems 3.
CryENGINE 3: Reaching the Speed of Light. Anton Kaplanyan, Crytek / Intel Corporation. SIGGRAPH 2010: Advances in Real-Time Rendering in Games Course.
Deferred Rendering for Current and Future Rendering Pipelines. Andrew Lauritzen, Intel Corporation. SIGGRAPH 2010: Beyond Programmable Shader Course.
Rendering Tech of Space Marine. Pope Kim, Relic Entertainment / POCU. Daniel Barrero, Relic Entertainment. KGC 2011.
Tiled Shading. Ola Olsson, Chalmers University of Technology / Epic Games. Ulf Assarsson, Chalmers University of Technology. Journal of Graphics, GPU, and Game Tools.
CryENGINE 3: Three Years of Work in Review. Tiago Sousa, Crytek / id Software. Nickolay Kasyan, Crytek / AMD. Nicolas Schulz, Crytek. GPU Pro 3.


import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

x = ["16/8/8", "32/24/8", "64/56/8", "128/56/8"]
forward_plus_y = [1.293643, 1.402208, 1.691659, 1.991093]
forward_plus_culling_y = [1.253589, 1.318080, 1.516608, 1.718923]
forward_clustered_y = [2.174741, 2.318197, 2.950581, 3.268107]
deferred_tiled_y = [1.370539, 1.453547, 1.709749, 1.972149]
deferred_tiled_culling_y = [1.321376, 1.372097, 1.554155, 1.718635]
deferred_tiled_dice_y = [1.579851, 1.630304, 1.849120, 2.059509]
deferred_tiled_dice_culling_y = [1.367307, 1.450763, 1.607851, 1.739381]
deferred_clustered = [1.369045, 2.354485, 2.935424, 3.519659]

plt.plot(x, forward_plus_y, label="Forward+")
plt.plot(x, forward_plus_culling_y, label="Forward+ 2.5D, AABB Culling")
plt.plot(x, forward_clustered_y, label="Forward Clustered")
plt.plot(x, deferred_tiled_y, label="Deferred Tiled")
plt.plot(x, deferred_tiled_culling_y, label="Deferred Tiled 2.5D, AABB Culling")
plt.plot(x, deferred_tiled_dice_y, label="Deferred Tiled DICE")
plt.plot(x, deferred_tiled_dice_culling_y, label="Deferred Tiled DICE 2.5D, AABB Culling")
plt.plot(x, deferred_clustered, label="Deferred Clustered")
plt.xlabel("Number of Lights (Point/Cone/Cone Shadowed)")
plt.ylabel("Scene Render Duration (ms)")
plt.legend()
plt.show()