Musings about Deferred Rendering in OpenGL 4.3+
Musings about Deferred Rendering in OpenGL 4.3+
Await details...
Last edited by devsh on Sun Nov 25, 2018 9:49 pm, edited 2 times in total.
Re: Clustered Deferred Rendeering in OpenGL 4.3
This might make it into the irr::ext:: namespace of IrrlichtBAW... but right now this depends on the outcome of several benchmarks.
There are 4 ways of doing deferred, it kind of correlates to what you are deferring; lighting, shading or rendering.
1) The old S.T.A.L.K.E.R. or GPU Gems 3 method (Tabula Rasa), improved upon by Crytek in Crysis is the deferred lighting... You have a G-Buffer and an L-Buffer
Basically you draw the scene geometry outputting depth, normal, shininess/roughness and some other crap needed for lighting (outdoor masks), total FBO size is between 64 to 104 bytes (the bigger value is for optional albedo).
As an optional you can have emissives write into the L-Buffer which is 3 values of diffuse and 3 values of specular, or 1 value of specular if you're poor (and Crysis 2 on console was).
Aaaand it needs to be a blendable HDR format, so either you waste 2 alpha channels or you use R11G11B10F (or you can't even be clever and pre-divide your light value by the exposure, because of blending).
After filling the G-Buffer you fill the L-Buffer (the L-Buffer shares zbuffer with G-Buffer), you draw the light bounding geometry (icospheres, cubes, cones) into the scene additively.
The special shader with which the light gets drawn fetches the depth, normal, shininess etc. and evaluates your lighting function on the reconstructed position writing the diffuse and specular into appropriate channels of the L-Buffer, nothing prohibits you from making "light aggregate shaders" where a single light volume simulates more complex lighting (from more than 1 light) and not all lights have to have the same shader code (no penalty).
The optimization that makes it workable is that you draw the light volumes with a stencil mask op to get the exact same effect as stencil shadows.
Earlier methods relied on two passes with light geometry to create a stencil mask, however the problems suffered were:
- It's two passes PER OBJECT breaking batching, and causing lots of render state switching (depth func, stencil, color write enable). [technically you could use different stencil bits, so up to 8 lights could be batched]
- To mitigate the above you'd have to partition the lights into non-overlapping groups in a beyond repair way
- The reversal of the Z-test direction for the 2nd pass which would kill early-Z and early-stencil on some hardware (could explicitly request it though)
- Wont benefit from early-stencil or early-Z if we've drawn some alpha-tested geometry into the G-Buffer, which is the only reason why we are doing these stencil-buffer shennanigans.
With OpenGL 4.3 we can do a bit better thanks to early_fragment_tests.
The trick is to carefully manage your light-bounding geometry and make it simple so that the polygons are "quasi-sorted", i.e. overlapping polygons are in the correct order in the mesh so that they are drawn in the correct order.
The rasterizer state would be:
Depth write disabled
Depth function (GREATER)
Frontface: stencil func GL_EQUAL to reference value 0 mask 0x1u, stencil pass op KEEP, stencil fail op KEEP, depth fail op REPLACE with 1 [alternatives are ALWAYS, KEEP, KEEP, INCR] + "discard" in shader
Backface: stencil func GL_EQUAL to reference value 1 mask 0x1u, stencil pass op REPLACE with 0, stencil fail op KEEP, depth fail op REPLACE with 0 [alternative is NEQUAL 0, DECR, KEEP, DECR]
The order in which the individual lights get drawn makes no difference.
The discard kills our early-Z but we use early_fragment_tests to make sure that the shader won't even start executing for all pixels covering the light volume (empty space in mid-air), let alone try to write out its results via a blend in the ROP.
This saves pixel shader invocations and framebuffer bandwidth.
Because we can now use "discard;" , your shader can cancel the execution as soon as you read the depth buffer (ergo know the position) and know its outside the exact light volume, inside a shadow and you can also discard after finding that the normal is back-facing or the diffuse+specular too dim to contribute.
The reason why discard kills early Z and Stencil is that discard prevents the fragment from writing to color, zbuffer, or stencil buffer (as zbuffer/stencil test and write seem to be an atomic operation).
Using discard in an early_fragment_tests shader forces the potential depth and stencil write to occur anyway, only inhibiting the color write (which for once, in this particular case, is useful).
The main performance bottleneck here is filling the G-Buffer (unless you have a lot of big lights, then its filling the L-Buffer), which is why if you have some scene proxy geometry (like for occlusion queries) or you're not vertex bound you do a Z-prepass to fill the Z-Buffer with approximate values as to reduce overdraw. Z-prepass can be 4x faster than regular shading on a 32bit color buffer, so with a G-Buffer we are looking at <10% cost to prepass.
Finally you'd have your composing shader, which would be a fullscreen quad to read the L-Buffer and calculate the directional lights (sun) to get total lights and multiply that in with the albedo and specular color channels from the G-Buffer (and do a whole bunch of other fullscreen stuff like modulate in SSAO, and tonemapping).
In the PS3 and Xbox 360 (especiall the 360 which had like 32mb EDRAM for FBO) the VRAM memory was at a premium with each console having 256 and 512 mb respectively, so in order to save on G-Buffer memory you wouldnt have the albedo and specular color channels, you'd just draw the scene again (but full Z-buffer so no overdraw, and posibly early-Z) with multiplicative blending into the L-Buffer or sample the L-Buffer and multiply the colours and draw straight to output buffer.
OBviously this results in identical or even worse fillrate usage as the fat-Gbuffer.
Things I need to benchmark:
1) ECP_NONE vs. different FBO for Z-prepass (cost of z-prepass)
2) Z-prepass benefit on preventing overdraw on various FBOs
There are 4 ways of doing deferred, it kind of correlates to what you are deferring; lighting, shading or rendering.
1) The old S.T.A.L.K.E.R. or GPU Gems 3 method (Tabula Rasa), improved upon by Crytek in Crysis is the deferred lighting... You have a G-Buffer and an L-Buffer
Basically you draw the scene geometry outputting depth, normal, shininess/roughness and some other crap needed for lighting (outdoor masks), total FBO size is between 64 to 104 bytes (the bigger value is for optional albedo).
As an optional you can have emissives write into the L-Buffer which is 3 values of diffuse and 3 values of specular, or 1 value of specular if you're poor (and Crysis 2 on console was).
Aaaand it needs to be a blendable HDR format, so either you waste 2 alpha channels or you use R11G11B10F (or you can't even be clever and pre-divide your light value by the exposure, because of blending).
After filling the G-Buffer you fill the L-Buffer (the L-Buffer shares zbuffer with G-Buffer), you draw the light bounding geometry (icospheres, cubes, cones) into the scene additively.
The special shader with which the light gets drawn fetches the depth, normal, shininess etc. and evaluates your lighting function on the reconstructed position writing the diffuse and specular into appropriate channels of the L-Buffer, nothing prohibits you from making "light aggregate shaders" where a single light volume simulates more complex lighting (from more than 1 light) and not all lights have to have the same shader code (no penalty).
The optimization that makes it workable is that you draw the light volumes with a stencil mask op to get the exact same effect as stencil shadows.
Earlier methods relied on two passes with light geometry to create a stencil mask, however the problems suffered were:
- It's two passes PER OBJECT breaking batching, and causing lots of render state switching (depth func, stencil, color write enable). [technically you could use different stencil bits, so up to 8 lights could be batched]
- To mitigate the above you'd have to partition the lights into non-overlapping groups in a beyond repair way
- The reversal of the Z-test direction for the 2nd pass which would kill early-Z and early-stencil on some hardware (could explicitly request it though)
- Wont benefit from early-stencil or early-Z if we've drawn some alpha-tested geometry into the G-Buffer, which is the only reason why we are doing these stencil-buffer shennanigans.
With OpenGL 4.3 we can do a bit better thanks to early_fragment_tests.
The trick is to carefully manage your light-bounding geometry and make it simple so that the polygons are "quasi-sorted", i.e. overlapping polygons are in the correct order in the mesh so that they are drawn in the correct order.
The rasterizer state would be:
Depth write disabled
Depth function (GREATER)
Frontface: stencil func GL_EQUAL to reference value 0 mask 0x1u, stencil pass op KEEP, stencil fail op KEEP, depth fail op REPLACE with 1 [alternatives are ALWAYS, KEEP, KEEP, INCR] + "discard" in shader
Backface: stencil func GL_EQUAL to reference value 1 mask 0x1u, stencil pass op REPLACE with 0, stencil fail op KEEP, depth fail op REPLACE with 0 [alternative is NEQUAL 0, DECR, KEEP, DECR]
The order in which the individual lights get drawn makes no difference.
The discard kills our early-Z but we use early_fragment_tests to make sure that the shader won't even start executing for all pixels covering the light volume (empty space in mid-air), let alone try to write out its results via a blend in the ROP.
This saves pixel shader invocations and framebuffer bandwidth.
Because we can now use "discard;" , your shader can cancel the execution as soon as you read the depth buffer (ergo know the position) and know its outside the exact light volume, inside a shadow and you can also discard after finding that the normal is back-facing or the diffuse+specular too dim to contribute.
The reason why discard kills early Z and Stencil is that discard prevents the fragment from writing to color, zbuffer, or stencil buffer (as zbuffer/stencil test and write seem to be an atomic operation).
Using discard in an early_fragment_tests shader forces the potential depth and stencil write to occur anyway, only inhibiting the color write (which for once, in this particular case, is useful).
The main performance bottleneck here is filling the G-Buffer (unless you have a lot of big lights, then its filling the L-Buffer), which is why if you have some scene proxy geometry (like for occlusion queries) or you're not vertex bound you do a Z-prepass to fill the Z-Buffer with approximate values as to reduce overdraw. Z-prepass can be 4x faster than regular shading on a 32bit color buffer, so with a G-Buffer we are looking at <10% cost to prepass.
Finally you'd have your composing shader, which would be a fullscreen quad to read the L-Buffer and calculate the directional lights (sun) to get total lights and multiply that in with the albedo and specular color channels from the G-Buffer (and do a whole bunch of other fullscreen stuff like modulate in SSAO, and tonemapping).
In the PS3 and Xbox 360 (especiall the 360 which had like 32mb EDRAM for FBO) the VRAM memory was at a premium with each console having 256 and 512 mb respectively, so in order to save on G-Buffer memory you wouldnt have the albedo and specular color channels, you'd just draw the scene again (but full Z-buffer so no overdraw, and posibly early-Z) with multiplicative blending into the L-Buffer or sample the L-Buffer and multiply the colours and draw straight to output buffer.
OBviously this results in identical or even worse fillrate usage as the fat-Gbuffer.
Things I need to benchmark:
1) ECP_NONE vs. different FBO for Z-prepass (cost of z-prepass)
2) Z-prepass benefit on preventing overdraw on various FBOs
Re: Clustered Deferred Rendeering in OpenGL 4.3
So now that we;ve covered classical deferred we have 2 variants of essentially the same thing
2) Tiled Deferred Lighting/Shading (Battlefield,DICE, etc.)
You cut the screen into 16x16 tiles, use compute shader to cull and create lists of lighs affecting each tile (get min/max depth in a tile and cull against a mini-frustum).
In the next compute shader launch workgroup per-tile and fetch the affecting lights into shared memory (if using few lights, you can fetch straight from UBO, but without dynamic indexing!!!),
then for all pixels in the tile calculate the lighting from all the affecting lights.
3) Clustered (Just Cause)
Cut the screen into 16x16 tiles, then also partition along the Z direction, forming a 3D grid (clusters). The Z partitioning scheme can be very complex (adaptive or fixed or even cluster by normal!).
Launch one workgoup per occupied cluster.
More work launched than tiled (some overlapping tiles at different Z run more than once), but less likely to have "hot pixels" holding everyone up because they are supposedly affected by more lights because of longer frusta in scenes with depth discontinuities.
Both methods have one MASSIVE advantage over classical deferred; in classical deferred, unless a light has been grouped (large lights with the current camera viewpoint get merged into one screenquad with directionals), every light equals one read of depth+normal plus the decode and position reconstruction as each lighting shader is its own invocation per light volume.
Multiple lights kill the texel bandwidth very quickly (linear scaling in number of lights and light screen area).
The clustered or tiled compute shaders run only once per frame and thats it!
Both methods can output to the L-Buffer or modulate the albedo and specular straight away! (reducing the memory need for an L-Buffer).
If for some strange reason an L-Buffer is required, it can be compressed and quantized into fewer bits as blending is not required (error doesn't sum up).
Whether we call it deferred lighting or shading depends on whether we output just the L-Buffer or we go crazy and take all shading parameters into the G-Buffer and produce final shaded colours in the full-screen pass (no second geometry drawing).
Benchmarks TODO:
1) 16x16 vs 32x32 tiles
2) Clustered vs Tiled showdown!
2) Tiled Deferred Lighting/Shading (Battlefield,DICE, etc.)
You cut the screen into 16x16 tiles, use compute shader to cull and create lists of lighs affecting each tile (get min/max depth in a tile and cull against a mini-frustum).
In the next compute shader launch workgroup per-tile and fetch the affecting lights into shared memory (if using few lights, you can fetch straight from UBO, but without dynamic indexing!!!),
then for all pixels in the tile calculate the lighting from all the affecting lights.
3) Clustered (Just Cause)
Cut the screen into 16x16 tiles, then also partition along the Z direction, forming a 3D grid (clusters). The Z partitioning scheme can be very complex (adaptive or fixed or even cluster by normal!).
Launch one workgoup per occupied cluster.
More work launched than tiled (some overlapping tiles at different Z run more than once), but less likely to have "hot pixels" holding everyone up because they are supposedly affected by more lights because of longer frusta in scenes with depth discontinuities.
Both methods have one MASSIVE advantage over classical deferred; in classical deferred, unless a light has been grouped (large lights with the current camera viewpoint get merged into one screenquad with directionals), every light equals one read of depth+normal plus the decode and position reconstruction as each lighting shader is its own invocation per light volume.
Multiple lights kill the texel bandwidth very quickly (linear scaling in number of lights and light screen area).
The clustered or tiled compute shaders run only once per frame and thats it!
Both methods can output to the L-Buffer or modulate the albedo and specular straight away! (reducing the memory need for an L-Buffer).
If for some strange reason an L-Buffer is required, it can be compressed and quantized into fewer bits as blending is not required (error doesn't sum up).
Whether we call it deferred lighting or shading depends on whether we output just the L-Buffer or we go crazy and take all shading parameters into the G-Buffer and produce final shaded colours in the full-screen pass (no second geometry drawing).
Benchmarks TODO:
1) 16x16 vs 32x32 tiles
2) Clustered vs Tiled showdown!
Re: Clustered Deferred Rendeering in OpenGL 4.3
4) Full Deferred Rendering of death -- Wolfgang Engel and Intel latest research
This method is the most insane of them all.
You simply rasterize triangle+object(meshbuffer) IDs into a 32 or 64bit buffer with only one shader (or two, for alpha ref).
You build off a classical tiled or clustered light manager (with extras).
On AMD you can get hold of barycentric coordinates for vertex attribute interpolation.
Split screen into tiles again (16x16 or similar).
There are some distinct materials (different shader logic, textures etc.), lets say 8, and each object has to use one.
Then with a compute shader from the object IDs you build a list of tiles which need at least one pixel shaded by one of the materials.
You indirectly dispatch a compute shader per material, shading the tiles in screen-space from the material list, you dont really care about doing extra work, then you only write out the samples which match the material ID.
The material compute shader needs to lookup the triangle vertices from the object and triangle ID, re-contruct the triangle and barycentrically interpolate the attributes to get the UV, Normal etc., work out the pixel derivatives (potentially much more accurate than pixel shader ones) for mip-mapping or anisotropic filtering, sample the textures and do custom shading+lighting work.
Logically it seems like more work per pixel, but triangles often span multiple pixels, so the GPU memory cache makes this much more efficient, plus you only fetch visible triangle vertices.
+The higher your resolution, the more this is true, so the cache coherency goes up (you really start seeing the perf difference in 4K).
+Intel and Mobile GPUS love this.
+Tiled Rasterizers love this (rasterizes faster on mobile, because the pixels take less memory, ergo tiles are bigger).
+If you want to support mixed complexity objects in your scene (no normal map, no shininess map, no texture color or data), you don't have to pay the price of the most complex for all of your objects (untextured objects are just as fillrate hungry as textured in all previous deferred scenarios).
+This has the advantage of using very little extra memory compared to forward rendering.
+Easily extends to MSAA
+You only fetch texels from textures for the pixels which are visible, reducing texel fillrate bandwidth and the effects of overdraw
+There's almost no point to a Z-prepass
+Your actual materials can be crazy complicated
+Can use the configurable MSAA extension (if not outputting barycentrics) to cut down pixel-shader by factors of 4,8,16 and even 32 work during the buffer filling pass
+You could vary the lighting rate, global illumination rate, shadowing rate, or the texturing rate!
Downsides:
- Need to keep the heterogeinity of materials in the scene low or introduce UBERSHADERS/UBERMATERIALS
- Need to pack textures together into arrays and eternally filled texture slots so that materials can be batched
- Need to pack index and vertex data from several meshbuffers together into shared GPU Buffers (or use transform feedback and keep a copy of the entire scene geometry)
- Skinned meshbuffers present a problem (either inclue skinning into the material compute shader or use transform feedback to cache the skinning results)
- Hard to do SSAO and other effects which need the normal information, etc.
- No L-Buffer to write baked lighting, emissive, light-probe or GI to
This method is the most insane of them all.
You simply rasterize triangle+object(meshbuffer) IDs into a 32 or 64bit buffer with only one shader (or two, for alpha ref).
You build off a classical tiled or clustered light manager (with extras).
On AMD you can get hold of barycentric coordinates for vertex attribute interpolation.
Split screen into tiles again (16x16 or similar).
There are some distinct materials (different shader logic, textures etc.), lets say 8, and each object has to use one.
Then with a compute shader from the object IDs you build a list of tiles which need at least one pixel shaded by one of the materials.
You indirectly dispatch a compute shader per material, shading the tiles in screen-space from the material list, you dont really care about doing extra work, then you only write out the samples which match the material ID.
The material compute shader needs to lookup the triangle vertices from the object and triangle ID, re-contruct the triangle and barycentrically interpolate the attributes to get the UV, Normal etc., work out the pixel derivatives (potentially much more accurate than pixel shader ones) for mip-mapping or anisotropic filtering, sample the textures and do custom shading+lighting work.
Logically it seems like more work per pixel, but triangles often span multiple pixels, so the GPU memory cache makes this much more efficient, plus you only fetch visible triangle vertices.
+The higher your resolution, the more this is true, so the cache coherency goes up (you really start seeing the perf difference in 4K).
+Intel and Mobile GPUS love this.
+Tiled Rasterizers love this (rasterizes faster on mobile, because the pixels take less memory, ergo tiles are bigger).
+If you want to support mixed complexity objects in your scene (no normal map, no shininess map, no texture color or data), you don't have to pay the price of the most complex for all of your objects (untextured objects are just as fillrate hungry as textured in all previous deferred scenarios).
+This has the advantage of using very little extra memory compared to forward rendering.
+Easily extends to MSAA
+You only fetch texels from textures for the pixels which are visible, reducing texel fillrate bandwidth and the effects of overdraw
+There's almost no point to a Z-prepass
+Your actual materials can be crazy complicated
+Can use the configurable MSAA extension (if not outputting barycentrics) to cut down pixel-shader by factors of 4,8,16 and even 32 work during the buffer filling pass
+You could vary the lighting rate, global illumination rate, shadowing rate, or the texturing rate!
Downsides:
- Need to keep the heterogeinity of materials in the scene low or introduce UBERSHADERS/UBERMATERIALS
- Need to pack textures together into arrays and eternally filled texture slots so that materials can be batched
- Need to pack index and vertex data from several meshbuffers together into shared GPU Buffers (or use transform feedback and keep a copy of the entire scene geometry)
- Skinned meshbuffers present a problem (either inclue skinning into the material compute shader or use transform feedback to cache the skinning results)
- Hard to do SSAO and other effects which need the normal information, etc.
- No L-Buffer to write baked lighting, emissive, light-probe or GI to
Re: Clustered Deferred Rendeering in OpenGL 4.3
The biggest problem of all MSAA... why is it a problem?
Because you output very fat G-Bffer which might be MSAA compressed, but only for bandwidth optimization (if you have Kx MSAA texture, it will eat Kx memory), but will not be as soon as you draw some triangle edges into it.
Another thing which is shocking is that the pixel shader outputs the same color value for all samples it covers with the triangle, unless sample shading is enabled (causing Kx invocations per pixel [or less, all depends on sample shading rate]), so anyhow you're writing Kx more data which most of the time (interiors of triangles) is K exact same copies of the data.
A modest 128 bit G-Buffer for deferred shading in MSAA 8x at 1080p (or MSAA x2 at 4K) will cost you 0.5 GB of VRAM.
Now the problem is that most pixels all sampes belong to the same object and have exact same values in the GBuffer, but some don't and these need per-sample lighting or a variant of it.
Then traditionally in classical deferred you run an edge detection filter which writes out either a stencil or a depth mask and you draw the light volumes with per-sample shading on the edges and without per-sample shading on the other pixels.
Now the edge filter is a double edged sword, you don't detect some edges, you run the risk of halos and more aliasing at primitive edges, you over-detect and your performance drops of a cliff due to excessive per-sample shading.
Also your L-Buffer needs MSAA
With tiled or clustered, its the same story but without the stencil mask, the really fat MSAA L-Buffer can be skipped, shared memory in the per-sample shading case can save us from needing an MSAA output to perform a resolve on.
Obviously this whole situation changes if we have the object+triangle ID handy, it makes computational anti-aliasing easier (FXAA, SMAA) and allows us to perform "aware-upsampling" from an L-Buffer without MSAA.
Inferred lighting relied on this for working with transparents and MSAA geometry, but REQUIRED drawing the scene twice.
Because you output very fat G-Bffer which might be MSAA compressed, but only for bandwidth optimization (if you have Kx MSAA texture, it will eat Kx memory), but will not be as soon as you draw some triangle edges into it.
Another thing which is shocking is that the pixel shader outputs the same color value for all samples it covers with the triangle, unless sample shading is enabled (causing Kx invocations per pixel [or less, all depends on sample shading rate]), so anyhow you're writing Kx more data which most of the time (interiors of triangles) is K exact same copies of the data.
A modest 128 bit G-Buffer for deferred shading in MSAA 8x at 1080p (or MSAA x2 at 4K) will cost you 0.5 GB of VRAM.
Now the problem is that most pixels all sampes belong to the same object and have exact same values in the GBuffer, but some don't and these need per-sample lighting or a variant of it.
Then traditionally in classical deferred you run an edge detection filter which writes out either a stencil or a depth mask and you draw the light volumes with per-sample shading on the edges and without per-sample shading on the other pixels.
Now the edge filter is a double edged sword, you don't detect some edges, you run the risk of halos and more aliasing at primitive edges, you over-detect and your performance drops of a cliff due to excessive per-sample shading.
Also your L-Buffer needs MSAA
With tiled or clustered, its the same story but without the stencil mask, the really fat MSAA L-Buffer can be skipped, shared memory in the per-sample shading case can save us from needing an MSAA output to perform a resolve on.
Obviously this whole situation changes if we have the object+triangle ID handy, it makes computational anti-aliasing easier (FXAA, SMAA) and allows us to perform "aware-upsampling" from an L-Buffer without MSAA.
Inferred lighting relied on this for working with transparents and MSAA geometry, but REQUIRED drawing the scene twice.
Re: Clustered Deferred Rendeering in OpenGL 4.3
What are the advantages of Deferred Lighting (geometry drawn again in final pass):
+ The G-Buffer contains only parameters relevant to the per-light lighting function (no albedo, no specular, no motionvector, no DoF CoC)
+ The shading and texturing can be very complex and different per-object and decoupled from the unified global deferred shader
+ Emissives don't have to be written to the L-Buffer or drawn in additively in a second pass
+ Static lightmaps or precomputed lighting is more easily handled (its basically a special case of emissive which can write to the L-Buffer)
+ Environment Map/Probe reflections are also handled more naturally and dont add complexity to the deferred pass
So what are the advantages if any of classical deferred?
+Performs well with few lights or small lights (<3 or 4x overdraw for the whole screen on average)
+Performs well in high depth complexity scenes or scenes where per-pixel hardware stencil culling is better than per-tile
+Allows for using complex custom light volumes
+Cheap stencil-like shadows are possible, in CryEngine they called it light-volume clipping (duplicate of the above)
By far the two best advantages over tiled and clustered are:
+Stencil shadows, otherwise one has to use static shadowmaps and the memory used + resolution becomes an issue (also the atlasing)
+No overhead and algorithmic complexity of light culling and tile-light-list building (for few lights)
Emissives are actually hard to do as:
1) They cannot write to albedo or specular (if no lighting, no emission)
2) They cannot just write to the L-Buffer diffuse or specular as they will be modulated with surface albedo, which may need to be 0 if we don't want our emissive material to be further affected by lights.
So emissive pretty much needs to be additively blended in after the shading or put into its own L-Buffer channel.
+ The G-Buffer contains only parameters relevant to the per-light lighting function (no albedo, no specular, no motionvector, no DoF CoC)
+ The shading and texturing can be very complex and different per-object and decoupled from the unified global deferred shader
+ Emissives don't have to be written to the L-Buffer or drawn in additively in a second pass
+ Static lightmaps or precomputed lighting is more easily handled (its basically a special case of emissive which can write to the L-Buffer)
+ Environment Map/Probe reflections are also handled more naturally and dont add complexity to the deferred pass
So what are the advantages if any of classical deferred?
+Performs well with few lights or small lights (<3 or 4x overdraw for the whole screen on average)
+Performs well in high depth complexity scenes or scenes where per-pixel hardware stencil culling is better than per-tile
+Allows for using complex custom light volumes
+Cheap stencil-like shadows are possible, in CryEngine they called it light-volume clipping (duplicate of the above)
By far the two best advantages over tiled and clustered are:
+Stencil shadows, otherwise one has to use static shadowmaps and the memory used + resolution becomes an issue (also the atlasing)
+No overhead and algorithmic complexity of light culling and tile-light-list building (for few lights)
Emissives are actually hard to do as:
1) They cannot write to albedo or specular (if no lighting, no emission)
2) They cannot just write to the L-Buffer diffuse or specular as they will be modulated with surface albedo, which may need to be 0 if we don't want our emissive material to be further affected by lights.
So emissive pretty much needs to be additively blended in after the shading or put into its own L-Buffer channel.
Re: Clustered Deferred Rendeering in OpenGL 4.3
More in-depth on the MSAA.
L-Buffer techniques are quite heavy, as one is forced to either:
1) Do all lighting on one MSAA L-Buffer and pay fillrate output/blend costs even on non-per-sample lighted pixels (even when you turn multisample off you write to all) [correction: could use samplemasks to only write one sample, but its slower than non-msaa buffer rendering]
2) Allocate a MSAA L-Buffer and a non-MSAA L-Buffer paying more memory and buffer clear instructions (requires a nasty MSAA depth buffer conversion routine to generate the FBO non-MSAA depth buffer, killing compression and optimization)
Deferred Lighting (with second geometry pass) is almost useless because it will require drawing the scene geometry 3 times:
1) To fill MSAA G-Buffer
2) To take the first sample from the L-Buffer (optionally non-MSAA) per pixel and shade the pixels with no multisample to a MSAA result buffer
3) To sample the MSAA L-Buffer and shade per sample to a MSAA result buffer
You can't merge pass 2 and 3 into one, because you'd need to draw the geometry with two different materials (per sample shading+multisample off and on) which is impossible in the same drawcall.
You could of-course do it per sample frequency and only do step 3 (but on all pixels), however it would obliterate the texture fillrate (per-sample normal map fetch, texture fetch, L-Buffer fetch).
This is why for MSAA scenarios I'd reccommend Deferred Rendering where the process is much simpler:
1) Draw MSAA G-Buffer
Which is followed by per-sample shading pixel stencil masking and then either by:
L-Buffer) Two passes of all lights at per-pixel and per-sample rates with appropriate stencil masking, possibly to separate buffers for each pass, the compose to the output buffer with fullscreen pass including directional lights
No L-Buffer) Two passes of fullscreen shading pass at per-pixel and per-sample rates writing straight to output buffer
The output buffer here can be non-MSAA straight away, unless you use HDR, then you need to combine this shader with tonemapping or output to MSAA buffer.
If you are not doing the Deferred Everything technique or Inferred Lighting, then you dont have unique triangle IDs per pixel.
This requires an "Edge Detection" shader to make stencil masks of where more than one triangle occupies a pixel and requires more than 1 sample-per-pixel lighting/shading.
Edge Detection is a misnomer, you don't do any of the difference, Sobel or Canny edge filters, you actually just read all pixel samples and their data such as normals and albedo, then see if they all match.
You can't really use depth, as opposed to colour and normals it is always correctly interpolated across samples (its not a fragment shader output) to give good Anti-Aliasing on intersecting triangles.
Obviously using per-sample shading for G-Buffer outputs (higher sampled normal or specular map) will lead to more pixels being stenciled for per-sample shading.
Some GPUs (future) might gently caress you over if they interpolate pixel shader color outputs between samples (don't write the same value to all pixel samples).
Without triangle IDs, detecting non-homogenous pixels (edges) has elements of guesswork and better estimates require more computation, its also very bandwidth intensive (need to fetch most of the G-Buffer attributes for comparison) as every pixel needs to be checked and all samples need to be checked to see if one is different.
This method has one more downside for non-early-Z-force classic deferred, it takes away one stencil buffer bit for the MSAA mask.
So to sum it up, if you don't have triangle IDs you face fetching all G-Buffer data (all pixels, and all samples, all attributes) at least once, just on the account of the "edge detection/stencil masking", which is much more expensive than forward rendering.
You can however, be VERY clever and notice that most of the time, when all samples are not identical, its due to two triangles being in the same pixel footprint (edges).
It's very very rare that you get 3 or more triangles (corners).
Ergo even if you run 16x or 32x MSAA, you should only do the shading at max 2x or 3x per pixel!
And for this you need sample group IDs, which you can either get reliably from triangle IDs or generate implicitly from the "edge detection".
L-Buffer techniques are quite heavy, as one is forced to either:
1) Do all lighting on one MSAA L-Buffer and pay fillrate output/blend costs even on non-per-sample lighted pixels (even when you turn multisample off you write to all) [correction: could use samplemasks to only write one sample, but its slower than non-msaa buffer rendering]
2) Allocate a MSAA L-Buffer and a non-MSAA L-Buffer paying more memory and buffer clear instructions (requires a nasty MSAA depth buffer conversion routine to generate the FBO non-MSAA depth buffer, killing compression and optimization)
Deferred Lighting (with second geometry pass) is almost useless because it will require drawing the scene geometry 3 times:
1) To fill MSAA G-Buffer
2) To take the first sample from the L-Buffer (optionally non-MSAA) per pixel and shade the pixels with no multisample to a MSAA result buffer
3) To sample the MSAA L-Buffer and shade per sample to a MSAA result buffer
You can't merge pass 2 and 3 into one, because you'd need to draw the geometry with two different materials (per sample shading+multisample off and on) which is impossible in the same drawcall.
You could of-course do it per sample frequency and only do step 3 (but on all pixels), however it would obliterate the texture fillrate (per-sample normal map fetch, texture fetch, L-Buffer fetch).
This is why for MSAA scenarios I'd reccommend Deferred Rendering where the process is much simpler:
1) Draw MSAA G-Buffer
Which is followed by per-sample shading pixel stencil masking and then either by:
L-Buffer) Two passes of all lights at per-pixel and per-sample rates with appropriate stencil masking, possibly to separate buffers for each pass, the compose to the output buffer with fullscreen pass including directional lights
No L-Buffer) Two passes of fullscreen shading pass at per-pixel and per-sample rates writing straight to output buffer
The output buffer here can be non-MSAA straight away, unless you use HDR, then you need to combine this shader with tonemapping or output to MSAA buffer.
If you are not doing the Deferred Everything technique or Inferred Lighting, then you dont have unique triangle IDs per pixel.
This requires an "Edge Detection" shader to make stencil masks of where more than one triangle occupies a pixel and requires more than 1 sample-per-pixel lighting/shading.
Edge Detection is a misnomer, you don't do any of the difference, Sobel or Canny edge filters, you actually just read all pixel samples and their data such as normals and albedo, then see if they all match.
You can't really use depth, as opposed to colour and normals it is always correctly interpolated across samples (its not a fragment shader output) to give good Anti-Aliasing on intersecting triangles.
Obviously using per-sample shading for G-Buffer outputs (higher sampled normal or specular map) will lead to more pixels being stenciled for per-sample shading.
Some GPUs (future) might gently caress you over if they interpolate pixel shader color outputs between samples (don't write the same value to all pixel samples).
Without triangle IDs, detecting non-homogenous pixels (edges) has elements of guesswork and better estimates require more computation, its also very bandwidth intensive (need to fetch most of the G-Buffer attributes for comparison) as every pixel needs to be checked and all samples need to be checked to see if one is different.
This method has one more downside for non-early-Z-force classic deferred, it takes away one stencil buffer bit for the MSAA mask.
So to sum it up, if you don't have triangle IDs you face fetching all G-Buffer data (all pixels, and all samples, all attributes) at least once, just on the account of the "edge detection/stencil masking", which is much more expensive than forward rendering.
You can however, be VERY clever and notice that most of the time, when all samples are not identical, its due to two triangles being in the same pixel footprint (edges).
It's very very rare that you get 3 or more triangles (corners).
Ergo even if you run 16x or 32x MSAA, you should only do the shading at max 2x or 3x per pixel!
And for this you need sample group IDs, which you can either get reliably from triangle IDs or generate implicitly from the "edge detection".
Re: Clustered Deferred Rendeering in OpenGL 4.3
From now on I will no longer consider the classical stencil based deferred, as it overcomplicates the explanations and is next to obsolete (I don't plan on using it ever).
In the very very clever method of variable lighting rate, we can assign shading group IDs to samples.
If triangle IDs are not available, the determination of "how many unique samples are there" becomes VERY difficult and can be radically different depending on how many attributes except for surface normal we have available for classification.
When there are fewer unique sample groups than max shading group IDs, the problem is trivial.
When there are more unique sample groups (trinagles in a pixel) than shading group IDs, we can assign the two most populous sample groups (see note about albedo) the available shading group IDs, and then for the rest of the pixels determine which group they are closer to (similar depth + similar normal).
Essentially this becomes a problem of clustering Nx (sample count) MSAA samples into K (max lighting rate) clusters with known centroids.
You'd only look at and discriminate based on the depth and normal in this stage, far plane pixels could also be omitted from consideration using the depth buffer.
Albedo+specular could be used to not assign shading group ID to pixels which "can't be lit", and modulate the priority so that the population (pixels in group) is weighted by the amount of light it can reflect.
If you have an L-Buffer it only needs to be MSAA Kx .
K>3 really doesn't make sense, unless you're rendering uber-high-quality hair,grass or other alpha-tested geometry.
The rest is really easy as we use the stencil buffer or similar to ouput how many shading group IDs we found in a pixel (between 1 and K), process the different lighting rate pixels in K separate passes.
You can either re-pack the K>=2 pixels more tightly with no gaps into a different buffer so that your compute threads are not idling (Unreal Engine 4 style) or accept some extra idle work.
Clustered Deferred extends to this naturally, as we can simply say that clusters should also be defined by shading group ID as well as depth, and allow a pixel to belong to more than 1 cluster.
Your L-Buffer resolve pass (if no L-Buffer then merged into the lighting shader pass) will grab the K lighting values, the N samples with albedo and specular (full res MSAA sample positions are assigned sample shading group IDs) and apply lighting values to the N samples.
This idea can be taken a whole level up for non-MSAA scenarios for diffuse lighting calculation (or non-specular surfaces overall), global illumination, rough/fuzzy raytracing etc. where we could process 2x2 or 4x4 groups of pixels in the exact same manner.
Maybe in the future it will actually pay-off to do compute the diffuse and specular dynamic lighting separately and at different rates per-pixel, as:
D) Diffuse lighting varies much less across a triangle (low frequency), has smaller range (surface disperses incoming light in all directions), but lots of pixels can be affected within the range as only back-facing normals are unlit.
S) Specular is very intense and bright, and is high-frequency, requiring high sample rate. It has a really really long range (think glossy reflections of far-far away street lights on completely diffusely unlit cars), because the reflected incoming light is concentracted along a very narrow beam. However we know how wide that beam can be from the roughness material parameter, so we can form a light-culling cone/frustum around the reflected view vector.
M) Materials (especially PBR) are either smooth or rough (modulated per pixel too by fresnel for example), indicating the relative weighing and importance of the diffuse and specular component.
Even if not doing diffuse and specular at different rates, we could investigate separate diffuse and specular light lists per tile/cluster (more aggressive specular culling + different distance functions for specular/diffuse).
Deferred Everything BONUS:
The material shading (texturing etc.) can also happen at variable rate either by using the same group lighting IDs for material groups or oversampling that.
Logically if you have 7 triangles on one pixel in MSAA 16x, you should run 7 sample invocations to get the correct texture color for each, but faking it with 3 might not produce visible artifacts.
Extra:
What distance light-falloff function to use?
(1,c,tau)
intensity = exp2(-tau*length)/(1+c/(length*length))
Models scattering absorption (fog) and inverse square law with no signularity near 0 (the 1+ is correct, as intensity is correlated with solid angle, and can be thought of as a factor in the calculation of the fraction of the hemisphere around the shaded pixel covered by a light).
Use same tau for modelling the participating media scattering for light-volumes.
In the very very clever method of variable lighting rate, we can assign shading group IDs to samples.
If triangle IDs are not available, the determination of "how many unique samples are there" becomes VERY difficult and can be radically different depending on how many attributes except for surface normal we have available for classification.
When there are fewer unique sample groups than max shading group IDs, the problem is trivial.
When there are more unique sample groups (trinagles in a pixel) than shading group IDs, we can assign the two most populous sample groups (see note about albedo) the available shading group IDs, and then for the rest of the pixels determine which group they are closer to (similar depth + similar normal).
Essentially this becomes a problem of clustering Nx (sample count) MSAA samples into K (max lighting rate) clusters with known centroids.
You'd only look at and discriminate based on the depth and normal in this stage, far plane pixels could also be omitted from consideration using the depth buffer.
Albedo+specular could be used to not assign shading group ID to pixels which "can't be lit", and modulate the priority so that the population (pixels in group) is weighted by the amount of light it can reflect.
If you have an L-Buffer it only needs to be MSAA Kx .
K>3 really doesn't make sense, unless you're rendering uber-high-quality hair,grass or other alpha-tested geometry.
The rest is really easy as we use the stencil buffer or similar to ouput how many shading group IDs we found in a pixel (between 1 and K), process the different lighting rate pixels in K separate passes.
You can either re-pack the K>=2 pixels more tightly with no gaps into a different buffer so that your compute threads are not idling (Unreal Engine 4 style) or accept some extra idle work.
Clustered Deferred extends to this naturally, as we can simply say that clusters should also be defined by shading group ID as well as depth, and allow a pixel to belong to more than 1 cluster.
Your L-Buffer resolve pass (if no L-Buffer then merged into the lighting shader pass) will grab the K lighting values, the N samples with albedo and specular (full res MSAA sample positions are assigned sample shading group IDs) and apply lighting values to the N samples.
This idea can be taken a whole level up for non-MSAA scenarios for diffuse lighting calculation (or non-specular surfaces overall), global illumination, rough/fuzzy raytracing etc. where we could process 2x2 or 4x4 groups of pixels in the exact same manner.
Maybe in the future it will actually pay-off to do compute the diffuse and specular dynamic lighting separately and at different rates per-pixel, as:
D) Diffuse lighting varies much less across a triangle (low frequency), has smaller range (surface disperses incoming light in all directions), but lots of pixels can be affected within the range as only back-facing normals are unlit.
S) Specular is very intense and bright, and is high-frequency, requiring high sample rate. It has a really really long range (think glossy reflections of far-far away street lights on completely diffusely unlit cars), because the reflected incoming light is concentracted along a very narrow beam. However we know how wide that beam can be from the roughness material parameter, so we can form a light-culling cone/frustum around the reflected view vector.
M) Materials (especially PBR) are either smooth or rough (modulated per pixel too by fresnel for example), indicating the relative weighing and importance of the diffuse and specular component.
Even if not doing diffuse and specular at different rates, we could investigate separate diffuse and specular light lists per tile/cluster (more aggressive specular culling + different distance functions for specular/diffuse).
Deferred Everything BONUS:
The material shading (texturing etc.) can also happen at variable rate either by using the same group lighting IDs for material groups or oversampling that.
Logically if you have 7 triangles on one pixel in MSAA 16x, you should run 7 sample invocations to get the correct texture color for each, but faking it with 3 might not produce visible artifacts.
Extra:
What distance light-falloff function to use?
(1,c,tau)
intensity = exp2(-tau*length)/(1+c/(length*length))
Models scattering absorption (fog) and inverse square law with no signularity near 0 (the 1+ is correct, as intensity is correlated with solid angle, and can be thought of as a factor in the calculation of the fraction of the hemisphere around the shaded pixel covered by a light).
Use same tau for modelling the participating media scattering for light-volumes.
Re: Clustered Deferred Rendering in OpenGL 4.3
I'm just spam posting a dead thread, but I need a place to put this which isn't a googledoc.
After I clean this up and add diagrams, this might make it into a IrrlichtBAW wiki page entry.
Or a blog, when I finally get around to making one.
After I clean this up and add diagrams, this might make it into a IrrlichtBAW wiki page entry.
Or a blog, when I finally get around to making one.
Re: Clustered Deferred Rendering in OpenGL 4.3
How to get Barycentric Coordinates for triangles.
AMD: Use AMD_shader_explicit_vertex_parameter
NVidia: Use NV_geometry_shader_passthrough, or revert to Intel method
Intel: Use tessellation shader or geometry shader with fixed output length
Mobile: Use compute shader for GPU triangle filtering + do vertex indexing in software
It might be that old-Nvidia benefits more from tessellation shader and Intel from geometry (their GS is fast).
Really 4 methods of roundabout-way-of computing barycentrics should be benchmarked:
Passthrough GS
Passthrough TS
Vertex software indexing (treat index buffer as a vertex buffer array of 1 component uint16 or uint32 vertices)
Filtered index buffer software indexing
Additionally the cost of computing barycentrics in the final deferred compute shader should be benchmarked against each method.
As a side note, do we really need to store dFdx and dFdy of the barycentrics per pixel in the triangle visibility buffer?
AMD: Use AMD_shader_explicit_vertex_parameter
NVidia: Use NV_geometry_shader_passthrough, or revert to Intel method
Intel: Use tessellation shader or geometry shader with fixed output length
Mobile: Use compute shader for GPU triangle filtering + do vertex indexing in software
It might be that old-Nvidia benefits more from tessellation shader and Intel from geometry (their GS is fast).
Really 4 methods of roundabout-way-of computing barycentrics should be benchmarked:
Passthrough GS
Passthrough TS
Vertex software indexing (treat index buffer as a vertex buffer array of 1 component uint16 or uint32 vertices)
Filtered index buffer software indexing
Additionally the cost of computing barycentrics in the final deferred compute shader should be benchmarked against each method.
As a side note, do we really need to store dFdx and dFdy of the barycentrics per pixel in the triangle visibility buffer?
Re: Clustered Deferred Rendering in OpenGL 4.3
Could use tessellation shader with no control shader for constant tessellation (maybe expose glPatchParameterfv) to speed it up a bit.
On actual tessellation for actual geometry amplification (sidenote):
Storing barycentrics would enable using tessellation shader with the Triangle Visibility Buffer, otherwise we would have to output the tessellated intermediate triangles to an intermediate buffer during the eval shader.
We definitely can calculate dFdx implcitly in compute shader without neighbours.
Barycentrics might be harder and might actually cost more.
On actual tessellation for actual geometry amplification (sidenote):
Storing barycentrics would enable using tessellation shader with the Triangle Visibility Buffer, otherwise we would have to output the tessellated intermediate triangles to an intermediate buffer during the eval shader.
We definitely can calculate dFdx implcitly in compute shader without neighbours.
Barycentrics might be harder and might actually cost more.
Re: Clustered Deferred Rendering in OpenGL 4.3
Just a little sidenote.
I'ver recently implemented and tested my idea of single-pass stencil-light-volumes, and it works.
The also modify backface reference value to be 0 and the stencil op to do DECREASE on stencil fail (everything else is KEEP, since you need the whole stencil buffer to be 0 everywhere after both sides finish rendering).
IMPORTANT: You need to specify or rotate your bounding volumes so that ALL FRONT FACING TRIANGLES ARE DRAWN BEFORE ALL BACK FACING ONES (more accurately so that triangles draw in the order they face the cameram just like in transparency).
If you don't do this, or don't have watertight shadow volumes then you will either have double draws or missing light.
However the reorientation (rotation) is pretty easy for symmetrical and paramteric light volumes such as pyramids, boxes, regular polyhedrons.
NOTE ABOUT MSAA: There are no gap-less rendering guarantees if your geometry is not indexed, for there to be truly no gaps or overlaps (tie breaking rules to be applied) your light bound geometry needs to be indexed topologies especialy at high resolutions/high MSAA sample count.
I'ver recently implemented and tested my idea of single-pass stencil-light-volumes, and it works.
It requires a slight modification of the stencil op to do INCREASE on front-face stencil pass (everything else KEEP + "discard" in shader), this is because you can only replace with reference value as well as with the reversed direction of the z-test will make it so the pixels passing all test will be behind objects.Depth write disabled
Depth function (GREATER) [reverse the direction of Z-compare to the other objects in the scene]
Frontface: stencil func GL_EQUAL to reference value 0 mask 0x1u, stencil pass op KEEP, stencil fail op KEEP, depth fail op REPLACE with 1 [alternatives are ALWAYS, KEEP, KEEP, INCR] + "discard" in shader
Backface: stencil func GL_EQUAL to reference value 1 mask 0x1u, stencil pass op REPLACE with 0, stencil fail op KEEP, depth fail op REPLACE with 0 [alternative is NEQUAL 0, DECR, KEEP, DECR]
The order in which the individual lights get drawn makes no difference.
The also modify backface reference value to be 0 and the stencil op to do DECREASE on stencil fail (everything else is KEEP, since you need the whole stencil buffer to be 0 everywhere after both sides finish rendering).
IMPORTANT: You need to specify or rotate your bounding volumes so that ALL FRONT FACING TRIANGLES ARE DRAWN BEFORE ALL BACK FACING ONES (more accurately so that triangles draw in the order they face the cameram just like in transparency).
If you don't do this, or don't have watertight shadow volumes then you will either have double draws or missing light.
However the reorientation (rotation) is pretty easy for symmetrical and paramteric light volumes such as pyramids, boxes, regular polyhedrons.
NOTE ABOUT MSAA: There are no gap-less rendering guarantees if your geometry is not indexed, for there to be truly no gaps or overlaps (tie breaking rules to be applied) your light bound geometry needs to be indexed topologies especialy at high resolutions/high MSAA sample count.
Re: Musings about Deferred Rendering in OpenGL 4.3+
Just realized that algorithms for Order Independent Transparency can help with Light List Construction in Tiled and Clustered Lighting (Deferred or Forward+).
https://github.com/buildaworldnet/Irrli ... issues/216
https://github.com/buildaworldnet/Irrli ... issues/216
Re: Musings about Deferred Rendering in OpenGL 4.3+
Got a mention in ConfettiFX's The Forge 1.23 release, it incorporates my single-pass stencil-light-volumes but for SDF shadows. Its implemented on Vulkan, DX12 and Metal
https://github.com/ConfettiFX/The-Forge
P.S. The light-shadow playground has a special (probably never seen before) single-pass SDF Shadow accumulation algorithm based on some early-fragment test stencil abuse.
https://github.com/ConfettiFX/The-Forge
P.S. The light-shadow playground has a special (probably never seen before) single-pass SDF Shadow accumulation algorithm based on some early-fragment test stencil abuse.