Rendering on zhangdoa

Rendering analysis - Cyberpunk 2077

zhangandhang@gmail.com (Hang Zhang) — Sat, 12 Dec 2020 10:30:00 +0000

(Spoilers-free post!)

Quite a lot of people I know inside or outside this industry are playing the long-waited “gargantuan maelstrom” now. After a couple few hours immersed around the stunning view (and bugs of course) in the Night City, I was wondering how one frame is rendered in Cyberpunk 2077 (every graphic programmer’s instincts!). So I opened RenderDoc and PIX, luckily REDengine didn’t behave unfriendly like some other triple-that (yes Watch Dogs 2 I mean you), I got a frame capture without any problems. (Disclaimer: I’m pretty sure later others like Alain Galvan, Anton Schreiner, Adrian Courrèges or even guys from CDPR would bring us a more detailed and accurate demonstration about how things works, I’m just chilling out around and welcome for any discussions and corrections! )

Overview

I’m running the game in Ultra configs in 1080p without ray-tracing and DLSS enabled, for the sake of my years “old” GTX1070
Two captured final results, one is around the main character’s apartment at day (let’s name it main street) and another one is around some river bank at night (Nobody dislikes eye candy!):
Some pictures’ color is manually clamped to easier visualize
It’s a DirectX 12-only game on Windows (if they started the development around 2013 or slightly earlier, I’m curious that how much effort it costs to upgrade the engine)
There are 2 Descriptor Heaps allocated just, one for half million CBV-SRV-UAVs and another for 2k~ Samplers

There is one Graphic queue, one Compute queue, and one Copy queue, the GPU work balance is quite even
ID3D12GraphicsCommandList::CopyBufferRegion is frequently called between passes to copy RT results
Quite a lot D3D12_RESOURCE_FLAGS::ALLOW_UNORDERED_ACCESS removal recommendation was thrown out by PIX, not sure if it’s really just my capture occasionally didn’t use some resources or, they do need some optimizations (I had experiences that unordered access was quite hurting sometimes)
Frame Timeline: Intensive at some early geometry stages, then deferred works kick in

Pre-geometry passes

Billboard and GUI

The entire frame starts by preparing the resources for the in-game billboards and screens
First few copy operations executed on 3 textures: The destination textures’ format is DXGI_FORMAT_R8_UNORM, the first is 480x272 and another two are at half size, the source texture should be a mega texture which is the destination of lots later copy operations. (Question: what’s the purpose for these copies? In the night scene it doesn’t have these operations, is it related to streaming?)
One of the three very first copied textures:
Next pass they are used as the shader input:
Billboards are drawn to one texture where all mips tightly packaged together
Billboards:
Then the elements for the cyberish-GUI are drawn and also copied to the mega texture
Part of the GUI elements:

Sky visibility/top-view shadow/mini map

The RT of the next pass is a single D16_UNORM 2k texture, around 400~ instanced draw calls executed in the captured scene (not sure how this RT later used)

Shadowmap from top:

Unknown compute pass 01

There are 2 dispatches over a 64x64x64 R16G16B16A16_FLOAT 3D texture in this pass (looks like voxelized normal/position?). The thread group count at the first time is 2x2x64, the second time it’s 16x16x16 (the RT is used later multiple times in terrain pass, ocean wave pass, depth-pre pass, and base material pass, definitely something crucial for geometry information)
Some slices of the 3D Normal:

video: title: "04_3DNormal_Slice": /attachments/04_3DNormal_Slice.mp4

Unknown compute pass 02

3 dispatches made up by 63x1x1, 3x1x1 then again 63x1x1 thread groups (in the night scene capture it’s 38x1x1, 3x1x1 and 38x1x1), couple-few ByteAddressBuffer and StructuredBuffer are bound as UAV (all of them are used later in the base material pass)

Terrain

1 R8G8B8A8_UNORM 2 slices 2D texture array and 1 R16_UINT 2D texture are drawn in 1k resolution (looks like they are chunks of terrain)
2 draw calls and RTs:

video: title: "05_Terrain": /attachments/05_Terrain.mp4

Unknown color pass 01

2 draw calls issued (only at the main street capture) but nothing is drawn due to the clip (and the meshes are barely comprehensible, looks like they are 2 levels of a LOD-ed mesh), RT is 1 256x256 R16_UINT 2D texture

Ocean wave

The top-view ocean map mask and a generated wave noise texture are bound as input, RT format also changes to R16_UNORM, no tessellation shader stage, some regular grid plate meshes are used to directly draw different chunks
Ocean wave noise in R16G16B16A16_FLOAT and mask in BC7_SRGB:

video: title: "06_OceanMask": /attachments/06_OceanMask.mp4:

Geometry passes

Depth-pre pass

The RT is in D32_S8_TYPELESS format, surely it would help to optimize later pixel-heavy works
Depth-pre pass result:

Base material passes

All vertex buffers uploaded to GPU memory are packed as SOA:

// VB0
nshort4 POSITION0;
// VB1
half2 TEXCOORD0;
// VB2
xint NORMAL0;
xint TANGENT0;
// VB3
unbyte4 COLOR0;
half2 TEXCOORD1;
// VB7
float4 INSTANCE_TRANSFORM0;
float4 INSTANCE_TRANSFORM1;
float4 INSTANCE_TRANSFORM2;

In pixel shader sometimes an R16G16B16A16_UNORM 64x64 noise texture is bound as input, only RG channels contain data, similar to what we’d use in SSAO for random normal rotation or offset
All material textures are accessed through the descriptor table
RT0 in R10G10B10A2_UNORM is the albedo, alpha channel is used as object mask, animated meshes are marked as 0x03
RT1 in R10G10B10A2_UNORM is world-space normal packaged by offset $0.5x + 0.5$
RT2 in R8G8B8A8_UNORM is metallic-roughness and other attributes, animated meshes are not rendered here, the alpha channel should be the transparent or emissive property for emissive material objects
RT3 in R16G16_FLOAT is Screen-space motion vector:
Depth-stencil is D32_FLOAT_S8X24_UINT format, stencil bit 0x15 is used to mark character body meshes, 0x35 for face, 0x95 for hair and 0xA0 is for trees, brushes, and other foliage
Drawing order:
1. Static Mesh
2. Ground and emissive objects
3. Animated objects and destructible (?)
4. Foliage
5. Decals
6. Mask for fur
7. Fur

video: title: "How the base material passes are drawn on on of the main street": /attachments/08_BaseMaterialPass_MainStreet.mp4 video: title: "How the base material passes are drawn near a river at night": /attachments/08_BaseMaterialPass_RiverBank.mp4 video: title: "The MRT": /attachments/08_BaseMaterialPass_RiverBank_MRT.mp4

DS convert passes

The D32S8_TYPELESS Depth-stencil buffer is converted to R32_TYPELESS in full-screen size and then downsampled to half size as R16_FLOAT and R8_UINT in 3 passes, for easier pixel sampling later.

Motion-stencil pass

All moving objects and foliage and furs are marked with some different bits in this pass regarding with last frame screen-space positions, and then further extended for few pixels (Exactly the same solution as Temporal Antialiasing in Uncharted 4: Better anti-ghost!)
Motion-stencil pass:

Mask and LUT passes

Reflection mask

Parts of some objects are drawn onto an R8_UNORM RT and used later in SSR and TAA passes

Compute pass to noise normal

The RT is used for sun shadow, AO(?), direct sunlight, SSR and indirect skylight passes

Noise sample:

Some passes to mask the sky out

All of them are in quite low resolution and mipmaped, the mipmap calculation executes multiple times

video: title: "13_SkyMaskMipmap": /attachments/13_SkyMaskMipmap.mp4

And finally, there is a graphic pass to draw a quite accurate sky mask on R32_FLOAT:

(HB)AO (?)

All in 960x540, looks like some sort of AO-required normal data:

video: title: "14_AO": /attachments/14_AO.mp4

Color-grading LUT

Color-grading LUT:

video: title: "15_ColorGradingLUT": /attachments/15_ColorGradingLUT.mp4

Ocean wave noise

Around 22~ 32x32x1 compute works are dispatched to generate noise for water, which is used in previous ocean wave pass

Light and shadow passes

CSM passes for direct light

A 2k R16_TYPELESS texture is used first to render a few simple meshes (some kinds of mesh proxy?), but the RT is never used
Then 4 classic CSM cascades are rendered and stored as 4 slices in a 4k R16_TYPELESS 2D texture array, no geometry shader RT index is involved, it just executes draw calls 4 times on each cascade

video: title: "16_CSM": /attachments/16_CSM.mp4

Omni shadow maps for point/area lights

Every omni shadow map is first rendered as R32_TYPELESS in 1k, then converted to 10 slices R16G16UNORM (should be VSM?)

Cloud distribution

Sky cubemaps

7 mipmap levels in R11G11B10_FLOAT format:

video: title: "18_SkyCubemap": /attachments/18_SkyCubemap.mp4

And a stereographic projected sky radiance is also generated:

Coat mask (?)

Clustered light index (?)

Some compute-only passes generate a few 3D textures that contain only indices:

video: title: "20_ClusteredIndices": /attachments/20_ClusteredIndices.mp4

Direct light shadow

In full screen resolution, it’s coming from sun at the day and moon at the night

Local lights mask (?)

In a 120x68 60x34 resolution

Unconfirmed compute dispatches to update some StructuredBuffers

Environment Radiance capture

First, cubemap version is generated video: title: "23_EnvironmentCaptureCubemap": /attachments/23_EnvironmentCaptureCubemap.mp4
Then all of them are converted to 2D dual-paraboloid textures in 512x256, each one contains 6 mips, and there are 32 captures stored as slices in one 2D texture array, 6~ texture arrays in the main street current scene video: title: "23_EnvironmentDualParaboloid": /attachments/23_EnvironmentDualParaboloid.mp4

64x64x64 texture that is R32G32_UINT format, 2 channels contain index-like data
Slice 19 for example

Landscape maps

A few layers drawn here, from normal to a kind of heatmap in 1k resolution

Local volumetric fog

2 mips from 240x136 to 120x68, each one has 128 slices and mip 0 is from main camera view, it should be some kind of ray-marching:

video: title: "26_LocalVolumetricFog": /attachments/26_LocalVolumetricFog.mp4

Head mask

The most hilarious one, should be used later to render detailed facial expressions

Direct and emissive lights

Tiled-lighting, each tile is 16x16 by size and all executed as indirect compute command:

video: title: "28_DirectAndEmissiveLight": /attachments/28_DirectAndEmissiveLight.mp4

Sky

Only masked sky region is drawn

SSR

A low resolution depth mask is drawn first:
Then the full-screen size SSR is drawn in R11B11G10_FLOAT with another R8_UNORM RT for reflectivity. The previously noised world-space normal and motion stencil RT is used as inputs, also the last frame’s downscaled TAA output is used for sampling:

video: title: "31_SSR": /attachments/31_SSR.mp4

Indirect light

Could see some artifacts (Light leaking because of VCT?):

All lights composition pass

2 RTs, specular reflections are stored on RT1:

Transparent LUT

Compute shader is involved a few times to generate some LUTs for later transparent and holographic objects:

Skin and eyes

Skin irradiance is rendered in screen space:
Then eyes are rendered:

video: title: "36_Eyes": /attachments/36_Eyes.mp4

AO(?)/GI shadow

Add additional shadows to pixels that should not be lighted too much: video: title: "37_AO": /attachments/37_AO.mp4

Volumetric fog

Just a simple image composition pass:

video: title: "38_VolumetricFog": /attachments/38_VolumetricFog.mp4

Water reflection

Few passes generate the water reflection, including all related masks and noises:

video: title: "39_Water": /attachments/39_Water.mp4

Post-processing passes

TAA

Next it’s a full-screen no-suprise TAA pass:

video: title: "40_TAA": /attachments/40_TAA.mp4

Blur

First, the TAA-ed result is downscaled to 1/2, 1/4, and 1/8 size in 1 compute pass and stored in 3 mips, then each of the mips is blurred horizontally and vertically

video: title: "41_Blur": /attachments/41_Blur.mp4

Unknown transparent

Cloud

The distribution noise of the cloud is pre-generated:

video: title: "42_Cloud": /attachments/42_Cloud.mp4

Holo and transparent objects

All futurism holographic objects are drawn next (a significant number of small triangle count meshes), including water surface and some local light scattering:

video: title: "43_Transparent": /attachments/43_Transparent.mp4

Post-TAA

2 downscaled images are generated again in 1/2 and 1/4 sizes with all transparent objects on them, and finally, it’s converted back to full-screen size and gets sharpened

video: title: "44_PostTAA": /attachments/44_PostTAA.mp4

HDR LUT(?)

A 256x256 R16_FLOAT, a 256x64 R16G16B16A16_FLOAT and a 256x1 R16G16B16A16_FLOAT images are generated by 3 compute dispatches, and finally are used as part of the inputs for the next compute pass to generate a structured buffer, which is used widely when rendering sky cubemaps and reflection probes

Bloom

The full-screen images are half-sized 6 times and get bloomed

video: title: "46_Bloom": /attachments/46_Bloom.mp4

Camera lens effects

Rendered on a half-screen size RT

Color grading

The color-graded image is then converted to LDR

Gamma correction

The Gamma correction is executed in a weird resolution 456x256

GUI elements

All GUI elements are drawn on a full HD RT and the corresponding mipmaps are generated, and then a few compute passes add bloom to them

Film Grain

The noise LUT is generated on-the-fly

Swap chain image

Finally, the result of the all above is presented to the screen

Undrawnable conclusions

Well, what else I could say, 2 decades have passed already in the 21st century, Cyberpunk 2077 basically is the summing of the modern game rendering techniques which progressed during the years. Even without the enabling of ray-tracing, the overall appearance of the graphics are magnificent, it just bursts my euphoria when I’m pacing around Night City. Despite all the controversies over the release and gameplay glitches which mainly caused by not robust enough physics system and overestimated streaming implementation, the game’s rendering is quite well-crafted, but still, I expected to see more state-of-the-art architectural practices such as a heavier GPU-driven rendering pipeline (which’s been more and more popular since 2016-2017). How’s your opinion about it? And what could change if we come back 5 years later to have a review? Maybe we’d laugh at the “outdated” techniques in Cyberpunk 2077, but I’m sure we’ll still keep admitting that real-time rendering is always full of exciting directions to explore!

Physically Based Rendering - Lighting

zhangandhang@gmail.com (Hang Zhang) — Sat, 07 Dec 2019 18:25:00 +0000

As soon as we modeled the surface’s physical properties that covered a certain range of material in real life, we would need to emit light onto them, in order to finally get the outcome radiance from the surface. If you take a look back at the rendering equation, the outcome radiance is just an integral of the income radiance over the semi-sphere around the normal. This gives us a fundamental assumption that with a brutal algorithm such as path tracing we could get a numerical solution for the problem. We could just model the geometry shape of the light source, then emit single light through all the possible directions and evaluate whether they hit a surface, until the overall energy diminished to a certain threshold. But in real-time rendering, the computational source currently we had in hand won’t be sufficient for such cost of the algorithm. We need to find the analytical replacement of the integral, or at least some cheaper numerical integrals.

Remapping of reality

Before we dive into the detail of the analytical representation of different light shapes, I’d introduce another type of unit system to measure energy and related quantities. If you remembered what I introduced in a previous post about radiometry, they are identically related. Thus the photometry it is, basically it’s a weighted version of radiometry, but with the respect to the nature of human eyes’ sensitivity. Because when we deal with the light in real life, typically the light sources we would get are bulbs, candles, flashlights, and our old friends, sun, we perceive the electromagnetic wave from them by the cell on our retina. And because of the non-linearly of human nature (of course!), our eyes have different sensitivities among different light wavelengths, so finally people invented the photometry quantities to better measure the light we actually “see”.

Here are the common photometry quantities we would occur in real-time rendering:

Quantity		Unit		Notes
Name	Symbol	Name	Symbol
Luminous energy	${Q_v}$	lumen second	${lm·s}$	Energy of electromagnetic radiation with respect of human eyes’ sensitivity
Luminous flux	${\Phi_v}$	lumen	${lm}$	also called Luminous power
Luminous intensity	${I_v}$	candela	${\frac{lm}{sr}}$ or ${cd}$	steradian is similar to angle in 2D space
Illuminance Flux density	${M_v}$ or ${E_v}$	lux	${\frac{lm}{m^2}}$ or ${lx}$	Il-luminance, means the luminance received by a surface
Luminance	${L_v}$	nit	${\frac{lm}{m^2·sr}}$ or ${\frac{cd}{m^2}}$ or ${nt}$

As you could see, it’s basically another version of quantity measurement for the electromagnetic radiation, but with more emphasis on the visible light range. Because the two types of retina cell - rods and cones have quite different responses for different intensity and wavelengths of lights, the actual mapping between radiometry to photometry requires to be done in two different versions. The one related to cones cell typically has a luminance range from ${10 nt}$ to ${10^8 nt}$ and we call it photopic vision in biology, and another one related to rods cell is called scotopic vision with the luminance range from ${10^{-3} nt}$ to ${10^{-6} nt}$. The cones cell could receive the chromatic of the world and the rods cell receives the silhouette of the object. But you may wonder, how much is one $nt$? The SI defines the fundamental unit of the light measurement by ${cd}$, and then the other units are its deduced units. And with the standard definition, one candela means “A $540.0154×10^{12}Hz$ monochromatic light emits $1/683$ Watt per steradian”, then we could say how much a nit it is by adding “per square meter” to it. The reason to choose such a reference wavelength is that one of the three types of cone cells is most sensitive to it, and if you took a look at the visible light spectrum you’d see it’s a greenish color. The magic number 1/683 came from a historical compatibility requirement to transit the old unit system to the modern one.

“…Radiometry deals purely with physical quantities, without taking account of human perception. A related field, photometry, is like radiometry, except that it weights everything by the sensitivity of the human eye. The results of radiometric computations are converted to photometric units by multiplying by the CIE photometric curve, bell-shaped curve centered around 555 nm that represents the eye’s response to various wavelengths of light…” - pg. 271, “Real-Time Rendering”, 4th Edition.

So with a certain wavelength $\lambda$, the mapping from radiometry to photometry is expressed like ${I_v(\lambda)}=683.002\overline{y}(\lambda)I_e(\lambda)$, or we could integrate over the entire visible light wavelength range to get ${I_v}=683.002\int_{380}^{780}\overline{y}(\lambda)I_e(\lambda)d\lambda$. The $\overline{y}(\lambda)$ here is the luminosity function, it’s a bell-shaped distribution function and has different values in photopic and scotopic. The photopic luminosity function would reach 1 in 555nm or 540 THz (the definition frequency of candela) and the scotopic luminosity function is 1 in around 510nm, they demonstrate the eyes’ perception ratio of different wavelengths. And we could furthermore deduce two quantities called luminous efficacy and luminous efficiency from the mapping function. Luminous efficacy $\eta$ is defined as ${\eta}=683.002\frac{\int_{380}^{780}\overline{y}(\lambda)I_e(\lambda)d\lambda}{\int_{380}^{780}I_e(\lambda)d\lambda}$ and the unit is ${\frac{lm}{W}}$, it represents the light source’s ability to emit visible light; And if we normalized the luminous efficacy with ${\frac{1W}{683lm}}$ then we would get luminous efficiency, which is simply defined as $\frac{\int_{380}^{780}\overline{y}(\lambda)I_e(\lambda)d\lambda}{\int_{380}^{780}I_e(\lambda)d\lambda}$. With the help of them we could easily evaluate whether a light source is wasting or saving energy, or to directly convert from the radiometric quantities to photometric version. And we could get that an ideal 540 THz black body would have a 683 ${\frac{lm}{W}}$ luminous efficacy or 100% luminous efficiency.

Since what we’ve been talking so far is all about the real-time rendering, we would use the discrete RGB space instead of a continuous spectrum, and then we could simplify the continuous luminosity function to a discrete version. And we could then give the user a few parameters like the radiometric quantities and the luminous efficiency to adjust the light source. But when we build a physically-based rendering pipeline we’d often go to find references in real life, and almost all the light sources like those bulbs sold in supermarket would print their photometric quantities on the package, it would be more convenient to use photometric quantities in the light system for the final users.

Integration matters

The more crucial part of a light system is how to model the geometry appearance of the light source. When dealing with a non-physically based rendering pipelines, there are often 3 types of analytical light we’d like to use, directional light, point light, and spot light. I’d not discuss any of them here because all of them are not modeling around the actual light source geometry but rather around the appearance of the lighting result, another reason is that there are tons of information online about how to implement them and I’d better not repeat again. I’d indicate them as punctual light sources later since the actual shape of them is infinitesimal. The physically correct light sources all have volume, or if we ignore one of the 3 dimensions with the respect of the nature of projection they still have the area. I’d indicate them as area light source later on. The typical analytical area light source has shapes like sphere, disk, tube and rectangular, or with a trivial shape which requires some advanced approximations.

If we took a look once more at the reflectance equation, let’s write a slightly different version $L_o=\int\limits_{\Omega+} f(v,l) V(l) L_i \langle n·l \rangle dl$ instead, the $V(l)$ here is a Heaviside function to indicate whether a light source is visible along a certain direction, and it describes the shape and the shadow of the light. As long as the shape of the light source is non-trivial, it’s impossible to find a general analytical solution for such hemisphere integral. The punctual light sources all assume the shape is infinitesimal so the integral would be simplified by an integral of cosine over hemisphere and finally we’d get $L_o=\pi f(v,l) \langle n·l \rangle$, that’s the reason why we could write Lo = rho * NDotL when using the simple Lambert diffuse BTDF $f(v,l) = \frac{\rho}{\pi}$, the $\pi$ is just perfectly canceled. For area light we cannot have such one-line-to-rule-them-all enjoyment, there are some practical solutions so far, that all of them would solve the problem to a certain level which is acceptable enough. I’d introduce them one by one below.

Form Factor

The first approach is that we could still try to solve the integral directly. We can’t numerically solve it in runtime, but if we convert it to the integral over the light source’s area instead, then it would bring some certainties because we would have known the shape or the actual area of the light source when shading the object. The rewritten version of the reflectance equation is $L_o=\int\limits_{A} f(v,l) L_i \langle n·l \rangle \frac{\langle n_a·-l \rangle}{r^2} dA$, here $n_a$ is the normal of $dA$, $r$ is the distance from the lighting point to $dA$. Because the effective area along the $dl$ is proportional to the normal orientation of $dA$ and the distance from the shading point to the $dA$, so here we introduced a $\frac{\langle n_a·-l \rangle}{r^2}$ factor.

And then we could move the BSDF out of the integral by assuming it’s not interleaved with light direction so much, for example in a Lambert diffuse BTDF case. Then what we’d solve is $L_o = f(v,l) \int\limits_{\Omega+} L_i \langle n·l \rangle dl$, or $L_o = f(v,l) E(n)$ that we could treat all the income luminance as the illuminance. The analytic method to integrate illuminance has been already developed in problem domain of energy transformation, such like radio transfer or heat transfer, since fundamentally they are all the same type of problems about how to calculate the energy transferred from one object to another, or specifically speaking in our case, from a surface to another one.

“…The view factor from a general surface A1 to another general surface A2 is given by: $\displaystyle F_{1\rightarrow 2}={\frac {1}{A_{1}}}\int {A{1}}\int {A{2}}{\frac {\cos \theta {1}\cos \theta {2}}{\pi s^{2}}},{d}A{2},{d}A{1}$…” [Wiki1]

The actual analytic formulae are more complicated but well-documented in [Mar14] and [HSM10], I’d only give each link of the commonly used geometry shapes below:

Sphere: “Patch to a sphere - Titled” in pg. 10 in [Mar14], and B-43 in [HSM10]
Disk: “Parallel configurations - Patch to disc” in pg. 22 in [Mar14], and B-13 in [HSM10]
Rectangular: “Parallel configurations - Patch to rectangular plate” in pg. 22 and “Perpendicular -configurations - Patch to rectangular plate” in pg. 26 in [Mar14], and B-5 in [HSM10]

The solution in [HSM10] covers more general cases than [Mar14]. The overall runtime cost if we’d implement them naively is not so acceptable for real products, but they bring us the perfect analytic forms, we could use them wisely in products with some optimizations and simplifications, or just leave them as a ground truth faster than the classic path tracing.

A code example for sphere light that modified from [LR14]:

	for (int i = 0; i < NR_SPHERE_LIGHTS; ++i)
	{
		float lightRadius = sphereLightCBuffer.data[i].luminousFlux.w;
		if (lightRadius > 0)
		{
			vec3 unormalizedL = sphereLightCBuffer.data[i].position.xyz - posWS;
			vec3 L = normalize(unormalizedL);
			vec3 H = normalize(V + L);

			float LdotH = max(dot(L, H), 0.0);
			float NdotH = max(dot(N, H), 0.0);
			float NdotL = max(dot(N, L), 0.0);

			float sqrDist = dot(unormalizedL, unormalizedL);

			float Beta = acos(NdotL);
			float H2 = sqrt(sqrDist);
			float h = H2 / lightRadius;
			float x = sqrt(max(h * h - 1, eps));
			float y = -x * (1 / tan(Beta));
			float illuminance = 0;

			if (h * cos(Beta) > 1)
			{
				illuminance = cos(Beta) / (h * h);
			}
			else
			{
				illuminance = (1 / max(PI * h * h, eps))
					* (cos(Beta) * acos(y) - x * sin(Beta) * sqrt(max(1 - y * y, eps)))
					+ (1 / PI) * atan((sin(Beta) * sqrt(max(1 - y * y, eps)) / x));
			}
		}
	}

As you could see here, the form factor approach is suitable for our low-frequency signals - the diffuse part of the BSDF, since it is modeled basically around the phenomenon of energy bounce between diffuse surfaces. But still we need to find a solution for the specular part, and if you want to use some more advanced BTDF than the Lambert model which involves more about the complex subsurface scattering phenomenon, then the form factor solution is not accurate enough.

Representative Point

Another approach is that we won’t solve any of the integral at all, instead, we would try to substitute the problem with our old method, use punctual light to approximate area light. If we still focused on the lighting result, area light is similar to a bunch of point lights. And then we could find a point on the surface of the light source that contributes the most to the lighting result, and use it as the location of a point light to continue our shading process. This method is often called Representative Point or Most Representative Point (abbr. MRP), which has been successfully used in lots of PBR pipelines like [Kar13] and [LR14].

“…These approaches resemble the idea of importance sampling in Monte Carlo integral, where we numerically compute the value of a definite integral by averaging samples over the integral domain. In order to do so more efficiently, we can try to prioritize samples that have a large contribution to the overall average….” - pg. 385, “Real-Time Rendering”, 4th Edition.

When using the MRP method, we could both apply it to diffuse and specular part of BSDF, while it depends on the actual requirement of the products. One of the general algorithms has been developed by [Dro14], it uses the halfway vector method to find MRP for diffuse and cone projection to find MRP for specular. For the diffuse part, the halfway vector could be expressed as $\vec{h} = \frac{\vec{pp_0} + \vec{pp_1}}{||\vec{pp_0} + \vec{pp_1}||}$, $p$ is the shading point, $p_0$ is the intersection point of a ray from $p$ along the reflection vector $\vec{r}$ of the view direction, and $p_1$ is the intersection point of a ray from $p$ along negative direction of the light plane normal $\vec{n’}$. While the demonstration here may sound not complex, in a real scenario with an actual light shape, we often need to adjust the halfway vector method’s MRP result onto the surface of the light source, since it won’t always fall inside the light area. In another word, we need to find the closest point of the diffuse MRP in the light area if it’s outside.

For the specular part, the calculation became more complicated with the respect of the probability distribution functions (abbr. PDF) in the BRDF, we need to generate a cone along the reflection direction based on the PDF and use the geometry center of the intersection area between the cone and the light surface as the MRP. Furthermore, we could pre-compute lookup tables for a specific BSDF and light texture combination we chose, by eliminating some variables we could limit the LUT to 3D tables and free us from runtime calculation. And after we use the MRP to calculate the illuminance we should weight it by the intersection area, to preserve the energy conservation. All the details were well demonstrated in [Dro14], I’d recommend reading if it’s available for you.

Another way to calculate specular MRP is that we could find the point on the light surface with the smallest distance to the reflection ray of view direction, which is demonstrated in [Kar13]. For example, in a sphere light case we could calculate $p_{cr} = (l · r)r − l$ then $p_{cs} = l + p_{cr} · min(1,\frac{radius}{||p_{cr}||})$, while $p_{cr}$ is the point on the reflection ray closest to the sphere center, and $p_{cs}$ is the point on the sphere surface closest to $p_{cr}$.

Linearly Transformed Cosines

The LTC approach tries to solve the integral from another point of view, or more precisely speaking, another space. The reflectance equation involves several variables, and we’d evaluate each direction of the incoming light with a BSDF function, and the complexity is $O(n^2)$ due to the variance of the $dl$ and the BSDF. But if we could control one of them to be some sort of constant, then we would optimize the complexity to $O(n)$. Thanks to the popular usage of parametric BSDF nowadays, we could pre-compute lots of them into a multi-dimensional lookup table and just sample from it in runtime to save our precious frame time. But unfortunately, when we implement such a solution and try to fit it into the rendering pipeline, we’d still need to integrate the illuminance over a semispherical region, with an uncertain shape and orientations of the light surface. But if we consider the solving procedure of the reflectance equation in a linear algebra point of view, then the integral of the incoming light in a “world” spherical space could be solved as the integral of the transformed incoming light in a “local” spherical space. If we could choose a suitable transformation to ensure the linearity then the above hypothesis would become a provable truth. And if this transformation could eliminate the shape factor or the orientation factor of the light then we would be one step closer to our expectation of $O(n)$. Luckily the BSDF is naturally a spherical distribution function, the only problem left is how to find the transformation for a BSDF and find a close form of the integral in that BSDF space, after all, if we can’t solve the reflectance equation easier then all of these works would be nonsense.

The most commonly used distribution function of BSDF today is the GGX approximation for the Cook-Torrance microfacet model, the isotropic version of it only depends on the view direction and the surface roughness $\alpha$, even we add the anisotropicity later on, the number of the variables is still only 3. But even 3 variables the integral is still too expensive to calculate in runtime, we have to continue eliminating variables and transforming the light source until we finally find a space cheaper enough to evaluate the integral in real-time. That’s how the LTC came from, we would use a clamped cosine spherical distribution function as the base space since the integral over it is both cheap and analytic, then find a transformation matrix to approximately transform it to a specific BSDF, then use the inverse of that matrix to transform the vertices of the light source into the base space, and finally calculate the integral by utilising the analytic form of line integral.

The theoretical details are well demonstrated in [Hei16] and the reference C++ source code has been provided on Github, but without too much explanation about what it is doing. Actually, if you have ever implemented any BSDF LUT, the approach behind is similar, we just generate the value pair of all possible parameter combination ahead of runtime, and store them in an easy-to-fetch data structure. What we want to get here is a table of transformation matrices that, we could transform our specific BSDF distribution function to the clamped cosine one bidirectionally. Or mathematically speaking, we need to find the $M$ in $D_{BSDF}(\omega) = M * D_o(\omega)$, where $D_o$ would be the original clamped cosine distribution function we chose, with a form of $D_o(\omega_o = (x, y, z)) = \frac{1}{\pi}max(0, z)$ that $\omega_o$ is the normalized direction vector. As you could see we would employ a 3D vector as usual, so the matrix should be a 3x3 matrix.

The next step would be how to calculate the $M$ for our specific BSDF, for example, what the [Hei16] did for GGX distribution. They use Nelder–Mead method to numerically search the $M$, with a cost/error function implemented by simply comparing the actual BSDF value and LTC fitted value. And because in the paper they only fitted isotropic version of GGX, finally the $M$ only has 5 effective elements on each diagonal.

\begin{bmatrix} a & 0 & b\\ 0 & c & 0\\ d & 0 & e\\ \end{bmatrix}

And with normalization by one of the elements there would be only 4 elements left, which could just be stored into a 4 channels texture and linear sampled in runtime. In the paper, they use $e$ as the normalizer, but later in [Hill16], they admit that it would bring some serious numerical error, and somehow I found that they use $c$ later again in the repo’s history, that should be a measured decision maybe. Also in [Hill16], they mentioned multiple problems when implementing LTC solutions, from how to deal with polygon clipping effectively to how to ensure the robust of inverse trigonometric functions, highly recommend to read.

After you get the LUT for $M$, then could just use it in runtime for the shading. You just need to write some shader codes to transform the light vertices to LTC space and do a spherical line integral by $\frac{1}{2\pi}\sum_{i=1}^{n}acos(v_i · v_j)\frac{v_i \times v_j}{||v_i \times v_j||} · n$. But this numerical integral is not practical for light shapes such as single line, capsule, sphere and disk, since you can’t find a small but smooth enough vertices number for ellipse shapes. Moreover, it’s the natural solution for convex polygons like rectangular light. Luckily later they provided the corresponding integral form for other shapes in [Hei17], where the final appearance and performance are quite plausible for real products in contrast with other solutions. I’d recommend again to find details in their publications.

To be continued.

Bibliography：

[Wiki1] https://en.wikipedia.org/wiki/View_factor

[Mar14] http://webserver.dmt.upm.es/~isidoro/tc3/Radiation%20View%20factors.pdf

[HSM10] http://www.thermalradiation.net/tablecon.html

[Kar13] B. Karis. “Real Shading in Unreal Engine 4”. In: Physically Based Shading in Theory and Practice, ACM SIGGRAPH 2013 Courses. SIGGRAPH ’13. Anaheim, California: ACM, 2013, 22:1–22:8. isbn: 978-1-4503-2339-0. doi: 10.1145/2504435.2504457. url: http://selfshadow.com/publications/s2013-shading-course/.

[LR14] S. Lagarde and C. de Rousiers. “Moving Frostbite to PBR”. In: Physically Based Shading in Theory and Practice, ACM SIGGRAPH 2014 Courses. SIGGRAPH ’14. Vancouver, Canada: ACM, 2014, 23:1–23:8. isbn: 978-1-4503-2962-0. doi: 10.1145/2614028.2615431. url: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/.

[Dro14] M. Drobot. “Physically Based Area Lights”. In: GPU Pro 5 Advanced Rendering Techniques. ISBN: 978-1-4822-0864-1. url: https://www.crcpress.com/GPU-Pro-5-Advanced-Rendering-Techniques/Engel/p/book/9781482208634

[Hei16] E.Heitz, J.Dupuy, S.Hill and D.Neubelt. “Real-time polygonal-light shading with linearly transformed cosines”. In: ACM Transactions on Graphics (TOG) Volume 35 Issue 4, July 2016 Article No. 41. url: https://eheitzresearch.wordpress.com/415-2/

[Hill16] S.Hill and E.Heitz. “Real-Time Area Lighting: a Journey from Research to Production”. In Advances in Real-Time Rendering in Games, ACM SIGGRAPH 2016 Courses. SIGGRAPH ’16. Anaheim, California: ACM, 2016, url: https://blog.selfshadow.com/publications/s2016-advances/

[Hei17] E.Heitz and S.Hill. “Real-Time Line- and Disk-Light Shading”. In Physically Based Shading in Theory and Practice, ACM SIGGRAPH 2017 Courses. SIGGRAPH ’17. Los Angeles, California: ACM, 2017, url: https://blog.selfshadow.com/publications/s2017-shading-course/heitz/s2017_pbs_ltc_lines_disks.pdf

Walking through the heap properties in DirectX 12

zhangandhang@gmail.com (Hang Zhang) — Wed, 18 Sep 2019 14:00:00 +0000

Indifferent to the difference?

Back to the old times, you never worried about how the physical memory would be allocated when you’re dealing with OpenGL and DirectX 11, GPU and the video memory were hidden behind the driver so well that you might even not realize they were there. Nowadays we get Vulkan and DirectX 12 (of course Metal, but…nevermind), that the “zero driver-overhead” slogan (not “Make XXX Great Again” sadly) become the reality on desktop platforms. And “ohh I don’t know that we need to manually handle the synchronization” or “ohh why I can’t directly bind that resources” and so on and on. Of course, the new generation (already not new actually) graphic API is not for casual usages, your hands get dirtier and your head gets more drizzling, while there are still a bunch of debugging layer warning and error messages keeping pop up. Long story in short, if you want something pretty and simple, turn around and rush to modern OpenGL (4.3+) or DirectX 11 and happy coding; if you want something pretty and fast, then stay with me a while and let’s see what’s going on with the new D3D12 memory model.

The fundamental CPU-GPU communication architecture is quite similar around different machines, you have a CPU chip, you have a GPU chip, you have some memory chips, gotcha! The typical PC with a dedicated graphics card would have 2 memory chips, one we often referred as the main memory and another one as the dedicated graphics card memory, or more commonly used (not so strict) name convention are RAM and VRAM for them. Other architectures like those game consoles, the main memory, and the video card memory would be the same physical one, we name such kind of memory accessing model as UMA - Uniform Memory Access. Also, the functional microchips of CPU and GPU would be put together or closer in some certain designs (for example PS4) to get optimized communication performance. You should remember that you just paid once for your DDR4 16GB fancy “memory” when you’re crafting your state-of-the-art PC right? They are the “main” RAM for the general-purpose, like loading your OS after power-up or put the elements of your std::vector<T> inside. But if you also purchased an AMD or NVIDIA dedicated graphics card, you might notice the printed instructions on the package box that there are other couple-few Gibibytes some sort of memory on it. That’s the VRAM memory where the raw texture and mesh data would stay when you are playing CS:GO and swearing random Ruglish in front of your screen.

So, if you want to render the nice SWAT soldier mesh in CS::GO, you need to load it from your disk into the VRAM and then ask GPU to schedule some parallel rasterization work to draw it. But unfortunately, you can’t access the VRAM directly in your C++ code due to the physical design of the hardware. You could reference a main memory virtual address by semantics like Foo* bar = nullptr in C++, because it would be finally compiled into some machine instructions like movq, $(0x108), $0 (it should be binary instruction data actually, for the sake of human-readability here I use assembly language instead) that your CPU could execute. But generally speaking, you can’t expect the same programming experience on GPU, since it is designed for highly parallel computational tasks thus you can’t refer to some fine-grin global memory addresses directly (there are always some exceptions, but let’s stay foolish at present). The start offset of a bunch of raw VRAM data should be available for you in order to create a context for GPU to prepare and execute works. If you were familiar with OpenGL or D3D11 then you had already used interfaces such as glBindTextures or ID3D11DeviceContext::PSSetShaderResources. These 2 APIs expose the VRAM memory not explicitly to developers, instead, you would get some indirect objects in runtimes like an integer handle in OpenGL or a COM object pointer in D3D11.

A step closer

GPU is a heavily history-influenced peripheral product, as time goes by its ability becomes more and more general and flexible. As you might know the appearing of Unified Scalar Shader Architecture and Highly Data Parallel Stream Processing made GPU become compatible for almost every kinds of parallel computation works, the only thing lying between the developer and the hardware is the API. The old generation of graphics APIs like OpenGL or DirectX 11 were designed with emphasis, that they’d better lead developers to a direction that they’d spend more time with the specific computer graphics related tasks they want to hand on with, rather than too much low-level hardware related details. But the experience told us, more abstraction, more overhead. So when the clock ticking around 2015 that the latest generation of graphics API was released to the mass developer like me, a brand new or I’d rather to say “retro” design philosophy appearing among them, no more pre-defined “texture” or “mesh” or “constant buffer” object models, instead we get some new but lower-level objects such as “resources” or “buffers” or “command”.

Honestly speaking, it’s a little bit painful to transit the programming mindset from OpenGL/D3D11 era to Vulkan/D3D12. It’s quite like a 3-years-old kid who used to ride his cute tiny bike with auxiliary wheels now need to drive a 6-shifts manual gear 4WD car. Previously you call a glGen* or ID3D11Device::Create* interfaces you would get the resource handles in no means more than few milliseconds. Now you even can’t “invoke” functions to let GPU do these works! But wait, could we actually ask GPU to allocate a VRAM range for us and put some nice AK-47 textures inside before? Just the graphic cards vendor’s implementation handled the underlying dirty business for us, all the synchronization of CPU-GPU communication, all the VRAM allocation and management, all the resources binding details, we had even not taken a glimpse about them before! But it’s not as bad as I exaggerated, you just have to take care the additional steps which you don’t obligate to do previously, and if you succeeded you’d not only get more code in your repo but also a tremendous performance boost in your applications.

Let’s forget about the API problems for a couple few minutes and take a look back at the hardware architecture to better understand the physical structure that triggered the API “revolution”. The actual memory management relies on the hardware memory bus (it’s part of the I/O bridge) and the MMU - Memory Management Unit, they work together to transfer data between the processor and different external adapters to RAM, and mapping physical memory address to a virtual one. So when you want to load some data from your HDD to RAM, the data would travel through the I/O bridge to CPU and then after some parsing processes it would be stored into RAM. If you had a performance-focused attitude when writing codes, you may wonder is there any optimizations for usage cases like simply loading an executable binary file to RAM, which doesn’t require any additional processing to the data itself. And yes, we had DMA - Direct Memory Access! With DMA the data doesn’t need to travel through CPU anymore and instead, it would be loaded directly from the HDD to RAM.

As we could imagine, CPU and GPU could have individual RAMs and MMUs and Memory Buses, thus they could execute and load-store data into their RAMs individually. That’s perfect, two kingdoms live peacefully with each other. But the problems emerge as soon as they start to communicate, the data needs to be transferred from the CPU side to GPU side or vice versa, and we need to build a “highway” for it. One of the “highway” hardware communication protocol that widely used today is PCI-E, I’d omit the detail instructions and focus on what we’d care about here. It’s basically another bus-like design and provides the functionality that we could transfer data in between different adapters, such as a dedicated graphics card and main memory. With its help, we could almost freely (sadly highway still need payment, it’s not a freeway yet) write something utilizing CPU and GPU together now.

The bridges are a little bit too many, isn’t it? If you remembered that I’ve briefly introduced a memory architecture called UMA before, it basically just looks like we merging RAM and VRAM together. Since its design requires the chip and memory manufacturers to produce such products, and until now I’ve never seen one in the customer hardware market, we can’t craft it by ourselves. But still, if you had an Xbox One or PS4 you’ve enjoyed the benefit of UMA.

Heap creation

So now it’s time to open your favorite IDE and #include some headers. In D3D12, all the resources would resident inside some explicitly specified memory pools, and the responsibility to manage the memory pool belongs to the developer now. This is the how the interface

HRESULT ID3D12Device::CreateHeap(
 const D3D12_HEAP_DESC *pDesc,
 REFIID riid,
 void **ppvHeap
);

comes. If you’re familiar with D3D11 or other Windows APIs in COM model you could easily understand the function signature style. It is made by the combination of a reference to a description structure instance, a COM object class’s GUID and a pointer to store the created object instance’s address. The return value of the function is the execution result.

Now let’s take a look at the description structure:

typedef struct D3D12_HEAP_DESC {
 UINT64 SizeInBytes;
 D3D12_HEAP_PROPERTIES Properties;
 UINT64 Alignment;
 D3D12_HEAP_FLAGS Flags;
} D3D12_HEAP_DESC;

It apparently follows the consistent code style of D3D12 API, and here we get another property structure to fulfill in:

typedef struct D3D12_HEAP_PROPERTIES {
 D3D12_HEAP_TYPE Type;
 D3D12_CPU_PAGE_PROPERTY CPUPageProperty;
 D3D12_MEMORY_POOL MemoryPoolPreference;
 UINT CreationNodeMask;
 UINT VisibleNodeMask;
} D3D12_HEAP_PROPERTIES;

This structure would inform the device which kind of the physical memory should the heap refer to. Since the documentation of D3D12 is comprehensible enough, I’d rather not talk about too many things which have been listed there. When D3D12_HEAP_TYPE Type is not D3D12_HEAP_TYPE_CUSTOM, then the D3D12_CPU_PAGE_PROPERTY CPUPageProperty should be always D3D12_CPU_PAGE_PROPERTY_UNKNOWN, because the CPU accessibility of the heap has already been indicated by the D3D12_HEAP_TYPE so you shouldn’t repeat the information; Similar reason, D3D12_MEMORY_POOL MemoryPoolPreference should always be D3D12_MEMORY_POOL_UNKNOWN when D3D12_HEAP_TYPE Type is not D3D12_HEAP_TYPE_CUSTOM.

In UMA architecture, there is only one physical memory pool which is both shared by CPU and GPU, the most common case is that you got an Xbox One and start to write some D3D12 games on it. In such case only D3D12_MEMORY_POOL_L0 is available and thus we don’t need to take care of it at all.

The most of the desktop PC with a dedicated graphics card are NUMA memory architecture (although recent years there are something like AMD’s hUMA appeared and gone), in such case D3D12_MEMORY_POOL_L0 is the RAM and D3D12_MEMORY_POOL_L1 is the VRAM.

So now if we set the heap type to D3D12_HEAP_TYPE_CUSTOM, then we could have a more flexible control over the heap configuration. I’ll list a chart below that how different combination of D3D12_CPU_PAGE_PROPERTY and D3D12_MEMORY_POOL would finally look like on NUMA architectures.

	NOT_AVAILABLE	WRITE_COMBINE	WRITE_BACK
L0	Similar as `D3D12_HEAP_TYPE_DEFAULT`, a GPU access-only RAM (but a little bit non-sense configuration for common usage cases)	Similar as `D3D12_HEAP_TYPE_UPLOAD`, it is uncached for CPU read operation so the reading result won’t always stay coherent but write operation is faster because now the memory ordering is trivial and irrelevant, perfect for GPU to read	Similar as `D3D12_HEAP_TYPE_READBACK`, all the GPU write operation would be cached and CPU read operation would get a coherent and consistent result
L1	Similar as `D3D12_HEAP_TYPE_DEFAULT`, a GPU access-only VRAM	Invalid, CPU can’t access VRAM directly	Invalid, CPU can’t access VRAM directly

It looks like that we don’t need a custom heap property structure on NUMA architectures (or single engine/single adapter case), all possible heap types have been already provided by the pre-defined types, there is not too much space for us to maneuver in order to get some advanced optimization. But if your application wants any better customization for all the possible hardware that it would run on, then using custom heap properties is still worth enough to investigate.

And finally, we had a misc flag mask to indicate the detailed usage of the heap:

typedef enum D3D12_HEAP_FLAGS {
 D3D12_HEAP_FLAG_NONE,
 D3D12_HEAP_FLAG_SHARED,
 D3D12_HEAP_FLAG_DENY_BUFFERS,
 D3D12_HEAP_FLAG_ALLOW_DISPLAY,
 D3D12_HEAP_FLAG_SHARED_CROSS_ADAPTER,
 D3D12_HEAP_FLAG_DENY_RT_DS_TEXTURES,
 D3D12_HEAP_FLAG_DENY_NON_RT_DS_TEXTURES,
 D3D12_HEAP_FLAG_HARDWARE_PROTECTED,
 D3D12_HEAP_FLAG_ALLOW_WRITE_WATCH,
 D3D12_HEAP_FLAG_ALLOW_SHADER_ATOMICS,
 D3D12_HEAP_FLAG_ALLOW_ALL_BUFFERS_AND_TEXTURES,
 D3D12_HEAP_FLAG_ALLOW_ONLY_BUFFERS,
 D3D12_HEAP_FLAG_ALLOW_ONLY_NON_RT_DS_TEXTURES,
 D3D12_HEAP_FLAG_ALLOW_ONLY_RT_DS_TEXTURES
} ;

Depends on the specific D3D12_RESOURCE_HEAP_TIER that different hardware support, some certain D3D12_HEAP_FLAGS are not allowed to use alone or combine together. The furthermore detail is well documented on the official website so I’ll not discuss them here. Because some of the enums are just the alias to the others, the actual possible heap flags are less than how many it is defined, and I’ll list a chart below to demonstrate different usage cases and the corresponding flags.

	Tier1	Tier2
All resource types	Not supported	`D3D12_HEAP_FLAG_ALLOW_ALL_BUFFERS_AND_TEXTURES` or `D3D12_HEAP_FLAG_NONE`
Buffer only	`D3D12_HEAP_FLAG_DENY_RT_DS_TEXTURES	D3D12_HEAP_FLAG_DENY_NON_RT_DS_TEXTURES`
Non-RT/DS texture only	`D3D12_HEAP_FLAG_DENY_BUFFERS	D3D12_HEAP_FLAG_DENY_RT_DS_TEXTURES`
RT/DS texture only	`D3D12_HEAP_FLAG_DENY_BUFFERS	D3D12_HEAP_FLAG_DENY_NON_RT_DS_TEXTURES`
Swap-chain surface only	`D3D12_HEAP_FLAG_ALLOW_DISPLAY`	Same as Tier1
Shared heap (multi-process)	`D3D12_HEAP_FLAG_SHARED`	Same as Tier1
Shared heap (multi-adapter)	`D3D12_HEAP_FLAG_SHARED_CROSS_ADAPTER`	Same as Tier1
Memory write tracking	`D3D12_HEAP_FLAG_ALLOW_WRITE_WATCH`	Same as Tier1
Atomic primitive	`D3D12_HEAP_FLAG_ALLOW_SHADER_ATOMICS`	Same as Tier1

As you can see above, the only meaningful difference between Tier1 and Tier2 here is that Tier2 support a D3D12_HEAP_FLAG_ALLOW_ALL_BUFFERS_AND_TEXTURES flag thus we could put all the common resources into one heap. It again depends on what specific task you would like to finish, sometimes you want an all-in-one heap, sometimes it’s better to separate them into different heaps by the usage cases.

Resource creation

After you created a heap successfully, you could start to create resources inside it now. There are 3 different ways to create a resource:

Create resource which has only virtual address inside the already created heap, it requires us to map to the physical address manually later. ID3D12Device::CreateReservedResource is the interface for such a task;
Create resource which has both virtual address and mapped physical address inside the already created heap, the most commonly-used resources are this type. ID3D12Device::CreatePlacedResource is the interface for such a task;
Create placed-resource and an implicate heap at the same time. ID3D12Device::CreateCommittedResource is the interface for such a task.

If you don’t want to manually manage the heap memory at all, then you could choose to use committed-resource with some sacrifices to the performance, but naturally it’s not a good idea to stick with committed-resource heavily in the product code (unless you’re lazy like me who don’t want to write more code in show-case projects). The more mature choice is using placed-resources since we’ve already could create heaps, the only thing left that you have to do now is designing a heap memory management module with some efficient strategies. You could just use as many design patterns and architectures from the experience when you’re implementing the main RAM heap memory management system (still malloc() inside 16ms? No way!). A ring buffer or a double buffer for Upload heap or some linked-list for Default heap or whatever, there are no limitations for the imagination, just analysis your application requirement and figure out a suitable solution (but don’t write a messy GC system for it:). There shouldn’t be too many choices since in the most D3D12 applications like a game, the most of the resources are CPU write-once and others are dynamic buffers which won’t occupy too much space but update frequently.

The more advanced situation which rely on a tremendous memory size, such like mega-texture (maybe you need a 64x64 $km^2$ terrain albedo texture?) or sparse-tree volume textures (maybe you need a voxel-cone-traced irradiance volume?), which would index over the physical VRAM address easily or the actual texture size is beyond the maximum hardware support. In such cases a dynamic virtual memory address mapping technique is necessary. Developers intended to implement a software cache solution for this problem in the past because the APIs didn’t provide any reliable functionalities at that time (before D3D11.2 and OpenGL 4.4 which started to support tiled/sparse textures). The reserved-resources in D3D12 are the fresh new one-for-all solution today, it inherited the design of the tiled-resources architecture in D3D11 but also provided more flexibilities. But still, it depends on the hardware support when you wonder how to fit your elegant and complex SVOGI volume texture into the VRAM, it’s better to query D3D12_TILED_RESOURCES_TIER and see if the target hardware support tiled-resource or not at first.

So many descriptors in Vulkan

zhangandhang@gmail.com (Hang Zhang) — Sun, 21 Apr 2019 10:53:00 +0000

Brainwash is always somewhere

It’s really a mess when I started to port my engine to Vulkan, there are too many new data types that mapped to those came-from-nowhere concepts, which I don’t need to take care about previously. But luckily those concepts are well designed and once after you understand what they are, every pain you occurred would just disappear.

The new generation graphics APIs all transport the responsibility of CPU-GPU communication to the user more or less explicitly, now if you want to ask GPU to do something for you, there won’t be any already defined API, which you just need to feed in some data from your CPU and memory then all the others would be handled by “somebody”.

Let’s say, GPU knows absolutely nothing what you want to do as always, and that “(ex)-somebody”, previously it’s your graphics card vendor’s implementation of those graphics API, they did the “trivial” underlying pipeline works for you. They all have gone now, you have to take care of all what it did for you before. Sounds like a fairly bad break-up!

But what would you benefit from a break-up? (almost) Freedom. Now GPU is more like a general computing server, which exposure itself through the new generation of lower level APIs. As the client, we need to submit the computing work with a detailed enough work description, and then keep feed in data and commands following with the description which we signed with GPU before. But in practice, the most work I did like lots of people who found this article is rendering, and for this purpose, these new APIs were designed still with lots of rendering related specific concepts (because GPU is still “Graphic” Process Unit today:)).

But without considering the details, the whole bunch of things is easy to understand. What we need to do is just fit ourselves into the new CPU-GPU communication model. Create a work description, submit work, repeat, that’s all. Now it’s time to write the code, and you may want to say “whaaaaat” when you type “vk” inside your IDE if it has some kind of code autocomplete features. Yep, too many data types!

Ce este?

“…A descriptor is an opaque data structure representing a shader resource such as a buffer, buffer view, image view, sampler, or combined image sampler. Descriptors are organised into descriptor sets, which are bound during command recording for use in subsequent draw commands…” -13. Resource Descriptors, Vulkan® 1.1.106 - A Specification (with all published extensions)

If you asked me what is the most beneficial thing I got from the journey of writing a game engine, I would answer, “Don’t Panic”. One headache thing when I play with Vulkan is the descriptors, I can’t catch the meaning of it at the very beginning, because with an OpenGL mindset there is no corresponding concept.

But if you think in a fresh point of view, it’s really not a nonsense existence because we have to tell GPU the work description, like where are the resources, how shader will access them and in which kind of view since data are just some bytes inside the GPU memory. So, it’s really better to build a new mindset closer to the GPU pipeline.

“…Descriptors are grouped together into descriptor set objects. A descriptor set object is an opaque object that contains storage for a set of descriptors, where the types and number of descriptors is defined by a descriptor set layout. The layout object may be used to define the association of each descriptor binding with memory or other hardware resources. The layout is used both for determining the resources that need to be associated with the descriptor set, and determining the interface between shader stages and shader resources… -13.2. Descriptor Sets, Vulkan® 1.1.106 - A Specification (with all published extensions)

Actually, there isn’t a “descriptor” data type that we could interact with directly on the CPU side, as the specification said, it’s opaque. The workflow of creating the descriptors is designed as to create a combination of a set(VkDescriptorSet) handle, a buffer or image bound info(VkDescriptorBufferInfo/VkDescriptorImageInfo) and a write or copy operation(VkWriteDescriptorSet/VkCopyDescriptorSet). You acquire one set instance from a pool(vkAllocateDescriptorSets), and all of the set and pool have their own characteristics or usage hints which you would specific before creating them. These characteristics are typically configured with the layout(VkDescriptorSetLayout) and the create info(VkDescriptorSetAllocateInfo and VkDescriptorPoolCreateInfo). After you create the set you have to provide the info about what buffer or image it would bind to, and finally, update this information by a write or copy operation(vkUpdateDescriptorSets). The cons of a plain code example are it’s not so intuitive about the relation between data structure and functions, so I made a little flow graph to demonstrate.

graph TD
	subgraph "create descriptor pool"
 VkDescriptorPoolSize-.->VkDescriptorPoolCreateInfo
 VkDescriptorPoolCreateInfo==vkCreateDescriptorPool==>pool((VkDescriptorPool))
	end
	subgraph "create descriptor set layout"
 VkDescriptorSetLayoutBinding-.->VkDescriptorSetLayoutCreateInfo
	VkDescriptorSetLayoutCreateInfo==vkCreateDescriptorSetLayout==>layout(VkDescriptorSetLayout)
 end
 subgraph "create descriptor set"
	layout-.->VkDescriptorSetAllocateInfo
	pool-.->VkDescriptorSetAllocateInfo
	VkDescriptorSetAllocateInfo==vkAllocateDescriptorSets==>set(VkDescriptorSet)
 end
	subgraph "UBO"
	ubo(VkBuffer)-.->VkDescriptorBufferInfo
	end
	subgraph "Sampler"
	sampler(VkSampler)-.->VkDescriptorImageInfo
	end
	subgraph "update descriptor set"
	set-.->writeCopySets(VkWriteDescriptorSet/VkCopyDescriptorSet)
	VkDescriptorBufferInfo-.->writeCopySets
	VkDescriptorImageInfo-.->writeCopySets
	writeCopySets--vkUpdateDescriptorSets-->device(VkDevice)
 end
	subgraph "create pipeline layout"
	layout-.->VkPipelineLayoutCreateInfo
	VkPipelineLayoutCreateInfo==vkCreatePipelineLayout==>pipelineLayout(VkPipelineLayout)
	end
	subgraph "create pipeline"
	pipelineLayout-.->VkGraphicsPipelineCreateInfo
	VkGraphicsPipelineCreateInfo==vkCreateGraphicsPipelines==>pipeline(VkPipeline)
	end

All the thick line indicate the real object instance is created by a function invocation, while the dotted line means the dependency between the info. I omitted the other dependencies of creating a VkPipeline since they are not related to the topic I’m talking about.

Actually now you may start to feel Vulkan has a really clean architecture model, indeed it is. We now have a far more flexible possibility that we could have as many descriptors in many different descriptor sets in many descriptor pools with many different combinations of configurations. One thing that breaks a little bit of the name convention is the VkWriteDescriptorSet and VkCopyDescriptorSet, they should and only could be created by user directly and submit later rather than acquire from the vKDevice (I was thinking why there isn’t a VkWriteDescriptorSetCreateInfo kind stuff but it would be too redundant to create a VkWriteDescriptorSet, because after all, it’s about an operation around the descriptor, or maybe better call it VkDescriptorSetWriteOp?).

Some usage cases

Let’s have a look at some code examples, which are all coming from some real scenarios I occurred before.

Single UBO data accessed per shader stage

I have a UBO for main camera related data like the projection matrix, only update once per frame, the GLSL code like this:

layout(std140, row_major, set = 0, binding = 0) uniform cameraUBO
{
	mat4 uni_p_camera_original;
};

Then it’s better to answer some questions before creating the descriptor-related data:

How many descriptors will we have inside the pool? Only one.
Which kind of resource type it will be used for? For uniform buffer.
How many different type and number of descriptors will be allocated from this pool? Only one type and only descriptor will it hold.
How many sets it could hold at all? Only one.

VkDescriptorPoolSize l_poolSize = {};
l_poolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
l_poolSize.descriptorCount = 1;

VkDescriptorPoolCreateInfo l_poolInfo = {};
l_poolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
l_poolInfo.poolSizeCount = 1;
l_poolInfo.pPoolSizes = l_poolSize;
l_poolInfo.maxSets = 1;

Create VkDescriptorPool.

VkDescriptorPool l_pool;
vkCreateDescriptorPool(m_device, &l_poolInfo, nullptr, &l_pool);

Where it will be bound to with the shader? Binding point 0.
How many descriptors will be bound? Only one.
Which shader stage could access it? Vertex shader.

VkDescriptorSetLayoutBinding l_setLayoutBinding = {};
l_setLayoutBinding.binding = 0;
l_setLayoutBinding.descriptorCount = 1;
l_setLayoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
l_setLayoutBinding.pImmutableSamplers = nullptr;
l_setLayoutBinding.stageFlags = VK_SHADER_STAGE_VERTEX_BIT;

VkDescriptorSetLayoutCreateInfo l_layoutCreateInfo = {};
l_layoutCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
l_layoutCreateInfo.bindingCount = 1;
l_layoutCreateInfo.pBindings = l_setLayoutBinding;

Create VkDescriptorSetLayout.

VkDescriptorSetLayout l_setLayout;
vkCreateDescriptorSetLayout(m_device, &l_layoutCreateInfo, nullptr, &l_setLayout);

Create VkDescriptorSet.

VkDescriptorSetAllocateInfo l_allocInfo = {};
l_allocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
l_allocInfo.descriptorPool = l_pool;
l_allocInfo.descriptorSetCount = 1;
l_allocInfo.pSetLayouts = &l_setLayout;

VkDescriptorSet l_set;
vkAllocateDescriptorSets(.m_device, &l_allocInfo, &l_set)

Which resource it will be bound to? A UBO.
Where to bind? Binding point 0.
Which VkDescriptorSet that the write operation targeted at? The one we just created.

VkDescriptorBufferInfo l_bufferInfo = {};
l_bufferInfo.buffer = m_cameraUBO; // created from somewhere else
l_bufferInfo.offset = 0;
l_bufferInfo.range = sizeof(CameraGPUData);

VkWriteDescriptorSet l_writeDescriptorSet = {};
l_writeDescriptorSet.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
l_writeDescriptorSet.dstBinding = 0;
l_writeDescriptorSet.dstSet = l_set;
l_writeDescriptorSet.dstArrayElement = 0;
l_writeDescriptorSet.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
l_writeDescriptorSet.descriptorCount = 1;
l_writeDescriptorSet.pBufferInfo = &l_bufferInfo;

vkUpdateDescriptorSets(
		m_device, // VkDevice handle
		1, // only 1 VkWriteDescriptorSet
		&l_writeDescriptorSet,
		0,
		nullptr);

Then when I need to submit command, the only thing left is a call to vkCmdBindDescriptorSets:

	vkCmdBindDescriptorSets(
	&m_commandBuffer,
	VK_PIPELINE_BIND_POINT_GRAPHICS,
	m_pipelineLayout,
	0, // the first set is set 0
	1, // and one set only
	&l_descriptorSet, 0, nullptr);

Array UBO data accessed per shader stage

My punctual light data is inside an array which contains all the information and will be updated to GPU once per frame, but I’ll iterate through the array for a deferred style light pass inside the fragment shader. The GLSL code like this:

#define MAX_POINT_LIGHT 64
// w component of luminance is attenuationRadius
struct pointLight {
	vec4 position;
	vec4 luminance;
	//float attenuationRadius;
};

layout(set = 0, binding = 2) uniform pointLightUBO
{
	pointLight uni_pointLights[MAX_POINT_LIGHT];
};

The only things change in C++ code is the binding point and the buffer range, since it’s an array.

#define MAX_POINT_LIGHT 64
VkDescriptorSetLayoutBinding l_setLayoutBinding = {};
l_setLayoutBinding.binding = 2;

VkDescriptorBufferInfo l_bufferInfo = {};
l_bufferInfo.buffer = m_pointLightUBO; // created from somewhere else
l_bufferInfo.offset = 0;
l_bufferInfo.range = sizeof(PointLightGPUData) * MAX_POINT_LIGHT; // the total size of the UBO array

VkWriteDescriptorSet l_writeDescriptorSet = {};
l_writeDescriptorSet.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
l_writeDescriptorSet.dstBinding = 2;

The command submit is exactly the same as the previous example:

	vkCmdBindDescriptorSets(
	&m_commandBuffer,
	VK_PIPELINE_BIND_POINT_GRAPHICS,
	m_pipelineLayout,
	0, // the first set is set 0
	1, // and one set only
	&l_descriptorSet,
	0,
	nullptr);

Array UBO data accessed per draw call object

The local-to-world space transformation matrix needs to be updated per object, and for this what we could use is the Dynamic Uniform Buffer (forget your glUpdateUniform* things!), I’ll update a UBO array per frame which contain all the drawable meshes information, and will use an offset to access the corresponding part per draw call later.

layout(std140, row_major, set = 0, binding = 1) uniform meshUBO
{
	mat4 uni_m;
};

Now in C++ code, I need to specify a different descriptor type, also a different binding point because I bind other resources at binding point 0, but it’s trivial.

VkDescriptorPoolSize l_poolSize = {};
l_poolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;

VkDescriptorSetLayoutBinding l_layoutBinding = {};
l_layoutBinding.binding = 1;
l_layoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;

VkDescriptorBufferInfo l_bufferInfo = {};
l_bufferInfo.buffer = m_meshUBO;
l_bufferInfo.offset = 0;
l_bufferInfo.range = sizeof(MeshGPUData);

Since it would be accessed in a dynamic favor, the buffer range is per block rather than the whole UBO data array.

And when bind the DescriptorSet, we need to specify the dynamic offset, I use such an implementation:

unsigned int l_blockSize = sizeof(MeshGPUData);

for (int i = 0; i < total_meshes_this_frame; i++)
{
	auto l_offset = l_blockSize * i;

	vkCmdBindDescriptorSets(
	&m_commandBuffer,
	VK_PIPELINE_BIND_POINT_GRAPHICS,
	m_pipelineLayout,
	0, // the first set is set 0
	1, // and one set only
	&l_descriptorSet,
	1, // Now we have one dymanic offset
	&l_offset // the offset value
	);

	// draw call
	//...
	//
}

Multiple array UBO data accessed per draw call object

This is the combination of the previous situations, the mesh UBO and material UBO is related to each draw call, but the camera UBO is one frame one update, but still, we could achieve this.

The Vertex shader looks like this:

layout(std140, row_major, set = 0, binding = 0) uniform cameraUBO
{
	mat4 uni_p_camera_original;
};

layout(std140, row_major, set = 0, binding = 1) uniform meshUBO
{
	mat4 uni_m;
};

While the fragment shader looks like this:

layout(std140, set = 0, binding = 2) uniform materialUBO
{
	vec4 uni_albedo;
	vec4 uni_MRAT;
};

Now the C++ code:

How many descriptors will we have inside the pool? Now 3.
Which kind of resource type it will be used for? For normal uniform buffer and dynamic uniform buffer.
How many different type and number of descriptors will be allocated from this pool? 2 types 3 descriptors.
How many sets it could hold at all? Now, still 1.

VkDescriptorPoolSize l_cameraUBOPoolSize = {};
VkDescriptorPoolSize l_meshUBOPoolSize = {};
VkDescriptorPoolSize l_materialUBOPoolSize = {};

l_cameraUBOPoolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
l_cameraUBOPoolSize.descriptorCount = 1;

l_meshUBOPoolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
l_meshUBOPoolSize.descriptorCount = 1;

l_materialUBOPoolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
l_materialUBOPoolSize.descriptorCount = 1;

VkDescriptorPoolSize l_UBOPoolSizes[] = { l_cameraUBOPoolSize , l_meshUBOPoolSize, l_materialUBOPoolSize };

VkDescriptorPoolCreateInfo l_poolInfo = {};
l_poolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
l_poolInfo.poolSizeCount = 3;
l_poolInfo.pPoolSizes = l_UBOPoolSizes;
l_poolInfo.maxSets = 1;

Other parts are all the same, just change the binding points and the buffer info. And when bind the DescriptorSet, the dynamic offset is an array now:

unsigned int l_meshDataBlockSize = sizeof(MeshGPUData);
unsigned int l_materialDataBlockSize = sizeof(MeshGPUData);

for (int i = 0; i < total_meshes_this_frame; i++)
{
	auto l_meshOffset = l_meshDataBlockSize * i;
	auto l_materialOffset = l_materialDataBlockSize * i;
	unsigned int l_offsets[] = { l_meshOffset, l_materialOffset };

	vkCmdBindDescriptorSets(
	&m_commandBuffer,
	VK_PIPELINE_BIND_POINT_GRAPHICS,
	m_pipelineLayout,
	0, // the first set is set 0
	1, // and one set only
	2, // Now we have two dymanic offsets
	&l_offsets // the offset value array
	);

	// draw call
	//...
	//
}

That’s it! The sampler descriptor is similar to uniform buffer, all the rules work as well as what I applied above. Freedom means more responsibility and caution, but we now have more possibility to optimize the whole rendering pipeline to a new level of efficiency, I would keep investigating how to utilize the power of Vulkan in a more concise way, after all, it’s quite a more complex API than OpenGL!

Normal and normal mapping

zhangandhang@gmail.com (Hang Zhang) — Sun, 18 Nov 2018 01:34:00 +0000

A (not) tedium work

Recently I started to port my project to DirectX 11, and it has a lot of interesting differences with OpenGL such like the coordinates and matrix convention, stronger type safety requirement, better shader resources management (I didn’t try OpenGL’s SSBO yet but the constant buffer is really easier to wrap into an elegant layer) and etc. One thing I stuck a little is the normal mapping there since I used the on-the-fly tangent generation in GLSL, it quite confuses me at first when I rewrote it in HLSL.

Always Review

Basically the normal vector is interpreted as one unit direction vector who is perpendicular with a surface (or mathematically speaking in general, normal vector $\vec{n}$ is one gradient vector of the gradient $\nabla f = \left\langle {{f_x},{f_y},{f_z}} \right\rangle$ of a scalar field $f\left( {x,y,z} \right) = k$ in a vector space at a certain point $\left( {{x_0},{y_0},{z_0}} \right)$, who is orthogonal/normal always with the field), and with this surface normal data in hand we could achieve some old-style flat shading which only gives us some discrete results of the surface color. Later Gouraud (maybe not him) invented the vertex normal, as the average of the surface normals where the vertex is in, it could give a smoother shading result with the nature of interpolation in the pixel processing stage.

Typically we would get the precomputed normal vector of the models from those DCC tools such as Blender 3DS Max or Maya (or generated with some cross-product on CPU side by your own), and they are stored in the local space of the model. Then if we want to render anything with its help in our world space (strictly speaking finally in screen space), we need to transform these normal vectors.

The first idea is just to use the model’s local to world transformation matrix to transform the normal vector since it represents a direction, its w component is 0 in the homogeneous 4D space then the translation part (the last column of transformation matrix) won’t have any effect on it. And the rotation part is what we want to apply, but if we scaled an un-unified extension to the model, then the normal vector would be sheared too, the direction changes! So we need to figure out how to cancel the unwanted scaling only on normal vectors, multiple solutions here:

Multiply with the inverse of the scale matrix $N_{ws} = S^{-1} * M * N_{ls}$;
Without multiply the transformation matrix at the first, instead, just multiply a rotation matrix with local space normals $N_{ws} = R * N_{ls}$;
A classic trick called “invert transpose normal matrix”, which means use the transpose of the inverse of the model’s local to world transformation matrix to multiply with the local space normals $N_{ws} = M^{-1T} * N_{ls}$. Since it looks less intuitive, you may ask why this works, well, with the inverse operation we could cancel the scaling, and because the scaling part is in the diagonal of the matrix (the coordinate system’s axes actually), so transpose operation didn’t affect anything. And since rotation matrix is an orthogonal matrix its inverse is its transpose, then the transpose operation just inverse again the rotation part, as a conclusion means we first invert it, then transpose it, same as invert it and invert again, which means nothing happened to the rotation part! Let’s write it in formula: $M^{-1T} = S^{-1T} * R^{-1T} * T^{-1T}$, since $S^{-1T} = S^{-1}, R^{-1} = R^{T}$, then $M^{-1T} = S^{-1} * R * T^{-1T}$. In practice we often shrink the 4x4 matrix to 3x3, just use the upper left part, which removes the useless translation components as $M_{3 \times 3}^{-1T} = S^{-1} * R$, and with this matrix, we could achieve our goal.

Since most of the rendering pipelines(as far as I know) are designed to deal with scene transform on the CPU side, and always passes the already multiplied transformation matrix to GPU, then the method 3 is used commonly. But for HLSL it doesn’t provide a native inverse function like GLSL, that means whether you have to build a special hard-coded version for this purpose or inverse it previously on the CPU side. And with respect to this little cons the method 2 looks like another good choice because you don’t need any inverse operation on the GPU, the cost changes to an additional cbuffer data passed to GPU. Currently, I choose to use method 2, unless it hits me with some painful issues I think I won’t change to method 3 in DirectX.

Messy micro

And then the normal mapping became a little bit annoyed. The technique to store the geometry detail as the offsets in surface/vertex’s tangent space was invented by Blinn in the ’70s, which we called normal mapping commonly now, it is one kind of bump mapping techniques which could give a huge improvement about the surfaces detail without adding more vertices to hurt the performance. Since the common practice is storing normal texture in tangent space, unless we could get the tangent space-to-model’s local/world space transformation, we can’t apply it to our vertex normals. This means we need to construct a space that treats the vertex normals as the positive Z axis, then transform the normal texture data to world space, or transform the other light/position data to tangent space with its inverse matrix. There are still some different approaches:

Use precomputed tangent vector combining with normal vector to calculate the bitangent vector, and use these 3 as the axes to construct a tangent/TBN space;
Compute tangent vector on-the-fly by using texture UV coordinates and vertices data. Since the edge direction of the triangle could be calculated by the vertex position, and at the same time it also could be constructed by the TBN space’s T and N axis and the UV coordinates, then we could combine them together to get the T and N, in formula it is $\vec{E} = \vec{p_1} - \vec{p_2}$, $\vec{E} = \Delta \vec{U}T + \Delta \vec{V}B$, then this is a solvable linear algebra problem.

In practical, if we choose method 1, then we have to precompute the tangent vector and store it offline somewhere, then passed to GPU in real-time. And if we choose method 2, we need to calculate it on CPU side then send to GPU, basically, it has no implementation difference with method 1. I chose method 2 and implement it on shader directly, rather than the original idea to compute it on CPU, it would spend less bandwidth and make the vertex data structure tighter. The idea comes from a nice blog post, it utilizes the built-in shader partial derivative functions to calculate the screen space gradient, then does some cross-product to get the final results. But since OpenGL users commonly use RHS and DirectX users commonly use LHS, also because of the different windows/texture coordinate system of them, it confused me a lot at the beginning.

The texture coordinates of OpenGL starts from bottom-left corner, and when submitting 2D texture data array (1D in memory) to OpenGL, it would fill the texture buffer from the bottom-left corner to top; In DirectX the texture coordinates starts from top-left corner, and DirectX texture buffer would be filled from top-left corner to bottom. That means if you sample a texture data with same coordinates, OpenGL and DirectX would return the same results. And this means actually for every loaded textures data, I don’t need to change anything related with UV coordinates. The misinterpretation of this led me to a wrong UV convert parser at the first.

Then the partial derivative functions, this is easier to map from OpenGL to DirectX, the ddx_fine() and ddy_fine() are exactly same as dFdx() and dFdy(). But then the T and B axes construction is a little bit tricky, I choose to use LHS in DirectX, then have to flip the T and B when the other implementation details are as same as OpenGL. Also, the final TBN 3x3 matrix in OpenGL and DirectX would be as the transpose to each other but meanwhile staying unified with each other’s math convention, so I just need to follow the matrix-vector multiplication rules on each side as usual and always.

Here are some code pieces about:

	// get edge vectors of the pixel triangle
 vec3 dp1 = dFdx(thefrag_WorldSpacePos.xyz);
 vec3 dp2 = dFdy(thefrag_WorldSpacePos.xyz);
 vec2 duv1 = dFdx(thefrag_TexCoord);
 vec2 duv2 = dFdy(thefrag_TexCoord);

	// solve the linear system
 vec3 N = normalize(thefrag_Normal);
 	vec3 dp2perp = cross(dp2, N);
 	vec3 dp1perp = cross(N, dp1);
 	vec3 T = normalize(dp2perp * duv1.x + dp1perp * duv2.x);
 	vec3 B = normalize(dp2perp * duv1.y + dp1perp * duv2.y);

 mat3 TBN = mat3(T, B, N);

	vec3 WorldSpaceNormal = normalize(TBN * (texture(uni_normalTexture, thefrag_TexCoord).rgb * 2.0 - 1.0));


	// get edge vectors of the pixel triangle
	float3 dp1 = ddx_fine(input.thefrag_WorldSpacePos);
	float3 dp2 = ddy_fine(input.thefrag_WorldSpacePos);
	float2 duv1 = ddx_fine(input.thefrag_TexCoord);
	float2 duv2 = ddy_fine(input.thefrag_TexCoord);

	// solve the linear system
	float3 N = normalize(input.thefrag_Normal);

 float3 dp2perp = cross(dp2, N);
 float3 dp1perp = cross(N, dp1);
 float3 T = -normalize(dp2perp * duv1.x + dp1perp * duv2.x);
 float3 B = -normalize(dp2perp * duv1.y + dp1perp * duv2.y);

 float3x3 TBN = float3x3(T, B, N);

 float3 normalInWorldSpace = normalize(mul(t2d_normal.Sample(SampleTypeWrap, input.thefrag_TexCoord).rgb * 2.0f - 1.0f, TBN));

Physically Based Rendering - Material

zhangandhang@gmail.com (Hang Zhang) — Sun, 12 Aug 2018 17:36:00 +0000

I learned the shading magic like how the most of others did, from a classic Blinn-Phong model to some more complex models like Cook-Torrance model, but the understanding would always be confined around the common practices shaped by our slow speed computer, and our eager desires for the more realistic results. With these compensations, tricks, and hacks we may achieve lots of cool and amazing stuff at first, but if we don’t have a clear bird’s-eye view (for example people like me who is always suffering among the formulae), then we would lose ourselves inside just shader code and strange artifacts often and can’t find the solutions without painful testing and doubting.

Then I was thinking, why not implement and learn something from the basic of the physics again? Let’s forget about a little what I’ve already written and start to construct a shading pipeline from scratch, following the physical interpretations, then seeking for solutions to run between each 16 ms. Let’s go back to the start.

So, what is light?

“…In physical optics, light is modeled as an electromagnetic transverse wave, a wave that oscillates the electric and magnetic fields perpendicularly to the direction of its propagation…” - pg. 293, “Real-Time Rendering”, 4th Edition.

(The book already has become the bible of me, it’s a nice working reference, a balanced textbook and a well-formed dictionary, I’ll quote a lot from it later in this ctrl-c + ctrl-v style post (Why not?))

Light, a kind of electromagnetic wave whose property shaped by its wavelengths majorly, is one of the fundamental beings in our universe, and all of our beautiful rendering works would focus on how to calculate the physical transmission of it always. But what about to think in this way, just because it could be received by our eyes (a psychophysical phenomenon) then it had a special name as “light”, and if we could treat it not so special in our theoretical discussion would something become easier and more generally to handle with?

Let’s say, the basic electromagnetic mechanism works well too with light, all and no exceptions (of course it is), then we could just model a general solution in the computer, and feed it with the light wavelengths $\lambda$ we emitted from some origins, and other relating properties of some objects which stops the light propagation to calculate the result (may also contain the influences from the space they are in later, now let’s just work inside an ideal static vacuum). The classic physics model of the electromagnetic mechanism is described by Maxwell’s equations successfully, and with it we could figure out how the electric field propagating through space, thus means to get the correct phenomenon of the electromagnetic wave in our simulation box.

But unfortunately, we don’t have such huge computational resources to run a field propagation simulation in real-time inside our personal computer, representing a non-analytical electric field also requires a 3D data structure at least, we must allow some compensations. Luckily, physicists have already simplified the scope of the problems for us, instead of using the original electromagnetic field related methods, we could try to use the radiation methods to focus on the energy property only, and this would bring us a simplification of only considering about 2 things, the emitter and the receiver of the electromagnetic wave.

A new measurement

“…Light waves are emitted when the electric charges in an object oscillate. Part of the energy that caused the oscillations—heat, electrical energy, chemical energy—is converted to light energy, which is radiated away from the object. …” - pg. 296, “Real-Time Rendering”, 4th Edition.

Then we would arrive at Radiometry, which deals with the measurement of electromagnetic radiation. For the common usage, there are some radiometric quantities which represent the measurement of the electromagnetic radiation energy with respects to other basic physical quantities such like time/distance/area/angle, I’ll list some of them below which are more important to our rendering business, with the reference from the book and Wikipedia.

Quantity		Unit		Notes
Name	Symbol	Name	Symbol
Radiant energy	${Q_e}$	joule	${J}$	Energy of electromagnetic radiation
Radiant flux	${\Phi_e}$	joule per second or watt	${\frac{J}{s}}$ or ${W}$	also called Radiant power
Radiant intensity	${I_e}$	watt per steradian	${\frac{W}{sr}}$	steradian is similar with angle in 2D space
Irradiance Flux density	${M_e}$ or ${E_e}$	watt per square metre	${\frac{W}{m^2}}$	Ir-radiance, means the radiance received by a surface
Radiance	${L_e}$	watt per steradian per square metre	${\frac{W}{m^2·sr}}$

We won’t directly calculate Radiant energy, because usually we would handle with some objects which has shape and some events which happens during a period, it’s more convenient to cancel these quantities at first, that means we better choose to use Radiant intensity, Irradiance Flux density and Radiance.

To evaluate how a light source emitted light or energy, we need to define what is a light source at first. In this blog post, I would choose to discuss the ideal punctual light source only, which has an infinity small shape, the same as a point in space. Also, I would idealize it with an omnidirectional radiation characteristic, which means the radiation wouldn’t variant around different steradians.

We could deduce the definition of Radiant Intensity first, imagine a unit sphere surrounding the point light source, a unit steradian would emit some energy per second, then $I_e = \frac{d\Phi}{d\omega}$.

Similar, we could get Irradiance Flux density (or call it Radiant exitance if we want to emphasize it’s more about emitting rather than receiving, but it assumes we use some area light sources), $M_e = \frac{d\Phi}{dA}$.

And then Radiance is $L_e = \frac{d\Phi}{d\omega} * \frac{d\Phi}{dA \cos\theta} = \frac{d^2\Phi}{ d\omega dA \cos\theta}$, we now need to consider about the angle between the surface normal and the unit steradian because the actual effective area is not same as the original unit area, it is projected ($A\cos\theta$ is called projected area), so here we add a cos to it. This kind of cos-weighted distribution is called Lambert’s cosine law.

Through time and space, once and forever?

“…The oscillating electrical field pushes and pulls at the electrical charges in the matter, causing them to oscillate in turn. The oscillating charges emit new light waves, which redirect some of the energy of the incoming light wave in new directions. This reaction, called scattering, is the basis of a wide variety of optical phenomena. …” - pg. 297, “Real-Time Rendering”, 4th Edition.

If we emit any light/energy in vacuum, it would never change the direction and the energy unless meets with some obstructions (for example some molecules or small dust floating in space or some large giant gas planets like Saturn, or our body/eyes), then it would be absorbed or scattered due to the characteristics of the obstructions. Actually, this is all what we need to care about, the obstructions IS exactly the receiver! Then if we could model how the receiver it is we would solve the problem. But again, the limitation of our computational resources doesn’t allow us to simulate every molecule, instead, we have to follow some macro scope rules to abstract them. We call these single or group of trivial shape objects which influence the wave propagation as particles, and the volume the particles fulfilled with as media.

We live on Earth where the air and water are the most common noticeable medium, who gives us an inspiration for aesthetic creations for thousands of years. For to measure how they influence the light propagation, we need to define a kind of ratio between the original light and the affected light.

“…The ratio of the phase velocities of the original and new waves defines an optical property of the medium called the index of refraction (IOR) or refractive index, denoted by the letter $n$. Some media are absorptive. They convert part of the light energy to heat, causing the wave amplitude to decrease exponentially with distance. The rate of decrease is defined by the attenuation index, denoted by the Greek letter $\kappa$ (kappa). Both n and $\kappa$ typically vary by wavelength. Together, these two numbers fully define how the medium affects light of a given wavelength, and they are often combined into a single complex number $n + i \kappa$, called the complex index of refraction. …” - pg. 298, “Real-Time Rendering”, 4th Edition.

As the book introduced, we use complex IOR to represent the characteristics of a media, but in practice of local illumination, we’d often only care about the real number part $n$, since the attenuation happens around the conductor medium more and we typically imply we would treat the air as our original media (or in the volume rendering business, but I won’t cover about that topic here). Then we could simply get IOR by $n = \frac{c}{v}$, where $c$ is the light speed in vacuum and $v$ is the light speed in the media. For a further detailed discussion about complex IOR, I recommend this post by Sébastien Lagarde, which he talked around the situations that covering all the other possible medium interfaces.

And then we would have a physical law called Snell’s law, which relates the incident angle with refracted angle by IOR of two mediums, written as $\sin(\theta_t) = \frac{n_1}{n_2}\sin(\theta_i)$, where $\theta_i$ is the angle between the interface normal and the incident light direction, $\theta_t$ as the angle between the inverse of the interface normal and the refracted/transmitted light direction. We denote the index of refraction on the “outside” (the side where the incoming, or incident, wave originates) as and the index of refraction on the “inside” (where the wave will be transmitted after passing through the surface) as .

Now if we know the angles (here we abstract the light to a single monochromatic beam, which has “angle” and a single wavelength, but actually we’ve talked before how it is in real situation) and the IOR of the medium, could we calculate anything useful to display on the screen? Well, with Snell’s law we could figure out the direction change of the light, but we didn’t know how much energy would change, also if the light is not monochromatic, we don’t know which range of wavelengths would be reflected or refracted. And the most important problem is, we are receiving light through our eyes, but not always directly from the light source, what we want to receive are those light “reflected” from different surfaces, rather than those directly emitted from the sun or the bulbs!

So actually, we need to figure out such a scenario: Light is emitted from a source, meets with some surfaces and changes, then somehow travels into our eyes. When the light “hit” the surface we then could try to apply Snell’s law. But Snell’s law is just about how light changes at the medium interface in a very ideal situation, we don’t know what would happen next, since the conservation of energy always work in this universe (until the moment I wrote this line it is still valid), the reflected light must be weaker than the incident light, but how much? Also, where the refracted light part goes?

That requires us to give further information about how light continuously traveling. As I mentioned before, media is made up by particles, then the size of particles and the distance of particles should have an influence on the light transmission. But since it’s impossible to calculate everything happened inside the media, we choose to model the region around the surface of the media only which has much more contributions to the light into our eyes.

“…In rendering, we typically use geometrical optics, which ignores wave effects such as interference and diffraction. This is equivalent to assuming that all surface irregularities are either smaller than a light wavelength or much larger. …” - pg. 303, “Real-Time Rendering”, 4th Edition.

“…surface irregularities much larger than a wavelength change the local orientation of the surface. When these irregularities are too small to be individually rendered—in other words, smaller than a pixel—we refer to them as microgeometry. …” - pg. 304, “Real-Time Rendering”, 4th Edition.

“…For rendering, rather than modeling the microgeometry explicitly, we treat it statistically and view the surface as having a random distribution of microstructure normals. As a result, we model the surface as reflecting (and refracting) light in a continuous spread of directions. The width of this spread, and thus the blurriness of reflected and refracted detail, depends on the statistical variance of the microgeometry normal vectors—in other words, the surface microscale roughness. …” - pg. 304, “Real-Time Rendering”, 4th Edition.

The book introduced the microgeometry theory, which is balanced between the computational burden and the credibility of the result when it was adopted into real-time rendering community. Since our screen has limited discrete pixels, then it’s meaningless and wasting to calculate anything too accurately, instead we choose to use some statistical models for a cheaper routine. You may have heard some statistical methods used in rendering like the most famous Monte Carlo method before, here we would follow the similar path to figure out how to get the final light.

The Rendering Equation

Now we could introduce an almost accurate representation of the interaction between light and surfaces. For a single light in wavelength $\lambda$ at time $t$ who hits the surface in a unit area $A$ from a unit steradian $\omega_i$, and is “changed” by the surface then finally “re-emitted” to our eyes by an unit steradian $\omega_o$, we could write such a formula: $L_o(A,\omega_o,\lambda,t) = f_r L_i(A,\omega_i,\lambda,t)$, which describe the incident radiance is weighted by a factor $f_r$ who indicates the surface optic characteristic then contributes to the surface irradiance. Then we could do an integration for every possible incident direction of different light around the hemisphere centered at the surface normal, combine with the original emitted light from the surface itself, to get a more general formula. One summarization formula for rendering in this context is The Rendering Equation, $L_o(p,\omega_o,\lambda,t) = L_e(p,\omega_o,\lambda,t) + \int\limits_{\Omega} f_r(p,\omega_i,\omega_o,\lambda,t) L_i(p,\omega_i,\lambda,t) (n \cdot \omega_i) d\omega_i$, basically it is the same transcript of what I talked before, but with a simplification that we ignore the surface area, just minimize it to a point $p$.

Further more, we would like to treat the light-surface interaction as an time-domain individual event, then we could omit $t$. And because we would finally send a RGB color space data to the screen, so a continuous spectral irradiance is fairly unnecessary, then we would like to also cancel the spectrum by replacing it with 3 individual similar formulae, which have same form but care only about each color channels, thus we could just write one instead as The Reflectance Equation: $L_o(p,\omega_o) = \int\limits_{\Omega} f_r(p,\omega_i,\omega_o) L_i(p,\omega_i) (n \cdot \omega_i) d\omega_i$.

“…Local reflectance is quantified by the bidirectional reflectance distribution function (BRDF), denoted as $f(l, v)$. …” - pg. 310, “Real-Time Rendering”, 4th Edition.

Now if we know how the incident light they are, then we just need to give a $f_r$ weight who described the entire optic characteristic of the media and would get the final results. This $f_r(p,\omega_i,\omega_o)$, like the book introduced, is called BRDF, combining with the microgeometry method I listed above, we would have chance to finally write some codes to calculate. But at first, let’s take a look back at the surface model we used here, we say “surface” not “interface”, the “surface” indicates we are inspecting in a region around the “interface”, so it should have more properties than what Snell’s law could describe. Let’s take a look:

As the media won’t always absorb and scatter all the refracted light inside (for example most of the non-conductors won’t, since there are too less free electrons inside to do the business), some of them would leave the media and enter the previous media again, which we called this kind of phenomenon as Subsurface Scattering. A typical practice is to separate the BRDF to 2 parts, a reflection part as specular and a local subsurface scattering part as diffuse. I would like to use the notation as $f_s$ and $f_d$ later.

Also, sometimes we need to calculate more general subsurface scattering, then we would use a bidirectional scattering distribution function (abbr. BSSRDF) instead, and for the light who travel through the entire media and leave in another surface it becomes bidirectional transmittance distribution function (abbr. BTDF). Together they are called as BxDF.

To make a BxDF physically correct, we need to achieve 3 goals：

$f_r(\omega_i,\omega_o) \ge 0$, a BxDF never results negative weight;
$f_r(\omega_i,\omega_o) = f_r(\omega_o,\omega_i)$, it’s called Helmholtz reciprocity, simply speaking means if we change the incident and the observe direction it should have the same result;
$\forall \omega_i, \int\limits_{\Omega}f_r(\omega_i,\omega_o)(n \cdot \omega_i) d\omega_i \le 1$, means we need to follow the energy conservation law, the weight should never exceed than 1, the relative outgoing light energy should never exceed than the relative incoming light energy.

In practice there is a kind of convenient way to evaluate whether a BxDF is energy conservation or not called White furnace test, I’ll talk about it later.

BRDF

BRDF would be thought as $f_r(\omega_i,\omega_o) = \frac{dL_o(\omega_o)}{L_i(\omega_i)\cos\theta_i d\omega_i}$, which gives us a possibility to measure it in real, for example MERL is one kind of BRDF database. Also, we could write BRDF as $f_r(\omega_i,\omega_o) = f_d(\omega_i,\omega_o) + f_s(\omega_i,\omega_o)$, to indicate that we would like to seperate BRDF to the specular and diffuse parts and solve them independently.

Let’s use $n$ as the macro surface normal vector and $m$ as the micro surface normal vector, and $h\ = \frac{\omega_o + \omega_i}{||\omega_o + \omega_i||}$ as the normalized halfway vector of the view direction and light direction.

Also, for the sake of convenience to discuss BRDF more practically, I’d like to introduce the Directional albedo which measures the amount of light coming from a given direction that is reflected at all, into any outgoing direction in the hemisphere around the surface normal, in formula as $R_s = \int\limits_{\Omega}f(l, v)(n·s)ds$, if the BRDF is Helmholtz reciprocal then could substitute $s$ with $l$ or $v$ freely.

Diffuse part

Simple Lambert model

$f_{Lambert} = \frac{\rho} {\pi}$, $\rho$ as the “color” of the surface, strictly speaking it is the subsurface scattering part of the surface irradiance under a particular lighting circumstance; $\pi$ comes from the fact that we treat the surface as the Lambertian surface, thus it won’t change due to the view and light direction, then the BRDF integral over the hemisphere yield it.

Let’s visualize its directional albedo: (BRDF Explorer/Octave WIP)

This simple Lambert diffuse model would ignore the surface micro scope variation, combines with the Lambert’s cosine law we could already get $L_o(\omega_o) = \int\limits_{\Omega}\frac{\rho} {\pi}L_i(\omega_i)\cos\theta_i d\omega_i$, since the integral of direction $\omega_i$ over the hemisphere is $\pi$, then it would cancel the $\pi$ in the BRDF, so finally we would just get $L_o(\omega_o) = \rho \cos\theta_i$, exactly the same as what we learned first in the real-time rendering class 101!

Cook-Torrance model

The lack of micro scope detail of the simple Lambert model limits us to get further realistic results, luckily as the researchers and scientists work on it for decades, we’ve already had some advanced replacements of the simple Lambert model, one of them which is used most commonly today in real-time rendering community is the microgeometry theory, we’d refer it as microfacet theory here.

R.L. Cook and K. E. Torrance [CT82] wrote a paper in 80’s which is the root of the most popular adopted microfacet model today, with the other nice references [Hei14][LR14] we could conclude the general Cook-Torrance model as $f_{Cook-Torrance}=\frac{1}{|n·\omega_o||n·\omega_i|}\int\limits_{\Omega}f_m(\omega_o,\omega_i,m)G(\omega_o,\omega_i,m) D(m,\alpha)\langle \omega_o·m\rangle \langle \omega_i·m\rangle dm$. It emphasizes that the macro BRDF is correlated with the micro BRDF, and it could be calculated through the integral over the microfacet $m$, with the additional weight functions $G$ and $D$ to help keeping the micro-macro mapping relationship stay correct.

The $D$ function is called Distribution function or Normal Distribution Function(abbr. NDF), gives the spatial/statical distribution of the micro normal $m$ over the macro normal $n$, the $\alpha$ here is a user-controlled variable which describes how “rough” the surface it is, so we call it roughness or smoothness in practice (typically we’d like to build a non-linear mapping between the user-controlled roughness and the real $\alpha$). We would use statical functions in practice since it’s the only possible way to calculate in real-time, about how to mapping from spatial function to statical function I recommend to read this paper [Hei14] for detail understanding.

The $G$ function is called Geometry function or Masking-shadowing function, but strictly speaking, we would better call it $V$ as Visibility function, since the Geometry function is usually used to compose the Visibility function actually, but in literal it is used exchangeably often. It gives a weight about how the microfacets influence themselves by masking each other along the view direction and shadowing each other along the incident light direction, it should be deduced accordingly with the $D$ function we chose.

I’ll list some common used $D$ and $G$ functions below:

$D$ function

Gaussian $D$ function [CT82]

$D_{Gaussian}=ce^{(-\alpha/m)^2}$

Beckmann $D$ function [CT82]

$D_{Beckmann}=\frac{1}{m^2\cos^4\alpha}e^{-[ ,(\tan\alpha)/m ] ,^2}$

For these two $D$ functions, $c$ is an optional scaling factor, $\alpha$ as the angle between $n$ and $h$, $m$ as the RMS of the slope of the microfacet. Since they are computationally expensive, it’s rare to see them in real products, but we’d like to treat them as the offline reference sometimes.

Berry $D$ function [Bur12]

$D_{Berry}=\frac{c}{((n·h)^2(\alpha^2-1)+1)}$

GGX/Trowbridge-Reitz $D$ function [Bur12]

$D_{TR}=\frac{c}{((n·h)^2(\alpha^2-1)+1)^2}$

Generalized GGX/Trowbridge-Reitz $D$ function [Bur12]

$D_{GTR}=\frac{c}{((n·h)^2(\alpha^2-1)+1)^\gamma}$

For these three $D$ functions, $c$ is an optional scaling factor, $\alpha$ is the roughness, $\gamma$ is an optional exponential factor. If we choose $\gamma=10$ it is fairly close to the Beckmann $D$ function. They are most commonly used $D$ functions as far as I know, gives a “long-tailed” visual appearance.

$G$ function

Cook-Torrance $G$ function [CT82]

$G_{cook-torrance} = \min{1,\frac{2(n·h)(n·\omega_o)}{(\omega_o·h)},\frac{2(n·h)(n·\omega_i)}{(\omega_o·h)}}$

This one comes from the original paper itself but I haven’t seen it in any product level shader yet since it could be deduced to other more optimal versions.

Smith $G$ function [Smith67] [Hei14]

$G_{Smith}=\frac{\chi^+(\omega_o·\omega_m)}{1 + \Lambda(\omega_o)}$

The original paper is not publicly available, so I list the deduced version from a later reference. Here the nominator $\chi^+(u)$ is a heavy-side function, when $u>0$ then $\chi^+(u)=1$, otherwise $\chi^+(u)=0$, it ensures the sidedness effect, while the $\Lambda()$ function is an integral over the slopes of the microsurface, which gives the masking probability. So, unless we provide a possible $\Lambda()$ function, this formula can’t be translated to a shader.

Schlick $G$ function [Sch94]

$G_{Schlick}=\frac{n·\omega_o}{(n·\omega_o)(1-k)+k}·\frac{n·\omega_i}{(n·\omega_i)(1-k)+k}$

$k$ is the user-controlled roughness, in practice we could remapping roughness to $\alpha=\frac{Roughness 1}{2}$ [Bur12], $k=\alpha^2/2$ [Kar13] to get a better non-linearity result. Schlick $G$ function is the approximation of Smith $G$ function in $[ ,0,1] ,$, which is kind friendly for our application scene because its parameter requirement is acceptable.

Height correlated Smith $G$ function [Hei14] [LR14]

$G_{CorrelatedSchlick}=\frac{\chi^+(\omega_o·h)\chi^+(\omega_i·h)}{1+\Lambda(\omega_o)+\Lambda(\omega_i)}$

$\Lambda(m)=\frac{-1+\sqrt{1+\alpha^2\tan^2(\theta_m)}}{2}=\frac{-1+\sqrt{1+\frac{\alpha^2(1-\cos^2(\theta_m))}{\cos^2(\theta_m)}}}{2}$, because the microfacet would mask and shadow at the same time, then if we correlate the masking and shadowing parts with the respect of its height would bring some energy loss back. The detailed deduction is math heavily, I’d recommend reading the corresponding papers for further understanding.

Multi-scattering Smith $G$ function [HHED16]

All of the $G$ functions I listed above only take care of a single-scattering phenomenon, while in reality, the rougher surface would have more possibility to bounce light around different microfacets, it’s important to count these part of the energy. The paper [HHED16] gives a stochastic method based ground truth, and later [IW17] gives an implementable compensation way to achieve it, an alternative version of the general formula is given by the book as:

“… $f_{ms}(l, v) = \frac{\overline F \overline {Rs_{F_1}}}{\pi(1-\overline {Rs_{F_1}})(1-\overline F(1-\overline {Rs_{F_1}}))}(1-Rs_{F_1}(l))(1-Rs_{F_1}(v))$ …” - pg. 346, “Real-Time Rendering”, 4th Edition.

Let’s start to decompose this formula from the $F$ Fresnel function. As we discussed before, the Snell’s law could give us the direction of the transmitted light, and the ideal mirror reflection could give us the reflected light direction, but we don’t know how much energy is reflected and how much is transmitted, this is quite an annoy problem. But luckily it has been solved (with some restricted conditions) already in 19th century by French scientist Augustin-Jean Fresnel as Fresnel equations. Long talk in short, since we only care about the ideal sandbox in our real-time compensations, we would look for a nice enough Fresnel function to simulate this phenomenon. The Schlick Fresnel approximation is widely adopted nowadays, it’s has a form as $F_{Schlick}=F_0 + (F_{90} - F_0)(1-(\cos\theta))^5)$, $F_0$ is the specular reflectance from normal incidence, it represents the IOR when the view direction is perpendicular with the surface as $\omega_o \parallel n$; similar $F_{90}$ is the IOR when $\omega_o \perp n$, some of the implementations would assume $F_{90}$ is always 1, and it could cover most of common conductor/dielectric materials type. Finally $\cos\theta=n·\omega_o$ or $\cos\theta=n·\omega_i$, it’s another kind of cosine-weighted contribution appears here.

The $\overline F$ here means the average of $F$ over all the different cosine of angles, if we just use the Schlick Fresnel approximation mentioned above, it could simply be calculated as $\overline F = \frac{20}{21}F_0 + \frac{1}{21}$.

Then we’d move on to another average, $Rs_{F_1}$, it is the directional albedo of , which is the specular BRDF term with set to 1, which could be interpreted as the irradiance of a pure white surface when illuminated by a unit directional light source, and Stephen Hill wrote a nice blog series about it. Basically, if we could implement the single-scattering BRDF, then could just use some discrete numerical integral method (like Importance sampling) to calculate it and save it to a look-up table, which exactly similar like what the split-summing technique of IBL applicated in [Kar13].

Use Simple Lambert model as the microfacet BRDF in Cook-Torrance model [LR14]

$f_m(\omega_o,\omega_i,m) = f_{Lambert} = \frac{\rho} {\pi}$ and then $f_{cook-torrance}=\frac{\rho}{\pi}\frac{1}{|n·\omega_o||n·\omega_i|}\int\limits_{\Omega}G(v,l,m) D(m,\alpha)\max(0,\omega_o·m)\max(0,\omega_i·m)dm$

Still, no analysis solution, but gives a theoretical fundamental about the problem we need to solve, and unified all the problem inside one microsurface theory.

Oren-Nayar model [ON94]

$f_{oren-nayar}=\frac{\rho} {\pi}(A+(B·\max(0, \cos (\omega_i - \omega_o))·\sin(\max(\omega_i, \omega_o))·tan(\min(\omega_i,\omega_o))))$ $A=1-\frac{\alpha}{2\alpha+0.66}$, $B=0.45\frac{\alpha}{\alpha+0.09}$ and $\alpha$ is the roughness, it’s an approximation of the general Cook-Torrance model, when $\alpha = 0$ we’ll get the Simple Lambert model. Could treat Oren-Nayar model as a kind of generalization of Simple Lambert model.

Disney model [Bur12]

Another advanced Lambert-based diffuse model which considers about Fresnel effect, $f_{disney}=\frac{\rho} {\pi}(1+(F_{d90}-1)(1-(n · \omega_i))^5)(1+(F_{d90}-1)(1-(n · \omega_o))^5)$, or written as $f_{Disney}=\frac{\rho} {\pi}F_{Schlick}(1, F_{d90}, n, \omega_o)·F_{Schlick}(1, F_{d90}, n, \omega_i)$, here $F_{d90}=0.5+2(h·\omega_i)^2\alpha$, $\alpha$ is roughness.

Normalized Disney model [LR14]

For the sake of energy conservation, we could remapping the original Disney model to $[ ,0,1] ,$, then $f_{normalizedDisney}=c·\frac{\rho} {\pi}F_{Schlick}(1, F_{d90}, n, \omega_o)·F_{Schlick}(1, F_{d90}, n, \omega_i)$, $c=\frac{1}{1.51}+\frac{0.51}{1.51}\alpha$ it’s a scaling factor, I deduced it here for to better compare with the original version, and now $F_{d90}=0.5+(2(h·\omega_i)^2-0.5)\alpha$, $\alpha=Roughness^\gamma$, in practice the original paper chooses $\gamma=2$, and it find when $\gamma=4$ it’s almost near the result in [Sch14] where $\alpha=(0.3+0.7Roughness)^6$.

Specular part

Phong model

$f_{Phong} = (r·\omega_o)^\alpha$, $r$ is the reflection direction of $\omega_i$, it’s the most famous and commonly used specular model in last few decades, and even programmed inside the graphics hardware, it needs the exponential factor $\alpha$ as the user-controlled parameter.

Normalized Phong model

$f_{normalizedPhong} = c·(r·\omega_o)^\alpha$, $c=\frac{\alpha+1}{2\pi}$, actually Phong model gives a $D$ function in a microsurface view of point, here $\alpha$ thus could be thought as the roughness.

Blinn-Phong model

$f_{Blinn-Phong} = (n·h)^\alpha$, an optimization of Phong model, in practice if we choose mapping $\alpha_{blinn-phong}=4\alpha_{phong}$ then Blinn-Phong model would looks like Phong model [Wiki1].

Normalized Blinn-Phong model

$f_{normalizedBlinn-Phong} = c·(n·h)^\alpha$, $c=\frac{\alpha+2}{4\pi(2-2^{\frac{-\alpha}{2}})}$.

Cook-Torrance model [CT82] [Hei14]

$f_{cook-torrance} = \frac{F(\omega_o, h , f_0, f_{90})D(h, \alpha)G(\omega_o, \omega_i, h)}{4|n·\omega_o||n·\omega_i|}$, the new kids (popular from ~2012) in town! Everything we’ve talked before, the denominator is deduced from the Jacobian Matrix when we change the space from the microfacet space to macro and makes it’s quite elegant, we use the microfacet theory but calculate in macro!

Some sample codes

a. Simple Lambert model + Blinn-Phong model

vec3 CalcDirectionalLight(dirLight light, vec3 normal, vec3 diffuse, vec3 specular, vec3 viewPos, vec3 fragPos)
{ 
 vec3 N = normalize(normal);
 vec3 L = normalize(-light.direction);
 vec3 V = normalize(viewPos - fragPos);
 vec3 H = normalize(V + L);

 float NdotH = max(dot(N , H), 0.0);
 float NdotL = max(dot(N , L), 0.0);
 
 // ambient color
 vec3 ambientColor = diffuse * light.color * 0.04;

 // diffuse color
 vec3 diffuseColor = diffuse * NdotL * light.color;
 
 // specular color
 float alpha = 32;
 vec3 specularColor = specular * pow(NdotH, alpha) * light.color;
 
 return (ambientColor + diffuseColor + specularColor);
}

b. Oren-Nayar model + Normalized Blinn-Phong model

// Oren-Nayar diffuse BRDF
// ----------------------------------------------------------------------------
float orenNayarDiffuse(float LdotV, float NdotL, float NdotV, float roughness) 
{
 float s = LdotV - NdotL * NdotV;
 float t = mix(1.0, max(NdotL, NdotV), step(0.0, s));

 float sigma2 = roughness * roughness;
 float A = 1.0 - (0.5 * sigma2 / (sigma2 + 0.33));
 float B = 0.45 * sigma2 / (sigma2 + 0.09);

 return max(0.0, NdotL) * (A + B * s / t);
}

vec3 CalcDirectionalLight(dirLight light, vec3 normal, vec3 diffuse, vec3 specular, float roughness, vec3 viewPos, vec3 fragPos)
{ 
 vec3 N = normalize(normal);
 vec3 L = normalize(-light.direction);
 vec3 V = normalize(viewPos - fragPos);
 vec3 H = normalize(V + L);
 float LdotV = max(dot(L , V), 0.0);
 float NdotH = max(dot(N , H), 0.0);// ambient color
 vec3 ambientColor = diffuse * light.color * 0.04;

 // diffuse color
 float Fd = orenNayarDiffuse(LdotV, NdotL, NdotV, roughness);
 vec3 diffuseColor = diffuse * Fd * light.color;
 
 // specular color
 float alpha = 32;
 float normalizedScaleFactor = (alpha + 2) / (4 * PI * (2 - pow(2, (-alpha / 2))));
 vec3 specularColor = specular * (1 - Fd) * pow(NdotH, alpha) * normalizedScaleFactor * light.color;
 
 return (ambientColor + diffuseColor + specularColor);
}

c. Normalized Disney model + Cook-Torrance (specular) model, use $D_{TR}$+$G_{CorrelatedSchlick}$+$F_{Schlick}$ combination

(reference Frostbite Engine [LR14])

// Frostbite Engine model
// ----------------------------------------------------------------------------
// Specular/Diffuse BRDF Fresnel Component
// ----------------------------------------------------------------------------
vec3 Frostbite_fresnelSchlick(vec3 f0, float f90, float u)
{
 return f0 + (f90 - f0) * pow(1.0 - u, 5.0);
}
// Diffuse BRDF
// ----------------------------------------------------------------------------
float Frostbite_DisneyDiffuse(float NdotV, float NdotL, float LdotH, float earRoughness)
{ 
 float energyBias = mix(0, 0.5, linearRoughness);
 float energyFactor = mix(1.0, 1.0/1.51, linearRoughness);
 float fd90 = energyBias + 2.0 * LdotH * LdotH * linearRoughness;
 vec3 f0 = vec3 (1.0, 1.0, 1.0);
 float lightScatter = Frostbite_fresnelSchlick(f0, fd90, NdotL).r;
 float viewScatter = Frostbite_fresnelSchlick(f0, fd90, NdotV).r;
 return lightScatter * viewScatter * energyFactor;
}
// Specular BRDF Geometry Component
// ----------------------------------------------------------------------------
float Frostbite_V_SmithGGXCorrelated(float NdotL , float NdotV , float alphaG)
{
 float alphaG2 = alphaG * alphaG;
 float Lambda_GGXV = NdotL * sqrt(NdotV * NdotV * (1.0 - alphaG2) + alphaG2);
 float Lambda_GGXL = NdotV * sqrt(NdotL * NdotL * (1.0 - alphaG2) + alphaG2);
 return 0.5 / max((Lambda_GGXV + Lambda_GGXL), 0.00001);
}
// Specular BRDF Distribution Component
// ----------------------------------------------------------------------------
float Frostbite_D_GGX(float NdotH , float roughness)
{
 // remapping to Quadratic curve
 float m = roughness * roughness;
 float m2 = m * m;
 float f = (NdotH * m2 - NdotH) * NdotH + 1;
 return m2 / (f * f);
}
// ----------------------------------------------------------------------------
vec3 Frostbite_CalcDirectionalLightRadiance(dirLight light, vec3 albedo, float metallic, float roughness, vec3 normal, vec3 viewPos, vec3 fragPos, vec3 F0)
{
 vec3 N = normalize(normal);
 vec3 L = normalize(-light.direction);
 vec3 V = normalize(viewPos - fragPos);
 vec3 H = normalize(V + L);

 float NdotV = max(dot(N , V), 0.0);
 float LdotH = max(dot(L , H), 0.0);
 float NdotH = max(dot(N , H), 0.0);
 float NdotL = max(dot(N , L), 0.0);

 // Specular BRDF
 float f90 = 1.0;
 vec3 F = Frostbite_fresnelSchlick(F0, f90, LdotH);
 float G = Frostbite_V_SmithGGXCorrelated(NdotV, NdotL, roughness);
 float D = Frostbite_D_GGX (NdotH, roughness);
 vec3 Fr = F * G * D;

 // Diffuse BRDF
 float Fd = Frostbite_DisneyDiffuse(NdotV, NdotL, LdotH ,roughness * roughness); 
 
 return (Fd * albedo + Fr) * light.color * NdotL / PI;
}

d.Simple Lambert model + Cook-Torrance (specular) model, use $D_{TR}$+$G_{Schlick}$+$F_{Schlick}$ combination

(reference from Unreal Engine 4[Kar13])

// Unreal Engine model
// ----------------------------------------------------------------------------
// Specular BRDF Distribution Component
// ----------------------------------------------------------------------------
float Unreal_DistributionGGX(float NdotH, float roughness)
{
 float a = roughness*roughness;
 // remapping to Quadratic curve
 float a2 = a * a;
 float NdotH2 = NdotH*NdotH;

 float nom = a2;
 float denom = (NdotH2 * (a2 - 1.0) + 1.0);
 denom = denom * denom;

 return nom / denom;
}
// Specular BRDF Geometry Component
// ----------------------------------------------------------------------------
float Unreal_GeometrySchlickGGX(float NdotV, float roughness)
{
 float r = (roughness + 1.0);
 float k = (r*r) / 8.0;

 float nom = NdotV;
 float denom = NdotV * (1.0 - k) + k;

 return nom / denom;
}
// ----------------------------------------------------------------------------
float Unreal_GeometrySmith(float NdotV, float NdotL, float roughness)
{
 float ggx2 = Unreal_GeometrySchlickGGX(NdotV, roughness);
 float ggx1 = Unreal_GeometrySchlickGGX(NdotL, roughness);

 return ggx1 * ggx2;
}
// Specular BRDF Fresnel Component
// ----------------------------------------------------------------------------
vec3 Unreal_fresnelSchlick(float cosTheta, vec3 F0)
{
 return F0 + (1.0 - F0) * pow(1.0 - cosTheta, 5.0);
}
// ----------------------------------------------------------------------------
vec3 Unreal_CalcDirectionalLightRadiance(dirLight light, vec3 albedo, float metallic, float roughness, vec3 normal, vec3 viewPos, vec3 fragPos, vec3 F0)
{
 vec3 N = normalize(normal);
 vec3 L = normalize(-light.direction);
 vec3 V = normalize(viewPos - fragPos);
 vec3 H = normalize(V + L);

 float NdotV = max(dot(N , V), 0.0);
 float NdotH = max(dot(N, H), 0.0);
 float HdotV = max(dot(H , V), 0.0);
 float NdotL = max(dot(N , L), 0.0);
 
 // Specular BRDF
 vec3 F = Unreal_fresnelSchlick(HdotV, F0);
 float G = Unreal_GeometrySmith(N, V, L, roughness); 
 float D = Unreal_DistributionGGX(N, H, roughness); 
 
 vec3 nominator = D * G * F; 
 float denominator = 4 * NdotV * NdotL;
 vec3 specular = nominator / max(denominator, 0.00001);
 
 // for energy conservation 
 vec3 kS = F;
 vec3 kD = vec3(1.0) - kS; 
 kD *= 1.0 - metallic; 
 
 return ((kD * albedo + specular) * light.color * NdotL) / PI;
}

To be continued.

Bibliography：

[Bur12] B. Burley. “Physically Based Shading at Disney”. In: Physically Based Shading in Film and Game Production, ACM SIGGRAPH 2012 Courses. SIGGRAPH ’12. Los Angeles, California: ACM, 2012, 10:1–7. isbn: 978-1-4503-1678-1. doi: 10.1145/2343483.2343493. url: http://selfshadow.com/publications/s2012-shading-course/.

[CT82] R. L. Cook and K. E. Torrance. “A Reﬂectance Model for Computer Graphics”. In: ACM Trans. Graph. 1.1 (Jan. 1982), pp. 7–24. issn: 0730-0301. doi: 10.1145/357290.357293. url: http://graphics.pixar.com/library/ReflectanceModel/.

[Hei14] E. Heitz. “Understanding the Masking-Shadowing Function in Microfacet-Based BRDFs”. In: Journal of Computer Graphics Techniques (JCGT) 3.2 (June 2014), pp. 32–91. issn: 2331-7418. url: http://jcgt.org/published/0003/02/03/.

[HHED16] E. Heitz, J. Hanika, E. d’Eon, C. Dachsbacher, “Multiple-Scattering Microfacet BSDFs with the Smith Model”. In: ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2016, Volume 35 Issue 4, July 2016, ISSN: 0730-0301 E, ISSN: 1557-7368 doi>10.1145/2897824.2925943, url: https://eheitzresearch.wordpress.com/240-2/

[HTSG91] Xiao D. He, Kenneth E, Torrance, Frangois X. Sillion and Donald P. Greenberg. “A Comprehensive Physical Model for Light Reflection”. In: ACM SIGGRAPH Computer Graphics Homepage, Volume 25 Issue 4, July 1991, Pages 175-186, ACM New York, NY, USA, doi>10.1145/127719.122738, url: https://www.graphics.cornell.edu/pubs/1991/HTSG91.pdf

[Sch94] Schlick, Christophe, “An Inexpensive BRDF Model for Physically-based Rendering”, Computer Graphics Forum, vol.13, no.3, Sept.1994, pp.149–162. http://dept-info.labri.u-bordeaux.fr/ ~Schlick/DOC/eur2.html

[Sch14] N. Schulz. “Moving to the Next Generation - The Rendering Technology of Ryse”. In: Game Developers Conference. 2014.

[Wiki1] https://en.wikipedia.org/wiki/Blinn%E2%80%93Phong_shading_model

Rendering on zhangdoa

Rendering analysis - Cyberpunk 2077

Overview

Pre-geometry passes

Billboard and GUI

Sky visibility/top-view shadow/mini map

Unknown compute pass 01

Unknown compute pass 02

Terrain

Unknown color pass 01

Ocean wave

Geometry passes

Depth-pre pass

Base material passes

DS convert passes

Motion-stencil pass

Mask and LUT passes

Reflection mask

Compute pass to noise normal

Some passes to mask the sky out

(HB)AO (?)

Color-grading LUT

Ocean wave noise

Light and shadow passes

CSM passes for direct light

Omni shadow maps for point/area lights

Cloud distribution

Sky cubemaps

Coat mask (?)

Clustered light index (?)

Direct light shadow

Local lights mask (?)

Unconfirmed compute dispatches to update some StructuredBuffers

Environment Radiance capture

A lot GI-related (?) compute passes

Landscape maps

Local volumetric fog

Head mask

Direct and emissive lights

Sky

SSR

Indirect light

All lights composition pass

Transparent LUT

Skin and eyes

AO(?)/GI shadow

Volumetric fog

Water reflection

Post-processing passes

TAA

Blur

Unknown transparent

Cloud

Holo and transparent objects

Post-TAA

HDR LUT(?)

Bloom

Camera lens effects

Color grading

Gamma correction

GUI elements

Film Grain

Swap chain image

Undrawnable conclusions

Physically Based Rendering - Lighting

Remapping of reality

Integration matters

Form Factor

Representative Point

Linearly Transformed Cosines

Walking through the heap properties in DirectX 12

Indifferent to the difference?

A step closer

Heap creation

Resource creation

So many descriptors in Vulkan

Brainwash is always somewhere

Ce este?

Some usage cases

Single UBO data accessed per shader stage