🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Directx12 many PipelineState changes vs. Universal Shader Design

Started by
5 comments, last by evelyn4you 3 years, 5 months ago

hello

i am migrating my dx11 indie game engine to dx12.

My dx11 engine was highly optimized e.g. rendering sponza scene would have

only 2 VertexBuffers ( containing material idx to reference material struc buffer )
- one for simple textured Material
- one for use with normal maps

so the render loop state changed where minimized like this

foreach subset ( e.g. 100 or more )
{
if ( subset visible)
{
set ShaderResourceView (texutes of material)
set pixelshader (of material )
draw call indexed..
}
}

Now in DX12
- (better ) i can get rid of setting the ShaderResourceView because i can index in a texture array
- (worse ) i have a PipelineState change for every different pixelshader ( material )

I know that PSO where made to be switched fastly but the PSO structure contains so many attributes.
Does the compiler of the commanlist optimize all theese attributes changes well ?

In my engine i have e.g. 15 typical dafault shaders
- with/without roughnessmap
- with/without mettalicmap
- with/without normalmap


Solution 1:

Now i could make one universal pixelshader with many branches to sample or not to sample a specific texture

=> advantage this would reduce to 2 draw calls for the whole sponza scene

Solution 2:

keeping my system of many Pipeline State changes

How to optimize this in DX12

Advertisement

dx12 was released as such to let developers optimize code how they want and not have “hidden” work done for us;

while we're crying now -lol- whatever solution u decide to go with, just profile it. It's your choice really…

But whatever you decide, keep the 2D*-rule in mind:

  • Avoid or minimize descriptor heaps changes, those are expensive (the GPU can stall due to flushes);
  • Descriptor tables changes within a descriptor heap are fast and not expensive, encourage this;

D* is for DDLox's rule for Descriptors, I made this up to help myself remember it and mum is happy -lol-

If u can, try both your solutions and see which works best; game dev is fun for these trial-and-error reasons ?

have fun ?

@ddlox hi ddlox,

thanks for you comment.

After about 4 years of DX12 beeing used by game develpers i thougt there would be a general design that solves the question that is raised by all game developers of all times.

“How can i draw as many meshes with different vertexlayouts with different materials as possible”

of cource the problem of lighting model, shading etc. neglected in the first place

The do and don't do of NVIDIA of course i have studied but the recommendations don't help much when setting up new engine design.

hi,

to prepare my migration i benchmarked the uber = universal shader approach vs specialized compiled shaders. ( DX11 )

There is absolutely NO performance difference. I was really suprised !! Maby this is because i have a new GPU ( RTX 3070 )

How to explain:
IMHO the “branching” of the pixel shader code only gives a performance hit, if the different branches are use from at least some pixels "within one wave", because the fast threads have to wait for the slow paths to finish.

E.g. the branch using 4 textures ( sampling is expensive ) only hits performance if there is at least one pixel using this part in one wave.

But if rendering a submesh with simple material using 1 texture the “expensive” branch is never “activated” the GPU knows this and no threads have to wait until the reading of 4 texture samples is finished.

Dont know if the results can be compared to DX12 state changes.

But i will use this uber shader version because the amount of Pipeline states will explode if i would create each one for each shader permutation.

You have to be careful because there are drivers (Nvidia in particular) that will detect branches on simple constant buffer values and generate optimized variants of your shaders with the branches removed. It's easier for them to do this in D3D11, but they can do it in D3D12 on more recent OS versions. Try creating your D3D11 device with D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS and see if that affects your benchmark.

@MJP

hello MJP,

many thanks, i always enjoy every help.
I did test what you proposed and there was NO performance difference.

But i have ot admit, that I cannot understand in depth what this creation flag REALLY means, although ich have read the Microsoft docs.

The shadercode gets a structured Material buffer array. One UINT value per material is per bit Flags encoded which textures to sample.
I simply decode the flags and sample or not-sample the bound textures.

When working (formerly) with optimized shader permutations i have control vars that control the shader compile process and the code is smaller and more optimized per material.

Sorry my slow understanding, but would you please be so kind to explain why i have to be careful ?

Do you mean it would be better e.g. to control the sampling of the textures be a separate UINT value per texture in the constantbuffer / structbuffer so that in the runtime mode the gpu can ADDITIONALLY better optimize the execution of my branched code ?

This topic is closed to new replies.

Advertisement