Percentage-Closer Soft Shadows #1107

AnyOldName3 · 2024-02-29T20:36:03Z

AnyOldName3
Feb 29, 2024

I've done some work to implement Percentage-Closer Soft Shadows into the VulkanSceneGraph. It's available on this branch of vsgExamples and requires this branch of the VSG. So far, I've not updated the precompiled shaders as the VSG branch isn't technically dependent on the vsgExamples one, so to see them in action, you'll need to use options that force the shaders to be compiled from source at runtime, or precompile them yourself.

Here are a few screenshots:

Basic scene with five blocker samples and eight shadow samples

This angle actually has several cascade boundaries, but they're well-hidden, which can be a problem when combining PCSS with cascaded shadow maps.

Basic scene with lots of samples

Large scene

Nearly everything's just been changes in the shader - the only change to the VSG itself is having ViewDependentData pass the shadow map array and samplers as separate uniforms instead of using a combined image sampler, which allows the blocker search to access depth values, but the shadow samples to still benefit from hardware PCF.

The basic approach (after a decent amount of prototyping - there's a description of each decision in the commit history) is:

Search for potential blockers so an average distance can be calculated.
Use this average distance to determine how wide the penumbra should be.
Use a rotated poisson disk scaled to the penumbra size to sample the shadow map.

I've kept the current VSG approach of selecting shadow cascades based on whether a sample point lies within that cascade's light-space bounds. This has made things more complicated than some implementations, which would instead pass the view-space depth range for each cascade to the shader and select based on that, but avoids ignoring a better-quality shadow map in the regions where it spills beyond its intended boundary. There'll be some performance impact of this, although I can't quantify it, but I didn't want to override decisions that had already been made just to make this a bit simpler - it can always be changed later.

Blocker search

The blocker search is conducted across each cascade before any shadow sampling is done - if another approach is taken, there's a risk of missing potential blockers beyond the bounds of a given shadow map, and calculating an inappropriate penumbra radius because of that. The sample points use the same rotated poisson-like disk as the actual shadow sampling, but with a fixed radius - I intend to turn this into a specialisation constant, define or uniform as the most sensible value depends on the scene, its scale and the number of sample points used. If the blocker search radius is too large, and sample points too sparse, it becomes likely that the most important occluders directly between the centre of the light and the shadowed object are missed, and holes appear in the shadow. There are ways to calculate the maximum necessary blocker search radius, but typically it's large and would require many, many blocker samples to ensure nothing important was missed, so most of the time, knowing the nature of the scene, a more appropriate value can be selected manually.

At the moment, the blocker distances are averaged in light-space, but the average blocker for a particular shadow map is converted back to eye space. The first reason for this is that it allows the distances from different cascades with different variants of light space to be combined.

Penumbra radius

The second reason the average blocker is converted to eye space is that not all versions of light space preserve angles between lines. That means that if a technique like Light-Space Perspective Shadow Maps, or even worse plain Perspective Shadow Maps (neither of which I recommend as one of the maniacs who's previously implemented one), are used, we can't know the angle subtended by the light source in light space. We need to know that angle to know how big the penumbra from the average blocker would be. Currently, it's another constant set based on how big the Sun appears on Earth, but I'll eventually move it to the directional light data as not everything people want to render is on Earth and non-fictional. The actual radius calculation is just super-simple trigonometry.

Shadow sampling

As mentioned already, sampling is done using a rotated poisson-like disk scaled to the penumbra radius. Being poisson-like, it avoids moiré patterns and similar problems caused by having sample positions be too ordered, but unlike an actual poisson disk, it has the additional properties that:

Sample points were selected and ordered such that using a subset from the start of the list still yields a poisson-like disk. This allows the sample count to be parameterised and still give comparable results as if a separate disk had been precomputed for each potential sample count. The sample count will end up as a define or specialisation constant.
Sample points were biased when it was computed such that the distribution of points was still poission-like after adding the same points, but rotated. This avoids banding when clusters of points with similar radii exist, which can happen with true poisson disks.

The literature I've historically found about shadow map sampling didn't make it clear how much a difference a good poisson-like sampling disk would make over a bad poisson disk, so I thought this was interesting to discuss here.

The rotation for the poisson disk is calculated based on the same maths as Godot (also MIT-licenced) uses - it takes a hash of the pixel position using a formula originally from Crytek. Other techniques I tried led to obvious banding, significant moving noise patterns or noise that changed so much from frame to frame that surfaces looked like an untuned analogue CRT screen. I didn't keep screenshots, but the Godot approach just worked far better than anything else, especially with low sample counts.

Like with the blocker search, each shadow map in turn is tried with all the sample points until the total amount that landed within valid light space reaches the target threshold.

The result of having both the blocker search and shadow sampling work this way is really consistent penumbras across cascade boundaries - I can't find them even in screenshots where I know where they should be.

robertosfield · 2024-03-01T11:29:19Z

robertosfield
Mar 1, 2024
Collaborator

Thanks Chris for posting this work. Once the implementation is further along I'll merge with VSG and vsgExamples master, as a step in this direction I have merged the changes as a VSG separate-shadow-samplers branch and a vsgExamples soft-shadows branch:

https://github.com/vsg-dev/VulkanSceneGraph/tree/separate-shadow-samplers
https://github.com/vsg-dev/vsgExamples/tree/soft-shadows

I've done a first past review of the changes to the VSG, it looks like we should be able to use the new createDescriptorImage convenience functions. I want to test the changes against the original cascaded shadow map implementation, if that works fine then I'll get these changes to the VSG merged with master, hopefully wrapped up today.

Looking at vsgExample changes to shaders suggest that the phong shader is the most mature and as there is an overlap between the phong and pbr shaders w.r.t shadows I think it's probably time we created a dedicate shadow.frag/shadow.glsl file that is included by phong or pbr shaders to reduce the amount of duplication.

I would also like look at controlling whether Percentage-Closer Soft Shadows is enabled in the shaders and the settings used. Once I have spent more time with the new shader and C++ code I have a better idea of what changes to vsg::ViewDependentState and vsg::Light might be required. One change I have been considering is moving the vsg::Light::shadowMaps value out of vsg::Light and leaving this to vsg::ViewDependentState. Adding extra light size property to vsg::Light to help guide the size of penumbras seems like something we might consider as well.

0 replies

robertosfield · 2024-03-01T13:41:35Z

robertosfield
Mar 1, 2024
Collaborator

I have been testing the soft-shadows branch of vsgExamples and found that the modified standard_phong.frag that defaults to num of samples to 8 I get obvious under sampling issues, but upping this to 16 I get significantly better results and obviously going all the way up to 64 produces very nice results.

To harder to judge performance impact without creating a set of test models and animation paths. From my initial tests with just the simple vsgshadow test model I'm seeing 2750fps with 8 samples, 2131fps for 32 samples. I'm thinking that ViewDependentState should pass in the number of samples to use.

The angleSubtended is something I would put into vsg::DirectionalLight and pass to the shader via the LightData. Defaulting to sun and earth makes sense. In the shader I see that the tan(angleSubtended/2) is used so perhaps ViewDependentState can compute this. The use of the inverse of the shadow map matrix is something that could be done by ViewDependentState.

0 replies

AnyOldName3 · 2024-03-01T14:52:06Z

AnyOldName3
Mar 1, 2024
Author

I've done a first past review of the changes to the VSG, it looks like we should be able to use the new createDescriptorImage convenience functions. I want to test the changes against the original cascaded shadow map implementation, if that works fine then I'll get these changes to the VSG merged with master, hopefully wrapped up today.

I used the convenience functions to create the sampler descriptors, but the sampled image descriptors needed VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL, so I used the existing ImageInfo DescriptorImage constructor overload with a customised ImageInfo with the required layout. If you were creating a descriptor for a combined image sampler with that layout, you'd need to do the same thing, which is what the original code did. The samplers don't need that layout setting as the layout's only read for the image. However, locally I've managed to make it a bit more concise by adding a std::nullptr_t constructor overload for vsg::ref_ptr so you can pass in nullptr instead of explicitly creating a null ref_ptr with vsg::ref_ptr<Sampler>(). You can see that here AnyOldName3@bc3701e, but I've not pushed it to my separate-shadow-samplers branch in case you think it's less clear. The constructor overload should probably become a thing, though, as it's more consistent with STL smart pointers that way.

The prior shadow implementation should work just fine with the separate sampler and image provided it's adapted to combine them in the shader. That should just mean changing

layout(set = VIEW_DESCRIPTOR_SET, binding = 2) uniform sampler2DArrayShadow shadowMaps;
...
texture(shadowMaps, vec4(sm_tc.st, shadowMapIndex, sm_tc.z)).r

to

layout(set = VIEW_DESCRIPTOR_SET, binding = 2) uniform texture2DArray shadowMaps;
layout(set = VIEW_DESCRIPTOR_SET, binding = 4) uniform sampler shadowMapShadowSampler;
...
texture(sampler2DArrayShadow(shadowMaps, shadowMapShadowSampler), vec4(sm_tc.st, shadowMapIndex, sm_tc.z)).r

As an aside, this made me realise I'd forgotten to put some of the shadow implementation in the PBR shader, so I've just pushed another commit that copies and pastes the rest.

Looking at vsgExample changes to shaders suggest that the phong shader is the most mature and as there is an overlap between the phong and pbr shaders w.r.t shadows I think it's probably time we created a dedicate shadow.frag/shadow.glsl file that is included by phong or pbr shaders to reduce the amount of duplication.

Some of the difference will be the aforementioned stuff I forgot to copy and paste until just now. I definitely agree that splitting it our into a separate reusable file is a good idea. This also makes switching implementations easier - even if multiple implementations end up in the same file, it prevents filling all the lighting shaders with several shadow implementations, but there's also the option of putting each shadow implementation in its own file.

I have been testing the soft-shadows branch of vsgExamples and found that the modified standard_phong.frag that defaults to num of samples to 8 I get obvious under sampling issues, but upping this to 16 I get significantly better results and obviously going all the way up to 64 produces very nice results.

To harder to judge performance impact without creating a set of test models and animation paths. From my initial tests with just the simple vsgshadow test model I'm seeing 2750fps with 8 samples, 2131fps for 32 samples. I'm thinking that ViewDependentState should pass in the number of samples to use.

Eight was essentially chosen arbitrarily, but with the sampling pattern I ended up on, the undersampling artefacts are a lot less bad than they were with much higher sample counts and all the other ones I tried first. I'm definitely in favour of it being parameterised, but I'm not sure ViewDependentState should necessarily be the thing controlling it. Particularly with low sample counts, which some applications will prefer if they care more about framerate than noise, the value used will have a big impact on things like the optimal amount of loop unrolling, and how much of the sampling disk actually needs embedding into the binary, which won't be impacted if it's a specialisation constant or a define, but would if it's a uniform. I'm under the impression ViewDependentState can't set the values of push constants.

The angleSubtended is something I would put into vsg::DirectionalLight and pass to the shader via the LightData. Defaulting to sun and earth makes sense. In the shader I see that the tan(angleSubtended/2) is used so perhaps ViewDependentState can compute this. The use of the inverse of the shadow map matrix is something that could be done by ViewDependentState.

The precomputable part of the trigonometry's currently left in the shader because while its inputs are const, the SPIR-V disassembly was showing the constant had been folded, so it didn't matter. Once it's moved to the light data, that obviously won't happen anymore, so doing as much as possible CPU-side is probably sensible.

The inverse light space matrix is in the shader because I wanted to gauge your appetite for using up more of the light data buffer on something like that before making things more complicated.

4 replies

robertosfield Mar 1, 2024
Collaborator

As per your suggestion I have updated the vsgExample standard_phong.cpp and stadard_pbr.cpp shaders and updated the built-in ShaderSet in the separate-shadow-samplers branch. With this changes vsgshadow in vsgExamples master works fine. It does look like we have a redundant sampler being added that is only required for the soft shadows, but as I expect soft shadows to be merged with VSG master within the next week I think this is fine:

I have created a PR for these changes:

#1109

If the automated build tests go smoothly I'll merged with VSG master. which will allow folks to test out the new PCSS shaders in vsgExamples soft-shadows branch with VSG master.

robertosfield Mar 1, 2024
Collaborator

#1109 is now merged with VSG master so no need to pull in any VSG branches to work with @AnyOldName3 work on Percentage-Close Soft Shadow shaders in the vsgExamples soft-shadows branch.

For the very latest updates to PCCS implementation you might want to keep track of Chris' forks.

AnyOldName3 Mar 1, 2024
Author

Neither of us has changed these bits to reflect the new layout yet. I assume that's something we need to do, right? https://github.com/vsg-dev/vsgExamples/blob/master/examples/utils/vsgshaderset/phong.cpp#L60

robertosfield Mar 1, 2024
Collaborator

Good catch, the ShaderSet's need to have the new descriptors added.

AnyOldName3 · 2024-03-04T19:19:40Z

AnyOldName3
Mar 4, 2024
Author

Something that I've alluded to on one of our calls, but I've only just confirmed is a real problem is the depth clamping. If all you care about is whether something occludes light, it makes sense to enable depth clamping for the shadow map RTT and restrict light space to only cover the frustum as it means there's more depth precision available where it matters. However, when PCSS is used, you also care about how far away occluders are, and depth clamping squashes a bunch of potential occluders onto the shadow camera's near plane. This means that any occluders outside the view frustum potentially end up with much narrower penumbras than they're supposed to have, and this can look particularly bad when zooming in etc. as they grow and shrink before your eyes.

The solution would be to disable depth clamping and extend light space to encompass all potential occluders rather than just all potential receivers. Currently, disabling depth clamping in the vsgShadow example only makes the first change, which just means that a bunch of occluders are left out and some shadows are totally missing, or even worse, totally missing from just one shadow map, causing an abrupt cutoff.

I'm under the impression we don't necessarily know the bounds of the set of potential occluders for a shadow map, so would need an arbitrary large amount of depth precision to be wasted, but at the worst, a bound could be set based on the blocker search radius and angle subtended by the light. With the current default of 1m, we'd 'only' need 220m of shadow map depth range to be nearer the near plane than the nearest shadow receiver.

As a side note, while calculating that I realised I'd got the angle subtended by the Sun on Earth wrong by a factor of ten. When I use the correct value, the soft shadows are a lot less impressive in the example scenes than the screenshots above. Reducing the penumbras from 36cm to 3.6cm obviously has a big impact, not least because lots of shadow map texels aren't much smaller, so end up visible. On the plus side, if a new planet gets discovered much, much closer to the sun than Mercury, we know that what I've created would work pretty well for simulating the shadows there.

0 replies

robertosfield · 2024-03-05T09:24:44Z

robertosfield
Mar 5, 2024
Collaborator

W.r.t depth clamp, currently this is enable for the whole scene graph, my plan is to specialize the shadow map rendering so only depth clamp is used for it. This doesn't change the issue you raise but as a heads up that I may do some work in this area.

0 replies

robertosfield · 2024-03-05T13:12:02Z

robertosfield
Mar 5, 2024
Collaborator

W.r.t blocker search requiring unclamped shadow/depth values, we don't know the extents of that the shadow map requires unless we run a compute bounds traversal, something that is expensive so not something we'd want to do on the fly.

So we could pick a conservative value for the near plane of the shadow map, pushing it back towards the light to make sure it captures all the geometry that we think is relevant. I think you are suggesting that do this but limit how far we push the shadow map near plane back.

When originally implementing the shadow map rendering I did experiment with not clipping the depth value with the intention of just having unbounded floating values rendered to the shadow map. I didn't get this to work, but I may well have just been doing it wrong. If it's possible then it might be worth seeing if this is possible rather than resorting to using the depth clamp.

6 replies

AnyOldName3 Mar 8, 2024
Author

I thought of an alternative that might work on a wider range of hardware. If as well as the depth buffer, an R32F colour buffer was bound, then the unclamped depth could be written as the colour, and then the colour buffer could be used as the shadow map.

robertosfield Mar 8, 2024
Collaborator

Would you need a depth buffer at all if you are using that workaround? Would the shadow map samplers be fine with use of a colour buffer? Seems like a thing to try to see what happens,

AnyOldName3 Mar 8, 2024
Author

Would you need a depth buffer at all if you are using that workaround?

Yes, as the depth test still needs to happen. Otherwise, you couldn't guarantee that the occluder in a particular texel was the closest one to the light, it'd just be whichever was drawn last.

Would the shadow map samplers be fine with use of a colour buffer? Seems like a thing to try to see what happens,

Now we're using separate samplers, it's just a single-component float texture. The only thing that would be different would be changing the layout from VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL (which is guaranteed to be something you can convert VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL into) to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL (which implementations don't have to be able to create from the depth stencil layout).

robertosfield Mar 9, 2024
Collaborator

Double the effective depth buffer memory will also double the memory bandwidth on write so will be slower to generate the shadow maps. Given we can look at look at disable depth clamp or shifting the near plane back for the shadow maps I think these two approaches have more promise for fixing issues without adding overhead so these are the avenues I think we should look at first.

robertosfield Mar 9, 2024
Collaborator

As if the shadow maps nearest the eye that currently has the smallest depth range and causing issues with clamping and too small occluder distance then perhaps pushing back the near plan and the shadow maps but a constant amount might be sufficient for avoiding the worst of the potential artefacts.

AnyOldName3 · 2024-03-08T18:57:40Z

AnyOldName3
Mar 8, 2024
Author

@robertosfield Does vsg-dev/vsgExamples@31f288f seem like the right approach?

1 reply

robertosfield Mar 8, 2024
Collaborator

I'm not able to do a proper review till I back to work on Monday but on quick review it looks sensible.

AnyOldName3 · 2024-03-12T16:57:00Z

AnyOldName3
Mar 12, 2024
Author

I've got a performance comparison from yesterday's change with made it so the inverse shadow matrix is computed on the CPU instead of in the fragment shader:

64 samples

.\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --technique pcss -p .\saved_animation.vsgt

	1	2	3	mean
before	308.93	306.221	309.012	308.054
after	311.687	310.578	312.825	311.697

.\vsgshadow.exe --depthClamp --sm 1 -n 1 --fps -t --technique pcss -p .\saved_animation.vsgt

	1	2	3	mean
before	665.057	663.783	667.829	665.556
after	660.065	656.059	661.099	659.074

8 samples

.\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --technique pcss -p .\saved_animation.vsgt

	1	2	3	mean
before	738.446	744.731	744.606	742.594
after	745.146	746.369	745.282	745.599

.\vsgshadow.exe --depthClamp --sm 1 -n 1 --fps -t --technique pcss -p .\saved_animation.vsgt

	1	2	3	mean
before	890.733	895.739	889.855	892.109
after	899.397	893.704	893.911	895.661

So I think it's fair to say it barely helps, if at all. I don't have an explanation for why it ends up slower for the single-light single-shadow-map 64-sample case, but it's consistent when I come back to it, so doesn't appear to be down to something like a background task running on my machine while I was doing that one test.

4 replies

AnyOldName3 Mar 19, 2024
Author

I tried redoing this without the cache coherency issue outweighing the effect, and then got a kernel-mode driver crash when doing the last run, which meant I lost all my results. I can't be bothered doing it again right now, so I'll just note down my analysis:

The performance impact was well below the amount of noise in my results, so I can't say there even was any.
The results seemed a lot more variable between runs, and I don't know why.
The performance was better when it had been slow and worse when it had been fast, and I don't know the cause of that, so I'd like to check none of the recent optimisations or newly-configurable parameters have actually hurt performance.

AnyOldName3 Mar 20, 2024
Author

I've learned something about the performance variation. Maybe twenty minutes ago or so, I was reliably getting 700 FPS when running a particular test, and now I'm reliably getting 900 FPS running the exact same thing. I have all the same background applications running, and as far as I can tell, they're all doing consistent things and leaving plenty of CPU cores idle and the GPU unused, so shouldn't be adding much noise to the results.

I think this is a strong sign that the unexpected performance regression I was concerned about yesterday isn't real, but it does call into question all the benchmark results I've got that weren't back-to-back (and potentially some of the ones that were, too). I'll need to investigate further to see if I can isolate a cause.

AnyOldName3 Mar 20, 2024
Author

Well I've found a trigger. If my music player is visible on my second monitor, it costs me about 20% of my framerate. If it's entirely covered by other windows, or is minimised, then I get that 20% back. I don't think a static picture of some album art and a progress bar should be so expensive, so I'm going to message AMD about it.

AnyOldName3 Mar 20, 2024
Author

Redone without that interference:

.\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --technique pcss --shadow-samples 64 -p .\saved_animation.vsgt

	1	2	3	mean
before	510.53	509.082	507.251	508.954
after	511.914	511.724	515.644	513.094

.\vsgshadow.exe --depthClamp --sm 1 -n 1 --fps -t --technique pcss --shadow-samples 64 -p .\saved_animation.vsgt

	1	2	3	mean
before	835.183	836.143	830.952	834.093
after	835.034	834.119	834.775	834.643

.\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --technique pcss --shadow-samples 8 -p .\saved_animation.vsgt

	1	2	3	mean
before	866.57	866.523	869.388	867.494
after	866.321	860.577	852.499	859.799

.\vsgshadow.exe --depthClamp --sm 1 -n 1 --fps -t --technique pcss --shadow-samples 8 -p .\saved_animation.vsgt

	1	2	3	mean
before	883.613	890.545	888.388	887.515
after	892.344	891.077	893.911	892.444

So I'd still conclude that the effect is less than the amount of noise in the results, and that I'm still not sure it's worth using up the light buffer space for the extra matrix.

AnyOldName3 · 2024-03-12T19:30:21Z

AnyOldName3
Mar 12, 2024
Author

Some more investigation about the impact of the stochastic rotation of the poisson disk. Obviously, the most obvious impact of this is that it's going to convert artefacts with patterns into noise, which should be less visually impactful. When there are only a few samples, the most obvious problem is fairly severe banding, and that gets pretty effectively turned into much less severe noise. However, it would make intuitive sense if the banding wasn't a problem in the first place when the sample count was turned up high, but it's still clear that there's some noise under certain conditions when you look closely at the shadows even with sixty-four samples, so there must be some kind of signal that it's scrambling to create that noise. Disabling the rotation makes its causes pretty clear:

when the shadow map resolution is too low, it doesn't matter how thoroughly you sample it, you'll still be able to see its texels, and it's their boundaries that the rotation's turning into noise.
there's still visible banding even with 64 samples when the penumbra's relatively large.

From a little local testing, I've determined that most of the noise visible through the rotation is caused by 64 samples not being enough to totally eliminate banding, unless the resolutions really bad. That's irritating, as it's not like simply throwing more samples at the problem is an entirely viable solution. The results from the previous post demonstrate that going from eight samples to sixty-four halves the framerate, so going from 64 to 128 might well halve it again, but will only halve the intensity of the noise.

However, I don't think this justifies abandoning the current approach and switching to moment shadow maps or anything like that. I imagine that a big part of why I'm noticing this is that I've spent a long time staring at these shadows so am hyper-aware of any defect that most people wouldn't notice. Additionally, the noise is very attenuated by a gentle blur or small amount of scaling of the final image (in fact, you might not be able to see there's noise at all in GitHub's preview until you click the image to see it at its original resolution), so I'd expect that any application using shadows in addition to bloom, depth of field, or temporal anti-aliasing won't have visible noise. Also, most applications will be using textured meshes instead of solid colours like this test scene, and the noise will be far less significant than colour variations from textures.

8 samples, no rotation

8 samples, rotation

64 samples, no rotation

64 samples, rotation

0 replies

AnyOldName3 · 2024-03-12T19:46:25Z

AnyOldName3
Mar 12, 2024
Author

Also, I just did a quick test of the various techniques that now exist on my branch.

--pcss is Pecentage-closer soft shadows, as shown above. It attempts to work out the correct penumbra radius (softness) for shadows based on what's casting the shadows.
--pcf is a fixed-radius PCF on top of the hardware PCF, so shadows will be consistently soft even when they wouldn't be in real life.
--hard just uses the hardware PCF like upstream VSG shadows already do.
--none just skips shadow calculation in the shader, but still does everything CPU-side needed for the other techniques.

Eight samples (for soft techniques) `--sm 3 -n 3`

technique	average framerate
`--pcss`	557.26
`--pcf`	646.871
`--hard`	711.961
`--none`	708.328

0 replies

robertosfield · 2024-03-12T20:46:12Z

robertosfield
Mar 12, 2024
Collaborator

Thanks for all the details. It sounds like you've close to exhausting what you can do with this type of PCSS algorithm. Do you feel the implementation is close to ready to merge with the main VSG?

It's a curious finding that the CPU inverse is not noticeably faster, this suggests that read memory bandwidth is a significant cost vs cost of computing an inverse. This doesn't sound right but your observations are hard to explain away otherwise. Trying different hardware might be informative. What hardware are you working on for these tests?

Have you tried looking at the results with multisampling enabled?

1 reply

AnyOldName3 Mar 12, 2024
Author

These tests are on a Radeon 7800 XT on Windows at 1440p. I know that in OpenGL on modern AMD hardware, UBOs don't end up in the same fast memory as plain uniforms, but there's a large cache that's equally fast, so if that cache is shared with textures, it's plausible that the shadow map sampling fills the cache with shadow map data, evicting the light data, but on the other hand, it's also plausible that the shadow map sampling just leaves loads of free cycles to calculate the inverse in that would otherwise just be spent waiting for reads to finish, or even just that the matrix inverse just ends up being such a small fraction of what the shader does that removing it only has a very small performance impact.

It's kind of inherent with this style of PCSS that you'll end up with lots of texture reads being conditional on the result of other texture reads, and that's never going to produce particularly nice machine code for a GPU. There might not be much more that can be done to improve it without doing something drastic like blocking off a potential future implementation of perspective shadow maps (in which case, some matrix transforms can be turned into simple scaling for parts of the maths, and bits of the shader might become easier to rearrange) or we switch to passing the frustum split distances to the shader and then picking the shadow map to use based on the eye-space Z coordinate rather than whether the light-space position ends up within bounds for a particular shadow map (which makes it easier to write more GPU-friendly code). On the other hand, there are some things I could play with now I've got a consistent setup for comparing performance, and I might discover a nice surprise of some kind.

AnyOldName3 · 2024-03-13T18:55:07Z

AnyOldName3
Mar 13, 2024
Author

Another bit of data:
Running .\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --technique pcss -p .\saved_animation.vsgt with different values for the blocker search radius gave interesting results.

radius	framerate (8 samples)	framerate (64 samples)
1.0	572.629	260.692
0.1	632.972	420.747
0.01	654.223	464.623

The result for 0.01 units with eight samples is particularly interesting as it's faster than fixed-radius PCF at 0.05 units, despite doing extra maths and twice as many texture accesses. Together, these numbers are a strong sign that the biggest problem is simply waiting to sample bits of the texture that aren't in the cache, and there might be decent gains to be had by reordering the sample points in the poisson-like disk so bits of the sampled texture near to each other get sampled in quick succession, reducing the chance relevant cache lines have been evicted.

The tool I used to generate the sampling disk as a feature intended for this that I haven't used yet, so I'll see if I can generate something that runs faster without sacrificing the progressive property that allows the same disk to be used for small and large sample counts.

0 replies

AnyOldName3 · 2024-03-13T19:14:56Z

AnyOldName3
Mar 13, 2024
Author

It looks like the tool either does the cache optimisation or maintains the progressive property, but can't do both (and also the cache optimisation seems a little naive as it's just sorting points by their y coordinate). However, it definitely impacted performance:

radius	framerate (8 samples)	framerate (64 samples)
1.0	769.08	318.247
0.1	895.641	514.052
0.01	906.244	589.625

15 replies

AnyOldName3 Mar 15, 2024
Author

I can probably give it a try, though. It shouldn't be hard to come up with a basic proof of concept and get a rough estimate of how good it could be.

robertosfield Mar 15, 2024
Collaborator

For performance I would have thought cache coherency of the shadow map read is the most critical part, with the sampler disk of less importance so only reading every second/third/fourth/fifth point would have less impact.

AnyOldName3 Mar 15, 2024
Author

Yes, I agree with that, but the fact that you're mentioning it makes me think I didn't get my point across clearly.

We get cache coherency for the shadow map read by ensuring each point we sample is reasonably close to other ones we've sampled recently, which means each point in the sampling disk is spatially close to its predecessor. This still needs to hold when we only use a subset of the disk, but there are plenty of ways to do that, e.g. if every point is close to the one before and after it, then the one before it will also be fairly close to the one after it.

However, we also need the sampling disk to be a good sampling disk that gives us a good representation of the signal it's sampling. Most of the subsets you can take of a good sampling disk aren't themselves good sampling disks. The existing sampling disk was generated with a tool that would produce a set of sampling points that could produce a decent sampling disk with a subset of any size of its points, and ordered them such that you'd be able to get it by using the first n points.

We can't have the sampling points sorted both of these ways at the same time as they're kind of opposites - with the first sort, point n+1 would want to be the nearest unused point to point n, whereas with the latter, it'd want to be the most distant from the existing sample points (as that'd be the most undersampled region).

What I'm proposing is putting what are currently the first four points at indices 0, 16, 32 and 48, then the next four points at indices 8, 24, 40 and 56, then the next eight at indices 4, 12, 20, 28, 36, 44, 52 and 60, and so on. This will then mean you can access the four points in the optimal four-sample disk by accessing points at i*16, the optimal eight-sample disk by accessing points at i*8, etc., but will give me the freedom to reorder the sampling points in each group to improve cache coherency, which I hope will give good enough results.

AnyOldName3 Mar 15, 2024
Author

Results for an interleaved cache-optimised still-progressive poisson-like sampling disk, when the cache optimisation is just a y sort (I tried a cleverer approach, but 32! is too big):

radius	framerate (8 samples)	framerate (64 samples)
1.0	725.184	308.947
0.1	870.905	511.085
0.01	884.089	584.831

So that's nearly all the benefit with only a minor change to how things work. I'm happy to call this the least bad option, and by a significant margin.

AnyOldName3 Mar 15, 2024
Author

The changes are now pushed.

robertosfield · 2024-03-16T14:29:50Z

robertosfield
Mar 16, 2024
Collaborator

I tried merging your vsgExamples soft-shadows branch with main repo's vsgExamples soft-shadows branch but I didn't get any shadows exact for when using --technique hard, using pccs I don't see anything shadow. WIth soft I do see shadows but not if I use more than one shadow map.

When I use the --technique soft I only see to get shadows for the what is probably the nearest shadow map.

I tried use your repos soft-shadows branch directly as well but get the same result. Could there a check-in missing?

3 replies

AnyOldName3 Mar 16, 2024
Author

Have you got this commit for the VSG? master...AnyOldName3:VulkanSceneGraph:inverse-shadow-matrix

Without it, the inverse shadow matrix won't be included in the light data, so only --technique none, --technique hard and --technique pcf will work. I didn't create a PR yet as the performance benefit was minuscule and it nearly doubled the amount of the light data buffer a directional light would consume, so it was unclear if it was a good idea. I guess that now the most serious performance bottleneck has been mitigated somewhat, it might be worth trying with and without this change again.

robertosfield Mar 16, 2024
Collaborator

I haven't tried that change to the VSG, I will try it out.

How much more needs to be done to wrap up the soft shadows work? Once we merge the work I'll bump the VSG version to reflect the addition.

robertosfield Mar 16, 2024
Collaborator

I have merged the change to the VSG as soft-shadows branch to match the branch in vsgExamples. I'll merge changes to vsgExamples next.

robertosfield · 2024-03-16T17:05:44Z

robertosfield
Mar 16, 2024
Collaborator

I have merged the soft shadows work into VSG and vsgExamples as soft-shadows branches:

https://github.com/vsg-dev/VulkanSceneGraph/tree/soft-shadows
https://github.com/vsg-dev/vsgExamples/tree/soft-shadows

Let me know if I've made any mistakes in the merge. To test things I tried:

vsgshadow --technique hard --sm 3 --dc --direction 1 1 -1.0
vsgshadow --technique pcf --sm 3 --dc  --direction 1 1 -1.0
vsgshadow --technique pcss --sm 3 --dc  --direction 1 1 -1.0

In testing I found that all of the techniques are hitting up clipping of the shadow map, even with depth clamp on. Unfortunately Github doesn't allow me to attach .vsgt files so I've renamed them to .txt:

problem_small.txt
problem_large.txt

To test it out for the large dataset I use this command line:

vsgshadow --technique hard --sm 3 --dc --samples 8 --shadow-samples 64 --direction 1 1 -0.5 -n 2  --large --sd 1000 -p problem_large.vsgt

I tested this same path and command line with VSG + vsgExamples master and see the same problem so this clearly isn't soft shadows specific issue. It seems to be low angle light direction that seems to provoke it.

1 reply

AnyOldName3 Mar 16, 2024
Author

This seems to suppress the problem:

diff --git a/data/shaders/shadows_hard.glsl b/data/shaders/shadows_hard.glsl
index 55767b4..f71d1b6 100644
--- a/data/shaders/shadows_hard.glsl
+++ b/data/shaders/shadows_hard.glsl
@@ -19,18 +19,30 @@ float calculateShadowCoverageForDirectionalLight(inout int lightDataIndex, inout

         vec4 sm_tc = sm_matrix * vec4(eyePos, 1.0);

-        if (sm_tc.x >= 0.0 && sm_tc.x <= 1.0 && sm_tc.y >= 0.0 && sm_tc.y <= 1.0 && sm_tc.z >= 0.0)
+        if (sm_tc.x >= 0.0 && sm_tc.x <= 1.0 && sm_tc.y >= 0.0 && sm_tc.y <= 1.0 && sm_tc.z >= 0.0 && sm_tc.z <= 1.0)
         {
             matched = true;
             overallCoverage = texture(sampler2DArrayShadow(shadowMaps, shadowMapShadowSampler), vec4(sm_tc.st, shadowMapIndex, sm_tc.z)).r;

This condition makes a difference when the fragment is closer to the light than the shadow RTT's near plane, and it's pretty obvious that potential casters in that region will be clipped if depth clamping is disabled, or clamped to be further from the light if it's enabled, so there's no reason to think that this was ever a good idea.

In OSG-based OpenMW's shadow implementation, I made it skip the opposite condition, i.e. when the fragment is further from the light source than the shadow RTT's far plane, as there'd be plenty of potential occluders, and they'd have higher resolution than they would in the next shadow map. However, even then, I still had to make it use the next shadow map, too, otherwise occluders beyond that shadow map's far plane would be ignored. We could get the same effect here by swapping the existing sm_tc.z >= 0.0 condition for sm_tc.z <= 1.0 and changing matched = true; to matched = sm_tc.z >= 0.0;. I'm not sure this is an espeically good idea, though, as it just means some fragments need to do a bunch of extra shadow map reads, and was only a good idea in OSG-based OpenMW because it's almost always cull or draw limited, so the GPU spends plenty of time idle.

robertosfield · 2024-03-31T17:36:37Z

robertosfield
Mar 31, 2024
Collaborator

As another simplification step I have modified the vsgshadow example so that it sets up the shader hints just once then uses this for the phong and pbr ShaderSets:

vsg-dev/vsgExamples@e319c95

This change is checked into vsgExamples soft-shadows-simplfied branch.

There is still more stuff I'd like to get things ready for merging with master, but will be to return to this tomorrow.

0 replies

robertosfield · 2024-04-01T10:15:22Z

robertosfield
Apr 1, 2024
Collaborator

I have refactored the way vsgshadow sets up the ShadowSettings and shaderHints so it's all done more locally:

vsg-dev/vsgExamples@86369cd

I'm now wondering about how we might combine the defines required into the ShadowSettings subclasses.

0 replies

robertosfield · 2024-04-01T15:22:15Z

robertosfield
Apr 1, 2024
Collaborator

I checked in an shader optimization to the soft-shadows-simplfied branch of vsgExamples that changes how the loops are testing and exited:

vsg-dev/vsgExamples@1b03783

With the huge_medieval_battle_scene test model:

test	before	after
without shadows	400fps	400fps
3 shadow maps and hard shadows	82fps	97fps
35fps with 3 shadow maps and PCSS	35fps	37fps
3 shadow maps and PCF	29fps	31.4fps

The PCF cases is a bit skewed because it's using a penumbra radius of 0.1 on model that is so small, I've also changed the example to enable the radius to be passed in I get 64fps, I've left the same values as the original test though so we can see like for like.

A structuring of the PCSS shader a bit further so the that occluder search use the same approach I've done for the final sampling might net additional improvement in fps.

As part of the changes I've made today I've changed vsgshadow so that it takes --hard, --pcss and --pcf radius command line options for toggling the technique rather than the previous --technique command line option. This allows vsgshadow to be a bit more centralized and easier to follow.

I'm seeing a regression vs VSG/vsgExample master in vsganimation with hard shadows but I haven't yet got to the bottom of this.

7 replies

AnyOldName3 Apr 2, 2024
Author

I have reservations about the correctness of this optimisation. It no longer skips over unused light data and shadow maps, so shadow casting for the next light will use the wrong data. This seems to be the core of the change, and I don't see a way to keep the performance benefit without leaving support for multiple lights broken, as the only change to control flow is that it skips past all the things that get the indices ready for the next light.

robertosfield Apr 2, 2024
Collaborator

The handling is just done differently, the accounting for skipping unused shadow maps should all be there. The main skip is now standard_pbr.frag and standard_phong.frag. I tested the multiple line multiple shadow case when I original made these changes, and just running it again shows it's working fine.

vsgshadow -n 3 --sm 3
vsgshadow -n 3 --sm 3 --pcss 
vsgshadow -n 3 --sm 3 --soft 0.1

All work correctly as far as I can tell. The code is simpler and faster than it was before. Please read all the changes to the shaders.

AnyOldName3 Apr 2, 2024
Author

I've found it now - there are other changes that aren't in the commit you linked that make it work, with vsg-dev/vsgExamples@fbd0655 being the important one. I'm actually pretty surprised 1b03783 made any difference to anything as by the time it was made, all it did was avoid writing to some variables that wouldn't be read from again, which glslang should be clever enough to eliminate, and the operations involved were just multiplications and additions, which shouldn't have made enough difference to show above the noise floor. Are the results definitely a before and after for the commit you linked?

robertosfield Apr 2, 2024
Collaborator

Please just test out VSG master and vsgExamples master head as I have made many commits over the past 4-5 days, some fix regressions. For further refinements I would suggest branching VSG.vsgExample master and working from there.

Tomorrow I'll looking at a RegionOfInterest functionality for the scene graph/shadow techniques. My aim with this will be controlling the shadow map extents with a user definable scene graph mechanism. This is orthogonal to soft shadow maps so will be a seperate branch of work. Hopefully I'll wrap it up tomorrow.

AnyOldName3 Apr 3, 2024
Author

I've investigated a bit further, and think I can clarify a bit:

None of the optimisations you've made make any difference on my machine, which explains why they didn't look to me like things that would make any difference - when I've tried similar things, nothing's happened. Obviously, when an optimisation makes more difference than expected, it's often a sign that it's changed more than it was supposed to, so I was suspicious, but everything seems to be working fine. I think it just boils down to AMD's proprietary Windows driver applying optimisations to the already-compiled SPIR-V that your driver isn't, meaning some things that were free on my machine had a cost on yours.
I've seen a 15% performance hit from the switch to storage buffers, so overall things are generally worse on my end than a few days ago.

Hopefully that brings us back onto the same page.

robertosfield · 2024-04-02T12:10:29Z

robertosfield
Apr 2, 2024
Collaborator

I have add the ability to override the ShadowSettings per Light, or as a catch all in ViewDependentState:

Changes to VSG: vsg-dev/vsgExamples@888070c
Changes to vsgExamples (for testing) : vsg-dev/vsgExamples@888070c

This change should make it possible to have different View's with different ShadowSettings such as reducing/increasing visual quality.

0 replies

robertosfield · 2024-04-02T12:19:53Z

robertosfield
Apr 2, 2024
Collaborator

@AnyOldName3 I'm now close to being ready to merge the soft-shadows-simplified branch with VSG master. Are there issue you can think of that I should wait for?

I am thinking about renaming the shadow_pcf.glsl and associated VSG_SHADOW_PCF to shadow_soft.glsl and VSG_SHADOW_SOFT respectively as PCF is a bit cryptic. We could add doxygen and shader comments explaining the algorithm details.

1 reply

robertosfield Apr 2, 2024
Collaborator

I have decided to go ahead ad renamed VSG_SHADOW_PCF to VSG_SHADOW_SOFT as this should be more intuitive to users:

vsg-dev/vsgExamples@e317b76

robertosfield · 2024-04-02T15:01:04Z

robertosfield
Apr 2, 2024
Collaborator

During testing I found the vsgtextureprojection example was no longer creation textures so I've updated the vsgExamples/data/shaders/textureprojection_phong.frag to be consistent with the new standard_phong.frag that utilizes the new shadow*.glsl shaders:

vsg-dev/vsgExamples@d5b6109

The following change also had to be made to enable shadows:

    auto shaderHints = shaderSet->defaultShaderHints = vsg::ShaderCompileSettings::create();

    if (numShadowMapsPerLight>0)
    {
        shaderHints->defines.insert("VSG_SHADOWS_HARD");
    }

Previously hard shadows were supported out of the box without the need for any additional defines, so the soft-shadows branch work will break them as things currently stand.

We could possibly just have VSG_SHADOWS_HARD always be built into the shaders. This would at least help existing application mostly work as before when they update to VulkanSceneGraph-1.1.3.

I am thinking about adding a ShadowSettings std::string define member variable that could be used to pass on to the ShaderSet configuration. The HardShadow::define would default to VSG_SHADOWS_HARD, SoftShadow::define to VSG_SHADOWS_SOFT and PercentageCloserSoftShadows::define to VSG_SHADOWS_PCSS.

ShaderSet's don't know about ShadowSettings so figuring out which optional code paths to enable would still need to be done explicitly by applications even if it could be made less hardwired than passing a specific "VSG_SHADOWS_*" string.

0 replies

robertosfield · 2024-04-02T16:33:53Z

robertosfield
Apr 2, 2024
Collaborator

I have decided to return the phong and pbr ShaderSets to enable VSG_SHADOWS_HARD code path by default so compatibility with older versions of the VSG is a bit more seamless:

vsg-dev/vsgExamples@bafb749

Users will still need to change

 light->shadowMaps = numShadowMaps;

to

 light->shadowSettings = vsg::HardShadows::create(numShadowMaps);

But this necessary to provide a coherent and extensible means of defining the desired shadow rendering technique so key part of evolution of the VSG's API to handle increasingly sophisticated types of rendering out of the box.

I have now tackled all the issues I've spotted in today's review and testing so I'm ready to merge with VSG/vsgExamples master. I'll do one more round of review & testing then merge.

0 replies

robertosfield · 2024-04-02T16:55:58Z

robertosfield
Apr 2, 2024
Collaborator

I have now merged the soft-shadows-simplified branches of the VSG and vsgExamples with the respective masters:

0 replies

robertosfield · 2024-04-03T16:54:07Z

robertosfield
Apr 3, 2024
Collaborator

@AnyOldName3 I'm replying here as trying to navigate nested replies is painfully out of chronically order.

W.r.t optimizations, we are working on different OS, hardware and of course drivers and test models so I'd expect some variations.

The performance regression with storage buffer usage is troubling. For an variable sized data structure a storage buffer is better fit, but if it's hammering performance then we may need to think about making it fixed sized. If it's cache optimization issue that your hardware/driver is hitting up against then then storing the inverse matrix may part of the problem.

2 replies

AnyOldName3 Apr 3, 2024
Author

In OpenGL, the spec said that uniform buffers had to end up in the same fast memory as GL2-era uniforms, whereas storage buffers could be backed by slower VRAM (and didn't benefit from all the same caches as textures do). Not all hardware followed the letter of the spec, e.g. OpenMW has a separate path for Intel integrated GPUs that uses array uniforms instead of UBOs as that's faster on that hardware, but slower everywhere else, and I've read that AMD GCN GPUs treated UBOs and SSBOs as the same thing but relied on a largeish cache to make smaller buffers fast for read-only access.

However, I'm on a newer Navi card, which, like I believe post-Maxwell Nvidia cards do, is supposed to have dedicated memory on-chip for UBOs. I'm not entirely sure this is actually the case, as the reported maximum SSBO and UBO size under Vulkan is the same large value, which wouldn't make much sense if there was a small dedicated piece of fast memory, whereas looking at https://vulkan.gpuinfo.org/listreports.php, I see contemporary Nvidia GPUs list a comparatively tiny value for the maximum UBO size while the SSBO size is still big. It could just be the case that AMD's driver just uses the UBO/SSBO distinction as a hint for the caching strategy, and might end up putting small SSBOs in the 'dedicated fast UBO' memory and leaving big UBOs out of it as and when it feels like doing so.

AnyOldName3 Apr 3, 2024
Author

Anyway, regardless of what I'm seeing, the difference should be bigger on your machine, as it's got an Nvidia GPU, which definitely has dedicated UBO memory. I'm assuming you measured before and after the switch to storage buffers and didn't see a difference? I guess maybe Nvidia's driver could be implicitly converting small read-only SSBOs into UBOs instead.

AnyOldName3 · 2024-04-03T18:51:42Z

AnyOldName3
Apr 3, 2024
Author

It might be worth taking vsg-dev/vsgExamples@master...AnyOldName3:vsgExamples:speedy-bodge for a spin on hardware other than mine before I put effort into doing it properly. It seems to give me maybe a 10% boost, but I cut corners (e.g. hardcoding things) and don't know how much of the benefit will stay once everything's done right.

1 reply

AnyOldName3 Apr 4, 2024
Author

Some timing data for that commit (when reapplied to the StorageVsUniform branch):
.\vsgshadow.exe --depthClamp --sm 3 -n 3 --fps -t --shadow-samples 64 --pcf 0.1 -p .\saved_animation.vsgt with
saved_animation.vsgt

	default (ssbo)	`--ubo`
before	245.967	285.848
after	264.051	289.922

The main idea behind the change was that instead of iterating over the shadow maps in the outer loop, and the sample points in the inner loop, it'd iterate over the sample points in the outer loop, and over the shadow maps in the inner loop, which then let the expensive texture sampling operation get moved out of the inner loop so divergence could be reduced.

Because it would be flitting between the shadow maps, I also made it so the shadow matrices were kept in a local array, otherwise it'd just exacerbate the SSBO problem. This is a bit of a problem, as that array needs as many elements as a light might have shadow maps, and that's not something that's bounded, so I just hardcoded it to three for now. It might turn out that the intermediate array isn't necessary, or turns out not to be necessary if the shadow matrices are provided in a certain way (e.g. in a mat4-typed UBO instead of a vec4-typed SSBO), which would make things a little simpler.

robertosfield · 2024-04-03T19:14:55Z

robertosfield
Apr 3, 2024
Collaborator

I didn't do before and after performance test from the storage buffer change. I made this change for both jointMatrices used for skinning and the lightData. Both I changed to avoid the hardwiring of sizes in the shaders.

I will need to create a branch to test uniform vs buffer so it can be tested on a range of system. I have an Intel LInux lapto, and Intel Windows desktop both with integrated graphics, a AMD 5700G Linux destop but it has my Geforce 2060 plugged in. I will need to formularize a set of performance tests.

1 reply

robertosfield Apr 4, 2024
Collaborator

I have created StorageVsUniform branches of VSG and vsgExamples to experiment with performance of storage and uniform buffers for animation joints and light data:

https://github.com/vsg-dev/VulkanSceneGraph/tree/StorageVsUniform
https://github.com/vsg-dev/vsgExamples/tree/StorageVsUniform

Currently only vsganimation has code to enable use of uniform rather than storage buffer, simply add --ubo ti the command line i.e

 vsganimation ~/Data/glTF-Sample-Assets/Models/BrainStem/glTF/BrainStem.gltf
 vsganimation ~/Data/glTF-Sample-Assets/Models/BrainStem/glTF/BrainStem.gltf --ubo

I will update vsgshadow to have similar codepath for enabling uniform buffer.

robertosfield · 2024-04-04T12:03:25Z

robertosfield
Apr 4, 2024
Collaborator

I have updated the StorageVsUniform branches of VSG and vsgExamples to enable testing of storage vs uniform buffers for the lightData. This allows us to do tests with storage buffer (default) or uniform buffer (enabled with --ubo) such as:

vsgshadow --large -p saved_animation.vsgt -t --duration 5.0 --sm 3 --pcss
vsgshadow --large -p saved_animation.vsgt -t --duration 5.0 --sm 3 --pcss --ubo

The animation path I used is attached but renamed to have .txt extension, this just needs to be removed to use it as above:
saved_animation.vsgt.txt

Results on my AMD 5700G + Geforce 2060 linux system shows uniform buffer is faster by 22% for hard shadows, 12% for PCSS and 6% for Soft shadows (with a penumbra of 1.0 for the above test.)

I did see a noticeable difference in vsganimation with a skinned model so it looks like use of storage buffer in the vertex shader has far less impact than use in the fragment shader so it may be that we can split the ticket.

I will now test on my Intel Linux laptop and my Intel Windows 11 desktop both with integrated graphics.

6 replies

robertosfield Apr 4, 2024
Collaborator

I have now run the same tests on my Intel Core i7-13700 desktop system and hard shows is around 7% faster with uniform buffer than storage buffer, but for both Soft/PCF and PCSS I see no differences, some runs one is faster than the other and visa-versa.

I now need to move my Geforce 2060 into my Intel desktop to see if the OS and proprietary drivers make a difference, and whether the AMD5700G integrated GPU shows any difference.

AnyOldName3 Apr 4, 2024
Author

On my machine, now I've got that branch to make testing things easier, I'm seeing no difference in framerate with SSBOs vs UBOs unless I take measures to ensure there's lots of cache pressure. If I only have one light, or have the PCF/PCSS blocker search radius only as big as a few shadow texels, or only take a few samples of the shadow map, the performance is identical. Once I introduce cache pressure (three lights, 10m PCF radius, 64 shadow samples), the difference grows to 18%. That supports the idea that on Navi-era AMD hardware, SSBOs compete for cache occupancy against textures, whereas UBOs are cached somewhere else.

robertosfield Apr 4, 2024
Collaborator

Thanks for the testing and insights Chris.

I have moved my Geforce 2060 into my Windows 11 desktop and run with the latest NVidia drivers. Curiously my hard shadows frame rate is 85% higher under Linux than under Windows - 912fps vs 493fps.

Even more curious the hit with soft and pcss is much lower under Windows than Linux and may be performing better. Unfortunately for these tests I only noted the % between storage and uniform buffers rather than absolute frame rates so can't provide numbers for this observation. Unfortunately moving a card between two machines is a bit of pain so I'll not attempt to get to the specifics today.

With Windows11/2060 & NVidia x driver I am seeing a 2%, 6% and 11% better performance with using uniform buffers rather than storage buffers.

Last datapoint I can gather today is the AMD 5700G integrated GPU on my Linux desktop system, once I have this I will attempt to draw some conclusions and come up with a short and longer plans for how to get the overall best performance configuration out of the box.

robertosfield Apr 4, 2024
Collaborator

I have now tested my Linux desktop system with the AMD5700G's integrated GPU, for all the tests I get identical results between uniform buffer and storage buffer.

So from all our tests it looks like we see between 0% and 22% improvement with using uniform buffer rather than storage buffer, so my change to use storage buffer for both joint matrices and light data caused this performance regression on some hardware/driver combinations.

Short term I think changing back to uniform buffer for the light data makes sense, but leave the joint matrices as storage buffer. I would like to make the ability to toggle the descriptor types easier for the ViewDependentState types, the StorageVsUniform branches kinda do this but in a quick and dirty hacky way. Given that it's performance regression, not a an actual bug perhaps it'd be fine to just leave with the current storage buffer until I come up with a cleaner way.

Perhaps a ViewDependentState could "have a" set of vsg::DescriptorBindings that could default to what the standard built-in ShaderSet's use for ViewDependentState, but allow uses to override them if they use their own ShaderSet with it's own DescriptorBindings,

This would also touch upon the ViewDependentStateBinding/CustomDescriptorSetBinding that the ShaderSet's use to integrate ViewDependentState but as this has always felt a bit hacky perhaps there is some coherent solution hidden away that I just need to spend a few hours teasing out. Not a task for today though as I've already been waylaid from my original plans getting to the bottom of this performance regression.

I don't think it is worthwhile trying to performance optimize the shadow functionality till I resolve what to do about the above and to change the default behavior back to use uniform buffers.

Oh... one of the issues with uniforms buffers is that you have to set a fixed size in the shaders for the lightData array, something that storage buffer avoids. The fixed size is really a hack to get things to work reasonably well out of the box, it's not an ideal solution. I did try using specialization constants to set the lightData[] arrays size but GLSL/SPIR-V didn't allow this. So... we're left with using something like #prama(tic) shader composition to inject the size to use, but... the VSG doesn't yet support setting a value for the #defines so that's possibly another feature I need to implement to address this problem.

AnyOldName3 Apr 4, 2024
Author

I did try using specialization constants to set the lightData[] arrays size but GLSL/SPIR-V didn't allow this.

One of the examples for specialization constants in the GL_KHR_vulkan_glsl spec is setting an array size, so I'm surprised to hear that, unless it's a bug in glslang which might have been fixed. I just tried

layout(constant_id = 17) const int arraySize = 12;
layout(set = 7, binding = 0) uniform TestUniform
{
	mat4 matrices[arraySize];
} mats;

and it at least compiles.

robertosfield · 2024-04-04T15:57:34Z

robertosfield
Apr 4, 2024
Collaborator

On Thu, 4 Apr 2024 at 16:52, Chris Djali ***@***.***> wrote: I did try using specialization constants to set the lightData[] arrays size but GLSL/SPIR-V didn't allow this. One of the examples for specialization constants in the GL_KHR_vulkan_glsl spec is setting an array size, so I'm surprised to hear that, unless it's a bug in glslang which might have been fixed. I just tried layout(constant_id = 17) const int arraySize = 12;layout(set = 7, binding = 0) uniform TestUniform { mat4 matrices[arraySize]; } mats; and it at least compiles.

It could have been a bug/limitation in an older version of glslang. It's something I tried last year and just couldn't get it to work.

0 replies

robertosfield · 2024-04-19T13:27:32Z

robertosfield
Apr 19, 2024
Collaborator

@AnyOldName3 This morning I made a series of changes to the VSG to help optimize rendering performance, merging changes to use Uniform Buffer for LightData and then introduction of VSG_ALPHA_TEST #define into the build-in ShaderSets. Performance tests I've done showed the value of these changes, so the defaults will now perform better.

I have also created a branch to test whether computing the inverse shadow map matrix on the CPU and passing to the GPU in the LightData uniform buffer or computing the inverse shadow map matrix on the GPU when required was better. The branches that tests this out are:

ShadowMapInverse branch of VulkanSceneGraph
]ShadowMapInverse branch of vsgExamples](https://github.com/vsg-dev/vsgExamples/tree/ShadowMapInverse)

Running vsgshadow with a new --smi command line option moves the compute of the inverse shadow matrix into the PCSS shader and out of the LightData. Performance test of hard, soft and PCSS shadows on my Linux desktop AMD5700G, Linux Intel i7 Laptop, and Windos 11 Intel i7+Gefore2060 desktop failed to show any performance differences between the two configurations except for a 5.7% difference on my AMD5700G where --smi (computing in PCSS shader) was slower than what is in VSG master/computing the shadow matrix on the CPU.

I am surprised that there was no measurable penalty for having the extra LightData usage for the hard and soft shadow code paths as neither use the inverse shadow matrix. I am also surprised there there is no measurable cost in computing the inverse shadow matrix on the Intel i7 onboard GPU on my laptop or the Geforce 2060.

I like the simplicity of doing the inverse when required on the GPU, but making PCSS a little ~6% on an AMD5700G and similar hardware is not an easy trade. So for now I think I'll now merge changes to VSG master to move the inverse computation back into the PCSS shader. Perhaps this is something to revisit in the future.

0 replies

robertosfield · 2024-04-22T11:48:10Z

robertosfield
Apr 22, 2024
Collaborator

@AnyOldName3 I have changed the fragment shaders to use the suggest specialization constant approach for setting the lightData[] size:

#1169
vsg-dev/vsgExamples@90c47aa

These changes are now part of VSG and vsgExamples master.

2 replies

AnyOldName3 Apr 22, 2024
Author

It's probably a good idea to document the range of specialisation constant IDs that the VSG reserves. They don't have to be consecutive, and there are a lot of different uint32_t values, so it should be fine to block off a big range, but I imagine some implementations might have signedness bugs, so it's probably best not to reserve half the possible values and force applications to use the ones that'd be negative if accidentally stored as a signed type.

robertosfield Apr 22, 2024
Collaborator

I'll put it on my TODO list... kinda long though so might take me a while to clear off the critical stuff and get to low priority stuff like this.

Percentage-Closer Soft Shadows #1107

AnyOldName3 Feb 29, 2024

Basic scene with five blocker samples and eight shadow samples

Basic scene with lots of samples

Large scene

Blocker search

Penumbra radius

Shadow sampling

Replies: 41 comments · 91 replies

robertosfield Mar 1, 2024 Collaborator

robertosfield Mar 1, 2024 Collaborator

AnyOldName3 Mar 1, 2024 Author

robertosfield Mar 1, 2024 Collaborator

robertosfield Mar 1, 2024 Collaborator

AnyOldName3 Mar 1, 2024 Author

robertosfield Mar 1, 2024 Collaborator

AnyOldName3 Mar 4, 2024 Author

robertosfield Mar 5, 2024 Collaborator

robertosfield Mar 5, 2024 Collaborator

AnyOldName3 Mar 8, 2024 Author

robertosfield Mar 8, 2024 Collaborator

AnyOldName3 Mar 8, 2024 Author

robertosfield Mar 9, 2024 Collaborator

robertosfield Mar 9, 2024 Collaborator

AnyOldName3 Mar 8, 2024 Author

robertosfield Mar 8, 2024 Collaborator

AnyOldName3 Mar 12, 2024 Author

64 samples

8 samples

AnyOldName3 Mar 19, 2024 Author

AnyOldName3 Mar 20, 2024 Author

AnyOldName3 Mar 20, 2024 Author

AnyOldName3 Mar 20, 2024 Author

AnyOldName3 Mar 12, 2024 Author

8 samples, no rotation

8 samples, rotation

64 samples, no rotation

64 samples, rotation

AnyOldName3 Mar 12, 2024 Author

Eight samples (for soft techniques) --sm 3 -n 3

robertosfield Mar 12, 2024 Collaborator

AnyOldName3 Mar 12, 2024 Author

AnyOldName3 Mar 13, 2024 Author

AnyOldName3 Mar 13, 2024 Author

AnyOldName3 Mar 15, 2024 Author

robertosfield Mar 15, 2024 Collaborator

AnyOldName3 Mar 15, 2024 Author

AnyOldName3 Mar 15, 2024 Author

AnyOldName3 Mar 15, 2024 Author

robertosfield Mar 16, 2024 Collaborator

AnyOldName3 Mar 16, 2024 Author

robertosfield Mar 16, 2024 Collaborator

robertosfield Mar 16, 2024 Collaborator

robertosfield Mar 16, 2024 Collaborator

AnyOldName3 Mar 16, 2024 Author

robertosfield Mar 31, 2024 Collaborator

robertosfield Apr 1, 2024 Collaborator

robertosfield Apr 1, 2024 Collaborator

AnyOldName3 Apr 2, 2024 Author

robertosfield Apr 2, 2024 Collaborator

AnyOldName3 Apr 2, 2024 Author

robertosfield Apr 2, 2024 Collaborator

AnyOldName3 Apr 3, 2024 Author

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 2, 2024 Collaborator

robertosfield Apr 3, 2024 Collaborator

AnyOldName3 Apr 3, 2024 Author

AnyOldName3 Apr 3, 2024 Author

AnyOldName3 Apr 3, 2024 Author

AnyOldName3 Apr 4, 2024 Author

robertosfield Apr 3, 2024 Collaborator

robertosfield Apr 4, 2024 Collaborator

robertosfield Apr 4, 2024 Collaborator

robertosfield Apr 4, 2024 Collaborator

AnyOldName3
Feb 29, 2024

Replies: 41 comments 91 replies

robertosfield
Mar 1, 2024
Collaborator

robertosfield
Mar 1, 2024
Collaborator

AnyOldName3
Mar 1, 2024
Author

robertosfield Mar 1, 2024
Collaborator

robertosfield Mar 1, 2024
Collaborator

AnyOldName3 Mar 1, 2024
Author

robertosfield Mar 1, 2024
Collaborator

AnyOldName3
Mar 4, 2024
Author

robertosfield
Mar 5, 2024
Collaborator

robertosfield
Mar 5, 2024
Collaborator

AnyOldName3 Mar 8, 2024
Author

robertosfield Mar 8, 2024
Collaborator

AnyOldName3 Mar 8, 2024
Author

robertosfield Mar 9, 2024
Collaborator

robertosfield Mar 9, 2024
Collaborator

AnyOldName3
Mar 8, 2024
Author

robertosfield Mar 8, 2024
Collaborator

AnyOldName3
Mar 12, 2024
Author

AnyOldName3 Mar 19, 2024
Author

AnyOldName3 Mar 20, 2024
Author

AnyOldName3 Mar 20, 2024
Author

AnyOldName3 Mar 20, 2024
Author

AnyOldName3
Mar 12, 2024
Author

AnyOldName3
Mar 12, 2024
Author

Eight samples (for soft techniques) `--sm 3 -n 3`

robertosfield
Mar 12, 2024
Collaborator

AnyOldName3 Mar 12, 2024
Author

AnyOldName3
Mar 13, 2024
Author

AnyOldName3
Mar 13, 2024
Author

AnyOldName3 Mar 15, 2024
Author

robertosfield Mar 15, 2024
Collaborator

AnyOldName3 Mar 15, 2024
Author

AnyOldName3 Mar 15, 2024
Author

AnyOldName3 Mar 15, 2024
Author

robertosfield
Mar 16, 2024
Collaborator

AnyOldName3 Mar 16, 2024
Author

robertosfield Mar 16, 2024
Collaborator

robertosfield Mar 16, 2024
Collaborator

robertosfield
Mar 16, 2024
Collaborator

AnyOldName3 Mar 16, 2024
Author

robertosfield
Mar 31, 2024
Collaborator

robertosfield
Apr 1, 2024
Collaborator

robertosfield
Apr 1, 2024
Collaborator

AnyOldName3 Apr 2, 2024
Author

robertosfield Apr 2, 2024
Collaborator

AnyOldName3 Apr 2, 2024
Author

robertosfield Apr 2, 2024
Collaborator

AnyOldName3 Apr 3, 2024
Author

robertosfield
Apr 2, 2024
Collaborator

robertosfield
Apr 2, 2024
Collaborator

robertosfield Apr 2, 2024
Collaborator

robertosfield
Apr 2, 2024
Collaborator

robertosfield
Apr 2, 2024
Collaborator

robertosfield
Apr 2, 2024
Collaborator

robertosfield
Apr 3, 2024
Collaborator

AnyOldName3 Apr 3, 2024
Author

AnyOldName3 Apr 3, 2024
Author

AnyOldName3
Apr 3, 2024
Author

AnyOldName3 Apr 4, 2024
Author

robertosfield
Apr 3, 2024
Collaborator

robertosfield Apr 4, 2024
Collaborator

robertosfield
Apr 4, 2024
Collaborator

robertosfield Apr 4, 2024
Collaborator