Most efficient transform upload #284

RobertMtx · 2024-03-04T12:07:13Z

RobertMtx
Mar 4, 2024

What would be the more efficient way to upload transforms to the shader?

Most of the transforms will be a 1 matrix per 1 rendered object static transform, but characters will need 40+ dynamic transforms each, and some objects will have single dynamic transforms to move around and be physics-based, etc.

Even though most are 1-to-1, I could still upload them to a shader-view buffer and place an index into their instance data to access the buffer, if that would be the smarter way to handle it.

My first instinct was to just throw the transform(s) into the per-instance vertex data. But it looks like 4x4 matrices are not supported as a data-type there (where float4x4 is supported in global space). So I would need to upload the transform as 4 float4 types. This isn't horrible, but wasn't sure how it would play out when there are 40+ dynamic matrices controlling characters.

A 3rd method I've considered is to place all static data into static buffers, and then just offset them for each object. But doing this is the same as using dynamic buffers according to sample documentation?

I also considered switching to quaternion+float4 transforms to cut the data in half, and writing a basic quaternion shader header. Any idea how well something like this would perform? I wasn't sure if hardware was specifically designed to crunch on matrices, making quaternions a bad idea. I could test it, but it would be a lot of coding just for a test. Either way, this would be a secondary concern, because I still have to upload them (although per-instance data would fit better for this).

I will be doing instanced drawing whenever the data is convenient. But I've fallen into a trap in the past, where I uploaded everything into dynamic buffers just to allow me to instance-draw nearly everything, and I got pretty bad performance out of it (and with a $4K graphics card). Too much data to upload every frame. So I'm retooling the system to use primarily static or default buffers, except for data that changes every frame, such as animated or physics objects. I've done very little instanced drawing in past APIs and got great performance, so I'm hoping that not making that the focus will help.

I realize this question also depends on the API, which I will be using DirectX12 primarily, but possibly supporting Vulkan later on.

I appreciate any advice!

Edit: I forgot to mention that I'm using a single global resource signature for all world objects. I could change this, but definitely prefer not to, due to all of the changes it would bring with it. The single resource signature makes it tricky to have individual static shader-view buffers per object or area, because I would need to call SRB->Set() for each draw call. This means I'm pretty much restricting myself to using a shared (between all objects) static buffer for the entire active area, per purpose, or associating data as per-instance vertex data.

Answered by TheMostDiligent

Mar 4, 2024

A general rule of thumb for GPU programming is to perform as few updates or any issue commands for that matter as possible.
So the most efficient way would be to upload all your transforms into a buffer (e.g. structured buffer) and then load them from that buffer using e.g. instance ID or another object identifier.

A 3rd method I've considered is to place all static data into static buffers, and then just offset them for each object. But doing this is the same as using dynamic buffers according to sample documentation?

Yes, there will be some overhead for each draw call, so better to use index in the shader.

A 3rd method I've considered is to place all static data into static buffers, …

View full answer

TheMostDiligent · 2024-03-04T19:45:04Z

TheMostDiligent
Mar 4, 2024
Maintainer

A general rule of thumb for GPU programming is to perform as few updates or any issue commands for that matter as possible.
So the most efficient way would be to upload all your transforms into a buffer (e.g. structured buffer) and then load them from that buffer using e.g. instance ID or another object identifier.

A 3rd method I've considered is to place all static data into static buffers, and then just offset them for each object. But doing this is the same as using dynamic buffers according to sample documentation?

Yes, there will be some overhead for each draw call, so better to use index in the shader.

A 3rd method I've considered is to place all static data into static buffers, and then just offset them for each object. But doing this is the same as using dynamic buffers according to sample documentation?

Using less data is always beneficial. How much? You never know until you measure.

But I've fallen into a trap in the past, where I uploaded everything into dynamic buffers just to allow me to instance-draw nearly everything

This still may be a viable approach if you batch draw calls. Your transform matrix buffer may be dynamic. Unless you bind a new buffer before each draw call, it should work fast too.

I forgot to mention that I'm using a single global resource signature for all world objects.

You can add your transform buffer to this global signature. It is not a problem at all if only one shader will use this buffer - there is no overhead in heaving resources in the signature.

It is hard to tell which method will work best as there are many factors that may affect the performance. Always measure how much benefit each method gives.

4 replies

RobertMtx Mar 7, 2024
Author

Thanks, I appreciate all of the advice. I was not aware of the fact that having unused resources in a signature would have no negative effects. That's great to hear.

I actually have a question about data updates that need to happen more than once per frame. This is the first time I've ever tried to update shader-view data more than once per frame, and I'm running into problems with it not working correctly. The data is a single uint value, so I stored it into a uint4..

struct AREA
{
	uint Env;
	uint Reserved1;
	uint Reserved2;
	uint Reserved3;
};

The Env value is an index into which room/environment is currently being rendered. The data is uploaded via a dynamic buffer, twice per visible room, per frame. There are usually 1-4 environments visible at a time, and the engine iterates through them, overwrites Env to match the active room slot for each one, then renders that entire room all at once, then moves on to the next. Since I use shadowmapping, I have to render each room twice. Currently, I render all rooms to shadows, then render all rooms to normal polys. This means I have to update the uint twice per room, per frame.

As soon as I moved this index from vertex instance data (per object) to a shader-view global, the wrong index started getting used. I'm assuming the value is not changing or being updated when it's supposed to. But I'm not sure why, or if what I'm doing is even allowed. Since I'm just issuing commands-only, and the GPU doesn't actually draw things until later, I'm worried that everything (all rooms) end up drawing with whatever index was written last, and the first indices for rooms 0-3 (or whatever) are ignored.

Secondly, this might be the type of situation where I would need to use a fence. I have not had many reasons to use these yet, apart from moving resources between states (render targets) with Diligent::StateTransitionDesc -> TransitionResourceStates() (are these fences?). I'm not very experienced with them yet, so it would be easy for me to overlook their necessity. Would I need to use something like a fence in order to update a single global integer between draw calls?

Lastly, do you know if there is an easier way to do something like this? I'm not uploading a single integer to save bandwidth or anything. This was just the best way I could imagine doing this. All active environments (rooms with light data) are always uploaded/available (another data array), but each individual object needs to know which environment to use (env = Environments[Area.Env]). But this whole concept of requiring a separate buffer and segmented upload stream just to update a single integer, multiple times per frame, and it not even working correctly, is not exactly singing harmony. Do you have any advice on how to do this a better way?

TheMostDiligent Mar 7, 2024
Maintainer

I'm worried that everything (all rooms) end up drawing with whatever index was written last, and the first indices for rooms 0-3 (or whatever) are ignored.

Diligent performs proper versioning for you. When you map a dynamic buffer, it gives you new memory each time (unless you ask it not to).

Secondly, this might be the type of situation where I would need to use a fence.

No, fence is for synchronization between GPU and CPU.

Diligent::StateTransitionDesc -> TransitionResourceStates() (are these fences?).

No, these are barriers, which are very different. They are for GPU->GPU synchronization. In majority of cases Diligent can handle them for you.

Would I need to use something like a fence in order to update a single global integer between draw calls?

No

Lastly, do you know if there is an easier way to do something like this?

Dynamic buffer is OK. You can also try DEFAULT buffer + UpdateBuffer. Measure what gives better performance.

But this whole concept of requiring a separate buffer and segmented upload stream just to update a single integer, multiple times per frame, and it not even working correctly, is not exactly singing harmony.

As an alternative, you can use single-instance draw calls and ass the index as the first index. Then, in the per-instance vertex buffer you can pass the index. You can use the instance step rate of 0 to have this value passed to all instances.

RobertMtx Mar 7, 2024
Author

Thanks for the info! So you're saying my problem more likely lies somewhere in a flawed execution of what I'm doing, rather than just missing a step of GPU communication somewhere. I would certainly prefer that scenario. Thanks again for your help!

TheMostDiligent Mar 7, 2024
Maintainer

Yes, your scenario is very typical that does not require any special synchronization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most efficient transform upload #284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Most efficient transform upload #284

RobertMtx Mar 4, 2024

Replies: 1 comment · 4 replies

TheMostDiligent Mar 4, 2024 Maintainer

RobertMtx Mar 7, 2024 Author

TheMostDiligent Mar 7, 2024 Maintainer

RobertMtx Mar 7, 2024 Author

TheMostDiligent Mar 7, 2024 Maintainer

RobertMtx
Mar 4, 2024

Replies: 1 comment 4 replies

TheMostDiligent
Mar 4, 2024
Maintainer

RobertMtx Mar 7, 2024
Author

TheMostDiligent Mar 7, 2024
Maintainer

RobertMtx Mar 7, 2024
Author

TheMostDiligent Mar 7, 2024
Maintainer