-
Notifications
You must be signed in to change notification settings - Fork 243
Cheap layout compatibility checks #1464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Cheap layout compatibility checks #1464
Conversation
Otherwise you can't have nice things like the same view descriptor set for all shadersets.
Avoids disturbing necessary descriptor sets while still avoiding the need for full layout compatibility checks. This is achieved with a basic vector<bool> saying which sets a pipeline layout has and doesn't have, so layouts it doesn't have won't be bound. This could mean different sets are bound than if a full compatibility check was done, but that should only happen when the scenegraph requested invalid API usage - there should only be one BindDescriptorSet for each slot at the top of the state stacks, and if it's for a slot the current pipeline layout needs, it must be compatible. Therefore, we only need to worry about compatibility checks for slots the current pipeline layout doesn't use.
The VSG generates layouts like this when descriptor bindings are dependent on defines, and doesn't generate matching empty BindDescriptorSet instances to go with them. That disrupts the fix in the previous commit, so they must be handled as if they don't exist. Even though the pipeline layout says these sets exist, there's no need to bind anything to them as they inherently can't be statically used by the pipeline.
I have just done a first pass review and looks good to me so far. When I'm ready for a second pass I'll create a branch and do performance testing, I'm not expecting any issues based on my initial code reviewer :-) |
What kind of timeline can be expected for this getting the full review? It's been a month since it was opened, so it's already been quite a while. |
I am hoping to get most of the outstanding PRs reviewed this week. |
@AnyOldName3 could you give me a heads up on which of your PRs are ready to review and merge with VSG master and the priority/order that they would be best merged, and which ones are lower priority/for discussion ones. I ask this as I have a whole bunch of work to do on many different development threads so it's hard for me to keep track of everything. |
I'll put a list of everything here to avoid having to figure out a new place. Soon#1464 (i.e. this one) is the highest-priority PR from my perspective. It also includes the changes from #1438 , so it might be easier to review that one first. vsg-dev/vsgXchange#224 doesn't change the API or fix a bug, so nothing has an absolute dependency on it, but it makes a significant difference to text atlas generation time, so it would be nice to have sooner rather than later. Less soonAnyOldName3/vsgXchange#1 is a minor docs fix that I apparently opened to the wrong repo ages ago. I think it's slightly more useful than the equivalent thing someone else submitted later, but it's not particularly important now it's not replacing a placeholder. #1296 is only a proof-of-concept, and doesn't need immediate attention. #1419 and vsg-dev/vsgExamples#350 are drafts. I've used them to investigate problems by rebasing them and fixing up merge conflicts each time I've needed to, which is a hassle, so I'd rather we figured out a way to make them mergeable but disabled by default eventually, but it's not a priority. I think we've given up on cgitest, so robertosfield/cgitest#1 can be ignored. |
As a first step in review of this PR I have merged as a branch: I will track down the old performance tests I was doing to make sure everything is performing well with this branch. On code review it looks reasonable. I might make some small tweaks for readability, but will wait till I'm more familiar with the code. |
I have done the bench marking tests I used earlier in the year and see a 1.2% slow down with this PR. This probably the max end of the type of slow down end users might experience. The other improvements prior to the PR still put this PR ahead of v1.1.10 and v1.0.9 results, so while in the wrong direction don't rob us of all the gains my previous optimization work achieved. I will do another code review this time looking for opportunities to lower the overhead. |
That's a big surprise to me as the simulated worst-case test I did should genuinely have been a worst-case but didn't have nearly that much overhead, and I compared the emitted assembly for multiple compilers to ensure that I wasn't relying on vendor-specific optimisations. Do you have a test scene you can share so I can see what's different between it and the ones I generated? The only guess I've got is that maybe yours rebinds one material descriptor over and over and it remains in L1 (which would mean doing something explicitly to dirty it), so this branch would make it spill, whereas mine switched between a few materials so they'd always need binding, so they'd be less likely to remain in L1. Either that, or your Vulkan driver has much less overhead than mine, which would be a good thing to know about for future performance testing. |
I don't own the test models I am using so can't share. They were imported from old OSG models so probably stress different bits of the VSG than more modern PBR etc. models. Longer term I'd like to have a set of scene generator utilities to create stress testing models programmatically so won't can create a range of tests and scale them up/down as appropriate. The code looks like it can be optimized - there more if statements and variable sized containers than I think is required or optimal. The first step I've taken on this path is to reorgnize the checks into two CommandBuffer::enabled(..) methods, this makes the code more readable and centralizes the checks so rewriting them won't change the code that depends upon them. I have put my changes so far into a optimized-layout-compatibility-checks branch: https://github.com/vsg-dev/VulkanSceneGraph/tree/optimized-layout-compatibility-checks The changes so far are just to introduce the CommandBuffer::enabled() method: 334b98d Next up I'm curious about using a fixed size array for the enabled slots so we can avoid the checks against the container CommandBuffer::_currentDescriptorSetSlots.size(). I'm also thinking that summing the results when iterating over the _currentDescriptorSetSlots and then comparing to the size required might avoid putting if statements in a for loop. I don't know if the test models I'm using would have BindDescriptorSet's with multiple DescriptorSet's so this might be a optimization that might not be picked up in results. |
Now I have put the checks all into two CommandBuffer methods we focus on where the optimizations need to be made. First up is the simple single set check: inline bool enabled(uint32_t firstSet) const
{
return static_cast<uint32_t>(_currentDescriptorSetSlots.size()) > firstSet && _currentDescriptorSetSlots[firstSet];
} Getting the container size and doing the comparison could be replaced by just having a fixed sized _currentDescriptorSetSlots that is big enough for all possibilities. A first step would be to have a std::array or just use a 32 bit mask. The multiple set variant is more complex: inline bool enabled(uint32_t firstSet, uint32_t numSets) const
{
if (static_cast<uint32_t>(_currentDescriptorSetSlots.size()) < firstSet + numSets)
return false;
for (uint32_t slot = firstSet; slot < firstSet + numSets; ++slot)
{
if (!_currentDescriptorSetSlots[slot])
return false;
}
return true;
} It has a three separate conditionals that affect control flow. We could avoid the for loop and all the conditionals by replacing it all with a set of bit mask operations. |
The implementation here's all vector of bool stuff, so it's already all bitmask operations because I'd expected the loop in If we're happy to limit the VSG to 32 or 64 descriptor slots instead of the Annoyingly, that's still not enough to convince the big three compilers to fully unroll the loop in |
I have replaced the use of std::vector with a uin32_t and bitmask maths and performance on my test models is back to where it was, at 1.4% faster than the original PR. The changes require 2 less lines of code as well. The changes are in commit: 32eb8c4. So the new code works faster and in my preliminary testing I can't see any regressions, but I'm just testing with standard models that worked prior to this code anyway. So my next step is check the layout tests. I don't recall them off the top of my head so will have to look back over past communications. @AnyOldName3 could you do a code review of my https://github.com/vsg-dev/VulkanSceneGraph/tree/optimized-layout-compatibility-checks branch? Thanks. |
At a first glance, I'd worry it might be slower as moving the checks to inline functions in the CommandBuffer definition stopped them being inlined with MSVC. |
inlining is hint to the compiler to inline to code where possible, if it's a f*cking awful compiler that can't inline very simple code, that this code is, then it's utter shameful mess that everyone should steer clear of. This new code is written to be VERY easy for the compiler to optimize. No if statements, no loops, just simple bit math. This is how you should have written the code in the first place if you had wanted it be efficient. I have merged your vsgincompatiblepipelinelayouts example as branch of vsgExamples and updated it to compile against the recent changes to the built-in ShaderSets. Further details on: vsg-dev/vsgExamples#359 I've now done as much as I can do to get this work ready to be merged with VSG master. I now need you to provide some guidance on how to test using vsgincompatiblepipelinelayouts and to testing yourself. |
It turns out that Godbolt had changed some of my compiler flags at some point, so inlining of non-templates was disabled. Either way, now that's sorted out, I get nearly identical assembly with all three major compilers with your The I'll do performance testing with the scenegraphs I tested before, but I'm not anticipating any problems. On all the platforms I tried, the |
I have made a similar change to PipelineLayout.cpp as (your](AnyOldName3@1917a35) fix to the ordering of slot assignment. I chose to stick with a accumulating the result in a temporary variable then assigning the completed result to avoid any potential threading issues as CPUs guarantee consistent results on assignment at the word level. The commit is: 3a97706 I've reviewed your changes to use bitset<32> and this is close my uin32_t implementation, it's a bit cleaner to read but adds an extra header include and templates for the compiler to parse. Performance wise I expect it should end up the same as uint32_t as it won't pay the penalty for dynamic allocations, copying overhead, pointer references and extra conditions that the variable sized vector will have had. |
I can confirm that that branch works, and using the same tests as before, the overhead is around 0.14%, so a bit better than the branch from this PR. |
It's been a couple of weeks since I confirmed that the optimized-layout-compatibility-checks branch works. As far as I'm aware, there isn't anything else I'm supposed to be doing to move it forward, so I'm still waiting for you to finish your performance tests and merge it. |
Sorry. I have ended up having to complete other work first. That is almost
complete. Hoping it'll be out today bed tomorrow.
…On Thu, 3 Jul 2025, 19:24 Chris Djali, ***@***.***> wrote:
*AnyOldName3* left a comment (vsg-dev/VulkanSceneGraph#1464)
<#1464 (comment)>
It's been a couple of weeks since I confirmed that the
optimized-layout-compatibility-checks
<https://github.com/vsg-dev/VulkanSceneGraph/compare/optimized-layout-compatibility-checks>
branch works. As far as I'm aware, there isn't anything else I'm supposed
to be doing to move it forward, so I'm still waiting for you to finish your
performance tests and merge it.
—
Reply to this email directly, view it on GitHub
<#1464 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKEGUFRI4OALK3DBXPRR3L3GVYMPAVCNFSM6AAAAAB5EBWYV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMZTGE3DONZUHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I have done more testing comparing master with AnyOldName3-vague-layout-compatibility-checks (vector) and optimized-layout-compatibility-checks (uint32_t based), and depending upon the test model I see between 0.5 to 3% slow down. The 3% slow down for an OpenFlight city model that contains billboarded trees that will be dropped into the transparent bin, so I think this is the sticking point. I haven't dug into specific causes of the slow down so can't suggest possible optimizations to resolve it yet. What I am thinking is that we need a couple of tools to help with testing:
Such capabilities are open ended task that one could easily spend months working, but for the purposes of catching slow downs and helping figure out bottlenecks we need something sooner so will need to start very simple. I'm thinking of starting simple terrain grid with trees distributed over it and camera path to do the unit tests with. If it's synthetic then there will be no issues with copyright or hosting databases. |
When I've been doing performance testing, I've generally found that using any real data has given me much, much smaller impacts than you've had from the same changes, typically far too small to measure against the variance I see from run to run, so to try and get results that indicate what you'll see, I've been designing scenegraphs that hammer the modified code paths as often as possible, and then altering them so different iterations read different data to see how much impact that has - obviously running a hundred instructions in a loop will go at different speeds if it's only reading from L1 cache. I've got a patch for vsgviewer that adds flags to duplicate either the whole scenegraph or nodes within it. If I don't add lots of duplicates of the same data, then the frametime is nearly entirely spent in Either way, 3% seems like a very large impact given what I've been able to measure. The transparent bin shouldn't mean the modified code was running more often than in my tests, as at least one of them included rebinding the pipeline and every descriptor set between each draw, just like a renderbin would. I don't think as much as 3% of instructions executed or memory accessed would be things that weren't happening before based on what I saw when looking at the assembly etc., but there are confounding factors like whether we're pushing things out of the cache that the driver was using, and that's not going to be consistent between hardware vendors. I'll investigate further with translucent drawables and see if I can get different results from my previous testing. |
I'm seeing 3% slow down on a real-world model, it's a consistent enough result to know it's not just noise. I don't yet know the specific cause of the bottleneck and until I have a performance benchmark written that reproduces it reliably there is no point speculating on it's cause or how to resolve it. It really sucks that I'm seeing this performance regression because I really could without yet another unpaid task to follow up on before I can get on with tagging the next point release. However, a 3% regression for resolving a niche issue isn't something I can just let by. Good performance can die from a thousand paper cuts. |
Pull Request Template
Description
Avoids disturbing necessary descriptor sets while still avoiding the need for full layout compatibility checks.
This is achieved with a basic
vector<bool>
saying which sets a pipeline layout has and doesn't have, so layouts it doesn't have won't be bound. This could mean different sets are bound than if a full compatibility check was done, but that should only happen when the scenegraph requested invalid API usage - there should only be oneBindDescriptorSet
for each slot at the top of the state stacks, and if it's for a slot the current pipeline statically uses, it must be compatible. Therefore, we only need to worry about compatibility checks for slots the current pipeline layout doesn't use, which the change here now does.There's a slight caveat here - if a pipeline doesn't statically use any bindings in a descriptor set, it's legal to bind nothing to it, so someone might make a scenegraph that leaves its
BindDescriptorSet
out of a stategroup and inherets something incompatible. If a descriptor set is empty, it can't be statically used, so that's treated as if there's no descriptor set. Any handling beyond that would have a larger performance overhead, so I think use cases other than this aren't worth supporting directly. If someone wants to avoid binding useless descriptor sets, there are other means, e.g. adding a dummy statecommand or not having a pointless descriptor set in the first place.This is based on #1438, even though technically it could be separated and used on its own. This is because my earlier attempts to fix the problem relied on that PR, and using the same baseline made comparing them more straightforward.
This fix is by far the fastest that helps every failure case I identified (before yesterday), with the performance impact of a worst-case scenario ranging from below the noise floor of my measurements to just 0.25% overhead depending on whether I cherry-pick a good run or my worst run. For a scenegraph that isn't designed specifically to make this have as large an impact as possible, it should be free - it does a few instructions extra work for each pipeline layout change and descriptor set binding, both of which take much longer than that already.
Fixes #1394
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
The main interesting test is the incompatible layout example here: vsg-dev/vsgExamples@master...AnyOldName3:vsgExamples:incompatible-layouts. As discussed in #1394, it replicates the problem I initially discovered in a client app and several related situations that I discovered and thought were likely to have problems. There's lots of detail in this post #1394 (comment). With the master branch, lots of validation errors are emitted in most test cases, but with this branch, it causes no errors at all.
Obviously, I've confirmed that this fixes the problem in the original client app.
As well as this, I've done performance testing. On my machine, the overhead is much smaller than the variation in framerate or execution time from run to run, so simply timing things didn't provide meaningful results. Instead, I found that running for a couple of minutes with a sampling-based profiler (I used the one built into Visual Studio) and comparing the percentage of samples in the modified functions was much more consistent. To get a worst-case scenario and maximise the observable overhead, I generated a scenegraph where the same basic cube is drawn thousands of times cycling between 32 different materials, with the number of draw commands chosen to give around a hundred frames per second. Trying fewer draws and more frames just made the noise from slow functions (particularly
vkCmdBeginRenderPass
andvkBeginCommandBuffer
) overwhelm the overhead I was trying to measure, but trying a smaller number of slower frames meant they were called less and countered this out. Other variations on this style of scenegraph produced equivalent results. As mentioned above, if I cherry-pick the fastest baseline result and slowest result from this branch, and compare the number of samples in modified functions, that only accounts for 0.25% of all samples. If I don't artificially try and make things look as bad as possible, then the difference is too small to measure.Checklist: