Skip to content

Avoid creating staging buffers in TransferTask that exceed the Vulkan physical device property limit #1473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rms7326
Copy link

@rms7326 rms7326 commented May 21, 2025

Pull Request Template

Description

TransferTask::_transferData() can create a single staging buffer who's allocation size exceeds the Vulkan physical device maxMemoryAllocationSize property limit. Windows NVidia drivers tend to be very strict about keeping allocations at or below the stated maximum memory allocation size whereas Linux is more forgiving.

Now, if multiple regions to transfer collectively exceed the maximum memory allocation size, TransferTask::_transferData() creates a new staging buffer for it and any subsequent regions. If a single region exceeds the maximum memory allocation size TransferTask::_transferData() ensures that the region is in its own staging buffer so that it is more likely to succeed on drivers that are less strict about memory requests exceeding the specified physical device limit.

Fixes #

Failure to transfer data to the GPU, particularly on Windows platforms with NVidia cards. Even though all of our buffers that needed copying to the GPU were under the maximum memory allocation limit, collectively they were over the limit so when TransferTask::_transferData called vsg::createBufferAndMemory() to create an enormous staging buffer it failed.

Type of change

Please delete options that are not relevant.

  • [ X ] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I tested this extensively with small fabricated maximums to verify the edge conditions. All existing tests and example applications work as before.

Checklist:

  • [ X ] My code follows the style guidelines of this project
  • [ X ] I have performed a self-review of my own code
  • [ X ] I have commented my code, particularly in hard-to-understand areas
  • [ X ] My changes generate no new warnings

Michael Saunders added 2 commits May 21, 2025 15:46
… physical device property limit.

If multiple regions to transfer collectively exceed the maximum memory
allocation size, TransferTask::_transferData() creates a new staging
buffer for it and any subsequent regions. If a single region exceeds the
maximum memory allocation size TransferTask::_transferData() ensures
that the region is in its own staging buffer so that it is more likely
to succeed on drivers that are less strict about memory requests
exceeding the specified physical device limit. Windows NVidia drivers
tend to be very strict about keeping allocations at or below the stated
maximum memory allocation size whereas Linux is more forgiving.
@robertosfield
Copy link
Collaborator

I understand in principle the limit issue but the changes to TransferTask are really extensive so it's not something I can easily review, and as it's critical part of the VSG it's not something I'll risk just merging. It'll require a significant effort to come up with test cases that currently work & break things and then compare with this PR, and step through all the code to make sure there are no issues and that all the tests run reliably.

If you have test cases could you share these? Adding some tests to vsgExamples/tests will be a good first step.

Once I understand how things are breaking I can think through how the code should evolve to tackle this issue. It might be in the form something like this PR, or it might be best tackled with another approach.

I do have some major work I'm tackling right now so really need to focus on that before I start looking at other complex topics, this might take me till the end of the month.

@rms7326
Copy link
Author

rms7326 commented May 22, 2025

I can certainly come up with a VSG example that would demonstrate this issue but the failure is very machine driver dependent and would require a significant amount of data delivered to the GPU. Building a vsgExamples/test unit test would be significantly more challenging as the behaviors are internal to the TransferTask and mocking Vulkan would be very challenging.

Would you consider, perhaps as a first step, altering the way the TransferTask is instantiated? Currently RecordAndSubmitTask::RecordAndSubmitTask() constructs a transfer task directly. If instead a function object responsible for instantiating the TransferTask was injected into the RecordAndSubmit constructor via Viewer::assignRecordAndSubmitTaskAndPresentation() we could provide our own version. The default behavior would be to instantiate a vsg::TransferTask therefore not requiring any client code changes.

@robertosfield
Copy link
Collaborator

I'd rather not start adding callbacks interfaces at a fine-grained level for setting up scene graph and view elements as this task can become a rather too open ended task as folks wish to customize behavior of different elements in different ways.

The Viewer::assignRecordAndSubmitTaskAndPresentation() method is a convenience function for setting up the internals but it should be possible to set up the RecordAndSubmitTasks yourself and assign what ever combination of TransferTasks/Windows/etc. that is ideal for you application. So perhaps this might be the route for a test program.

I'd rather TransferTasks work better out of the box though, needing to delve into internals is something I'd prefer most user not to need to do. Scene graphs built ontop of Vulkan isn't something we have prior art for so it's a case of feeling out our way as different folks push the scene graph into different domains and to handle different amounts of data.

@rms7326
Copy link
Author

rms7326 commented May 22, 2025

I too would like to see TransferTask work better out of the box. I'll come up with a simple vsgExample that creates a varying numbers of meshes of varying sizes controlled by several command line options so that you can demonstrate this allocation failure yourself on Windows machines with NVidia GPUs.

@rms7326
Copy link
Author

rms7326 commented May 22, 2025

Robert, it looks like my coworker, @theodoregoetz already developed a suitable test, vsgExample/test/vsgtriangles, that can be used to demonstrate this issue. If you run the the test with the following arguments on a Windows machine with an 8GB VRAM NVidia card the following test creates 10, 384MB device buffers from which TransferTask creates one host visible staging buffer that is 3.84GB (just under the 4GB limit) and will succeed:

vsgtriangles -c 10 -n 200 200 1000

and the following test creates 10, 576MB device buffers from which TransferTask creates one host visible staging buffer that is 5.76GB (just over the 4GB limit) and will fail:

vsgtriangles -c 10 -n 200 200 1500

Using my branch the second test succeeds because TransferTask creates two staging buffers, one that is just under the 4GB limit and the other containing the remaining 1.5GB of staging buffers that didn't fit within the 4GB staging buffer.

Note that on Linux the same NVidia card allows host visible staging buffers to be allocated larger than the specified maximum size.

@robertosfield
Copy link
Collaborator

Excellent, thanks for the suggestion. Fails for my under Linux as well. I have a 6GB Geforce 2060.

$ vsgtriangles -c 10 -n 200 200 1500
number of triangles: 60000000
number of vertices:  180000000
size of scene: 5493.16 MB
vsgtriangles 
[Exception] - Error: Failed to allocate DeviceMemory. result = -2

At this point I know what part of the setup fails on my system, but armed with this test case I now have the ability to investigate further.

Does the above test pass on your system with your PR?

As a general comment, there will always be a limit to just how much we can allocate on our systems so testing for it and building appropriate complexity of scene graphs will always be required, on the VSG side we'll need to make this process as graceful at failing as feasible when things are pushed too far.

@rms7326
Copy link
Author

rms7326 commented May 22, 2025

Robert, I am able to run vsgtriangles example on my Linux machine because I have 8GB of VRAM which is plenty large enough to accommodate the 10, 576MB device buffers. I suspect that your device buffer memory allocation is failing because you just don't have enough VRAM and it hasn't even gotten to the transfer stage yet.

You could try:

vsgtriangles -c 10 -n 200 200 1200

That may just give you enough head room to fit the device buffers in your 6GB of VRAM and still produce a host visible staging buffer of 4.6GB which will succeed on Linux and fail on Windows using the master branch and succeed for both Linux and Windows using the transfer-task-allocation branch.

@rms7326
Copy link
Author

rms7326 commented May 22, 2025

By the way, as a tangential comment, you can inspect the pageable device local memory feature and enable memory paging which will offload VRAM to RAM for extreme cases:

// query if the physical device supports pageable memory and enable if so
auto pageableMemory{physicalDevice.getFeatures<VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT,
    VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PAGEABLE_DEVICE_LOCAL_MEMORY_FEATURES_EXT>()};
if (pageableMemory.pageableDeviceLocalMemory) {
    requestedDeviceFeatures.get<VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT,
        VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PAGEABLE_DEVICE_LOCAL_MEMORY_FEATURES_EXT>().pageableDeviceLocalMemory = VK_TRUE;
}

This doesn't solve the TransferTask issue of requesting host visible memory allocations that are larger permitted but, it does fix your second issue with allocating more memory than is available on VRAM.

@robertosfield
Copy link
Collaborator

I've done a couple of quick code reviews behind this PR but have yet to understand it. There's a lot of changes, it's not easy to read what is being done and why. This is not straight forward code to begin with, and as it's so fundamental to VSG usage it really has to work reliably, so any changes need to be thoroughly checked.

Could you explain the changes from a high level so I can review the changes with knowledge of your intent.

It may be once I properly understand the issue and the proposed solution I can come up with a cleaner and less intrusive set of changes, or that the changes are perfect and can be merged as it, or something in between. At this point though I'm no where no close to having the confidence of merging the changes.

@rms7326
Copy link
Author

rms7326 commented Jun 18, 2025

Robert, sorry for the delayed response. I was helping out my family during a crisis.

The gist of the need for the change was to ensure that any one memory block allocation does not exceed the Vulkan maxMemoryAllocationSize found in the PhysicalDevice properties (VkPhysicalDeviceMaintenance3Properties). The original version of TransferTask attempted to compute the total size required to contain all the addressed aligned image and buffer data. The maximum block size is typically is around 4GB. Linux drivers for both AMD and NVidia appear to accept block sizes that are larger than the advertised limit without failure but on Windows, the NVidia drivers refuse, causing vsg::createBufferAndMemory() to throw an exception (actually the underlying vsg::DeviceMemory::create() throws because vkAllocateMemory() returns failure).

My changes to TransferTask attempted to maintain your original logic as much as possible with the addition of keeping track of the memory block sizes. If adding an image or buffer to the "current" block would exceed the maximum block size, a new set of offsets is generated for a new block. If the single image or buffer size is itself bigger than the maximum it will request it and fail (or succeed) just as it does today depending on the platform and drivers. I added fairly extensive comments noting this detail in the code. As far as reliability goes, we have been using the changes in our product with many varying sizes of data with success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants