Skip to content

Conversation

raphaelthegreat
Copy link
Contributor

  • Removed per-draw memory barrier. Cache already emits a post barrier on every buffer upload and shader cannot write to DMA buffers so don't see any point
  • Removed fillBuffer commands on buffer creation and fault buffer processing. In first case, the buffer is going to get fully copied into anyway, so it just wastes GPU resources to zero fill it
  • Switched fault buffer marking to atomic operation. It is very possible for waves of the same draw/dispatch to race accessing the same word and even pipelined draws/dispatches as there is no per-draw fault buffer barrier. Using atomic ensures proper synchronization in both cases

tagging @LNDF to review this

@georgemoralis georgemoralis requested a review from LNDF July 17, 2025 13:13
@LNDF
Copy link
Member

LNDF commented Jul 17, 2025

Thanks for making this PR. I had this pending to do but currently very busy with other things in life so I have little time.

I will check it in a few hours, when I get home.

@@ -488,7 +488,6 @@ bool Rasterizer::BindResources(const Pipeline* pipeline) {
range.upper() - range.lower());
}
}
buffer_cache.MemoryBarrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe i'm wrong in this one. But since accessing buffer content through DMA doesn't bind those buffers, how does the driver know how to synchronize calls even if there is a barrier? Does this happen automaticly when an access is detected using a device buffer address?

Copy link
Contributor Author

@raphaelthegreat raphaelthegreat Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is an interesting question. The process of binding buffers I don't believe involves the driver doing any "smart" tracking, but rather the pipeline barrier command itself just emits sync packets that perform cache flushes or wait for specific parts of the pipeline to finish. Can check radv to make sure

Copy link
Contributor Author

@raphaelthegreat raphaelthegreat Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like radv ignores the buffer memory range https://github.com/chaotic-cx/mesa-mirror/blob/main/src/amd/vulkan/radv_cmd_buffer.c#L13475 In general I believe the whole buffer device address feature wouldn't be very useful if drivers relied on bindings to do synchronization

@georgemoralis
Copy link
Collaborator

updated branch so we can have ci and being tested

@squidbus
Copy link
Collaborator

@LNDF Are there any remaining issue with this from your view?

@LNDF
Copy link
Member

LNDF commented Jul 26, 2025

@LNDF Are there any remaining issue with this from your view?

LGTM. Maybe testing to see if something regresses.

@georgemoralis
Copy link
Collaborator

@LNDF Are there any remaining issue with this from your view?

LGTM. Maybe testing to see if something regresses.

you know very well that without merging testing is quite limited

@StevenMiller123
Copy link
Collaborator

I can test on the main DMA titles in a moment here.

@LNDF
Copy link
Member

LNDF commented Jul 29, 2025

@LNDF Are there any remaining issue with this from your view?

LGTM. Maybe testing to see if something regresses.

you know very well that without merging testing is quite limited

Then it's ok imo

@StevenMiller123
Copy link
Collaborator

Seems like this PR slightly reduces DMA performance over main. No graphical regressions from what I can tell though.

Main:
image

PR:
image

@georgemoralis
Copy link
Collaborator

i thought the title was optimizations ;/

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Jul 29, 2025

Can you test if this is due to the atomic operation by reverting that specific change? There isn't any change I can think of that would regress performance in any way

OpAtomicOr(U32[1], fault_ptr, ConstU32(u32(spv::Scope::Device)), u32_zero_value, page_mask);

to

const auto fault_value{OpLoad(U32[1], fault_ptr)};
const auto fault_value_masked{OpBitwiseOr(U32[1], fault_value, page_mask)};
OpStore(fault_ptr, fault_value_masked);

@StevenMiller123
Copy link
Collaborator

Reverting the atomic operation does improve performance slightly, by around 2 frames per second.

@raphaelthegreat
Copy link
Contributor Author

This is interesting as the fault buffer should only be written on uncached memory so the game must be using uncached memory always to consistently stay at lower perf. With the current design the atomic is technically necessary for sync but looks like its been fine without it, so it could be removed to avoid the perf loss

@LNDF
Copy link
Member

LNDF commented Jul 30, 2025

This is interesting as the fault buffer should only be written on uncached memory so the game must be using uncached memory always to consistently stay at lower perf. With the current design the atomic is technically necessary for sync but looks like its been fine without it, so it could be removed to avoid the perf loss

Yes. The game appears to use un cached memory pages every frame. It is like that for a while until it "gets stable" and stops doing that. Maybe it reserves a memory area for some GPU operations?

Also note that the other day @StevenMiller123 noticed that sometimes the game tries to access eboot code region (don't think it is supposed to do that). But only happens sometimes and has a higher chance of happening if the game is running faster (may be a race condition?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants