Skip to content

Conversation

@dvijayak
Copy link

@dvijayak dvijayak commented Apr 2, 2024

This PR introduces the ability for users of libmatoya to query for errors and handle them accordingly. Currently, only GPU errors are being reported.

The proof-of-concept is done. What is pending is to clean up the implementation.

Improves on the way GPU errors are handled while drawing quads or UI. Matoya now handles these cases gracefully and resets the pipeline seamlessly in these scenarios.

Additionally, a few fixes were made as they were discovered while implementing error recovery.

WINDOWS-ONLY for now d3d11 and d3d12

Work In-Scope or Completed

  1. Changes to the platform-specific gfx modules that add better recovery in the event of fatal GPU errors.
  2. Improvement to MTY_WindowSetUITexture. Now the existing texture for the given id is only deleted if the new texture was successfully created. (Discovered this bug that was causing log-spam when GPU was reset while the app window was minimized to tray).
  3. Updated MTY_WindowGetContextState to consider device resets as deeming the window's graphics context state as MTY_CONTEXT_STATE_NEW so that matoya clients will know to resubmit textures when GPU errors/resets take place.

@dvijayak dvijayak added the 1. Coding Feature has been described inside of the parsec branch, and you are coding on it. label Apr 2, 2024
dvijayak added 2 commits April 3, 2024 08:13
… thread and not a window. Defined error codes so that matoya clients can action differently based on the error received while still masking the internal error details.
@dvijayak dvijayak force-pushed the gpu-error-handler branch from 3e74bc7 to c014061 Compare April 3, 2024 12:42
@bmcnett
Copy link

bmcnett commented Apr 5, 2024

it occurs to me that maybe there is something much simpler that solves the problem at hand: instead of a queue of MTY_Error, you may really want a bitmask, with one bit for each possible MTY_Error. if there are 64 or fewer error types, this fits into a uint64_t. it can be tested by ANDing the uint64_t with (1ULL << error_code), etc. this is because you probably don't care exactly what order the errors occurred in, but you probably do care about which combination of 2+ errors occurred.

@bmcnett bmcnett added Low priority Feature review can be safely postponed indefinitely. and removed 1. Coding Feature has been described inside of the parsec branch, and you are coding on it. labels Apr 30, 2024
@dvijayak dvijayak added the 2. Review requested You have coded your code, and can now be reviewed. label May 1, 2024
@dvijayak dvijayak force-pushed the gpu-error-handler branch from d2450da to b4f1412 Compare May 3, 2024 18:40
@dvijayak dvijayak changed the title Robust Error-Handling for Matoya Clients Robust GPU error recovery May 8, 2024
@dvijayak dvijayak changed the title Robust GPU error recovery Robust GPU error recovery [Windows] May 8, 2024
Comment on lines 97 to 108
if (!cmn->gfx_ui)
cmn->gfx_ui = GFX_UI_API[cmn->api].create(device);

if (!cmn->ui_textures)
if (!cmn->gfx_ui) {
MTY_HashDestroy(&cmn->ui_textures, NULL);

} else if (!cmn->ui_textures) {
cmn->ui_textures = MTY_HashCreate(0);
}

return cmn->gfx_ui != NULL;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the gfx_ui state could not be created from the given device, then we want to clean out the hash of textures.

Comment on lines 220 to +222
MTY_Device *device = gfx_ctx_get_device(cmn);
if (!device)
return false;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid a bunch of error scenarios by not allowing the function to proceed if the graphics context doesn't have a good device handle. This scenario occurs when the GPU resets.

Comment on lines 225 to 245

if (r) {
void *texture = MTY_HashPopInt(cmn->ui_textures, id);
GFX_UI_API[cmn->api].destroy_texture(cmn->gfx_ui, &texture, device);
void *texture = MTY_HashGetInt(cmn->ui_textures, id);

if (rgba) {
void *texture_new = GFX_UI_API[cmn->api].create_texture(cmn->gfx_ui, device, rgba, width, height);

if (rgba)
texture = GFX_UI_API[cmn->api].create_texture(cmn->gfx_ui, device, rgba, width, height);
if (texture_new) {
GFX_UI_API[cmn->api].destroy_texture(cmn->gfx_ui, &texture, device);

if (texture)
MTY_HashSetInt(cmn->ui_textures, id, texture);
texture = texture_new;
MTY_HashSetInt(cmn->ui_textures, id, texture_new);
}

} else {
GFX_UI_API[cmn->api].destroy_texture(cmn->gfx_ui, &texture, device);
MTY_HashPopInt(cmn->ui_textures, id);
}

r = texture != NULL;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old logic is to first destroy the existing texture for id and get rid of it from the textures hash, and then carry on with creating the new texture with the given pixel data rgba if it was provide, finally updating the hash node for this id with the new texture handle.

The problem with this is that when create_texture fails and/or returns a null texture handle, the existing perfectly-fine texture is removed. This failure can occur when the GPU resets, and because the entry is removed from the hash indiscriminately, MTY_WindowHasUITexture thereafter always returns false for the given id, which is incorrect behavior.

My fix is to replace the old/existing texture with the new one IFF the new texture was successfully created. After this change, even when the GPU resets, textures are rendered properly because the hash still maintains a reference to the original valid handle.

Comment on lines 2063 to 2074
MTY_ContextState MTY_WindowGetContextState(MTY_App *app, MTY_Window window)
{
return MTY_CONTEXT_STATE_NORMAL;
MTY_ContextState state = MTY_CONTEXT_STATE_NORMAL;

struct window_common *cmn = mty_window_get_common(app, window);
if (cmn && cmn->api != MTY_GFX_NONE && cmn->device_changed) {
state = MTY_CONTEXT_STATE_NEW;
cmn->device_changed = false;
}

return state;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the device has changed or been reset, it is a mistake not to inform clients of matoya that textures need to be resubmitted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in this file implement a more robust GPU error handling. This portion concerns the main graphics context state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in this file implement a more robust GPU error handling. This portion concerns state pertaining to rendering UI.

Copy link

@Kodikuu Kodikuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level-ish review given that I don't actually have experience dealing with DirectX

Comment on lines +232 to +233
if (texture_new) {
GFX_UI_API[cmn->api].destroy_texture(cmn->gfx_ui, &texture, device);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if texture is NULL before destroying it. An up-front check is ideal, even if the individual APIs may check too

Comment on lines +210 to +213
if (r)
MTY_Log("d3d11_ctx reinit ok.");
else
MTY_Log("d3d11_ctx reinit FAILED!");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No braces, and I'm not certain we need to log two times every time


int32_t mty_d3d11_error_handler_default(int32_t e1, int32_t e2, void *opaque)
{
e1; e2; opaque;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually put each of these on a different line

@Kodikuu Kodikuu added 1. Coding Feature has been described inside of the parsec branch, and you are coding on it. and removed 2. Review requested You have coded your code, and can now be reviewed. labels Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1. Coding Feature has been described inside of the parsec branch, and you are coding on it. Low priority Feature review can be safely postponed indefinitely.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants