-
-
Notifications
You must be signed in to change notification settings - Fork 349
fix(session-replay): improve multi-threading of session replay processing #5018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Performance metrics 🚀
|
Revision | Plain | With Sentry | Diff |
---|---|---|---|
c6504da | 1232.06 ms | 1243.28 ms | 11.22 ms |
c6b920a | 1239.53 ms | 1256.53 ms | 17.00 ms |
318a891 | 1245.14 ms | 1259.02 ms | 13.88 ms |
5e769dd | 1216.24 ms | 1245.74 ms | 29.50 ms |
7cd187e | 1243.04 ms | 1244.79 ms | 1.75 ms |
4329cdb | 1206.96 ms | 1227.74 ms | 20.78 ms |
6813f7c | 1232.23 ms | 1251.47 ms | 19.24 ms |
4350d44 | 1228.75 ms | 1246.75 ms | 18.00 ms |
a176fc4 | 1226.24 ms | 1247.50 ms | 21.26 ms |
25bcc50 | 1237.69 ms | 1258.40 ms | 20.71 ms |
App size
Revision | Plain | With Sentry | Diff |
---|---|---|---|
c6504da | 20.76 KiB | 414.44 KiB | 393.69 KiB |
c6b920a | 21.58 KiB | 631.19 KiB | 609.61 KiB |
318a891 | 22.30 KiB | 750.02 KiB | 727.72 KiB |
5e769dd | 21.58 KiB | 572.21 KiB | 550.62 KiB |
7cd187e | 20.76 KiB | 401.65 KiB | 380.89 KiB |
4329cdb | 22.30 KiB | 833.10 KiB | 810.80 KiB |
6813f7c | 21.58 KiB | 614.86 KiB | 593.28 KiB |
4350d44 | 21.58 KiB | 629.82 KiB | 608.24 KiB |
a176fc4 | 22.84 KiB | 403.24 KiB | 380.39 KiB |
25bcc50 | 20.76 KiB | 427.23 KiB | 406.46 KiB |
Previous results on branch: philprime/session-replay-inversion-improvements
Startup times
Revision | Plain | With Sentry | Diff |
---|---|---|---|
9375ae6 | 1232.45 ms | 1244.33 ms | 11.88 ms |
3fc8649 | 1216.90 ms | 1228.31 ms | 11.41 ms |
57ce5ac | 1207.40 ms | 1227.88 ms | 20.48 ms |
8a0a01e | 1230.29 ms | 1241.67 ms | 11.38 ms |
64adbb3 | 1236.60 ms | 1241.93 ms | 5.33 ms |
App size
Revision | Plain | With Sentry | Diff |
---|---|---|---|
9375ae6 | 22.30 KiB | 850.53 KiB | 828.23 KiB |
3fc8649 | 22.30 KiB | 850.86 KiB | 828.55 KiB |
57ce5ac | 22.30 KiB | 850.79 KiB | 828.48 KiB |
8a0a01e | 22.30 KiB | 836.54 KiB | 814.24 KiB |
64adbb3 | 22.30 KiB | 852.79 KiB | 830.48 KiB |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5018 +/- ##
=============================================
+ Coverage 92.688% 92.780% +0.091%
=============================================
Files 677 678 +1
Lines 84405 84600 +195
Branches 29608 30763 +1155
=============================================
+ Hits 78234 78492 +258
+ Misses 6073 6008 -65
- Partials 98 100 +2
... and 29 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
…lay-inversion-improvements
This PR grew into a a rather large refactor, I'll create additional PRs to split the changes into smaller digestible parts |
…lay-inversion-improvements
Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift
Outdated
Show resolved
Hide resolved
Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift
Outdated
Show resolved
Hide resolved
Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift
Outdated
Show resolved
Hide resolved
I'm unsure if we're choosing the right approach here by adding two more DispatchQueues. If I'm not mistaken, the root cause is using the default DispatchQueue of the sentry-cocoa/Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift Lines 169 to 170 in 38fc771
sentry-cocoa/Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift Lines 147 to 206 in 38fc771
I don't think we need to call sentry-cocoa/Sources/Sentry/SentrySessionReplayIntegration.m Lines 231 to 244 in 44c4af7
Instead, I think we need to use one extra dedicated queue for video processing in |
I'll try to go into your point, but yes I'll contact you regarding a call to discuss this in detail.
Yes and no. The actual problem causing the warning is this line... sentry-cocoa/Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift Line 202 in 38fc771
... as it was called from a queue which was created here: sentry-cocoa/Sources/Sentry/SentrySessionReplayIntegration.m Lines 333 to 337 in 38fc771
This serial queue with priority low was still somehow on a higher priority than the one used by the asset writer in This sentry-cocoa/Sources/Sentry/SentryDispatchQueueWrapper.m Lines 112 to 113 in 44c4af7
This change alone might already fix the Xcode warning, but this PR is in general to improve multi-threading to ensure that thread inversion can not happen.
We are not adding two more queues, we are only adding the The sentry-cocoa/Sources/Sentry/SentrySessionReplayIntegration.m Lines 333 to 337 in 38fc771
But instead of using it in the sentry-cocoa/Sources/Swift/Integrations/SessionReplay/SentrySessionReplay.swift Lines 226 to 242 in 38fc771
I moved the background dispatching into the The processing queue is therefore now moved to the sentry-cocoa/Sources/Sentry/SentrySessionReplayIntegration.m Lines 134 to 139 in 44c4af7
That's what the newly added sentry-cocoa/Sources/Sentry/SentrySessionReplayIntegration.m Lines 129 to 133 in 44c4af7
As mentioned above, the ... and we need to loop through multiple frames and segments, while waiting for the The reason why If we go back to the previous implementation of sentry-cocoa/Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift Lines 164 to 171 in 38fc771
... that would be the same logic as if we do the synchronization This is therefore also a shift of synchronization logic, but I don't understand why Finally the |
I found this PR comment and #4439 which I'll need to go through in detail again because the comments do include additional information on the choices made. I'll try to summarize the reasonings and add more comments to the code itself, so it is documented for future maintenance. |
…lay-inversion-improvements
…lay-inversion-improvements
…lay-inversion-improvements
…lay-inversion-improvements
I also added more logging where relevant to debug issues |
…lay-inversion-improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going in the right direction. I have a few questions on error handling and testing. Apart from that it already looks way better than before.
### Fixes | ||
|
||
- Fix thread inversion warning in session replay (#5018) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: There is already a fixes section for the unreleased section
_replayAssetWorkerQueue = [SentryDispatchQueueWrapper | ||
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.asset-worker" | ||
relativePriority:-1]; | ||
// The dispatch queue is used to asynchronously wait for the asset worker queue to finish its | ||
// work. To avoid a deadlock, the priority of the processing queue must be lower than the asset | ||
// worker queue. Use a relative priority of -2 to make it lower than the asset worker queue. | ||
_replayProcessingQueue = [SentryDispatchQueueWrapper | ||
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.processing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h
: The comment mentions that setting the wrong priorities here can lead to a deadlock. I think it's essential to have a test fail when using the wrong priorities to avoid regressions. We could assert that the SentryDispatchQueueWrapper
is initialized with the correct attributes, which tests implementation details, and has the downside of having to change the test when the implementation details change. We could also have a test causing a deadlock when interacting with this class when the priorities are wrongly set, which could be a bit harder to achieve, but has the advantage of not testing implementation details.
// It is imporant that the renderVideo completion block signals the group. | ||
// The queue used by render video must have a higher priority than the processing queue to reduce thread inversion. | ||
// Otherwise, it could lead to queue starvation and a deadlock/timeout. | ||
guard group.wait(timeout: .now() + 120) == .success else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: 120 seems a bit high to me. Shouldn't 30 be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! 🙏
I had to set this higher because it was set to 2 seconds, and when debugging the timeout would always hit, but forgot to revert it afterwards.
// The asset worker queue is used to work on video and frames data. | ||
// Use a relative priority of -1 to make it lower than the default background priority. | ||
_replayAssetWorkerQueue = [SentryDispatchQueueWrapper | ||
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.asset-worker" | ||
relativePriority:-1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h
: That's so easy to break. If we ever change the priority of the default background priority here, we could end up with the same problem again. And when I change the priority in the SentryDispatchQueueWrapper
how should I know that I I need to change it here too?
sentry-cocoa/Sources/Sentry/SentryDispatchQueueWrapper.m
Lines 10 to 13 in 8856e5a
// DISPATCH_QUEUE_SERIAL_WITH_AUTORELEASE_POOL is requires iOS 10. Since we are targeting | |
// iOS 9 we need to manually add the autoreleasepool. | |
dispatch_queue_attr_t attributes = dispatch_queue_attr_make_with_qos_class( | |
DISPATCH_QUEUE_SERIAL, DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); |
A test should fail if these priorities don't match our requirements here. If you need ideas for testing this properly, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would highly appreciate an example how you would test the priorities of queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a factory method that I can swap out in the tests or something like that. Then I can assert that I call the factory with the correct parameters.
guard group.wait(timeout: .now() + 120) == .success else { | ||
SentryLog.error("[Session Replay] Timeout while waiting for video rendering to finish.") | ||
throw SentryOnDemandReplayError.errorRenderingVideo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: If we timeout, we could still return the videos we have and send what we have instead of discarding everything.
if let error = currentError { | ||
SentryLog.error("[Session Replay] Error while rendering video: \(error), cancelling video segment creation") | ||
throw error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: Same here, we could still return the videos we have and try to send them instead of throwing away all of them.
// It is imporant that the renderVideo completion block signals the group. | ||
// The queue used by render video must have a higher priority than the processing queue to reduce thread inversion. | ||
// Otherwise, it could lead to queue starvation and a deadlock/timeout. | ||
guard group.wait(timeout: .now() + 120) == .success else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
: What drove you to add the DispatchGroup inside the while frameCount < videoFrames.count {
and not outside? Now we have a timeout of 120 for every loop iteration, which can be a very long time. Shouldn't we have one timeout for the whole process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is wrong as mentioned in the other comment. It was at 2 seconds before, which is also a very tight window to do video conversion. I will reconsider a longer timeout for the entire process instead.
📜 Description
💡 Motivation and Context
Some of the methods of the session replay video processing are dispatched to a low priority background queue, which then dispatches additional AV work on another dispatch queue.
The current implementation does not respect queue priorities and can cause thread starvation, because the processing queue and the work queue are either on the same priority, or the work queue is on a lower priority.
In the worst case, the processing queue is blocked and waits for the work queue, but the work queue has a lower priority than the processing queue and is awaiting execution on the main thread.
This is what the thread inversion warning is showing.
Closes #4730
💚 How did you test it?
Manual testing.
📝 Checklist
You have to check all boxes before merging:
sendDefaultPII
is enabled.