Skip to content

fix(session-replay): improve multi-threading of session replay processing #5018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

philprime
Copy link
Contributor

@philprime philprime commented Mar 25, 2025

📜 Description

  • Changes the priorities of the background queues of session replay.
  • Refactors the async dispatch handling of session replay.

💡 Motivation and Context

Some of the methods of the session replay video processing are dispatched to a low priority background queue, which then dispatches additional AV work on another dispatch queue.

The current implementation does not respect queue priorities and can cause thread starvation, because the processing queue and the work queue are either on the same priority, or the work queue is on a lower priority.

In the worst case, the processing queue is blocked and waits for the work queue, but the work queue has a lower priority than the processing queue and is awaiting execution on the main thread.

This is what the thread inversion warning is showing.

Closes #4730

💚 How did you test it?

Manual testing.

📝 Checklist

You have to check all boxes before merging:

  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
  • I updated the docs if needed.
  • I updated the wizard if needed.
  • Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

Copy link
Contributor

github-actions bot commented Mar 25, 2025

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1224.19 ms 1243.69 ms 19.50 ms
Size 22.30 KiB 854.25 KiB 831.95 KiB

Baseline results on branch: main

Startup times

Revision Plain With Sentry Diff
c6504da 1232.06 ms 1243.28 ms 11.22 ms
c6b920a 1239.53 ms 1256.53 ms 17.00 ms
318a891 1245.14 ms 1259.02 ms 13.88 ms
5e769dd 1216.24 ms 1245.74 ms 29.50 ms
7cd187e 1243.04 ms 1244.79 ms 1.75 ms
4329cdb 1206.96 ms 1227.74 ms 20.78 ms
6813f7c 1232.23 ms 1251.47 ms 19.24 ms
4350d44 1228.75 ms 1246.75 ms 18.00 ms
a176fc4 1226.24 ms 1247.50 ms 21.26 ms
25bcc50 1237.69 ms 1258.40 ms 20.71 ms

App size

Revision Plain With Sentry Diff
c6504da 20.76 KiB 414.44 KiB 393.69 KiB
c6b920a 21.58 KiB 631.19 KiB 609.61 KiB
318a891 22.30 KiB 750.02 KiB 727.72 KiB
5e769dd 21.58 KiB 572.21 KiB 550.62 KiB
7cd187e 20.76 KiB 401.65 KiB 380.89 KiB
4329cdb 22.30 KiB 833.10 KiB 810.80 KiB
6813f7c 21.58 KiB 614.86 KiB 593.28 KiB
4350d44 21.58 KiB 629.82 KiB 608.24 KiB
a176fc4 22.84 KiB 403.24 KiB 380.39 KiB
25bcc50 20.76 KiB 427.23 KiB 406.46 KiB

Previous results on branch: philprime/session-replay-inversion-improvements

Startup times

Revision Plain With Sentry Diff
9375ae6 1232.45 ms 1244.33 ms 11.88 ms
3fc8649 1216.90 ms 1228.31 ms 11.41 ms
57ce5ac 1207.40 ms 1227.88 ms 20.48 ms
8a0a01e 1230.29 ms 1241.67 ms 11.38 ms
64adbb3 1236.60 ms 1241.93 ms 5.33 ms

App size

Revision Plain With Sentry Diff
9375ae6 22.30 KiB 850.53 KiB 828.23 KiB
3fc8649 22.30 KiB 850.86 KiB 828.55 KiB
57ce5ac 22.30 KiB 850.79 KiB 828.48 KiB
8a0a01e 22.30 KiB 836.54 KiB 814.24 KiB
64adbb3 22.30 KiB 852.79 KiB 830.48 KiB

Copy link

codecov bot commented Mar 25, 2025

Codecov Report

Attention: Patch coverage is 87.10602% with 45 lines in your changes missing coverage. Please review.

Project coverage is 92.780%. Comparing base (ea44e7f) to head (168bb3d).

Files with missing lines Patch % Lines
...egrations/SessionReplay/SentryOnDemandReplay.swift 86.666% 26 Missing and 2 partials ⚠️
...ions/SessionReplay/SentryOnDemandReplayTests.swift 76.190% 10 Missing ⚠️
...tegrations/SessionReplay/SentrySessionReplay.swift 81.818% 4 Missing ⚠️
Sources/Sentry/SentrySessionReplayIntegration.m 93.548% 2 Missing ⚠️
...tions/SessionReplay/SentrySessionReplayTests.swift 91.666% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##              main     #5018       +/-   ##
=============================================
+ Coverage   92.688%   92.780%   +0.091%     
=============================================
  Files          677       678        +1     
  Lines        84405     84600      +195     
  Branches     29608     30763     +1155     
=============================================
+ Hits         78234     78492      +258     
+ Misses        6073      6008       -65     
- Partials        98       100        +2     
Files with missing lines Coverage Δ
Sources/Sentry/SentryDispatchQueueWrapper.m 100.000% <100.000%> (ø)
.../Integrations/SessionReplay/SentryReplayType.swift 100.000% <100.000%> (ø)
...rations/SessionReplay/SentryReplayVideoMaker.swift 100.000% <ø> (ø)
...grations/SessionReplay/SentryReplayTypeTests.swift 100.000% <100.000%> (ø)
...tions/SessionReplay/SentrySessionReplayTests.swift 98.953% <91.666%> (-0.194%) ⬇️
Sources/Sentry/SentrySessionReplayIntegration.m 88.172% <93.548%> (-0.129%) ⬇️
...tegrations/SessionReplay/SentrySessionReplay.swift 96.197% <81.818%> (-1.495%) ⬇️
...ions/SessionReplay/SentryOnDemandReplayTests.swift 92.067% <76.190%> (-0.725%) ⬇️
...egrations/SessionReplay/SentryOnDemandReplay.swift 90.740% <86.666%> (-2.778%) ⬇️

... and 29 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea44e7f...168bb3d. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@philprime
Copy link
Contributor Author

This PR grew into a a rather large refactor, I'll create additional PRs to split the changes into smaller digestible parts

@philprime philprime marked this pull request as ready for review April 11, 2025 13:49
@philprime philprime moved this from In Progress to Needs Review in [DEPRECATED] Mobile SDKs Apr 11, 2025
@philipphofmann
Copy link
Member

I'm unsure if we're choosing the right approach here by adding two more DispatchQueues. If I'm not mistaken, the root cause is using the default DispatchQueue of the SentryDispatchQueueWrapper here:

group.enter()
videoWriterInput.requestMediaDataWhenReady(on: workingQueue.queue) {

SentrySessionReplayIntegration.resumePreviousSessionReplay already runs on a BG thread scheduled on the SentryDispatchQueueWrapper.queue. This resumePreviousSessionReplay then calls SentryOnDemandReplay.createVideoWithBeginning, which calls SentryOnDemandReplay.renderVideo. renderVideo then renders the video with AVAssetWriterInput.requestMediaDataWhenReady, which uses SentryDispatchQueueWrapper.queue. As this queue is serial, I'm not surprised that, especially when the SDK starts and there are a couple of envelopes that need to be sent, we receive a priority inversion warning.

private func renderVideo(with videoFrames: [SentryReplayFrame], from: inout Int, at outputFileURL: URL) throws -> SentryVideoInfo? {
guard from < videoFrames.count, let image = UIImage(contentsOfFile: videoFrames[from].imagePath) else { return nil }
let videoWidth = image.size.width * CGFloat(videoScale)
let videoHeight = image.size.height * CGFloat(videoScale)
let videoWriter = try AVAssetWriter(url: outputFileURL, fileType: .mp4)
let videoWriterInput = AVAssetWriterInput(mediaType: .video, outputSettings: createVideoSettings(width: videoWidth, height: videoHeight))
guard let currentPixelBuffer = SentryPixelBuffer(size: CGSize(width: videoWidth, height: videoHeight), videoWriterInput: videoWriterInput)
else { throw SentryOnDemandReplayError.cantCreatePixelBuffer }
videoWriter.add(videoWriterInput)
videoWriter.startWriting()
videoWriter.startSession(atSourceTime: .zero)
var lastImageSize: CGSize = image.size
var usedFrames = [SentryReplayFrame]()
let group = DispatchGroup()
var result: Result<SentryVideoInfo?, Error>?
var frameCount = from
group.enter()
videoWriterInput.requestMediaDataWhenReady(on: workingQueue.queue) {
guard videoWriter.status == .writing else {
videoWriter.cancelWriting()
result = .failure(videoWriter.error ?? SentryOnDemandReplayError.errorRenderingVideo )
group.leave()
return
}
if frameCount >= videoFrames.count {
result = self.finishVideo(outputFileURL: outputFileURL, usedFrames: usedFrames, videoHeight: Int(videoHeight), videoWidth: Int(videoWidth), videoWriter: videoWriter)
group.leave()
return
}
let frame = videoFrames[frameCount]
if let image = UIImage(contentsOfFile: frame.imagePath) {
if lastImageSize != image.size {
result = self.finishVideo(outputFileURL: outputFileURL, usedFrames: usedFrames, videoHeight: Int(videoHeight), videoWidth: Int(videoWidth), videoWriter: videoWriter)
group.leave()
return
}
lastImageSize = image.size
let presentTime = CMTime(seconds: Double(frameCount), preferredTimescale: CMTimeScale(1 / self.frameRate))
if currentPixelBuffer.append(image: image, presentationTime: presentTime) != true {
videoWriter.cancelWriting()
result = .failure(videoWriter.error ?? SentryOnDemandReplayError.errorRenderingVideo )
group.leave()
return
}
usedFrames.append(frame)
}
frameCount += 1
}
guard group.wait(timeout: .now() + 2) == .success else { throw SentryOnDemandReplayError.errorRenderingVideo }
from = frameCount
return try result?.get()
}

I don't think we need to call createVideoAsyncWithBeginning in the SentrySessionReplayIntegration. Creating the video already happens on a BG thread. Making it async requires using a DispatchGroup below, so it's more complex again.

dispatch_group_t group = dispatch_group_create();
dispatch_group_enter(group);
[resumeReplayMaker
createVideoAsyncWithBeginning:beginning
end:end
completion:^(NSArray<SentryVideoInfo *> *_Nullable resultVideos,
NSError *_Nullable resultError) {
videos = resultVideos;
error = resultError;
dispatch_group_leave(group);
}];
// Wait for the video creation to finish without a timeout, because the video creation is
// expected to finish in a reasonable time frame.
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

Instead, I think we need to use one extra dedicated queue for video processing in SentryOnDemandReplay and this should solve the issue. Maybe I'm missing something. I'm also happy to brainstorm this on a call.

@philprime
Copy link
Contributor Author

I'll try to go into your point, but yes I'll contact you regarding a call to discuss this in detail.

If I'm not mistaken, the root cause is using the default DispatchQueue of the SentryDispatchQueueWrapper here:

Yes and no. The actual problem causing the warning is this line...

guard group.wait(timeout: .now() + 2) == .success else { throw SentryOnDemandReplayError.errorRenderingVideo }

... as it was called from a queue which was created here:

dispatch_queue_attr_t attributes = dispatch_queue_attr_make_with_qos_class(
DISPATCH_QUEUE_SERIAL, DISPATCH_QUEUE_PRIORITY_LOW, 0);
SentryDispatchQueueWrapper *dispatchQueue =
[[SentryDispatchQueueWrapper alloc] initWithName:"io.sentry.session-replay"
attributes:attributes];

This serial queue with priority low was still somehow on a higher priority than the one used by the asset writer in requestMediaDataWhenReady, while being unclear on which priority.

This DISPATCH_QUEUE_PRIORITY_LOW seems to not even exist in Swift DispatchQueue.QoSClass which is why I changed it to QOS_CLASS_BACKGROUND here:

dispatch_queue_attr_t attributes = dispatch_queue_attr_make_with_qos_class(
DISPATCH_QUEUE_SERIAL, QOS_CLASS_BACKGROUND, relativePriority);

This change alone might already fix the Xcode warning, but this PR is in general to improve multi-threading to ensure that thread inversion can not happen.

I'm unsure if we're choosing the right approach here by adding two more DispatchQueues.

I don't think we need to call createVideoAsyncWithBeginning in the SentrySessionReplayIntegration. Creating the video already happens on a BG thread. Making it async requires using a DispatchGroup below, so it's more complex again.

We are not adding two more queues, we are only adding the asset worker queue.

The processing queue already existed here:

dispatch_queue_attr_t attributes = dispatch_queue_attr_make_with_qos_class(
DISPATCH_QUEUE_SERIAL, DISPATCH_QUEUE_PRIORITY_LOW, 0);
SentryDispatchQueueWrapper *dispatchQueue =
[[SentryDispatchQueueWrapper alloc] initWithName:"io.sentry.session-replay"
attributes:attributes];

But instead of using it in the SentrySessionReplay.swift here (only use case in the class):

private func createAndCapture(startedAt: Date, replayType: SentryReplayType) {
SentryLog.debug("[Session Replay] Creating replay video started at date: \(startedAt), replayType: \(replayType)")
//Creating a video is heavy and blocks the thread
//Since this function is always called in the main thread
//we dispatch it to a background thread.
dispatchQueue.dispatchAsync {
do {
let videos = try self.replayMaker.createVideoWith(beginning: startedAt, end: self.dateProvider.date())
for video in videos {
self.newSegmentAvailable(videoInfo: video, replayType: replayType)
}
SentryLog.debug("[Session Replay] Finished replay video creation with \(videos.count) segments")
} catch {
SentryLog.debug("Could not create replay video - \(error.localizedDescription)")
}
}
}

I moved the background dispatching into the createVideoWith(beginning:) so that the class SentrySessionReplay does not do any dispatching anymore, but instead uses a callback approach to process the video segments (PR diff).

The processing queue is therefore now moved to the SentryOnDemandReplay, but instead of just creating it in the initializer, I made it injectable from SentrySessionReplayIntegration.m:

// The dispatch queue is used to asynchronously wait for the asset worker queue to finish its
// work. To avoid a deadlock, the priority of the processing queue must be lower than the asset
// worker queue. Use a relative priority of -2 to make it lower than the asset worker queue.
_replayProcessingQueue = [SentryDispatchQueueWrapper
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.processing"
relativePriority:-2];

Instead, I think we need to use one extra dedicated queue for video processing in SentryOnDemandReplay and this should solve the issue. Maybe I'm missing something. I'm also happy to brainstorm this on a call.

That's what the newly added asset worker queue is for:

// The asset worker queue is used to work on video and frames data.
// Use a relative priority of -1 to make it lower than the default background priority.
_replayAssetWorkerQueue = [SentryDispatchQueueWrapper
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.asset-worker"
relativePriority:-1];

As mentioned above, the createVideoAsyncWith(beginning:) is dispatching into the background processing queue to iterate all the frames and convert them into segments...

https://github.com/getsentry/sentry-cocoa/blob/44c4af72f68c588083c5c351b853ed8b76821275/Sources/Swift/Integrations/SessionReplay/SentryOnDemandReplay.swift#L140C10-L147

... and we need to loop through multiple frames and segments, while waiting for the renderVideo callback to respond, therefore it uses a semaphore to synchronize it.

The reason why renderVideo is async, is caused by the AVAssetWriter using some self-configured background queue inside which is outside our control (as I did not see any way of configuring it), therefore instead we can only pass a callback queue to requestMediaDataWhenReady. Now as the processing queue is already waiting using the semaphore, the completion block of requestMediaDataWhenReady can not be dispatched on this blocked queue, and instead we need the asset worker queue to process media, then call the callback of renderVideo to then releasing the semaphore and the dispatch queue.

If we go back to the previous implementation of renderVideo which was not async with a callback but synchronized itself using DispatchGroup...

let group = DispatchGroup()
var result: Result<SentryVideoInfo?, Error>?
var frameCount = from
group.enter()
videoWriterInput.requestMediaDataWhenReady(on: workingQueue.queue) {
guard videoWriter.status == .writing else {

... that would be the same logic as if we do the synchronization
e could go back to the previous implementation of the renderVideo which did not offer a callback but synchronized itself using dispatch group, but that would still be the same as the logic in createVideoAsyncWith.

This is therefore also a shift of synchronization logic, but I don't understand why renderVideo should be synchronized in the first place, if the used viewWriterInput is also using callback-style async.

Finally the asset worker queue has an explicitly defined higher priority than the processing queue, so when the processing queue waits for the asset worker queue the dispatcher will prefer the asset worker to finish and release the other one.

@philprime
Copy link
Contributor Author

I found this PR comment and #4439 which I'll need to go through in detail again because the comments do include additional information on the choices made.

I'll try to summarize the reasonings and add more comments to the code itself, so it is documented for future maintenance.

@philprime philprime moved this from Needs Review to In Progress in [DEPRECATED] Mobile SDKs Apr 23, 2025
@philprime philprime marked this pull request as ready for review April 30, 2025 11:29
@philprime
Copy link
Contributor Author

I also added more logging where relevant to debug issues

Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going in the right direction. I have a few questions on error handling and testing. Apart from that it already looks way better than before.

Comment on lines +21 to +23
### Fixes

- Fix thread inversion warning in session replay (#5018)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: There is already a fixes section for the unreleased section

Comment on lines +130 to +137
_replayAssetWorkerQueue = [SentryDispatchQueueWrapper
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.asset-worker"
relativePriority:-1];
// The dispatch queue is used to asynchronously wait for the asset worker queue to finish its
// work. To avoid a deadlock, the priority of the processing queue must be lower than the asset
// worker queue. Use a relative priority of -2 to make it lower than the asset worker queue.
_replayProcessingQueue = [SentryDispatchQueueWrapper
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.processing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: The comment mentions that setting the wrong priorities here can lead to a deadlock. I think it's essential to have a test fail when using the wrong priorities to avoid regressions. We could assert that the SentryDispatchQueueWrapper is initialized with the correct attributes, which tests implementation details, and has the downside of having to change the test when the implementation details change. We could also have a test causing a deadlock when interacting with this class when the priorities are wrongly set, which could be a bit harder to achieve, but has the advantage of not testing implementation details.

// It is imporant that the renderVideo completion block signals the group.
// The queue used by render video must have a higher priority than the processing queue to reduce thread inversion.
// Otherwise, it could lead to queue starvation and a deadlock/timeout.
guard group.wait(timeout: .now() + 120) == .success else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: 120 seems a bit high to me. Shouldn't 30 be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! 🙏
I had to set this higher because it was set to 2 seconds, and when debugging the timeout would always hit, but forgot to revert it afterwards.

Comment on lines +128 to +132
// The asset worker queue is used to work on video and frames data.
// Use a relative priority of -1 to make it lower than the default background priority.
_replayAssetWorkerQueue = [SentryDispatchQueueWrapper
createBackgroundDispatchQueueWithName:"io.sentry.session-replay.asset-worker"
relativePriority:-1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: That's so easy to break. If we ever change the priority of the default background priority here, we could end up with the same problem again. And when I change the priority in the SentryDispatchQueueWrapper how should I know that I I need to change it here too?

// DISPATCH_QUEUE_SERIAL_WITH_AUTORELEASE_POOL is requires iOS 10. Since we are targeting
// iOS 9 we need to manually add the autoreleasepool.
dispatch_queue_attr_t attributes = dispatch_queue_attr_make_with_qos_class(
DISPATCH_QUEUE_SERIAL, DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

A test should fail if these priorities don't match our requirements here. If you need ideas for testing this properly, please let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would highly appreciate an example how you would test the priorities of queues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a factory method that I can swap out in the tests or something like that. Then I can assert that I call the factory with the correct parameters.

Comment on lines +210 to +212
guard group.wait(timeout: .now() + 120) == .success else {
SentryLog.error("[Session Replay] Timeout while waiting for video rendering to finish.")
throw SentryOnDemandReplayError.errorRenderingVideo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: If we timeout, we could still return the videos we have and send what we have instead of discarding everything.

Comment on lines +216 to +219
if let error = currentError {
SentryLog.error("[Session Replay] Error while rendering video: \(error), cancelling video segment creation")
throw error
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Same here, we could still return the videos we have and try to send them instead of throwing away all of them.

// It is imporant that the renderVideo completion block signals the group.
// The queue used by render video must have a higher priority than the processing queue to reduce thread inversion.
// Otherwise, it could lead to queue starvation and a deadlock/timeout.
guard group.wait(timeout: .now() + 120) == .success else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: What drove you to add the DispatchGroup inside the while frameCount < videoFrames.count { and not outside? Now we have a timeout of 120 for every loop iteration, which can be a very long time. Shouldn't we have one timeout for the whole process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is wrong as mentioned in the other comment. It was at 2 seconds before, which is also a very tight window to do video conversion. I will reconsider a longer timeout for the entire process instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Session Replay thread priority inversions
2 participants