Skip to content

Commit 572040b

Browse files
jordanhunt22Convex, Inc.
authored andcommitted
[Retention] Possible fixes (#25219)
We ended up overloading the database when rolling out document retention. I have included some possible fixes in this PR. They are twofold: 1. Increasing time in between batches to 10 minutes. Since pushes happen quite fast, it makes more sense to have instances query the database way less frequently. 2. Increase the jitter so all databases aren't hitting the database at the same time. 3. Update the indexes used on the query. I used the wrong index, which was `primary` instead of the `by_table_and_document_id` index on the documents table, so I updated to use the correct one. I didn't catch this when I ran `explain analyze`, but this should be way better. Let me know if you have any other suggestions. It would also be nice if we could somehow push this out more slowly so that we can monitor better as the push is going on. GitOrigin-RevId: 193bef96d3ec528d588bbf56cbbaca4e03bcb7d2
1 parent 3bc5b5c commit 572040b

File tree

2 files changed

+6
-9
lines changed

2 files changed

+6
-9
lines changed

crates/common/src/knobs.rs

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -382,12 +382,9 @@ pub static RETENTION_FAIL_START_MULTIPLIER: LazyLock<usize> =
382382
pub static RETENTION_FAIL_ALL_MULTIPLIER: LazyLock<usize> =
383383
LazyLock::new(|| env_config("RETENTION_FAIL_ALL_MULTIPLIER", 40));
384384

385-
/// Maximum number of batches of documents that can be deleted in a minute
386-
pub static DOCUMENT_RETENTION_BATCHES_PER_MINUTE: LazyLock<NonZeroU32> = LazyLock::new(|| {
387-
env_config(
388-
"DOCUMENT_RETENTION_BATCHES_PER_MINUTE",
389-
NonZeroU32::new(1).unwrap(),
390-
)
385+
/// Time in between batches of deletes for document retention
386+
pub static DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS: LazyLock<Duration> = LazyLock::new(|| {
387+
Duration::from_secs(env_config("DOCUMENT_RETENTION_BATCHES_PER_MINUTE", 10 * 60))
391388
});
392389

393390
/// Whether or not we run document retention in dry run mode

crates/database/src/retention.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ use common::{
3939
interval::Interval,
4040
knobs::{
4141
DEFAULT_DOCUMENTS_PAGE_SIZE,
42-
DOCUMENT_RETENTION_BATCHES_PER_MINUTE,
42+
DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS,
4343
DOCUMENT_RETENTION_DELAY,
4444
DOCUMENT_RETENTION_DRY_RUN,
4545
INDEX_RETENTION_DELAY,
@@ -476,7 +476,7 @@ impl<RT: Runtime> LeaderRetentionManager<RT> {
476476
// On startup wait with jitter to avoid a thundering herd. This does mean that
477477
// we will ignore commit timestamps for a while, but it saves us from
478478
// having every machine polling a very precise interval.
479-
Self::wait_with_jitter(&rt, *MAX_RETENTION_DELAY_SECONDS).await;
479+
Self::wait_with_jitter(&rt, *DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS).await;
480480

481481
loop {
482482
{
@@ -1140,7 +1140,7 @@ impl<RT: Runtime> LeaderRetentionManager<RT> {
11401140

11411141
let rate_limiter = new_rate_limiter(
11421142
rt.clone(),
1143-
Quota::per_minute(*DOCUMENT_RETENTION_BATCHES_PER_MINUTE),
1143+
Quota::with_period(*DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS).unwrap(),
11441144
);
11451145

11461146
loop {

0 commit comments

Comments
 (0)