[Retention] Possible fixes (#25219)

jordanhunt22 · Convex, Inc. · commit 572040b45f5b · 2024-04-30T21:57:39.000Z
We ended up overloading the database when rolling out document retention. I have included some possible fixes in this PR. They are twofold:

1. Increasing time in between batches to 10 minutes. Since pushes happen quite fast, it makes more sense to have instances query the database way less frequently.
2. Increase the jitter so all databases aren't hitting the database at the same time.
3. Update the indexes used on the query. I used the wrong index, which was `primary` instead of the `by_table_and_document_id` index on the documents table, so I updated to use the correct one. I didn't catch this when I ran `explain analyze`, but this should be way better.

Let me know if you have any other suggestions. It would also be nice if we could somehow push this out more slowly so that we can monitor better as the push is going on.

GitOrigin-RevId: 193bef96d3ec528d588bbf56cbbaca4e03bcb7d2
diff --git a/crates/common/src/knobs.rs b/crates/common/src/knobs.rs
@@ -382,12 +382,9 @@ pub static RETENTION_FAIL_START_MULTIPLIER: LazyLock<usize> =
 pub static RETENTION_FAIL_ALL_MULTIPLIER: LazyLock<usize> =
     LazyLock::new(|| env_config("RETENTION_FAIL_ALL_MULTIPLIER", 40));
 
-/// Maximum number of batches of documents that can be deleted in a minute
-pub static DOCUMENT_RETENTION_BATCHES_PER_MINUTE: LazyLock<NonZeroU32> = LazyLock::new(|| {
-    env_config(
-        "DOCUMENT_RETENTION_BATCHES_PER_MINUTE",
-        NonZeroU32::new(1).unwrap(),
-    )
+/// Time in between batches of deletes for document retention
+pub static DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS: LazyLock<Duration> = LazyLock::new(|| {
+    Duration::from_secs(env_config("DOCUMENT_RETENTION_BATCHES_PER_MINUTE", 10 * 60))
 });
 
 /// Whether or not we run document retention in dry run mode
diff --git a/crates/database/src/retention.rs b/crates/database/src/retention.rs
@@ -39,7 +39,7 @@ use common::{
     interval::Interval,
     knobs::{
         DEFAULT_DOCUMENTS_PAGE_SIZE,
-        DOCUMENT_RETENTION_BATCHES_PER_MINUTE,
+        DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS,
         DOCUMENT_RETENTION_DELAY,
         DOCUMENT_RETENTION_DRY_RUN,
         INDEX_RETENTION_DELAY,
@@ -476,7 +476,7 @@ impl<RT: Runtime> LeaderRetentionManager<RT> {
         // On startup wait with jitter to avoid a thundering herd. This does mean that
         // we will ignore commit timestamps for a while, but it saves us from
         // having every machine polling a very precise interval.
-        Self::wait_with_jitter(&rt, *MAX_RETENTION_DELAY_SECONDS).await;
+        Self::wait_with_jitter(&rt, *DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS).await;
 
         loop {
             {
@@ -1140,7 +1140,7 @@ impl<RT: Runtime> LeaderRetentionManager<RT> {
 
         let rate_limiter = new_rate_limiter(
             rt.clone(),
-            Quota::per_minute(*DOCUMENT_RETENTION_BATCHES_PER_MINUTE),
+            Quota::with_period(*DOCUMENT_RETENTION_BATCH_INTERVAL_SECONDS).unwrap(),
         );
 
         loop {