Skip to content

Conversation

jeffhwang-sq
Copy link
Collaborator

@jeffhwang-sq jeffhwang-sq commented May 5, 2025

Enabe dynamic partition adjustments for paused backfills

  • Adds a new option to edit pkey cursor.
  • Adds a new landing page to adjust the pkey.
  • Contain warning and success message.
Screen.Recording.2025-05-07.at.12.16.07.PM.mov
Screenshot 2025-05-05 at 12 56 57 PM Screenshot 2025-05-05 at 12 57 29 PM Screenshot 2025-05-08 at 12 19 40 PM

* master:
  Add pagination to events (#445)
  Fix DynamoDB BatchWriteItem 25-item limit in UpdateInPlaceDynamoDbBackfill (#447)
  Update BackfillCreateAction.kt (#446)
@adrw
Copy link
Collaborator

adrw commented May 7, 2025

On success, it should redirect back to the Backfill Show page. Alternatively it'd be nice to have the change form inline like the other config change buttons which turn into input boxes.

@jeffhwang-sq
Copy link
Collaborator Author

On success, it should redirect back to the Backfill Show page. Alternatively it'd be nice to have the change form inline like the other config change buttons which turn into input boxes.

Changes are done but let me see if I can add some tests.

jeffhwang-sq and others added 6 commits May 7, 2025 15:15
This change introduces a Cancel button to the UI for backfills that are
currently paused. Users can now manually transition a paused backfill to
a CANCELLED state, providing better control over workflows that are no
longer needed.

Changes Include:
- Added a cancel button in the backfill details view (only shown for
paused backfills).
- Updated run partition table state column and backfillruns table state
column, handling to correctly mark backfills as CANCELLED when
triggered.

Next Pr will create a new deleted column to support hiding the backfill
that are soft deleted.


https://github.com/user-attachments/assets/b0072300-d0ef-47cb-946e-b52f97e97073

![Screenshot 2025-04-28 at 2 49
28 PM](https://github.com/user-attachments/assets/5def7593-345c-4ab0-9cc1-2547ca534134)
## Problem

The DynamoDB BatchWriteItem API often has transient failures with
unprocessed items that cause entire Backfila batches to fail. These
could be more granularly retried within a run to avoid Backfila getting
stuck on a batch.

## Solution

Improve the retry mechanism to better align with [AWS BatchWriteItem
best
practices](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html):
- Collect unprocessed items across all batches and retry them together
- Use exponential backoff with jitter to handle throttling
- Only count towards retry limit when no progress is made
- Provide more detailed error reporting
…#452)

## Problem

The DynamoDB BatchWriteItem implementation in
UpdateInPlaceDynamoDbBackfill currently lacks handling for
ApiCallTimeoutException. When these timeouts occur, the entire batch
fails without any retry attempts, causing backfills to fail
unnecessarily.

## Solution

Add comprehensive timeout handling with these features:
- Track and retry chunks that experience API timeouts
- Use exponential backoff with jitter for retries
- Only increment the timeout counter when all chunks in an iteration
timeout
- Reset the timeout counter if any chunk succeeds
- Maintain separate counters for timeouts vs unprocessed items
- Provide detailed error context through suppressed exceptions

The implementation is designed to be resilient to transient timeouts
while still protecting against systemic failures. It coordinates the
backoff strategy between timeout retries and unprocessed item retries.
* master:
  Add ApiCallTimeoutException handling to UpdateInPlaceDynamoDbBackfill (#452)
  Improve BatchWriteItem retry handling (#449)
  Add "Cancelled" Button for Paused Backfills (#442)
@jeffhwang-sq jeffhwang-sq requested review from adrw and mpawliszyn May 8, 2025 15:21
@jeffhwang-sq jeffhwang-sq merged commit 70a001c into master May 8, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants