Skip to content

*(dm): retry statement execution on invalid connections in some scenarios (#12078) #12170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release-8.5
Choose a base branch
from

Conversation

cgtz
Copy link
Contributor

@cgtz cgtz commented May 15, 2025

This is a manual cherry-pick of #12078

What problem does this PR solve?

This PR tries to address a case where, when a TiDB machine is restarted, stale connections will live on in the connection pool. When DM attempts to use these connections, it will receive an "Invalid Connection" error, which will cause the task to pause and increase replication lag while waiting to reschedule it.

Issue Number: close #11805

What is changed and how it works?

When we get an invalid connection error on a begin statement, we can safely retry after resetting the connection. The reset connection call can also fail due to similar reasons or due to a load balancer layer refusing the connection, so we can also do a retry with backoff here.

Check List

Tests

Tested on our organization's deployment of DM.

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Retry when the DM syncer receives an invalid connection error while at the beginning of a transaction to avoid full task pauses that increase replication lag

Copy link
Contributor

ti-chi-bot bot commented May 15, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added do-not-merge/cherry-pick-not-approved do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 15, 2025
Copy link
Contributor

ti-chi-bot bot commented May 15, 2025

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the contribution This PR is from a community contributor. label May 15, 2025
Copy link
Contributor

ti-chi-bot bot commented May 15, 2025

Hi @cgtz. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 15, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cgtz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Gemini here, providing a summary for this pull request. This PR is a manual cherry-pick aimed at improving the resilience of the DM (Data Migration) syncer when dealing with invalid database connections. The core problem addressed is that when a downstream TiDB instance restarts, the DM syncer might hold onto stale connections in its pool. Attempting to use these stale connections results in an "Invalid Connection" error, which previously caused the task to pause and increased replication lag. This change introduces retry logic specifically for BEGIN statements when an invalid connection error occurs. It also adds retries for the connection reset operation itself, acknowledging that resetting can also fail.

Highlights

  • Error Handling: Introduced a new specific error code (ErrDBExecuteFailedBegin) for failures occurring during the BEGIN statement execution, allowing for more granular error handling and retry logic.
  • Retry Logic: Implemented retry logic for BEGIN statements when an invalid connection error is encountered. This allows the syncer to attempt to re-establish a valid connection and continue processing without pausing the task.
  • Connection Reset Retry: Added retry capability for the ResetConn operation itself, handling cases where the attempt to reset a connection might also fail due to network issues or load balancer behavior.
  • Test Updates: Updated existing tests and test scripts (all_mode/run.sh, many_tables/run.sh) to incorporate the new error code and adjust sleep times to accommodate the introduced retry delays.

Changelog

Click here to see the changelog
  • dm/_utils/terror_gen/errors_release.txt
    • Added a new error code ErrDBExecuteFailedBegin (code 10007) for begin statement execution failures.
  • dm/errors.toml
    • Defined the new error code DM-database-10007 with the message "execute statement failed: begin".
  • dm/pkg/conn/baseconn.go
    • Changed the error returned when BeginTx fails from ErrDBExecuteFailed to the new ErrDBExecuteFailedBegin (line 168).
  • dm/pkg/conn/baseconn_test.go
    • Updated the test case to assert against the new ErrDBExecuteFailedBegin when ExpectBegin returns an error (line 84).
  • dm/pkg/retry/errors.go
    • Imported the strings package (line 18).
    • Added IsRetryableConnectionErrorOnBeginOrSafeMode function to check if an error is retryable for BEGIN statements or in safe mode (lines 105-110).
    • Added IsRetryableErrorOnResetConn function to check if an error is retryable specifically for connection reset attempts (lines 112-119).
  • dm/pkg/retry/strategy.go
    • Added logging for context cancellation before retry backoff (line 98).
    • Added logging for non-retryable errors (line 104).
  • dm/pkg/terror/errcode_string.go
    • Added the new error code codeDBExecuteFailedBegin to the _ function (line 17).
    • Updated the _ErrCode_name constant string to include the name for the new error code (line 603).
    • Updated the _ErrCode_map to map the new error code to its string representation (line 612).
  • dm/pkg/terror/error_list.go
    • Added the codeDBExecuteFailedBegin constant to the error code list (line 24).
    • Defined the ErrDBExecuteFailedBegin variable with the new error code and message (line 746).
  • dm/syncer/dbconn/db.go
    • Captured the number of retries and the total cost of the ApplyRetryStrategy call (line 173).
    • Added numRetries and costSeconds to the error log when execute statements fail after retry (lines 206-207).
    • Modified the retryableFn to specifically check for ErrDBExecuteFailedBegin or ErrInvalidConn (in safe mode) and added a retry loop for conn.ResetConn within this function (lines 247-280).
    • Added a new function resetConnRetryableFn to define the retryable conditions for the ResetConn operation (lines 293-304).
  • dm/tests/all_mode/run.sh
    • Increased the sleep duration after running an SQL file from 2 seconds to 45 seconds to allow time for connection retries (line 484).
  • dm/tests/many_tables/run.sh
    • Increased the sleep duration after incremental data load from 30 seconds to 75 seconds (line 157).
    • Adjusted the expected minimum number of 'unit process error' logs from 4 to 2, reflecting the change in retry behavior (line 163).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Connection lost, a dreaded sight,
But retry logic shines so bright.
Begin again, the task resumes,
Dispelling replication glooms.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

ti-chi-bot bot commented May 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lance6716 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the area/dm Issues or PRs related to DM. label May 15, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of stale connections in DM causing task pauses by introducing a retry mechanism for BEGIN statements on "Invalid Connection" errors and for the connection reset operation itself. The changes are well-targeted, and the introduction of a specific error code for BEGIN failures (ErrDBExecuteFailedBegin) improves error clarity. Logging has also been enhanced for better observability during retry attempts.

The core logic for retrying appears sound, and the handling of different retryable scenarios for connection resets (including connection refused) is a good improvement. The updates to test scripts to accommodate potentially longer retry durations are noted and seem reasonable for ensuring test stability with the new resilience features.

Summary of Findings

  • Logging Consistency: In dm/syncer/dbconn/db.go, within the retryableFn function, when logging an error after a failed connection reset attempt (around line 273), log.ShortError(resetErr) is used. Other parts of the PR have updated similar logging to use zap.Error(resetErr) for structured logging. For consistency, this instance could also be updated. This was not commented on directly as it's considered low severity.
  • Log Level for Retry Attempts: In dm/syncer/dbconn/db.go, the resetConnRetryableFn logs at Error level when retrying a connection reset. This might be too noisy for transient, recoverable errors. A Warn level is suggested for better operational clarity (see review comment).

Merge Readiness

The pull request introduces valuable resiliency improvements. The core logic is sound. There is one medium-severity suggestion regarding logging practices in dm/syncer/dbconn/db.go that would enhance operational clarity. Addressing this point would be beneficial before merging.

As an AI, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers before merging.

Comment on lines +296 to +300
tctx.L().Error("retrying reset connection", zap.Int("retry", retryTime),
zap.Int("resetRetries", retryTime),
zap.String("queries", utils.TruncateInterface(queries, -1)),
zap.String("arguments", utils.TruncateInterface(args, -1)),
log.ShortError(err))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logging for retrying a connection reset could be improved for clarity and to reduce potential log noise:

  1. Log Level: Using tctx.L().Error(...) for a retry attempt of an operation that is deemed retryable (by IsRetryableErrorOnResetConn) might be too strong. If these are expected transient errors that eventually recover, logging them as errors can lead to alert fatigue or misinterpretation of system health. Consider using tctx.L().Warn(...) instead, as warnings are generally more appropriate for recoverable issues that are being actively handled.
  2. Redundant Log Field: The log includes zap.Int("retry", retryTime) and zap.Int("resetRetries", retryTime). Since both fields use retryTime, one of them is redundant. It's clearer to use just one, for example, zap.Int("attempt", retryTime).

How about adjusting the log level and simplifying the fields for better operational clarity?

			tctx.L().Warn("retrying connection reset due to retryable error", zap.Int("attempt", retryTime),
				zap.String("queries", utils.TruncateInterface(queries, -1)),
				zap.String("arguments", utils.TruncateInterface(args, -1)),
				log.ShortError(err))

Copy link
Contributor

ti-chi-bot bot commented May 15, 2025

@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Code Review

This pull request effectively addresses the issue of stale connections in DM causing task pauses by introducing a retry mechanism for BEGIN statements on "Invalid Connection" errors and for the connection reset operation itself. The changes are well-targeted, and the introduction of a specific error code for BEGIN failures (ErrDBExecuteFailedBegin) improves error clarity. Logging has also been enhanced for better observability during retry attempts.

The core logic for retrying appears sound, and the handling of different retryable scenarios for connection resets (including connection refused) is a good improvement. The updates to test scripts to accommodate potentially longer retry durations are noted and seem reasonable for ensuring test stability with the new resilience features.

Summary of Findings

  • Logging Consistency: In dm/syncer/dbconn/db.go, within the retryableFn function, when logging an error after a failed connection reset attempt (around line 273), log.ShortError(resetErr) is used. Other parts of the PR have updated similar logging to use zap.Error(resetErr) for structured logging. For consistency, this instance could also be updated. This was not commented on directly as it's considered low severity.
  • Log Level for Retry Attempts: In dm/syncer/dbconn/db.go, the resetConnRetryableFn logs at Error level when retrying a connection reset. This might be too noisy for transient, recoverable errors. A Warn level is suggested for better operational clarity (see review comment).

Merge Readiness

The pull request introduces valuable resiliency improvements. The core logic is sound. There is one medium-severity suggestion regarding logging practices in dm/syncer/dbconn/db.go that would enhance operational clarity. Addressing this point would be beneficial before merging.

As an AI, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers before merging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cgtz
Copy link
Contributor Author

cgtz commented May 15, 2025

@lance6716 @GMHDBJD , here is the new cherry pick PR for DM invalid connection retries. Please take a look. The mac os build seems to be failing because of an outdated github actions/cache version and not related to this change.

@lance6716
Copy link
Contributor

/cc @OliverS929

Copy link
Contributor

ti-chi-bot bot commented May 16, 2025

@lance6716: GitHub didn't allow me to request PR reviews from the following users: OliverS929.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @OliverS929

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. contribution This PR is from a community contributor. do-not-merge/cherry-pick-not-approved do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants