-
Notifications
You must be signed in to change notification settings - Fork 291
*(dm): retry statement execution on invalid connections in some scenarios (#12078) #12170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-8.5
Are you sure you want to change the base?
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @cgtz. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @cgtz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Gemini here, providing a summary for this pull request. This PR is a manual cherry-pick aimed at improving the resilience of the DM (Data Migration) syncer when dealing with invalid database connections. The core problem addressed is that when a downstream TiDB instance restarts, the DM syncer might hold onto stale connections in its pool. Attempting to use these stale connections results in an "Invalid Connection" error, which previously caused the task to pause and increased replication lag. This change introduces retry logic specifically for BEGIN
statements when an invalid connection error occurs. It also adds retries for the connection reset operation itself, acknowledging that resetting can also fail.
Highlights
- Error Handling: Introduced a new specific error code (
ErrDBExecuteFailedBegin
) for failures occurring during theBEGIN
statement execution, allowing for more granular error handling and retry logic. - Retry Logic: Implemented retry logic for
BEGIN
statements when an invalid connection error is encountered. This allows the syncer to attempt to re-establish a valid connection and continue processing without pausing the task. - Connection Reset Retry: Added retry capability for the
ResetConn
operation itself, handling cases where the attempt to reset a connection might also fail due to network issues or load balancer behavior. - Test Updates: Updated existing tests and test scripts (
all_mode/run.sh
,many_tables/run.sh
) to incorporate the new error code and adjust sleep times to accommodate the introduced retry delays.
Changelog
Click here to see the changelog
- dm/_utils/terror_gen/errors_release.txt
- Added a new error code
ErrDBExecuteFailedBegin
(code 10007) for begin statement execution failures.
- Added a new error code
- dm/errors.toml
- Defined the new error code
DM-database-10007
with the message "execute statement failed: begin".
- Defined the new error code
- dm/pkg/conn/baseconn.go
- Changed the error returned when
BeginTx
fails fromErrDBExecuteFailed
to the newErrDBExecuteFailedBegin
(line 168).
- Changed the error returned when
- dm/pkg/conn/baseconn_test.go
- Updated the test case to assert against the new
ErrDBExecuteFailedBegin
whenExpectBegin
returns an error (line 84).
- Updated the test case to assert against the new
- dm/pkg/retry/errors.go
- Imported the
strings
package (line 18). - Added
IsRetryableConnectionErrorOnBeginOrSafeMode
function to check if an error is retryable forBEGIN
statements or in safe mode (lines 105-110). - Added
IsRetryableErrorOnResetConn
function to check if an error is retryable specifically for connection reset attempts (lines 112-119).
- Imported the
- dm/pkg/retry/strategy.go
- Added logging for context cancellation before retry backoff (line 98).
- Added logging for non-retryable errors (line 104).
- dm/pkg/terror/errcode_string.go
- Added the new error code
codeDBExecuteFailedBegin
to the_
function (line 17). - Updated the
_ErrCode_name
constant string to include the name for the new error code (line 603). - Updated the
_ErrCode_map
to map the new error code to its string representation (line 612).
- Added the new error code
- dm/pkg/terror/error_list.go
- Added the
codeDBExecuteFailedBegin
constant to the error code list (line 24). - Defined the
ErrDBExecuteFailedBegin
variable with the new error code and message (line 746).
- Added the
- dm/syncer/dbconn/db.go
- Captured the number of retries and the total cost of the
ApplyRetryStrategy
call (line 173). - Added
numRetries
andcostSeconds
to the error log when execute statements fail after retry (lines 206-207). - Modified the
retryableFn
to specifically check forErrDBExecuteFailedBegin
orErrInvalidConn
(in safe mode) and added a retry loop forconn.ResetConn
within this function (lines 247-280). - Added a new function
resetConnRetryableFn
to define the retryable conditions for theResetConn
operation (lines 293-304).
- Captured the number of retries and the total cost of the
- dm/tests/all_mode/run.sh
- Increased the sleep duration after running an SQL file from 2 seconds to 45 seconds to allow time for connection retries (line 484).
- dm/tests/many_tables/run.sh
- Increased the sleep duration after incremental data load from 30 seconds to 75 seconds (line 157).
- Adjusted the expected minimum number of 'unit process error' logs from 4 to 2, reflecting the change in retry behavior (line 163).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Connection lost, a dreaded sight,
But retry logic shines so bright.
Begin again, the task resumes,
Dispelling replication glooms.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of stale connections in DM causing task pauses by introducing a retry mechanism for BEGIN
statements on "Invalid Connection" errors and for the connection reset operation itself. The changes are well-targeted, and the introduction of a specific error code for BEGIN
failures (ErrDBExecuteFailedBegin
) improves error clarity. Logging has also been enhanced for better observability during retry attempts.
The core logic for retrying appears sound, and the handling of different retryable scenarios for connection resets (including connection refused
) is a good improvement. The updates to test scripts to accommodate potentially longer retry durations are noted and seem reasonable for ensuring test stability with the new resilience features.
Summary of Findings
- Logging Consistency: In
dm/syncer/dbconn/db.go
, within theretryableFn
function, when logging an error after a failed connection reset attempt (around line 273),log.ShortError(resetErr)
is used. Other parts of the PR have updated similar logging to usezap.Error(resetErr)
for structured logging. For consistency, this instance could also be updated. This was not commented on directly as it's considered low severity. - Log Level for Retry Attempts: In
dm/syncer/dbconn/db.go
, theresetConnRetryableFn
logs atError
level when retrying a connection reset. This might be too noisy for transient, recoverable errors. AWarn
level is suggested for better operational clarity (see review comment).
Merge Readiness
The pull request introduces valuable resiliency improvements. The core logic is sound. There is one medium-severity suggestion regarding logging practices in dm/syncer/dbconn/db.go
that would enhance operational clarity. Addressing this point would be beneficial before merging.
As an AI, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers before merging.
tctx.L().Error("retrying reset connection", zap.Int("retry", retryTime), | ||
zap.Int("resetRetries", retryTime), | ||
zap.String("queries", utils.TruncateInterface(queries, -1)), | ||
zap.String("arguments", utils.TruncateInterface(args, -1)), | ||
log.ShortError(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging for retrying a connection reset could be improved for clarity and to reduce potential log noise:
- Log Level: Using
tctx.L().Error(...)
for a retry attempt of an operation that is deemed retryable (byIsRetryableErrorOnResetConn
) might be too strong. If these are expected transient errors that eventually recover, logging them as errors can lead to alert fatigue or misinterpretation of system health. Consider usingtctx.L().Warn(...)
instead, as warnings are generally more appropriate for recoverable issues that are being actively handled. - Redundant Log Field: The log includes
zap.Int("retry", retryTime)
andzap.Int("resetRetries", retryTime)
. Since both fields useretryTime
, one of them is redundant. It's clearer to use just one, for example,zap.Int("attempt", retryTime)
.
How about adjusting the log level and simplifying the fields for better operational clarity?
tctx.L().Warn("retrying connection reset due to retryable error", zap.Int("attempt", retryTime),
zap.String("queries", utils.TruncateInterface(queries, -1)),
zap.String("arguments", utils.TruncateInterface(args, -1)),
log.ShortError(err))
@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@lance6716 @GMHDBJD , here is the new cherry pick PR for DM invalid connection retries. Please take a look. The mac os build seems to be failing because of an outdated github actions/cache version and not related to this change. |
/cc @OliverS929 |
@lance6716: GitHub didn't allow me to request PR reviews from the following users: OliverS929. Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is a manual cherry-pick of #12078
What problem does this PR solve?
This PR tries to address a case where, when a TiDB machine is restarted, stale connections will live on in the connection pool. When DM attempts to use these connections, it will receive an "Invalid Connection" error, which will cause the task to pause and increase replication lag while waiting to reschedule it.
Issue Number: close #11805
What is changed and how it works?
When we get an invalid connection error on a begin statement, we can safely retry after resetting the connection. The reset connection call can also fail due to similar reasons or due to a load balancer layer refusing the connection, so we can also do a retry with backoff here.
Check List
Tests
Tested on our organization's deployment of DM.
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note