Skip to content

fix: improved socket error logging for connection diagnostics #5062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

vyavdoshenko
Copy link
Contributor

@vyavdoshenko vyavdoshenko commented May 5, 2025

Fixes: #5041

This implementation enhances error diagnostics when replica connections fail by implementing a utility function that extracts and formats detailed TCP socket information from /proc/net/tcp and /proc/net/tcp6 on Linux systems.
The function converts raw socket data into a human-readable format with proper TCP state names and formatted IP addresses. It's integrated with Replica, Connection, and Migration components to provide better context during connection failures.

@vyavdoshenko vyavdoshenko force-pushed the bobik/better_logging_around_replica branch from 28f8e9b to 3279654 Compare May 6, 2025 16:13
@vyavdoshenko vyavdoshenko requested a review from romange May 6, 2025 16:19
@vyavdoshenko vyavdoshenko force-pushed the bobik/better_logging_around_replica branch 3 times, most recently from 1a6f6c2 to 4385fd8 Compare May 7, 2025 10:42
@vyavdoshenko vyavdoshenko force-pushed the bobik/better_logging_around_replica branch from 4385fd8 to 9f7ee25 Compare May 8, 2025 07:31
Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried killing master/replica in the middle of replication and see what is printed?


auto tcp_info = io::ReadTcpInfo(sock_stat.st_ino);
if (!tcp_info) {
auto tcp6_info = io::ReadTcp6Info(sock_stat.st_ino);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before trying tcp6, you can get socket family:

int get_socket_family(int fd) {
    struct sockaddr_storage ss;
    socklen_t len = sizeof(ss);

    if (getsockname(fd, (struct sockaddr *)&ss, &len) == -1) {        
        return -1; // Indicate an error
    }

    return ss.ss_family;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// Add additional information about the reason for cancellation, if possible
// Possible to extract from system errors or context
std::error_code sys_err = std::error_code(errno, std::system_category());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you actually seen that this code helps?
we do not rely on errno, because iouring does not use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@vyavdoshenko
Copy link
Contributor Author

Have you tried killing master/replica in the middle of replication and see what is printed?

Restarting master:

I20250508 15:07:13.919483 14049 replica.cc:705] Transitioned into stable sync
W20250508 15:08:34.295888 14056 common.cc:413] ReportError: Software caused connection abort
E20250508 15:08:34.296075 14056 replica.cc:693] DflyStream error in phase STABLE_SYNC with 192.168.100.3:6379, error: Software caused connection abort, socket state: State: CLOSE_WAIT, Local: 192.168.100.2:55124, Remote: 192.168.100.3:6379, Inode: 233203
W20250508 15:08:34.317605 14056 protocol_client.cc:238] Socket error: Software caused connection abort in 192.168.100.3:6379, socket info: State: CLOSE_WAIT, Local: 192.168.100.2:55124, Remote: 192.168.100.3:6379, Inode: 233203
I20250508 15:08:34.335923 14049 replica.cc:729] Exit stable sync
W20250508 15:08:34.335968 14049 replica.cc:271] Error stable sync with 192.168.100.3:6379 (phase: TCP_CONNECTING): system:103 Software caused connection abort, socket state: socket not found in /proc/net/tcp or /proc/net/tcp6
E20250508 15:08:34.844139 14049 protocol_client.cc:194] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250508 15:08:34.844228 14049 replica.cc:219] Error connecting to 192.168.100.3:6379 (phase: TCP_CONNECTING): system:111, reason: Connection refused
E20250508 15:08:35.345132 14049 protocol_client.cc:194] Error while calling sock_->Connect(server_context_.endpoint): Connection refused
W20250508 15:08:35.345216 14049 replica.cc:219] Error connecting to 192.168.100.3:6379 (phase: TCP_CONNECTING): system:111, reason: Connection refused
I20250508 15:08:35.854527 14049 replica.cc:587] Started full sync with 192.168.100.3:6379
I20250508 15:08:35.855365 14049 replica.cc:607] full sync finished in 8 ms
I20250508 15:08:35.855481 14049 replica.cc:705] Transitioned into stable sync

Restarting replica:

I20250508 15:08:35.846104  1447 dflycmd.cc:696] Registered replica 192.168.100.2:6379
I20250508 15:08:35.854362  1447 dflycmd.cc:393] Started sync with replica 192.168.100.2:6379
I20250508 15:08:35.855278  1447 dflycmd.cc:433] Transitioned into stable sync with replica 192.168.100.2:6379
I20250508 15:09:22.736678  1447 dflycmd.cc:127] Disconnecting from replica 192.168.100.2:6379
W20250508 15:09:22.736735  1447 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
I20250508 15:09:22.737004  1447 dflycmd.cc:686] Replication error: Operation canceled: ExecutionState cancelled
I20250508 15:09:35.609882  1450 dflycmd.cc:696] Registered replica 192.168.100.2:6379
I20250508 15:09:35.634145  1450 dflycmd.cc:393] Started sync with replica 192.168.100.2:6379
I20250508 15:09:36.694377  1450 dflycmd.cc:433] Transitioned into stable sync with replica 192.168.100.2:6379

@vyavdoshenko vyavdoshenko requested a review from romange May 8, 2025 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

better logging around replica socket errors
2 participants