-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Description
In high-concurrency scenarios where the connection pool is exhausted, the go-redis
library currently uses the incoming request's context (reqCtx
) for the entire connection creation process (dialing, TLS handshake, AUTH, etc.) in ConnPool.Get
. This creates a critical issue that is at odds with the established best practices implemented in the Go standard library.
If the request times out while the connection is being established, the dialing operation is canceled. This often happens even if the connection is milliseconds away from completion, leading to several detrimental effects that are exacerbated by certain common conditions:
Three key factors compound this problem:
- PoolTimeout Compression: The configured
PoolTimeout
period is consumed while waiting for an available connection. By the time a new connection creation is initiated, a significant portion of the request's original timeout budget may already be exhausted, leaving an insufficient time window for the dialing operation to complete successfully. This makes connection establishment far more likely to fail. - Asymmetric Timeout Configurations: For latency-sensitive services, the timeout on the request context (dictated by upstream service level objectives) is often significantly shorter than a reasonable
DialTimeout
(which must account for network latency and TCP/TLS handshakes). This mismatch means the request is likely to time out well before the underlying dial operation has had a fair chance to succeed. - Propagated Context Timeout Decay: In distributed systems, the timeout on a request's context is typically set by the initial upstream client and decays as it propagates through the service chain. By the time it reaches a downstream service like Redis, the remaining time is frequently very short—often just milliseconds. This leaves virtually no time for a new connection to be established from scratch, making failures almost certain under pool contention.
These factors lead directly to the following negative outcomes:
- Unnecessary connection establishment failures: Perfectly good connections are aborted due to brief client-side timeouts.
- Connection pool starvation: A storm of timed-out requests can prevent the pool from ever being refilled, causing a cascade of failures for subsequent requests.
- Diagnostic ambiguity and operational overhead: It becomes extremely difficult to distinguish between genuine server-side/network connectivity issues and client-side timeout pressure. Observing a high rate of connection timeouts could indicate a problem with the Redis server, or it could simply be a symptom of the client's own aggressive timeouts cancelling viable connections. This ambiguity makes alerting, debugging, and root cause analysis significantly more complex and time-consuming.
- Resource waste: Server-side resources (TCP slots, CPU for TLS handshakes) are used without any benefit.
Prior Art in Go Stdlib and the Robust Solution
This is a well-understood problem in connection pool design. The Go standard library's database/sql
and net/http
packages have already implemented robust solutions to prevent request timeouts from destabilizing the connection pool.
1. database/sql.DB
The sql.DB
pool decouples connection creation from request lifecycles. It uses a long-lived, dedicated goroutine (connectionOpener
) to serialize connection creation. Crucially, it uses a background context for the connection establishment itself.
- Source Code Reference:
- The
connectionOpener
goroutine:sql.OpenDB
- The
openNewConnection
method which uses the opener's context:(*DB).openNewConnection
- The
mysql/driver
useConnect
method which creates new context under opener's context using connectionTimeout
- The
2. net/http.Transport
The implementation of http.Transport
is almost the same. A dialing operation has its own timeout (net.Dialer.Timeout
), but also listens for cancellation from the request context to avoid waste.
- Source Code Reference:
- The
getConn
method creating a new dial context:(*Transport).getConn
- The canonical example of context propagation: Look for the
select
statement with<-ctx.Done()
anddialCancel()
insidegetConn
. - The
dialConn
method using a context which is detached from the request context's cancellation signal:(*Transport).dialConnFor
- The
Expected Behavior
The library should adopt the same context strategy used by the standard library. The creation of a new connection, which is a shared resource, should be resilient to the cancellation of the individual request that triggered it.
The expected behavior is:
- Connection creation should primarily be governed by a independent, reasonable timeout (e.g., the configured
DialTimeout
). - It should also listen for cancellation from the original request's context for efficient resource cleanup.
- Most importantly, if the connection establishment succeeds after the original request has canceled, that healthy connection should still be placed into the pool for use by other requests.
Willingness to Contribute
The solution involves modifying the connection establishment logic (in internal/pool
) to implement this pattern. The following is the implementation of imitating the standard library:
type wantConn struct {
mu sync.Mutex // protects ctx, done and sending of the result
ctx context.Context // context for dial, cleared after delivered or canceled
cancelCtx context.CancelFunc
done bool // true after delivered or canceled
result chan wantConnResult // channel to deliver connection or error
}
func (w *wantConn) tryDeliver(cn *Conn, err error) bool {
w.mu.Lock()
defer w.mu.Unlock()
if w.done {
return false
}
w.done = true
w.ctx = nil
w.result <- wantConnResult{cn: cn, err: err}
close(w.result)
return true
}
func (w *wantConn) cancel(ctx context.Context, p *ConnPool) {
w.mu.Lock()
var cn *Conn
if w.done {
select {
case result := <-w.result:
cn = result.cn
default:
}
} else {
close(w.result)
}
w.done = true
w.ctx = nil
w.mu.Unlock()
if cn != nil {
p.Put(ctx, cn)
}
}
type wantConnResult struct {
cn *Conn
err error
}
func (p *ConnPool) asyncNewConn(ctx context.Context) (*Conn, error) {
// First try to acquire permission to create a connection
select {
case p.dialsInProgress <- struct{}{}:
// Got permission, proceed to create connection
case <-ctx.Done():
p.freeTurn()
return nil, ctx.Err()
}
dialCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), p.cfg.DialTimeout)
w := &wantConn{
ctx: dialCtx,
cancelCtx: cancel,
result: make(chan wantConnResult, 1),
}
var err error
defer func() {
if err != nil {
w.cancel(ctx, p)
}
}()
go func(w *wantConn) {
defer w.cancelCtx()
defer func() { <-p.dialsInProgress }() // Release connection creation permission
cn, cnErr := p.newConn(w.ctx, true)
delivered := w.tryDeliver(cn, cnErr)
if cnErr == nil && delivered {
return
} else if cnErr == nil && !delivered {
p.Put(w.ctx, cn)
} else { // freeTurn after error
p.freeTurn()
}
}(w)
select {
case <-ctx.Done():
err = ctx.Err()
return nil, err
case result := <-w.result:
err = result.err
return result.cn, err
}
}
I am willing to work on a Pull Request to implement this change and would appreciate guidance from the maintainers on the preferred approach.