Describe the bug
JACCL distributed init fails on Apple Silicon Macs (M3/M4) connected via Thunderbolt 5 RDMA with [jaccl] Changing queue pair to RTR failed with errno 22 (EINVAL). The runner crashes immediately on ConnectToGroup; the QP cannot transition to Ready-To-Receive state.
Affects every mlx version since the JACCL refactor (#3412, merged 2026-04-15, commit 4400504a) — bug is still present in current main HEAD e8ebdebe.
Root cause
In mlx/distributed/jaccl/lib/jaccl/rdma.cpp, Connection::info() selects the local GID for the QP destination handle. Pre-refactor, the code used query_gid(ctx, 1, 1, &gid) (always GID index 1). The refactor replaced this with a scan-and-filter loop:
ibv_gid gid;
for (int i = 0; i < port_attr.gid_tbl_len; i++) {
ibv_gid tmp;
if (ibv().query_gid(ctx, 1, i, &tmp) == 0) {
if (*(uint64_t*)&tmp.raw[0] == 0 && *(uint16_t*)&tmp.raw[8] == 0 &&
*(uint16_t*)&tmp.raw[10] == 0xffff) {
gid = tmp;
break;
}
}
}
The filter accepts only IPv4-mapped IPv6 GIDs (the ::ffff:x.x.x.x format used by RoCE v2). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...):
$ ibv_devinfo -d rdma_en3 -v
hca_id: rdma_en3
transport: Thunderbolt (100)
...
GID[0]: fe80::3474:d9ff:fe9d:cc84
GID[1]: fe80::1042:a3d5:410e:95e9
Neither GID matches raw[0..7]==0 && raw[10..11]==0xffff. The loop never assigns gid, leaving it uninitialized. The garbage value is sent to the peer via the side channel, the peer programs its QP with garbage destination GID, and the kernel rejects the local QP's RTR transition with EINVAL.
Reproduction
- 2 Mac Studio (M3 Ultra or M4 Max) connected via Thunderbolt 5
- macOS 26.x, RDMA enabled (
rdma_ctl status → enabled), Thunderbolt Bridge disabled, individual EXO Thunderbolt N services per port
- Any
mlx.distributed.init(backend="jaccl") over a 2-node mesh
- Crash:
ValueError: [jaccl] Changing queue pair to RTR failed with errno 22
Fix
Initialize gid to zero, prefer the IPv4-mapped GID (existing RoCE v2 behavior), and fall back to a non-zero GID — preferring index 1 (the original pre-refactor behavior, which corresponds to the RDMA port's primary link-local GID on Apple TB).
PR follows.
Environment
- macOS 26.4.1 (build 25E253)
- Apple M4 Max, 2 nodes
- Thunderbolt 5 mesh,
rdma_en3 PORT_ACTIVE, no Thunderbolt Bridge
- mlx commit
cc3f3e60 (rltakashige fork merged from upstream e8ebdebe)
Related
Describe the bug
JACCL distributed init fails on Apple Silicon Macs (M3/M4) connected via Thunderbolt 5 RDMA with
[jaccl] Changing queue pair to RTR failed with errno 22(EINVAL). The runner crashes immediately onConnectToGroup; the QP cannot transition to Ready-To-Receive state.Affects every mlx version since the JACCL refactor (#3412, merged 2026-04-15, commit
4400504a) — bug is still present in currentmainHEADe8ebdebe.Root cause
In
mlx/distributed/jaccl/lib/jaccl/rdma.cpp,Connection::info()selects the local GID for the QP destination handle. Pre-refactor, the code usedquery_gid(ctx, 1, 1, &gid)(always GID index 1). The refactor replaced this with a scan-and-filter loop:The filter accepts only IPv4-mapped IPv6 GIDs (the
::ffff:x.x.x.xformat used by RoCE v2). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...):Neither GID matches
raw[0..7]==0 && raw[10..11]==0xffff. The loop never assignsgid, leaving it uninitialized. The garbage value is sent to the peer via the side channel, the peer programs its QP with garbage destination GID, and the kernel rejects the local QP's RTR transition with EINVAL.Reproduction
rdma_ctl status→enabled), Thunderbolt Bridge disabled, individualEXO Thunderbolt Nservices per portmlx.distributed.init(backend="jaccl")over a 2-node meshValueError: [jaccl] Changing queue pair to RTR failed with errno 22Fix
Initialize
gidto zero, prefer the IPv4-mapped GID (existing RoCE v2 behavior), and fall back to a non-zero GID — preferring index 1 (the original pre-refactor behavior, which corresponds to the RDMA port's primary link-local GID on Apple TB).PR follows.
Environment
rdma_en3 PORT_ACTIVE, no Thunderbolt Bridgecc3f3e60(rltakashige fork merged from upstreame8ebdebe)Related