Skip to content

[BUG] JACCL "Changing queue pair to RTR failed with errno 22" on Apple Thunderbolt RDMA — GID selection regression in #3412 #3467

@danielkristofik

Description

@danielkristofik

Describe the bug

JACCL distributed init fails on Apple Silicon Macs (M3/M4) connected via Thunderbolt 5 RDMA with [jaccl] Changing queue pair to RTR failed with errno 22 (EINVAL). The runner crashes immediately on ConnectToGroup; the QP cannot transition to Ready-To-Receive state.

Affects every mlx version since the JACCL refactor (#3412, merged 2026-04-15, commit 4400504a) — bug is still present in current main HEAD e8ebdebe.

Root cause

In mlx/distributed/jaccl/lib/jaccl/rdma.cpp, Connection::info() selects the local GID for the QP destination handle. Pre-refactor, the code used query_gid(ctx, 1, 1, &gid) (always GID index 1). The refactor replaced this with a scan-and-filter loop:

ibv_gid gid;
for (int i = 0; i < port_attr.gid_tbl_len; i++) {
  ibv_gid tmp;
  if (ibv().query_gid(ctx, 1, i, &tmp) == 0) {
    if (*(uint64_t*)&tmp.raw[0] == 0 && *(uint16_t*)&tmp.raw[8] == 0 &&
        *(uint16_t*)&tmp.raw[10] == 0xffff) {
      gid = tmp;
      break;
    }
  }
}

The filter accepts only IPv4-mapped IPv6 GIDs (the ::ffff:x.x.x.x format used by RoCE v2). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...):

$ ibv_devinfo -d rdma_en3 -v
hca_id: rdma_en3
  transport: Thunderbolt (100)
  ...
  GID[0]: fe80::3474:d9ff:fe9d:cc84
  GID[1]: fe80::1042:a3d5:410e:95e9

Neither GID matches raw[0..7]==0 && raw[10..11]==0xffff. The loop never assigns gid, leaving it uninitialized. The garbage value is sent to the peer via the side channel, the peer programs its QP with garbage destination GID, and the kernel rejects the local QP's RTR transition with EINVAL.

Reproduction

  • 2 Mac Studio (M3 Ultra or M4 Max) connected via Thunderbolt 5
  • macOS 26.x, RDMA enabled (rdma_ctl statusenabled), Thunderbolt Bridge disabled, individual EXO Thunderbolt N services per port
  • Any mlx.distributed.init(backend="jaccl") over a 2-node mesh
  • Crash: ValueError: [jaccl] Changing queue pair to RTR failed with errno 22

Fix

Initialize gid to zero, prefer the IPv4-mapped GID (existing RoCE v2 behavior), and fall back to a non-zero GID — preferring index 1 (the original pre-refactor behavior, which corresponds to the RDMA port's primary link-local GID on Apple TB).

PR follows.

Environment

  • macOS 26.4.1 (build 25E253)
  • Apple M4 Max, 2 nodes
  • Thunderbolt 5 mesh, rdma_en3 PORT_ACTIVE, no Thunderbolt Bridge
  • mlx commit cc3f3e60 (rltakashige fork merged from upstream e8ebdebe)

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions