Skip to content

Repeated send/recv gets stuck #139

@Mellich

Description

@Mellich

Repeated calls of send/recv of the following form get stuck after several iterations on two ranks:

  for (int i=0;i<2000;i++) {
    accl.send(*op_buf, count, next_rank, 0, GLOBAL_COMM, true);
    accl.recv(*res_buf, count, prev_rank, 0, GLOBAL_COMM, true);
  }

The behavior seems to be non-deterministic and may only appear with large message sizes and high numbers of repetitions.
It happens more frequently for FPGA-to-FPGA communication as shown in the example but I also observed it for CPU-to-CPU via ACCL. It occurs using the UDP and TCP stack.

Setting a sufficiently long sleep between the iterations seems to increase stability.

I modified the XRT tests to show the described behavior here: https://github.com/Mellich/ACCL/blob/f805e8f87a91878228173668553ce25f9b9eaa31/test/host/xrt/test.cpp#L347

Using the branch above, the test gets stuck reliably for me qwhen executed with the following command:

mpirun -n 2 --tag-output  test -f -x bitstream.xclbin -s 262144 -u -b 1024 -d -n 2000

Example dump of CMAC and network layer status of the UDP version after the execution got stuck:

Message size: 1MB

Output of test:

Start send recv test...
Start send recv test...
Sending data on 0 to 1...
Receiving data on 0 from 1...
Sending data on 1 to 0...
Sending data on 0 to 1...
Receiving data on 1 from 0...
Receiving data on 0 from 1...
Sending data on 1 to 0...
Receiving data on 1 from 0...
Sending data on 0 to 1...
Receiving data on 0 from 1...

Rank 0:

{'tx': {'packets': 4610,
  'good_packets': 4610,
  'bytes': 6512924,
  'good_bytes': 6512924,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 3,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 4095,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'bad_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'rx': {'packets': 3244,
  'good_packets': 3244,
  'bytes': 4352872,
  'good_bytes': 4352872,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 2,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 2730,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'packets_undersize': 0,
  'packets_fragmented': 0,
  'packets_oversize': 0,
  'packets_toolong': 0,
  'packets_jabber': 0,
  'bad_fcs': 0,
  'packets_bad_fcs': 0,
  'stomped_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'cycle_count': 21535156494}

{'tx_path': {'arp': {'packets': 512, 'bytes': 30720, 'cycles': 240149},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'ethernet_header_inserter': {'packets': 4098,
   'bytes': 6463764,
   'cycles': 520519},
  'ethernet': {'packets': 4610, 'bytes': 6494484, 'cycles': 57800444},
  'udp': {'packets': 4098, 'bytes': 6406392, 'cycles': 520513},
  'app': {'packets': 4098, 'bytes': 6291648, 'cycles': 520224}},
 'rx_path': {'ethernet': {'packets': 3244,
   'bytes': 4339896,
   'cycles': 57683890},
  'packet_handler': {'packets': 3244, 'bytes': 4301648, 'cycles': 57683891},
  'arp': {'packets': 512, 'bytes': 30720, 'cycles': 147925},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'udp': {'packets': 2732, 'bytes': 4270928, 'cycles': 280258},
  'app': {'packets': 2732, 'bytes': 4194432, 'cycles': 280255}}}

Rank 1:

{'tx': {'packets': 3244,
  'good_packets': 3244,
  'bytes': 4352872,
  'good_bytes': 4352872,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 2,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 2730,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'bad_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'rx': {'packets': 4610,
  'good_packets': 4610,
  'bytes': 6512924,
  'good_bytes': 6512924,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 3,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 4095,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'packets_undersize': 0,
  'packets_fragmented': 0,
  'packets_oversize': 0,
  'packets_toolong': 0,
  'packets_jabber': 0,
  'bad_fcs': 0,
  'packets_bad_fcs': 0,
  'stomped_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'cycle_count': 48459584151}

{'tx_path': {'arp': {'packets': 512, 'bytes': 30720, 'cycles': 147923},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'ethernet_header_inserter': {'packets': 2732,
   'bytes': 4309176,
   'cycles': 279613},
  'ethernet': {'packets': 3244, 'bytes': 4339896, 'cycles': 57683218},
  'udp': {'packets': 2732, 'bytes': 4270928, 'cycles': 279606},
  'app': {'packets': 2732, 'bytes': 4194432, 'cycles': 279407}},
 'rx_path': {'ethernet': {'packets': 2138,
   'bytes': 2596932,
   'cycles': 57538733},
  'packet_handler': {'packets': 2138, 'bytes': 2572824, 'cycles': 57538728},
  'arp': {'packets': 512, 'bytes': 30720, 'cycles': 240150},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'udp': {'packets': 1626, 'bytes': 2542104, 'cycles': 258776},
  'app': {'packets': 1626, 'bytes': 2496576, 'cycles': 258771}}}

All send packets are also received by the network layer of the other rank, so no data seems to get lost over the link. However, there is a discrepancy between sent and received packets on both ranks. Shouldn't the count of packets be the same in this scenario?

The recv should block the subsequent send, so rx and tx should stay in balance.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions