Example is getting caught in a loop #27

jcalfee · 2024-08-23T18:07:26Z

Have you come across this situation?

It hangs when adding a 4th node bin/zr-config.js -c config/example.hjson -a tcp://127.0.0.1:8347/4

The info and www interface usually reports that the 4th node was added. But, more often than not, when I re-try and remove or add the 4th node again, one of the other 3 running nodes gets caught in a loop:

That starts a the long running loop you see above (presumably forever).

I can press ctrl+c to stop the looping node and the issue moves to the next Leader. If I bring it down to 2 nodes it stops then bring up the 3rd node, the looping continues in what ever node is the new Leader. It does not matter what node I start and stop, the loop just continues in the new Leader.

The lastIndex is 400 (the same number of messages created in the test data before I issued .stop). The queue has one item in the array, a buffer of size 400.

...

It is clear it is calling cancel and removing the ondata listener:

const cancel = () => {
    entriesRequests.delete(requestKey);
    if (stream) {
      stream.cancel()
      stream.removeListener('data', ondata);
      stream.removeListener('end', onend);
      stream = null;
    }
  };

It must be calling ondata again to get that data back into the queue:

  const ondata = (chunk) => {
    queue.push(chunk);
    pipe();
    if (stream && queue.length > nextToSendIdx) {
      /* push back stream */
      // debug('push back stream');
      stream.pause();
    }
  };

I don't have any changes and I can re-create the issue almost every time even after removing the tmp directory. It think it goes back to the -a add command hanging, and also why does that keep affecting the Leader.

This appeared on node v18.19.0 but I see the same issue with v20 as well, which means I did have to remove node_modules and re-install (pnpm i) so zeromq would work. The only other thing, run codium in Docker (x11docker if your interested) and start all my processes from codium (everything in docker). I doubt this is the cause though, I just wanted to help check the node version because this may have something to do with streams and timing.

The text was updated successfully, but these errors were encountered:

royaltm · 2024-08-24T13:44:32Z

Hi @jcalfee ,

Thanks for ther report.

At first glance this message "request entries done" originates from the server part of the raft engine. This code does not take part in maintaining inter-node raft communication or cluster integrity, but rather it handles communication with clients.

Last time I was extensively testing zmq-raft was with node v16. I'm still using it with node v18, but in a limited fashion- I'm using an in-process raft node only, so there are no external entry requests.

Perhaps you are right and something changed in the node stream API and now the code might be doing something wrong.

I'll take a look into it and let you know my findings. In the meantime perhaps try to repeat your exercise but with all the raft clients disconnected from the cluster and let me know the outcome.

jcalfee · 2024-08-27T10:45:39Z

I'm not sure how to run the 3+1 example and have clients that are not connected to a cluster.

I tried v14 with a newly installed node_modules and no ./tmp with the same results. Test data was not needed. Just to re-iterate, this is in a docker container and I don't fully understand the ramifications; I can see that the container is otherwise very stable, I do all my development work all day long within and interacting with this container. If I rebooted, who knows, something may change/fix within the docker container. I have not rebooted because everything else is generally stable, so there is usually something to learn, and using node-zmq-raft in production would likely involve containerd or dockerd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example is getting caught in a loop #27

Example is getting caught in a loop #27

jcalfee commented Aug 23, 2024

royaltm commented Aug 24, 2024 •

edited

Loading

Uh oh!

jcalfee commented Aug 27, 2024 •

edited

Loading

Uh oh!

Example is getting caught in a loop #27

Example is getting caught in a loop #27

Comments

jcalfee commented Aug 23, 2024

royaltm commented Aug 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcalfee commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

royaltm commented Aug 24, 2024 •

edited

Loading

jcalfee commented Aug 27, 2024 •

edited

Loading