Skip to content

Example is getting caught in a loop #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jcalfee opened this issue Aug 23, 2024 · 2 comments
Open

Example is getting caught in a loop #27

jcalfee opened this issue Aug 23, 2024 · 2 comments

Comments

@jcalfee
Copy link

jcalfee commented Aug 23, 2024

Have you come across this situation?

It hangs when adding a 4th node bin/zr-config.js -c config/example.hjson -a tcp://127.0.0.1:8347/4

image

The info and www interface usually reports that the 4th node was added. But, more often than not, when I re-try and remove or add the 4th node again, one of the other 3 running nodes gets caught in a loop:

image

image

That starts a the long running loop you see above (presumably forever).

I can press ctrl+c to stop the looping node and the issue moves to the next Leader. If I bring it down to 2 nodes it stops then bring up the 3rd node, the looping continues in what ever node is the new Leader. It does not matter what node I start and stop, the loop just continues in the new Leader.

The lastIndex is 400 (the same number of messages created in the test data before I issued .stop). The queue has one item in the array, a buffer of size 400.

image
...

It is clear it is calling cancel and removing the ondata listener:

const cancel = () => {
    entriesRequests.delete(requestKey);
    if (stream) {
      stream.cancel()
      stream.removeListener('data', ondata);
      stream.removeListener('end', onend);
      stream = null;
    }
  };

It must be calling ondata again to get that data back into the queue:

  const ondata = (chunk) => {
    queue.push(chunk);
    pipe();
    if (stream && queue.length > nextToSendIdx) {
      /* push back stream */
      // debug('push back stream');
      stream.pause();
    }
  };

I don't have any changes and I can re-create the issue almost every time even after removing the tmp directory. It think it goes back to the -a add command hanging, and also why does that keep affecting the Leader.

This appeared on node v18.19.0 but I see the same issue with v20 as well, which means I did have to remove node_modules and re-install (pnpm i) so zeromq would work. The only other thing, run codium in Docker (x11docker if your interested) and start all my processes from codium (everything in docker). I doubt this is the cause though, I just wanted to help check the node version because this may have something to do with streams and timing.

@royaltm
Copy link
Owner

royaltm commented Aug 24, 2024

Hi @jcalfee ,

Thanks for ther report.

At first glance this message "request entries done" originates from the server part of the raft engine. This code does not take part in maintaining inter-node raft communication or cluster integrity, but rather it handles communication with clients.

Last time I was extensively testing zmq-raft was with node v16. I'm still using it with node v18, but in a limited fashion- I'm using an in-process raft node only, so there are no external entry requests.

Perhaps you are right and something changed in the node stream API and now the code might be doing something wrong.

I'll take a look into it and let you know my findings. In the meantime perhaps try to repeat your exercise but with all the raft clients disconnected from the cluster and let me know the outcome.

@jcalfee
Copy link
Author

jcalfee commented Aug 27, 2024

I'm not sure how to run the 3+1 example and have clients that are not connected to a cluster.

I tried v14 with a newly installed node_modules and no ./tmp with the same results. Test data was not needed. Just to re-iterate, this is in a docker container and I don't fully understand the ramifications; I can see that the container is otherwise very stable, I do all my development work all day long within and interacting with this container. If I rebooted, who knows, something may change/fix within the docker container. I have not rebooted because everything else is generally stable, so there is usually something to learn, and using node-zmq-raft in production would likely involve containerd or dockerd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants