zmq_bind occasionally fails (EADDRINUSE) on Windows

Description

I'm using 0MQ in a test suite to synchronize distributed tests. During an overnight run, randomly the call to zmq_bind() fails with EADDRINUSE. I've simplified steps to reproduce to a single thread running this code: http://pastebin.com/Sy95wPHe

On my machine, the issue happens after around 16,000 successful iterations of creating, binding, and closing a socket. This doesn't seem to be an issue with the port actually being in use as the error suggests. I believe there is some kind of bug in 0MQ (or maybe Windows?). Here's why:
1) In the overnight test, ports are chosen randomly from a list of ports, and so the same ports are not re-used for some time. In my simple example, I run into the same problem when just re-using the same port. This makes it seem like it's not a problem with the port being released.
2) In the simple example, the port successfully opens over 16K times before it fails, and can repeatedly fail every time around 16K. Nobody else should be using that port except me.
3) If I modify the example to immediately retry zmq_bind() after an error, it still fails. If I sleep a second, and retry, it still fails. If I loop for 2 minutes retrying, it still fails. Once it fails, it continues to fail. However, if I loop to the next iteration where I create a new context and socket, the call to zmq_bind() on the same port immediately succeeds!
4) Related to point #3, the retry work around only works if I don't have any other sockets, even on different ports in a different context. If those exist, I have to destroy them to for the retry to succeed. This only appears to affect those in the same thread.

This kind of problem makes it to hard to find the 0MQ library to be reliable, and I'm contemplating writing my own library from scratch in order to ensure dependability in our use case.

Environment

Windows 7

Activity

Show:
Ian Barber
March 21, 2013, 10:36 PM

Which version of zeromq is this with? It may be worth trying setting the ZMQ_LINGER sockopt to 0 and see if that has any effect of the occurrence of the bug.

Daniel Marcotte
March 21, 2013, 10:53 PM

This is version 3.2.2, the latest available stable release. I've tried messing with the ZMQ_LINGER options, and there appears to be no change in the result.

Ian Barber
March 22, 2013, 8:06 PM

OK, that's interesting. WRT to point 4, you're saying that if there are any sockets, even if they are in different context, but in the same thread, you have to remove all of them for it to work? This feels like its hitting a limit somewhere. Could you try the test again, but change the type to inproc://foo or similar, and see if you get a similar issue? that will check whether its ZMQ on windows or more socket related at least.

Daniel Marcotte
April 1, 2013, 3:35 PM

The problem is not libzmq, but with my code and Windows. I'm using a static port number, which fails occasionally even when using native Winsock programming. However, if I switch to using port 0 (having the OS allocate a port), the problem goes away.

Cannot Reproduce

Assignee

Unassigned

Reporter

Daniel Marcotte

Components

Fix versions

Affects versions

Priority

Major