ZeroMQ supports multiple NICs for its sockets. If a client gets disconnected with queued messages, it will automatically try to reconnect with the other available NICs. However, there is a problem if the client, say a DEALER socket, has set its identity (using ZMQ_IDENTITY). The identity does not change upon reconnect. A server with a ROUTER socket that receives the new client's connection will see the same ZMQ_IDENTITY again. However, since the new incoming connection uses a different address (due to a different NIC) but has the same identity, the ROUTER socket will automatically place it into a list of "anonymous" connections. The client has no way of receiving responses.
The behaviour I was thinking about would be very simple: when the router socket encounters a connection with an already taken identity, it allows the new connection to assume that identity.
Ideally, any queued messages would have to be transferred to the queues for the new connection, rather than just dropping them.
This new behaviour could be enabled via a new flag for zmq_setsockopt. However, it would have to assume a trusted/secured network.
Please see zeromq-dev mailing list discussions with Pieter Hintjens regarding this issue at:
Discovered issue when using Windows Server 2012, though I don't think the problem is limited to that OS.
In terms of hardware, must have multiple NICs on machine that connects to the ROUTER socket (in my case, it is a DEALER socket with an identity)
The router should actually know that the original connection was broken, so that a new connection with the same identity can recover it. The forcing of duplicate identities to anonymous should only happen when the original client connection is still alive.
This is not what I am seeing (I'm using v3.2.3).
Here is my setup:
On one machine, I have an application that has a ROUTER socket.
On a virtual machine, I have another application using a DEALER socket that sets an identity. The virtual machine is setup with 2 NICs connected to a virtual switch.
They connect and everything is good.
From the Hypervisor (in this case, Hyper-V), I disconnect one of the virtual machine's NICs from the virtual switch. This is basically emulating disconnecting a cable in a physical setup.
The ROUTER socket does not know that there has been a disconnection.
The DEALER socket eventually switches over to use the second NIC. And re-connects to the ROUTER.
The ROUTER sees the same identity, the router_t::identify_peer method will return false, and the connection goes into the anonymous_pipes list.
I have implemented the fix I suggested above. I will create a pull request later this week for your review and consideration.
Proposed fix submitted as a pull request (#729) on Oct 31, 2013.
Code change, test, and documentation has been merged into master.