ROUTER socket does not support reconnection from clients with multiple NICs

Description

ZeroMQ supports multiple NICs for its sockets. If a client gets disconnected with queued messages, it will automatically try to reconnect with the other available NICs. However, there is a problem if the client, say a DEALER socket, has set its identity (using ZMQ_IDENTITY). The identity does not change upon reconnect. A server with a ROUTER socket that receives the new client's connection will see the same ZMQ_IDENTITY again. However, since the new incoming connection uses a different address (due to a different NIC) but has the same identity, the ROUTER socket will automatically place it into a list of "anonymous" connections. The client has no way of receiving responses.

Suggested Fix:

The behaviour I was thinking about would be very simple: when the router socket encounters a connection with an already taken identity, it allows the new connection to assume that identity.

Ideally, any queued messages would have to be transferred to the queues for the new connection, rather than just dropping them.

This new behaviour could be enabled via a new flag for zmq_setsockopt. However, it would have to assume a trusted/secured network.

Please see zeromq-dev mailing list discussions with Pieter Hintjens regarding this issue at:

http://lists.zeromq.org/pipermail/zeromq-dev/2013-June/021982.html
http://lists.zeromq.org/pipermail/zeromq-dev/2013-July/022044.html

Environment

Discovered issue when using Windows Server 2012, though I don't think the problem is limited to that OS.

In terms of hardware, must have multiple NICs on machine that connects to the ROUTER socket (in my case, it is a DEALER socket with an identity)

Activity

Show:
PieterP
September 20, 2013, 7:44 AM

The router should actually know that the original connection was broken, so that a new connection with the same identity can recover it. The forcing of duplicate identities to anonymous should only happen when the original client connection is still alive.

Mark Barbisan
October 24, 2013, 6:28 PM

This is not what I am seeing (I'm using v3.2.3).

Here is my setup:

  • On one machine, I have an application that has a ROUTER socket.

  • On a virtual machine, I have another application using a DEALER socket that sets an identity. The virtual machine is setup with 2 NICs connected to a virtual switch.

  • They connect and everything is good.

  • From the Hypervisor (in this case, Hyper-V), I disconnect one of the virtual machine's NICs from the virtual switch. This is basically emulating disconnecting a cable in a physical setup.

  • The ROUTER socket does not know that there has been a disconnection.

  • The DEALER socket eventually switches over to use the second NIC. And re-connects to the ROUTER.

  • The ROUTER sees the same identity, the router_t::identify_peer method will return false, and the connection goes into the anonymous_pipes list.

------------
I have implemented the fix I suggested above. I will create a pull request later this week for your review and consideration.

Mark Barbisan
October 31, 2013, 5:46 PM

Proposed fix submitted as a pull request (#729) on Oct 31, 2013.

Mark Barbisan
November 7, 2013, 4:30 PM

Code change, test, and documentation has been merged into master.

Fixed

Assignee

Unassigned

Reporter

Mark Barbisan

Labels

None

Components

Affects versions

Priority

Major