Pub socket sending fail issue

Description

1. Symptom :

  • If there are A, B, C three nodes, they connected each other with zmq.

  • The A Pub socket is connected to B, C Sub socket.

  • When B node is terminated abnormally and start zmq connection repeatedly, sometimes existed A Pub socket cannot send to C Sub socket.

2. Cause :

  • In the pub socket, One pipe was write() failed because status was not active. And that pipe entered terminated() function.

  • After the pipe was erased, sometimes the existed normal(active) pipe array index will have equal or greater number than eligible number.

  • So, this normal pipe cannot be sent, because it does not satisfy the match() function condition.(if (pipes.index (pipe_) >= eligible) return)

3. Solution :
Tha active status pipe index must be located in less array index than eligible number.
For that, We added swap before erase in the terminated() function.

---The original codes

dist.cpp

---The modified codes

4. For example :
There is array size is 4, initially matching is 2, eligible is 4, active is 4.
I will display m(2), e(4), a(4),

— In the original codes

— In the modified codes

Environment

arm android

Activity

Show:

Christian Kamm June 29, 2013 at 4:04 PM
Edited

WooSung Lee June 29, 2013 at 3:42 AM
Edited

Christian : I made pull request.
I request as my first suggested code.
I referred previous discussed mail(Christian Kamm noticed, http://lists.zeromq.org/pipermail/zeromq-dev/2013-March/020897.html) and zmq::dist_t::write() failed case codes.

Christian Kamm June 27, 2013 at 4:42 PM

WooSung: Yes, but note that I'm new to zmq myself. It seems like moving the to-be-deleted pipe into the not-eligible range would suffice. Then the erase() call can only move another non-eligible pipe into the deleted one's spot.

WooSung Lee June 27, 2013 at 12:05 PM

Thanks, Christian and David.
I'll make pull request.

Christian, Do you want to swap only about eligible-1 ?

Christian Kamm June 27, 2013 at 9:26 AM

Okay, so the problem is that dist:ipe_terminate() doesn't realize array::erase(N) works by reordering, moving the last element into the Nth spot in the array. That way pipe_terminate can accidentally move an ineligible pipe into the matching, active or eligible part of the pipes array, displacing an active one.

Example:

WooSung Lee, your patch looks good, will you make a pull request? Since matching <= active <= eligible, you could probably get away with only adding the swap to eligible-1.

This should be easy to reproduce with three subscribers: make the third hit the HWM and then terminate the first. The second should stop getting data. I'll give that a try.

Fixed

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Created June 26, 2013 at 5:57 AM
Updated September 20, 2013 at 9:53 AM
Resolved September 20, 2013 at 9:53 AM