push/pull + HWM + slow puller + zmq_close = lost message

Description

I have two processes performing a simple push/pull flow of data from one to pusher to the puller. The puller has a HWM set as it is io bound on disk writes and the number and size of messages handled would exceed memory limits if there was no HWM set. When the pusher is finished sending all the data messages it sends a message that tells the puller there is no more data. Once this termination message is sent the pusher is finished and can exit. The puller continues to process messages until it receives the termination message whereon it too exits.

If the HWM is reached on the pull side when the pusher sends the termination message and calls zmq_close on it's end of the socket the termination message is lost and the puller blocks indefinitely expecting more messages. Note that this happens at zmq_close time (as demonstrated by flipping the WAIT_FOR_CHILD_BEFORE_TERM define in the example code) and not zmq_term.

This would seem to contradict the documentation for zmq_close and zmq_term. I would have expected the final message to be buffered after the zmq_close and only dropped by zmq_term according to the ZMQ_LINGER setting (The default of -1 is used in the example code but explicitly setting it to -1 has also been tried).

It is also worth noting that setting the pushers HWM to 1 and waiting for ZMQ_EVENTS to say a message can be sent without blocking does not help in this situation either.

Also note that this behavior is exhibited with both tcp and ipc.

Which side of the PUSH/PULL binds the end point also makes no difference (although this is not easily switched in the attached example code).

If this is expected behavior/not a bug, is there a known work around to termination messages? I could have the puller ack back to the sender that it has received the message via another socket but it would seem the same issue would exist with the puller shutting down before the pusher received the ACK. I guess this behavior could be avoided by not having a HWM on the ack socket. Doubling the number of sockets and greatly increasing the complexity of a simple termination method seems to intuitively suggest this is the wrong approach though (eg: unix pipes would be extremely frustrating and far less used with out EOF).

The attached example code creates two processes via fork, connects them via a PUSH/PULL request. There are #defines at the top t control key params and the endpoint used is passed as a command line argument. The puller/child emulates a slow receiver by doing a one second sleep between receives. The current defines are the minimum values needed to always produce this behavior on my laptop.

Thanks

Environment

Linux on amd64. Debian SID. ZMQ compiled from source with default options.

Attachments

3
  • 17 Jul 2011, 08:04 AM
  • 17 Jul 2011, 08:04 AM
  • 13 Jul 2011, 02:28 PM

Activity

Show:

Martin Hurton November 14, 2012 at 8:32 PM

Could you please test with tip of zeromq3-x (https://github.com/zeromq/zeromq3-x)?

Timothy M. Shead November 14, 2012 at 8:18 PM

Martin:

Could you clarify which version of libzmq you mean? 3.2.1-RC2? I'm anxious to test this out.

Thanks,
Tim

Martin Hurton November 14, 2012 at 8:04 PM

Fixed by 776563fcffe975774c713ade357ea2b83d22da7c.

Martin Hurton November 14, 2012 at 8:03 PM

Mika, do you hit this issue with the last version of libzmq? If so, can you file a new issue? Thanks!

Mika Fischer February 14, 2012 at 1:43 PM

We can't reproduce the case with more than one message lost with a simple test case. So it's probably an issue with our code. But we can reproduce the issue with HWM=1 and one lost message reliably.

Fixed

Details

Assignee

Reporter

Labels

Components

Priority

Created July 13, 2011 at 2:28 PM
Updated August 29, 2013 at 4:36 PM
Resolved November 14, 2012 at 8:04 PM