libzmq
  1. libzmq
  2. LIBZMQ-286

HWM management on publisher side

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 3.0.2, 3.2.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None
    • Environment:

      Linux boxes (Ubuntu)

      Description

      > I am using ZMQ 3.0.x on linux boxes with the PUB/SUB pattern.
      > I have only one subscriber which is very slow. It needs 1 second every
      > time a message is read.
      > I have a HWM on the publisher side set to 10.
      >
      > In my message, I have a counter which is incremented for each message.
      > My messages are relatively small (150 bytes)
      > I have a print of date each time the publisher sends a message
      > (gettimeofday)
      > I also have on the same host where the publisher is running a wireshark
      > tool which captures network packets.
      >
      > With wireshark, I see that ZMQ drops messages number 11 to 39. I don't
      > understand why.
      > All the previous messages (number 1 to 10) have been sent on the network
      > because I see them on wireshark
      > The time reported by wireshark is coherent with the time printed by the
      > publisher.
      >
      > Message 8 sent by publisher at xxx623,205547
      > Message 8 seen by wireshark at xxx623,205556
      >
      > Message 9 sent by publisher at xxx623,205575
      > Message 9 seen by wireshark at xxx623,205584
      >
      > Message 10 sent by publisher at xxx623,205603
      > Message 10 seen by wireshark at xxx623,205611
      >
      > Message 11 sent by publisher at xxx623,205629
      > Message 12 sent by publisher at xxx623,205654
      > Message 13 sent by publisher at xxx623,205704
      > Message 14 sent by publisher at xxx623,205729
      > ....
      >
      > These messages are not seen by wireshark because I guess ZMQ took the
      > decision to drop them.
      > But why it took that decision? I don't think there are messages in the
      > queue because I have seen them on
      > the wire!
      >

      From what I have understood, the problem is the following.
      The pipe used for communication between my application subscriber
      thread and the zmq I/O thread is effectively marked as
      full (msgs_written - peers_msgs_read == uint64_t (hwm) in pipe.cpp file
      check_write method) even if I have seen my messages on the wire (shown
      by wireshark).
      The I/O thread has effectively sent the messages on the wire and it has
      sent the "activate_write" command to the pipe. When the subscriber
      thread sends a message, it processes the pipe command list (method
      socket_base_t::process_commands()) but the "activate_write" commands are
      (they are several) not executed immediately.
      This is due to the code in this socket_base_t::process_commands() method
      just before the loop processing the
      commands. There is some code to optimize commands processing which
      takes the decision to return from this
      method before the commands are processed. If I comment out this part of
      the code, things works much better and I do not notice dropped messages.

      1. pub.cpp
        2 kB
        Emmanuel Taurel
      2. sub.cpp
        1 kB
        Emmanuel Taurel

        Activity

        Hide
        Emmanuel Taurel added a comment -

        This bug also concerns ZMQ 3.1.1
        I have attached two files allowing to reproduce it.
        File pub.cpp is the publisher which publishes 150 messages in a loop. This publisher has a HWM set to 10. It also prints date when messages are sent to ZMQ.
        File sub.cpp is the subscriber. Each message is a multi-part message. The second part is a counter which is incremented each time a message is sent. The publisher send messages after
        a key is pressed on the keyboard. To re-produce the problem:
        1 - Start the publisher
        2 - Start the subscriber
        3 - Start Wireshark (with a correct capture filter) on the publisher host
        4 - Ask pub. to send messages by pressing a key
        5 - You will notice that the subscriber receives the first 10 (number 0 to 9) messages then most of the remaining messages are lost. This is because ZMQ has dropped them on the pub side because it thinks the pub
        HWM has been reached. In publisher window, look at date when message number 9 has been sent and compare this date with messages sent on the wire as reported by wireshark.
        ZMQ does not notice that messages number 0 to 9 has already been sent on the wire and
        wrongly think that the publisher HWM is reached.

        Thank's for your help

        Emmanuel

        Show
        Emmanuel Taurel added a comment - This bug also concerns ZMQ 3.1.1 I have attached two files allowing to reproduce it. File pub.cpp is the publisher which publishes 150 messages in a loop. This publisher has a HWM set to 10. It also prints date when messages are sent to ZMQ. File sub.cpp is the subscriber. Each message is a multi-part message. The second part is a counter which is incremented each time a message is sent. The publisher send messages after a key is pressed on the keyboard. To re-produce the problem: 1 - Start the publisher 2 - Start the subscriber 3 - Start Wireshark (with a correct capture filter) on the publisher host 4 - Ask pub. to send messages by pressing a key 5 - You will notice that the subscriber receives the first 10 (number 0 to 9) messages then most of the remaining messages are lost. This is because ZMQ has dropped them on the pub side because it thinks the pub HWM has been reached. In publisher window, look at date when message number 9 has been sent and compare this date with messages sent on the wire as reported by wireshark. ZMQ does not notice that messages number 0 to 9 has already been sent on the wire and wrongly think that the publisher HWM is reached. Thank's for your help Emmanuel
        Hide
        Martin Hurton added a comment -

        Your analysis is correct.
        The HWM can be though of as pipe capacity.
        It takes some time for a signal that the pipe can accept some more message to reach the produced.
        Why do you think this is a problem?

        Show
        Martin Hurton added a comment - Your analysis is correct. The HWM can be though of as pipe capacity. It takes some time for a signal that the pipe can accept some more message to reach the produced. Why do you think this is a problem?
        Hide
        Emmanuel Taurel added a comment -

        Hi Martin,

        For me, the problem is that ZMQ drops too many messages. ZMQ still thinks that the pipe is full while it is not the case because I have seen the messages on the wire (with wireshark). A message cannot be in the pipe and on the wire!
        Because ZMQ thinks the pipe is full, it drops messages while it could transfert them (the pipe is not full).

        I guess this is due to some optimization. As I explain in the bug report, if I comment out the optimization
        code in this pipe.cpp file, there is much less dropped packets but I guess I am loosing performances somewhere

        Cheers

        Show
        Emmanuel Taurel added a comment - Hi Martin, For me, the problem is that ZMQ drops too many messages. ZMQ still thinks that the pipe is full while it is not the case because I have seen the messages on the wire (with wireshark). A message cannot be in the pipe and on the wire! Because ZMQ thinks the pipe is full, it drops messages while it could transfert them (the pipe is not full). I guess this is due to some optimization. As I explain in the bug report, if I comment out the optimization code in this pipe.cpp file, there is much less dropped packets but I guess I am loosing performances somewhere Cheers
        Hide
        Martin Hurton added a comment -

        Exactly. There is an optimization, little time that we ignore signal from consumer that producer can write more message on socket.
        You may want to tune max_command_delay option in config.hpp for your use case, if you are willing to trade some throughput and latency.
        Let us know if this works for you.

        Show
        Martin Hurton added a comment - Exactly. There is an optimization, little time that we ignore signal from consumer that producer can write more message on socket. You may want to tune max_command_delay option in config.hpp for your use case, if you are willing to trade some throughput and latency. Let us know if this works for you.
        Hide
        Emmanuel Taurel added a comment -

        I am not able to re-produce this issue with release 3.2.0 rc1. But I have to say that I have updated my hardware platform to a brand new I7 computer. Because the issue was related to some optimization, I do not notice the problem with this new hardware. My previous computer was quite old (more than 5 years). Anyway, is there a link between the problem reported here and this thread (and patch) in the crossroad mailing list

        http://groups.crossroads.io/groups/crossroads-dev/messages/topic/jJ5x65jr3FTaiZ8Fjwqe5#post-iZXqUrsBSwluZpzjCYA6l

        I have asked the question on the crossroads mailing list but got no answer

        Cheers

        Show
        Emmanuel Taurel added a comment - I am not able to re-produce this issue with release 3.2.0 rc1. But I have to say that I have updated my hardware platform to a brand new I7 computer. Because the issue was related to some optimization, I do not notice the problem with this new hardware. My previous computer was quite old (more than 5 years). Anyway, is there a link between the problem reported here and this thread (and patch) in the crossroad mailing list http://groups.crossroads.io/groups/crossroads-dev/messages/topic/jJ5x65jr3FTaiZ8Fjwqe5#post-iZXqUrsBSwluZpzjCYA6l I have asked the question on the crossroads mailing list but got no answer Cheers
        Hide
        Martin Hurton added a comment -

        I think there is a link between those issues. Would you mind prepare a patch for this? Thanks!

        Show
        Martin Hurton added a comment - I think there is a link between those issues. Would you mind prepare a patch for this? Thanks!

          People

          • Assignee:
            Martin Sustrik
            Reporter:
            Emmanuel Taurel
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: