Uploaded image for project: 'libzmq'
  1. libzmq
  2. LIBZMQ-323

Unexpected behaviour epgm pub/sub socket in case of intensively use

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Incomplete
    • Affects Version/s: 2.1.11, 3.2.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
    • Environment:

      Win7, x86/x64

      Description

      It seems that applications which using 0MQ library intensively (>1k requests/sec) behave in unexpected way .

      When I try to send many messages(~1M) between 2 computers using epmg broadcast (pub/sub), and at some moment (after >200-500k messages which were sent and received successfully) something is going wrong and one computer stops receiving all messages from the other. I tried to stop sending messages for a while, wait some time- all is useless.
      Only if I restart one socket (no matter which one: on the first or on the second computer), at the beginning everything is good, but then problem repeats.

      Here you can find small test application in attachment:
      To test:
      Run application on 2 computers on one local net.
      Expected output:
      "Send

      {0} messages"
      and "Received multi {0}

      "
      Real output:
      at the beginning: as expected, afters some time only "Send

      {0}

      messages"

      P.S.
      Since I use .NET binding, initially I have created issue on clrzmq queue(https://github.com/zeromq/clrzmq/issues/45), but they advised me to report bug here.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            suilevap Pavel Raliuk added a comment -

            Is there any update on this issue?

            Show
            suilevap Pavel Raliuk added a comment - Is there any update on this issue?
            Hide
            suilevap Pavel Raliuk added a comment - - edited

            Issue was reproduced in clrzmq 3.0.0-alpha1 (libzmq 3.1).

            Show
            suilevap Pavel Raliuk added a comment - - edited Issue was reproduced in clrzmq 3.0.0-alpha1 (libzmq 3.1).
            Hide
            hurtonm Martin Hurton added a comment -

            You set the maximum multicast data rate on the socket to 1 Gbps.
            Are you sure your network can manage this traffic?

            Show
            hurtonm Martin Hurton added a comment - You set the maximum multicast data rate on the socket to 1 Gbps. Are you sure your network can manage this traffic?
            Hide
            suilevap Pavel Raliuk added a comment -

            Yes, all networks are about 2-3Gbps speed.

            Show
            suilevap Pavel Raliuk added a comment - Yes, all networks are about 2-3Gbps speed.
            Hide
            steve-o Steven McCoy added a comment -

            That doesn't make sense, you mean provisioned to only 2-3Gbps on a 10GigE NIC?

            With recovery sharing the same network you probably want to limit to 40% capacity. For high bandwidth broadcast you ultimately need a separate NIC for recovery. You might have to be a little creative for such a configuration, although a few options spring to mind.

            Show
            steve-o Steven McCoy added a comment - That doesn't make sense, you mean provisioned to only 2-3Gbps on a 10GigE NIC? With recovery sharing the same network you probably want to limit to 40% capacity. For high bandwidth broadcast you ultimately need a separate NIC for recovery. You might have to be a little creative for such a configuration, although a few options spring to mind.
            Hide
            suilevap Pavel Raliuk added a comment -

            One network adapter is used for several networks.

            Is it possible that insufficient capacity of network can lead to such behaviour?
            It seems is not OK, if after we stop sending messages, communication is still broken and recovers only after restarting the socket.

            P.S.
            As result, we didn't have time to spend more on this problem, and decided to use custom solution without zeroMQ.

            Show
            suilevap Pavel Raliuk added a comment - One network adapter is used for several networks. Is it possible that insufficient capacity of network can lead to such behaviour? It seems is not OK, if after we stop sending messages, communication is still broken and recovers only after restarting the socket. P.S. As result, we didn't have time to spend more on this problem, and decided to use custom solution without zeroMQ.
            Hide
            pieterh Pieter Hintjens added a comment -

            Not clear what the cause is.

            Show
            pieterh Pieter Hintjens added a comment - Not clear what the cause is.
            Hide
            suilevap Pavel Raliuk added a comment -

            Sample was attached,
            Does it mean issue is not reproduced?

            Show
            suilevap Pavel Raliuk added a comment - Sample was attached, Does it mean issue is not reproduced?
            Hide
            steve-o Steven McCoy added a comment -

            It is a known characteristic of reliable multicast systems. To send a full speed, i.e. network saturation, some form of congestion control is required. OpenPGM does not ship with a supported congestion control mechanism.

            PGMCC is one congestion control mechanism included within OpenPGM but empirical evidence has shown that the protocol fails above 1,000 messages per second. Further academic research is required in this area for high speed networks sending 10,000-1,000,000 messages per second.

            In lack of fully operational congestion control the alternative is to implement rate limiting. ZeroMQ provides control of PGM rate limiting via it's socket options. One can set the maximum capacity for "original data" traffic of the PGM channel then perform testing for capacity of repair data.

            Somewhere the architect has to decide that a certain data rate becomes intolerable, e.g. if there is 90% packet loss both PGM and TCP are not going to be productive.

            Show
            steve-o Steven McCoy added a comment - It is a known characteristic of reliable multicast systems. To send a full speed, i.e. network saturation, some form of congestion control is required. OpenPGM does not ship with a supported congestion control mechanism. PGMCC is one congestion control mechanism included within OpenPGM but empirical evidence has shown that the protocol fails above 1,000 messages per second. Further academic research is required in this area for high speed networks sending 10,000-1,000,000 messages per second. In lack of fully operational congestion control the alternative is to implement rate limiting. ZeroMQ provides control of PGM rate limiting via it's socket options. One can set the maximum capacity for "original data" traffic of the PGM channel then perform testing for capacity of repair data. Somewhere the architect has to decide that a certain data rate becomes intolerable, e.g. if there is 90% packet loss both PGM and TCP are not going to be productive.

              People

              • Assignee:
                Unassigned
                Reporter:
                suilevap Pavel Raliuk
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: