libzmq
  1. libzmq
  2. LIBZMQ-323

Unexpected behaviour epgm pub/sub socket in case of intensively use

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Incomplete
    • Affects Version/s: 2.1.11, 3.2.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
    • Environment:

      Win7, x86/x64

      Description

      It seems that applications which using 0MQ library intensively (>1k requests/sec) behave in unexpected way .

      When I try to send many messages(~1M) between 2 computers using epmg broadcast (pub/sub), and at some moment (after >200-500k messages which were sent and received successfully) something is going wrong and one computer stops receiving all messages from the other. I tried to stop sending messages for a while, wait some time- all is useless.
      Only if I restart one socket (no matter which one: on the first or on the second computer), at the beginning everything is good, but then problem repeats.

      Here you can find small test application in attachment:
      To test:
      Run application on 2 computers on one local net.
      Expected output:
      "Send

      {0} messages"
      and "Received multi {0}

      "
      Real output:
      at the beginning: as expected, afters some time only "Send

      {0}

      messages"

      P.S.
      Since I use .NET binding, initially I have created issue on clrzmq queue(https://github.com/zeromq/clrzmq/issues/45), but they advised me to report bug here.

        Activity

        Hide
        Pavel Raliuk added a comment -

        Is there any update on this issue?

        Show
        Pavel Raliuk added a comment - Is there any update on this issue?
        Hide
        Pavel Raliuk added a comment - - edited

        Issue was reproduced in clrzmq 3.0.0-alpha1 (libzmq 3.1).

        Show
        Pavel Raliuk added a comment - - edited Issue was reproduced in clrzmq 3.0.0-alpha1 (libzmq 3.1).
        Hide
        Martin Hurton added a comment -

        You set the maximum multicast data rate on the socket to 1 Gbps.
        Are you sure your network can manage this traffic?

        Show
        Martin Hurton added a comment - You set the maximum multicast data rate on the socket to 1 Gbps. Are you sure your network can manage this traffic?
        Hide
        Pavel Raliuk added a comment -

        Yes, all networks are about 2-3Gbps speed.

        Show
        Pavel Raliuk added a comment - Yes, all networks are about 2-3Gbps speed.
        Hide
        Steven McCoy added a comment -

        That doesn't make sense, you mean provisioned to only 2-3Gbps on a 10GigE NIC?

        With recovery sharing the same network you probably want to limit to 40% capacity. For high bandwidth broadcast you ultimately need a separate NIC for recovery. You might have to be a little creative for such a configuration, although a few options spring to mind.

        Show
        Steven McCoy added a comment - That doesn't make sense, you mean provisioned to only 2-3Gbps on a 10GigE NIC? With recovery sharing the same network you probably want to limit to 40% capacity. For high bandwidth broadcast you ultimately need a separate NIC for recovery. You might have to be a little creative for such a configuration, although a few options spring to mind.
        Hide
        Pavel Raliuk added a comment -

        One network adapter is used for several networks.

        Is it possible that insufficient capacity of network can lead to such behaviour?
        It seems is not OK, if after we stop sending messages, communication is still broken and recovers only after restarting the socket.

        P.S.
        As result, we didn't have time to spend more on this problem, and decided to use custom solution without zeroMQ.

        Show
        Pavel Raliuk added a comment - One network adapter is used for several networks. Is it possible that insufficient capacity of network can lead to such behaviour? It seems is not OK, if after we stop sending messages, communication is still broken and recovers only after restarting the socket. P.S. As result, we didn't have time to spend more on this problem, and decided to use custom solution without zeroMQ.
        Hide
        Pieter Hintjens added a comment -

        Not clear what the cause is.

        Show
        Pieter Hintjens added a comment - Not clear what the cause is.
        Hide
        Pavel Raliuk added a comment -

        Sample was attached,
        Does it mean issue is not reproduced?

        Show
        Pavel Raliuk added a comment - Sample was attached, Does it mean issue is not reproduced?
        Hide
        Steven McCoy added a comment -

        It is a known characteristic of reliable multicast systems. To send a full speed, i.e. network saturation, some form of congestion control is required. OpenPGM does not ship with a supported congestion control mechanism.

        PGMCC is one congestion control mechanism included within OpenPGM but empirical evidence has shown that the protocol fails above 1,000 messages per second. Further academic research is required in this area for high speed networks sending 10,000-1,000,000 messages per second.

        In lack of fully operational congestion control the alternative is to implement rate limiting. ZeroMQ provides control of PGM rate limiting via it's socket options. One can set the maximum capacity for "original data" traffic of the PGM channel then perform testing for capacity of repair data.

        Somewhere the architect has to decide that a certain data rate becomes intolerable, e.g. if there is 90% packet loss both PGM and TCP are not going to be productive.

        Show
        Steven McCoy added a comment - It is a known characteristic of reliable multicast systems. To send a full speed, i.e. network saturation, some form of congestion control is required. OpenPGM does not ship with a supported congestion control mechanism. PGMCC is one congestion control mechanism included within OpenPGM but empirical evidence has shown that the protocol fails above 1,000 messages per second. Further academic research is required in this area for high speed networks sending 10,000-1,000,000 messages per second. In lack of fully operational congestion control the alternative is to implement rate limiting. ZeroMQ provides control of PGM rate limiting via it's socket options. One can set the maximum capacity for "original data" traffic of the PGM channel then perform testing for capacity of repair data. Somewhere the architect has to decide that a certain data rate becomes intolerable, e.g. if there is 90% packet loss both PGM and TCP are not going to be productive.

          People

          • Assignee:
            Unassigned
            Reporter:
            Pavel Raliuk
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: