Unexpected behaviour epgm pub/sub socket in case of intensively use

Description

It seems that applications which using 0MQ library intensively (>1k requests/sec) behave in unexpected way .

When I try to send many messages(~1M) between 2 computers using epmg broadcast (pub/sub), and at some moment (after >200-500k messages which were sent and received successfully) something is going wrong and one computer stops receiving all messages from the other. I tried to stop sending messages for a while, wait some time- all is useless.
Only if I restart one socket (no matter which one: on the first or on the second computer), at the beginning everything is good, but then problem repeats.

Here you can find small test application in attachment:
To test:
Run application on 2 computers on one local net.
Expected output:
"Send {0} messages"
and "Received multi {0}"
Real output:
at the beginning: as expected, afters some time only "Send {0} messages"

P.S.
Since I use .NET binding, initially I have created issue on clrzmq queue(https://github.com/zeromq/clrzmq/issues/45), but they advised me to report bug here.

Environment

Win7, x86/x64

Attachments

Activity

Show:

Steven McCoy May 28, 2012 at 1:46 PM

It is a known characteristic of reliable multicast systems. To send a full speed, i.e. network saturation, some form of congestion control is required. OpenPGM does not ship with a supported congestion control mechanism.

PGMCC is one congestion control mechanism included within OpenPGM but empirical evidence has shown that the protocol fails above 1,000 messages per second. Further academic research is required in this area for high speed networks sending 10,000-1,000,000 messages per second.

In lack of fully operational congestion control the alternative is to implement rate limiting. ZeroMQ provides control of PGM rate limiting via it's socket options. One can set the maximum capacity for "original data" traffic of the PGM channel then perform testing for capacity of repair data.

Somewhere the architect has to decide that a certain data rate becomes intolerable, e.g. if there is 90% packet loss both PGM and TCP are not going to be productive.

Pavel Raliuk May 28, 2012 at 9:17 AM

Sample was attached,
Does it mean issue is not reproduced?

PieterP May 28, 2012 at 9:05 AM

Not clear what the cause is.

Pavel Raliuk May 28, 2012 at 7:23 AM

One network adapter is used for several networks.

Is it possible that insufficient capacity of network can lead to such behaviour?
It seems is not OK, if after we stop sending messages, communication is still broken and recovers only after restarting the socket.

P.S.
As result, we didn't have time to spend more on this problem, and decided to use custom solution without zeroMQ.

Steven McCoy May 20, 2012 at 2:05 AM

That doesn't make sense, you mean provisioned to only 2-3Gbps on a 10GigE NIC?

With recovery sharing the same network you probably want to limit to 40% capacity. For high bandwidth broadcast you ultimately need a separate NIC for recovery. You might have to be a little creative for such a configuration, although a few options spring to mind.