Multiple binds() to a ipc address cause silent bad behaviour.
Description
Environment
Linux, probably all other unix.
Activity
PieterP October 8, 2013 at 8:32 AM
Martin Sustrik September 28, 2011 at 10:36 AM
Yes. There's an abstract namespace for names starting with binary zero. However, it's a Linux-specific behaviour. Maybe it can be implemented as an optimisation for Linux, while leaving the other platforms intact. Would you like to give it a try?
Adrian Ratnapala September 28, 2011 at 10:29 AM
Hi Pieter, I've started an example for you, but won't get back to it until I finish with my day job today.
Martin, I think that Unix domain sockets under Linux can be created under some name space that is not attached to the file system, and therefore presumably disappear when closed. I don't know if this is practical or not, and I don't know what support other OSes have.
Martin Sustrik September 28, 2011 at 8:25 AM
I have no idea what the best solution is either.
If we don't unlink the file, failed applications will result in dangling endpoints, ie. attempt to restart the application would fail with EADDRINUSE.
If we do unlink the file we get the bad behaviour as described above.
Maybe there's a way to create temporary files with lifetime bound to the lifetime of the owner process?
PieterP September 27, 2011 at 9:46 PM
Adrian, would you create a minimal test case? Thanks.
If two processes bind() to the same IPC address, the second one will not immediately fail with EADDRINUSE. Instead it will begin to work correctly can (will?) cause hangups one of the sockets is closed.
The reason we get no error is that this that zmq_bind() deletes any file that potentially conflicts with the given address. This is bad, because an unlink is also done at close. So you can get:
server1: zmq_bind() - creates a socket and also a file NODE1 server2: zmq_bind() - unlinks NODE1 which continues to exist, but has no name - creates a socket and also a file NODE2, no error. server1: close() - closes NODE1, but unlinks NODE2! BADNESS
I am not sure what the best fix is. Presumably dangling nodes a a real problem (that why the first unlink is done).