libzmq
  1. libzmq
  2. LIBZMQ-281

Crash on heavy socket creation: Device or resource busy (mutex.hpp:91)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      zeromq 2.1.9, 2.1.10
      FreeBSD 8.2, max-fds 11095
      CentOS 5.5 (2.6.18-194.11.4.el5), max-fds 1024

      Description

      On heavy socket creation I experience an assert hit when pthread_mutex_destroy is called in mutex.hpp.

      Attached is a piece of code which demonstrates this.
      Compiled with: gcc -O zmqtest.c -o zmqtest -lzmq -lpthread

      When tested on a Linux box:

      ./zmqtest
      ...
      pong 18200
      Device or resource busy (mutex.hpp:91)
      zsh: abort (core dumped) ./zmqtest

      On repeated tests the counter have gone to anything from 9k to 70k before crashing. Crashing every time.

      On the FreeBSD box I have bit harder time to reproduce it, in some cases I get NULL from zmq_socket in the hammer threads instead (with errno 24: Too many open files). However, it is still reproducible, with counter going from anything to 80k to 200k before crashing.

      Note that the Too many open files problem never occurs on the Linux box, I always hit the assert before, even though the max-fd limit is much lower on the Linux box (see Env).

      This problem was noticed in a scenario where a local thread REQ'ed via inproc to a ROUTER thread and then closed the socket. When benchmarking and hammer-testing this code I found this problem. In my code I've solved this by keeping the socket in a thread-local container.

        Activity

        Hide
        Pieter Hintjens added a comment -

        Mika, if you find the cause and can make a successful patch, send me a pull request for 2-1 and I'll get that into the next release. Thanks!

        Show
        Pieter Hintjens added a comment - Mika, if you find the cause and can make a successful patch, send me a pull request for 2-1 and I'll get that into the next release. Thanks!
        Hide
        Mika Fischer added a comment -

        I worked around the symptoms of this issue: https://github.com/mika-fischer/zeromq2-1/commit/601d8167f025dad48a881334121073b80bd17a74

        If you don't want the RAII stuff I can send a patch without it tomorrow.

        However, I really think there's a deeper problem here as it seems very bad to me that other threads are still running methods of an object that is in the process of being destroyed. The patch waits until they're finished but the better fix would be to:
        1) not destroy an object that's still in use, and
        2) not call methods of an object that's about to be destroyed.

        Unfortunately, I don't understand the ZMQ codebase nearly enough to figure out what's actually going on here, so I can't really help fix the actual cause of this issue. OTOH, if it's fixed in 3.1, and 2.1 will be deprecated at some point, maybe this is good enough...

        Show
        Mika Fischer added a comment - I worked around the symptoms of this issue: https://github.com/mika-fischer/zeromq2-1/commit/601d8167f025dad48a881334121073b80bd17a74 If you don't want the RAII stuff I can send a patch without it tomorrow. However, I really think there's a deeper problem here as it seems very bad to me that other threads are still running methods of an object that is in the process of being destroyed. The patch waits until they're finished but the better fix would be to: 1) not destroy an object that's still in use, and 2) not call methods of an object that's about to be destroyed. Unfortunately, I don't understand the ZMQ codebase nearly enough to figure out what's actually going on here, so I can't really help fix the actual cause of this issue. OTOH, if it's fixed in 3.1, and 2.1 will be deprecated at some point, maybe this is good enough...
        Hide
        Pieter Hintjens added a comment -

        Maybe you can discuss your patch on the list, see if people want it in 2.1. It's worth IMO making that version as stable as we can, even if we don't always fix the core issues. 2.1 will be deprecated in six months or so, as 3.1 becomes stable.

        Show
        Pieter Hintjens added a comment - Maybe you can discuss your patch on the list, see if people want it in 2.1. It's worth IMO making that version as stable as we can, even if we don't always fix the core issues. 2.1 will be deprecated in six months or so, as 3.1 becomes stable.
        Hide
        Mika Fischer added a comment -

        I've sent a pull request with just the workaround: https://github.com/zeromq/zeromq2-1/pull/32

        Please apply this to 2.1. I don't think the RAII stuff needs to go into 2.1, I'll have a look whether it should be done in 3.1.

        Do you know when you'll release the next version of ZMQ 2.1.x? So that we can know whether we need to include a patched version of ZMQ in our product.

        Show
        Mika Fischer added a comment - I've sent a pull request with just the workaround: https://github.com/zeromq/zeromq2-1/pull/32 Please apply this to 2.1. I don't think the RAII stuff needs to go into 2.1, I'll have a look whether it should be done in 3.1. Do you know when you'll release the next version of ZMQ 2.1.x? So that we can know whether we need to include a patched version of ZMQ in our product.
        Hide
        Pieter Hintjens added a comment -

        Mika, thanks for that pull request. I've applied it to master. We'll make a new release of 2.1 RSN, probably this weekend.

        Show
        Pieter Hintjens added a comment - Mika, thanks for that pull request. I've applied it to master. We'll make a new release of 2.1 RSN, probably this weekend.

          People

          • Assignee:
            Pieter Hintjens
            Reporter:
            Johan Ström
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: