Patchwork [PULL,1/3] iothread: fix iothread hang when stop too soon

login
register
mail settings
Submitter Stefan Hajnoczi
Date Feb. 12, 2019, 4:01 a.m.
Message ID <20190212040136.30371-2-stefanha@redhat.com>
Download mbox | patch
Permalink /patch/723499/
State New
Headers show

Comments

Stefan Hajnoczi - Feb. 12, 2019, 4:01 a.m.
From: Peter Xu <peterx@redhat.com>

Lukas reported an hard to reproduce QMP iothread hang on s390 that
QEMU might hang at pthread_join() of the QMP monitor iothread before
quitting:

  Thread 1
  #0  0x000003ffad10932c in pthread_join
  #1  0x0000000109e95750 in qemu_thread_join
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:570
  #2  0x0000000109c95a1c in iothread_stop
  #3  0x0000000109bb0874 in monitor_cleanup
  #4  0x0000000109b55042 in main

While the iothread is still in the main loop:

  Thread 4
  #0  0x000003ffad0010e4 in ??
  #1  0x000003ffad553958 in g_main_context_iterate.isra.19
  #2  0x000003ffad553d90 in g_main_loop_run
  #3  0x0000000109c9585a in iothread_run
      at /home/thuth/devel/qemu/iothread.c:74
  #4  0x0000000109e94752 in qemu_thread_start
      at /home/thuth/devel/qemu/util/qemu-thread-posix.c:502
  #5  0x000003ffad10825a in start_thread
  #6  0x000003ffad00dcf2 in thread_start

IMHO it's because there's a race between the main thread and iothread
when stopping the thread in following sequence:

    main thread                       iothread
    ===========                       ==============
                                      aio_poll()
    iothread_get_g_main_context
      set iothread->worker_context
    iothread_stop
      schedule iothread_stop_bh
                                        execute iothread_stop_bh [1]
                                          set iothread->running=false
                                          (since main_loop==NULL so
                                           skip to quit main loop.
                                           Note: although main_loop is
                                           NULL but worker_context is
                                           not!)
                                      atomic_read(&iothread->worker_context) [2]
                                        create main_loop object
                                        g_main_loop_run() [3]
    pthread_join() [4]

We can see that when execute iothread_stop_bh() at [1] it's possible
that main_loop is still NULL because it's only created until the first
check of the worker_context later at [2].  Then the iothread will hang
in the main loop [3] and it'll starve the main thread too [4].

Here the simple solution should be that we check again the "running"
variable before check against worker_context.

CC: Thomas Huth <thuth@redhat.com>
CC: Dr. David Alan Gilbert <dgilbert@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Lukáš Doktor <ldoktor@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Lukáš Doktor <ldoktor@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Message-id: 20190129051432.22023-1-peterx@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 iothread.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Patch

diff --git a/iothread.c b/iothread.c
index 2fb1cdf55d..e615b7ae52 100644
--- a/iothread.c
+++ b/iothread.c
@@ -63,7 +63,11 @@  static void *iothread_run(void *opaque)
     while (iothread->running) {
         aio_poll(iothread->ctx, true);
 
-        if (atomic_read(&iothread->worker_context)) {
+        /*
+         * We must check the running state again in case it was
+         * changed in previous aio_poll()
+         */
+        if (iothread->running && atomic_read(&iothread->worker_context)) {
             GMainLoop *loop;
 
             g_main_context_push_thread_default(iothread->worker_context);