[Xenomai] Xenomai-forge: Deadlock in finalize_thread()

Kim De Mey kim.demey at gmail.com
Tue Aug 5 16:16:53 CEST 2014


2014-08-05 12:38 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
> On 08/05/2014 12:35 PM, Philippe Gerum wrote:
>> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey at gmail.com>:
>>>>>>>>>>
>>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> TIA,
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>
>>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>
>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>>
>>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>>  #else
>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>>  {
>>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>>> +       int ret;
>>>>>>>>>>>
>>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>>                 return;
>>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>>> free */
>>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>>          */
>>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>>> +       (void)ret;
>>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And...the assert kicks in!
>>>>>>>>>>
>>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Philippe,
>>>>>>>>>
>>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>>
>>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>>> issue doesn't occur anymore.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>>
>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>>                 if (status & mask)
>>>>>>>>                         break;
>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>         }
>>>>>>>>
>>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>>          */
>>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>         }
>>>>>>>>
>>>>>>>
>>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>>> The threads are still deadlocked at the same location.
>>>>>>>
>>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>>> far. I'll check also with the newer version.
>>>>>>>
>>>>>>
>>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>>> common race due to the async nature of the cancellation event which is
>>>>>> released while the caller holds its own lock without proper cleanup
>>>>>> actions. I'll dig deeper.
>>>>>>
>>>>>
>>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>>> to tell me.
>>>>>
>>>>> Thanks a lot already!
>>>>>
>>>>>
>>>>
>>>> I could reproduce the issue eventually on your exact test code, by
>>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>>> multi-core system. The patch below should paper over the current
>>>> manifestation of this bug, but the root issue which bites here lives way
>>>> below this API layer, and is clearly SMP-related. Are you running a
>>>> multi-core board?
>>>
>>> I have tried the test with your patch and the issue no longer occurs.
>>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>>> cores).
>>> The issue does not occur if I run the board with only 1 core activated.
>>>
>>
>> It looks like something may be wrong in the way glibc handles the
>> cancellability state of threads in SMP mode. Unless I'm completely
>> off-base reading the glibc code, this issue opens a nasty race with code
>> running SMP with async cancellation enabled, solely relying on
>> pthread_setcancelstate() to turn it off temporarily. Granted, async
>> cancellation is inherently unsafe, but still, something looks wrong.
>>
>> Anyway, please pull
>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
>> and let's see how this goes.
>>
>> Thanks,
>>
>
> This one fixes the commit log,
>
> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=75cc5dac959c06bac3a5730ae279ae7a0a4ad271
>

Pulled to this revision and the test with this revision is good. I
will continue to work with this revision now.

Thanks for the work-around. Is the bug known yet at glibc?




More information about the Xenomai mailing list