[Xenomai] Xenomai-forge: Deadlock in finalize_thread()

Philippe Gerum rpm at xenomai.org
Tue Aug 5 12:35:39 CEST 2014


On 08/05/2014 09:37 AM, Kim De Mey wrote:
> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey at gmail.com>:
>>>>>>>>
>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>
>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>>>>
>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>> ...
>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>
>>>>>>>>>>> TIA,
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>
>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>
>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>
>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>
>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>
>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>  #else
>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>  {
>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>> +       int ret;
>>>>>>>>>
>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>                 return;
>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>> free */
>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>          */
>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>> +       assert(ret == 0);
>>>>>>>>> +       (void)ret;
>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>
>>>>>>>>
>>>>>>>> And...the assert kicks in!
>>>>>>>>
>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>
>>>>>>>
>>>>>>> Philippe,
>>>>>>>
>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>
>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>> issue doesn't occur anymore.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>
>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>> index 05bb6cb..48aa032 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>                 if (status & mask)
>>>>>>                         break;
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>          */
>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>                 oldstate = thobj->cancel_state;
>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>         }
>>>>>>
>>>>>
>>>>> I have tried this patch but it does not seem to make any difference.
>>>>> The threads are still deadlocked at the same location.
>>>>>
>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>> far. I'll check also with the newer version.
>>>>>
>>>>
>>>> Ok, thanks for testing. I would not bother testing with another
>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>> common race due to the async nature of the cancellation event which is
>>>> released while the caller holds its own lock without proper cleanup
>>>> actions. I'll dig deeper.
>>>>
>>>
>>> Okay, in case there is anything you want me to try out, don't hesitate
>>> to tell me.
>>>
>>> Thanks a lot already!
>>>
>>>
>>
>> I could reproduce the issue eventually on your exact test code, by
>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>> multi-core system. The patch below should paper over the current
>> manifestation of this bug, but the root issue which bites here lives way
>> below this API layer, and is clearly SMP-related. Are you running a
>> multi-core board?
> 
> I have tried the test with your patch and the issue no longer occurs.
> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
> cores).
> The issue does not occur if I run the board with only 1 core activated.
> 

It looks like something may be wrong in the way glibc handles the
cancellability state of threads in SMP mode. Unless I'm completely
off-base reading the glibc code, this issue opens a nasty race with code
running SMP with async cancellation enabled, solely relying on
pthread_setcancelstate() to turn it off temporarily. Granted, async
cancellation is inherently unsafe, but still, something looks wrong.

Anyway, please pull
http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
and let's see how this goes.

Thanks,

-- 
Philippe.




More information about the Xenomai mailing list