[Xenomai] Xenomai-forge: Deadlock in finalize_thread()

Philippe Gerum rpm at xenomai.org
Tue Aug 5 16:33:26 CEST 2014


On 08/05/2014 04:16 PM, Kim De Mey wrote:
> 2014-08-05 12:38 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>> On 08/05/2014 12:35 PM, Philippe Gerum wrote:
>>> On 08/05/2014 09:37 AM, Kim De Mey wrote:
>>>> 2014-08-04 19:06 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>>>>>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>>>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.demey at gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <rpm at xenomai.org>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>>>>>> deadlock would help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> TIA,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315
>>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315
>>>>>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310
>>>>>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>>>>>
>>>>>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could see this
>>>>>>>>>>>> happening would be that the thread prologue somehow gets hit by an async
>>>>>>>>>>>> cancellation signal while holding its own lock. If so, then the patch below
>>>>>>>>>>>> would cause an assertion to trigger in such circumstance. You will have to
>>>>>>>>>>>> turn on --enable-assert in your configuration, leaving debugging entirely
>>>>>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>>>>>
>>>>>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>>>>>> +#ifndef NDEBUG
>>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>>>>>  #else
>>>>>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>>>>>  {
>>>>>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>>>>>> +       int ret;
>>>>>>>>>>>>
>>>>>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>>>>>                 return;
>>>>>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock
>>>>>>>>>>>> free */
>>>>>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>>>>>          */
>>>>>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>>>>>> +       assert(ret == 0);
>>>>>>>>>>>> +       (void)ret;
>>>>>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And...the assert kicks in!
>>>>>>>>>>>
>>>>>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Philippe,
>>>>>>>>>>
>>>>>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>>>>>
>>>>>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>>>>>> issue doesn't occur anymore.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have not been able to reproduce this race yet, but still the following patch would make sense:
>>>>>>>>>
>>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>>> index 05bb6cb..48aa032 100644
>>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, int mask)
>>>>>>>>>                 if (status & mask)
>>>>>>>>>                         break;
>>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* thobj->lock held */
>>>>>>>>>          */
>>>>>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>>>>>                 oldstate = thobj->cancel_state;
>>>>>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>>>>>                 __threadobj_tag_locked(thobj);
>>>>>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>>>>>                 thobj->cancel_state = oldstate;
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>
>>>>>>>> I have tried this patch but it does not seem to make any difference.
>>>>>>>> The threads are still deadlocked at the same location.
>>>>>>>>
>>>>>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>>>>>> far. I'll check also with the newer version.
>>>>>>>>
>>>>>>>
>>>>>>> Ok, thanks for testing. I would not bother testing with another
>>>>>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>>>>>> common race due to the async nature of the cancellation event which is
>>>>>>> released while the caller holds its own lock without proper cleanup
>>>>>>> actions. I'll dig deeper.
>>>>>>>
>>>>>>
>>>>>> Okay, in case there is anything you want me to try out, don't hesitate
>>>>>> to tell me.
>>>>>>
>>>>>> Thanks a lot already!
>>>>>>
>>>>>>
>>>>>
>>>>> I could reproduce the issue eventually on your exact test code, by
>>>>> dropping the cpu affinity from the test options (--cpu-affinity) on a
>>>>> multi-core system. The patch below should paper over the current
>>>>> manifestation of this bug, but the root issue which bites here lives way
>>>>> below this API layer, and is clearly SMP-related. Are you running a
>>>>> multi-core board?
>>>>
>>>> I have tried the test with your patch and the issue no longer occurs.
>>>> Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
>>>> cores).
>>>> The issue does not occur if I run the board with only 1 core activated.
>>>>
>>>
>>> It looks like something may be wrong in the way glibc handles the
>>> cancellability state of threads in SMP mode. Unless I'm completely
>>> off-base reading the glibc code, this issue opens a nasty race with code
>>> running SMP with async cancellation enabled, solely relying on
>>> pthread_setcancelstate() to turn it off temporarily. Granted, async
>>> cancellation is inherently unsafe, but still, something looks wrong.
>>>
>>> Anyway, please pull
>>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=7c4162315b8b970f9b82d53ddec8c910e15ff81a,
>>> and let's see how this goes.
>>>
>>> Thanks,
>>>
>>
>> This one fixes the commit log,
>>
>> http://git.xenomai.org/xenomai-forge.git/commit/?h=next&id=75cc5dac959c06bac3a5730ae279ae7a0a4ad271
>>
> 
> Pulled to this revision and the test with this revision is good. I
> will continue to work with this revision now.
> 
> Thanks for the work-around. Is the bug known yet at glibc?
> 

I did not find any mention of such issue, only an old nptl bug with
PTHREAD_CANCEL_DISABLE which surfaced in 2004, and obviously fixed many
moons ago. I have to create a self-contained and obvious enough test to
report it, which does not seem trivial.

-- 
Philippe.




More information about the Xenomai mailing list