Tools for multi-thread programming for linux

Helgrind 和 DRD

這兩個工具都是Valgrind的一部分，用途也相同，檢查Thread error，不過用的策略不同，可以交替使用檢茶室否有無隱藏的錯誤。
以下是從Binary hacks抄下的範例

#include <pthread.h>
static int count = 1;
void *incr_count(void *p) {
        count++;
        return 0;
}
static pthread_mutex_t m1 = PTHREAD_MUTEX_INITIALIZER;
static pthread_mutex_t m2 = PTHREAD_MUTEX_INITIALIZER;
void *lock_m1_then_m2(void *p) {
        pthread_mutex_lock(&m1);
        pthread_mutex_lock(&m2);
        pthread_mutex_unlock(&m2);
        pthread_mutex_unlock(&m1);
        return 0;
}
void *lock_m2_then_m1(void *p) {
        pthread_mutex_lock(&m2);
        pthread_mutex_lock(&m1);
        pthread_mutex_unlock(&m1);
        pthread_mutex_unlock(&m2);
        return 0;
}
int main() {
        pthread_t t1, t2, t3, t4;
        pthread_create(&t1, NULL, incr_count, NULL);
        pthread_create(&t2, NULL, incr_count, NULL);
        pthread_create(&t3, NULL, lock_m1_then_m2, NULL);
        pthread_create(&t4, NULL, lock_m2_then_m1, NULL);
        pthread_join(t4, NULL);
        pthread_join(t3, NULL);
        pthread_join(t2, NULL);
        pthread_join(t1, NULL);
        return count;
}

裡面有兩個錯誤，一個是count在multi-thread的情況沒有保護，這種情況也可以用下面的thread-sanitizer偵測出來。
另外一種情況就是lock的順序不同，導致Deadlock的情景。
編譯且執行

1 2	$ gcc demo.c -o demo -lpthread $ valgrind --tool=drd ./demo

輸出太長，列出感興趣的部份

==5172== Possible data race during write of size 4 at 0x600C90 by thread #3
==5172== Locks held: none
==5172== at 0x40065F: incr_count (in /home/hungming/a)
==5172== by 0x4C2DB38: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x4E3BE99: start_thread (pthread_create.c:308)
==5172==
==5172== This conflicts with a previous write of size 4 by thread #2
==5172== Locks held: none
==5172== at 0x40065F: incr_count (in /home/hungming/a)
==5172== by 0x4C2DB38: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x4E3BE99: start_thread (pthread_create.c:308)

上面這編列出可能有data-race的情形。

==5172== Thread #5: lock order “0x600CA0 before 0x600CC8” violated
==5172==
==5172== Observed (incorrect) order is: acquisition of lock at 0x600CC8
==5172== at 0x4C2DFCD: pthread_mutex_lock (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x4006FB: lock_m2_then_m1 (in /home/hungming/a)
==5172== by 0x4C2DB38: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x4E3BE99: start_thread (pthread_create.c:308)
==5172==
==5172== followed by a later acquisition of lock at 0x600CA0
==5172== at 0x4C2DFCD: pthread_mutex_lock (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x40070B: lock_m2_then_m1 (in /home/hungming/a)
==5172== by 0x4C2DB38: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5172== by 0x4E3BE99: start_thread (pthread_create.c:308)

這邊告訴我們lock的順序不對。
更多的使用方法可以參考
Helgrind使用說明
 DRD使用說明

thread-sanitizer

thread-sanitizer現在已經是LLVM的一部分，在編譯LLVM的時候就會編譯完成，而GCC 4.8之後也支援thread-sanitizer。
這跟上面的不同是檢查data-race issue。
寫個sample code

#include <pthread.h>
int Global;
void* Thread1(void* x) {
        Global++;
        return NULL;
}

void* Thread2(void* x) {
        Global--;
        return NULL;
}

int main() {
        pthread_t t[2];
        pthread_create(&t[0], NULL, Thread1, NULL);
        pthread_create(&t[1], NULL, Thread2, NULL);
        pthread_join(t[0], NULL);
        pthread_join(t[1], NULL);
        return 0;
}

這個範例很簡單，可以看出 Global 在不同Thread下操作可能出現問題。
編譯且執行，注意要加上-fsanitize=thread

1 2	$ clang simple_race.c -fsanitize=thread -g $ ./a.out

同樣列出我們所關心的部份

WARNING: ThreadSanitizer: data race (pid=4441)
Location is global ‘Global’ of size 4 at 0x7f4d31e90ad8 (a+0x0000016caad8)
SUMMARY: ThreadSanitizer: data race ??:0 Thread2

有了Tool之後，從Log分西問題出在哪就便得很重要了。