Memory Model for C/C++11

其實是看了The C++ Memory Model之後，對於之前懵懂的點有點茅塞頓開，寫下來記錄。

Pre-C/C++11

先來看以下這段Code

int count = 0;
bool flag = false;
void thread1()
{
	count = 1;  // (1)
	flag = true; // (2)
}
void thread2()
{
	while(!flag);
	r0 = count;
}

就直覺上來說，r0拿到的值會是1，而事實往往不會這麼簡單。Compiler有可能把(1)和(2)的指令重排，因為對Single thread來說，如此重排不匯兌結果產生任何影響，如果我們就算強迫Compiler禁止指令重排，CPU也會有機會做這件事。

由於Java之前就遇過這樣的問題，因此在這方面的工作已經十分完善，所以C++直接借鑒其努力，推出了atomics Library和新的Memory Model。
先定義幾個名詞

Sequential Consistency

the operations of all threads are executed in some sequential order

the operations of each thread appear in this sequence in the order specified by their program

用Code來解釋上面的情形

int x = y = 0;
void thread1()
{
	x = 1;
	r1 = y;
}
void thread2()
{
	y = 1;
	r2 = x;
}

關於上述第一點，每個Thread執行的順序就是Program order。所以x = 1就是會在r1 = y之前執行，而y = 1亦會在r2 = x執行。關於第二點比較難以理解，每個Thread執行順序可以是任意的，不過以Total Sequence看來，每個Thread看到的情況是一樣的。
如上面的Code有幾種Execution posiible

x = 1 -> r1 = y -> y = 1 -> r2 = x r1 = 0, r2 = 1
y = 1 -> r2 = s -> x = 1 -> r1 = y r1 = 1, r2 = 0
x = 1 -> y = 1 -> r1 = y -> r2 = x r1 = 1, r2 = 1

酸然這不代表所有可能的執行順遜，不過一旦執行順序確定，每個Thread看到的執行順序都是相同的。而每個各自的Thread獨自的執行順序是有序的。因此保證沒有Data race的可能性。
因此C++11藉由此定義了A Memory Model for C++: Sequential Consistency for Race-Free Programs而這是最簡單的

Synchronize

在C++有訂定某些operation具有synchronizing的特質，例如

void thread1()
{
	X();
	A();
}
void thread2()
{
	B();
	Y();
}

若A()和B()之間有synchronizes with關係的話，那麼X() happen before Y()
以最容易理解的Lock來當範例

std::mutex mtx;
bool bDataReady = false;
void thread1()
{
	mtx.lock();
	PrepareData();
	bDataReady = true;
	mtx.unlock();
}
void thread2()
{
	mtx.lock();
	if (bDataReady) {
		ConsumeData();
	}
	mtx.unlock();
}

根據Mutex所提供的synchronization

Mutex provides inter-thread synchronization:
unlock() synchronizes with calls to
lock() on the same mutex object.

以這個範例來看的話，PrepareData() happen before ConsumeData()，。所以可以拿到希望的資料。
由於Mutex有時候有點殺雞用牛刀的感覺，因此有更輕量的std::atomic來提供大部分的synchronizes with語意。

Atomic

對於Atomic的Memory Model又分三種

sequentially consistent

如同上面說的那些一樣，也是atomic的預設Memory model，不多說了。

relaxed

關於Relaxed model的描述

Each memory location has a total modification order(however, this order cannot be observed directly)

Memory operations performed by the same thread on the same memory location are not reordered with respect to the modification order.

每個記憶體位根據時間有一定性的修改順序。因此一個記憶體內的值根據時間軸來表示的話，可能是
v0 -> v1 -> v2 -> v3 -> v4 -> ..... -> vn
而在同一個Thread，他的Memory order不能被修改，因此在這個Thread看到的可能是
v0 -> v2 -> v3 -> vn ，或者是 v1 -> v2 -> v3，絕不可能是 v1 -> v0 -> v2這種情形
不過不同Thread的順序不受此限制，因此當某個Thread已經從v0->v1->v2時，另外一個Thread允許看到v0的存在。
因此下面這段Code就會有問題

atomic<bool> f=false;
atomic<bool> g=false;
void thread1()
{
	f.store(true, memory_order_relaxed);
	g.store(true, memory_order_relaxed);
}
void thread2()
{
	while(!g.load(memory_order_relaxed));
	assert(f.load(memory_order_relaxed));
}

以上麵的例子來說，雖然Thread1已經將f從flast->true了，不過thread2的f允許讀到false，所以assert可能匯出錯。
所以relaxed model只是核當Atomic operation，例如Counter，不是合作Thread間的同步。為了解決這問題，引進了Acquire-Release model

Acquire-Release (Consume-Release)

Consume-Release是Acquire-Release的弱化版，這邊就不提了。來看Acquire-Release的定義

a store-release operation synchronizes with all load-acquire operations reading the stored value.

All Operations in the releasing thread preceding the store-release happen before all operations following the load-acquire in the acquiring thread.

以下的程式碼因此獲得保證

int n;
atomic<bool> g=false;
void thread1()
{
	n = 23;
	g.store(true, memory_order_relaxed);
}
void thread2()
{
	while(!g.load(memory_order_relaxed));
	assert(n == 23);
}

在這邊我們就不需要用atomic來保護n了。

Dekker’s algorithm revisited

雖然Acquire-Release可以解決大部分的問題，不過還是有例外的，例如

atomic<bool> f1=false;
atomic<bool> f2=false;
void thread1()
{
	f1.store(true, memory_order_release);
	if (!f2.load(memory_order_acquire)) {
  	// critical section
	}
}
void thread2()
{
	f2.store(true, memory_order_release);
	if (!f1.load(memory_order_acquire)) {
  	// critical section
	}
}

由於f1跟f2不屬於同一個Memory location，因此load的operation有可能被compiler或是cpu提前至store之前，於是兩個Thread都有可能同時進入 critical section，這部是我們想要的，因此只能回到SC Model。正確的程式馬廄只能這樣寫

atomic<bool> f1=false;
atomic<bool> f2=false;
void thread1()
{
	f1.store(true);
	if (!f2.load()) {
  	// critical section
	}
}
void thread2()
{
	f2.store(true);
	if (!f1.load()) {
  	// critical section
	}
}

Memory fence

除了上面這些之外，C/C++11還提供了顯式的atomic_thread_fence，目的是為了不同Memory model中的溝通。
例如Cppreference這個範例

const int num_mailboxes = 32;
std::atomic<int> mailbox[num_mailboxes];
 
// The writer threads update non-atomic shared data and then update mailbox[i] as follows
std::atomic_store_explicit(&mailbox[i], receiver_id, std::memory_order_release);
 
// Reader thread needs to check all mailbox[i], but only needs to sync with one
for (int i = 0; i < num_mailboxes; ++i) {
    if (std::atomic_load_explicit(&mailbox[i], std::memory_order_relaxed) == my_id) {
        std::atomic_thread_fence(std::memory_order_acquire); // synchronize with just one writer
        do_work(i); // guaranteed to observe everything done in the writer thread before
                    // the atomic_store_explicit()
    }
 }

在Main thread用Release，不過由於在Reader thread用Acquire可能會造成效率上的損失，因此先用relaxeed來檢查，如果發現有mail之後在用上memory fence，後面的reader動作不會被移至前面。

結論

這篇是我對Memory Model的總體認知，打了一大堆還是不能解釋的很清楚，這本來就有夠複雜..

Reference

The C++ Memory Model Slide
Herb Sutter - atomic<> Weapons 1 2 Slide
[Acquire and Release Fences Don’t Work the Way You’d Expect] (http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/)
《C++0x漫谈》系列之：多线程内存模型