加载中

Sharing data between threads in annoying. Really annoying. If you do it wrong, the data is likely to become corrupted. Even more error-prone is ensuring the shared data is perceived in the right order (relative to other data reads/writes).

There are, as always, two ways to deal with this: The easy way, and the hard way.

The easy way

Use the locks (e.g. mutexes, critical sections, or equivalent) that come with your platform. They're not too hard to understand conceptually, and even easier to use. You don't have to worry about ordering problems, because the libraries/OS take care of that for you. The only general problem with locks is that they tend to be slow ("slow" being relative here; for general use, they're plenty speedy enough).

在进程间传递数据很烦人,真心烦人。一步做错,数据就会损坏。(相较于其他读写方式)即使数据排列正确,也更易出错。

一如既往的,有两种方式处理这个问题:简单的方式、麻烦的方式。

简单的方式

使用你使用平台提供的锁(互斥、临界区域,或等效)。这从概念上不难理解,使用上更简单。你无需担心排列问题,库或OS会替你解决。使用锁的唯一问题就是慢(这里的“慢”是相对而言的:一般应用,它的速度足够了)。

The catch

I was looking into doing some audio programming, where the audio callback (which gets called when the device needs more samples to fill its buffer) was on another thread. The goal with audio programming is to never let the audio "glitch" -- which means your callback code must be as close to real-time programming as you can get on a conventional OS. So, the faster things compute, the better, but what really matters with real-time programming is how deterministic the computation time will be -- will your code reliably take 5ms or could there be an occasional 300ms spike?

What does this mean? Well, for starters you can't allocate heap memory -- because that might involve an OS call, which might take a non-deterministic amount of time to complete (e.g. if you have to wait for a page to be swapped to disk). In fact, it's best to avoid calling into the kernel at all. But, this actually isn't the main problem.

The main problem is synchronizing the audio thread with the main thread. We can't use locks, because aside from non-negligible overhead (considering the frequency of callbacks), they can cause priority inversion -- you don't want a high-priority audio thread waiting to enter a critical section that some background thread is obliviously taking its time releasing, causing the audio to glitch.

The catch

我一直都在寻找这样一个音频程序,这个音频程序的回调(也就是说当设备需要更多的样例来填补缓冲时程序便被调用)运行在另一个线程中。设计这样音频程序的目标是永远也不要让这个程序出现问题。--这就意味着你要尽可能的让你的回调代码以真实程序时间运行在一个常见的操作系统上。所以,系统计算速度越快越好,但是真正的难点在于如何确定所谓程序真是时间也就是起关键性作用的计算时间--也就是说,你代码花费的时间也能是5ms也可能突然的变成300ms。

这意味着什么?这么说吧,就好像启动器不能分配堆内存空间--这是因为这个过程很可能关联着一些系统任务的调度,而这些调度很可能需要一些无关紧要的时间去完成(比如,就好像你必须要话费一定的时间去等待网页与磁盘之间进行的交换)。事实上,我们要尽量避免在内核上进行调用。然而,这还不是问题的关键。

问题的关键在于音频程序的调用线程与主线程之间的同步问题。我们不能使用锁,因为这回有一些不必要的消费(考虑到回调函数的经常调用),这还会引起优先级倒置--你没有去等着一个高优先级的音频线程,而这个音频线程正在等待一个关键性的部分通过后台线程去释放自己的空间,这就会使得音频程序产生问题。

The hard way

Enter lock-free programming.

Lock-free programming is a way of writing thread-safe code such that in the case of contention, the system is guaranteed to advance as a whole. "Wait-free" programming takes this a step further: the code is set up such that each thread can always advance regardless of what the other is doing. This also has the benefit of avoiding expensive locks.

Why doesn't everyone use lock-free programming? Well, it's not easy to get right. You have to be very careful never to corrupt data under any circumstances, regardless of what each of the threads are doing. But even harder to get right are guarantees related to ordering. Consider this example (aandbstart at 0):

Thread A    Thread B
b = 42
a = 1
            print(a)
            print(b)

困难的方式

走进无锁编程。

在竞争的情况下,无锁编程是一种编写线程安全代码的方式,保证系统作为一个整体推进。“无等待”编程进一步完善这点:代码被设置,这样不管对方在做什么,每个线程总可以推进。这也避免了使用昂贵的锁。为什么不是每个人都使用无锁编程呢?嗯,它不那么容易用对。任何情况下,不管每个线程在做什么,你必须很小心,绝不能破坏数据。但更难的是顺序相关的保证。考虑这个例子(a和b从0开始):

Thread A    Thread B
b = 42
a = 1
            print(a)
            print(b)
What are the possible sets of values that thread B could output? It could be any one of {0 0}, {1 42}, {0 42}, and {1 0}. {0 0}, {1 42}, and {0 42} make sense (which one actually results depends on timing), but how can {1 0} happen? This is because the compiler is allowed to re-order the loads and stores (or even remove some or invent others) in order to improve efficiency, as long as it appears to do the same thing on that thread. But when another thread starts interacting with the first, this re-ordering can become apparent.

It gets worse. You can force the compiler to not reorder specific reads or writes, but the CPU is allowed to (and does, all the time!) re-order instructions too, again as long it appears to do the same thing on the core that the instructions are running on.

线程B可能会输出的一组值是什么呢?有可能是{0 0 },{1 42},{0 42}和{1 0}, {0 0}, {1 42}中的任何一组,{0 42}(其实是一个取决于时间的结果)可以说得通,但是{1 0}是怎么出现的呢?这是因为只要在线程里出现,编译器都允许在加载和存储时(或者甚至在移除和构造时)进行重组以提高效率。但是当另一个线程开始与第一个交互时,重组就会变得很明显了。

情况变得糟糕了,你可以强行使编译器不对特定的读写进行重组,但是CPU也会允许(并且一直会)重组指令,只要指令在运行,内核中就会出现再一次相同的情况。

Memory barriers

Fortunately, you can force certain ordering guarantees all the way down to the CPU level using memory barriers. Unfortunately, they are extremely tricky to use properly. In fact, a major problem with lock-free code is that it's prone to bugs that can only be reproduced by specific non-deterministic interleavings of thread operations. This means a lock-free algorithm with a bug may work 95% of the time and only fail the other 5%. Or it may work all of the time on a developer's CPU, but fail occasionally on another CPU.

This is why lock-free programming is generally considered hard: It's easy to get something that looks right, but it requires much more effort and thought to create something which is guaranteed to work under all circumstances. Luckily, there are a few tools to help verify implementations. One such tool is Relacy: It works by running unit tests (that you write) under every possible permutation of thread interleaving, which is really cool.

内存屏障

幸运的是,你可以通过内存屏障强制某些顺序保证始终下调到CPU级别。不幸的是,要正确使用它们也很棘手。事实上,无所编码的一个主要问题是很容易出现一些这样的漏洞,只能通过特定线程操作的非确定性交错的方式来重现。这意味着无锁算法可能存在一个95%的时间工作正常而另外5%的时间会失败的漏洞。或者它可能在开发者的CPU上总是正常工作,但在另一个CPU上偶尔失败。

这就是无锁编程通常被认为很难的原因:你很容易得到一些看上去正确,但需要更多努力和思考来创造一些能保证在任何情况下都工作的事物。幸好,有一些工具能帮助验证这些实现。其中一个工具叫Relacy:它的工作方式是,在各种可能的排列线程交织中运行(你写的)单元测试,这实在太酷了。

Back to memory barriers: At the lowest level of the CPU, different cores each have their own cache, and these caches need to be kept in sync with each other (a sort of eventual consistency). This is accomplished with a sort of internal messaging system between the various CPU cores (sounds slow? It is. So it was highly optimized with, you guessed it, more caching!). Since both cores are running code at the same time, this can lead to some interesting orderings of memory loads/stores (like in the example above). Some stronger ordering guarantees can be built on top of this messaging system; that's what memory barriers do. The types of memory barriers that I found most powerful were the acquire and release memory barriers. A release memory barrier tells the CPU that any writes that came before the barrier should be made visible in other cores if any of the writes after the barrier become visible, provided that the other cores perform a read barrier after reading the data that was written to after the write barrier. In other words, if a thread B can see a new value that was written after a write barrier on a separate thread A, then after doing a read-barrier (on thread B), all writes that occurred before the write barrier on thread A are guaranteed to be visible on thread B. Once nice feature of acquire and release semantics is that they boil down to no-ops on x86 -- every write implicitly has release semantics, and every read implicitly has acquire semantics! You still need to add the barriers, though, because they prevent compiler re-ordering (and they'll generate the right assembly code on other processor architectures which don't have as strong a memory-ordering model).

If this sounds confusing, it is! But don't give up hope, because there's many well-explained resources; a good starting point is Jeff Preshing's very helpful articles on the subject. There's also a short list of links in this answer on Stack Overflow.

回到内存屏障:在CPU的最底层,每个核都拥有自己的缓存,并且这些缓存必须彼此之间保持同步(一种最终统一)。这是由CPU内核之间的一种内部信息机制来完成的(这样看来很慢?确实是的。所以这种机制经过高度优化,猜一猜它是怎么实现的,就是更多的缓存!)。因为两个内核在同事执行代码,所以会导致一些有意思的的内存加载/存储顺序(比如上面的例子)。内存屏障做的事情就是,在这个信息机制的顶层建立一些更强壮的顺序确保机制。我发现的最强大的内存屏障是acquire and release(这里翻译没啥意思,后面也会用这个名次)。“release” 内存屏障告诉CPU,如果屏障后的写操作对其他核变为可见,那么屏障前的写操作也必须保持可见。这种转变是在其他核读取数据(这些数据是“写屏障”之后写入的)之后,执行读屏障的时候进行的。换句话说,如果一个线程B可以看见另一个线程A在一个写屏障之后写入的新值,那么在执行一个读屏障之后(在线程B上),所有在写屏障之前A线程上进行的写操作都会对B线程可见。一个不错的功能: “acquire and release” 语义,类似于x86上的停止操作指令, 就是每个写操作都隐含release语义,每个读操作都隐含acquire语义。但是你仍然需要手动添加屏障,因为 “acquire and release” 语义不允许编译器重新排列内存加载顺序(并且它也会生成在其他处理器架构上同样正确的汇编代码,即使这些处理器没有强壮的内存排序模型)。

如果这令你很困惑,好吧,我承认确实很困惑。但是不要放弃希望,因为这里有一些资源,它们很好地解释了这个问题。从Jeff Preshing's very helpful articles on the subject这篇文章开始是不错的,这个页面上还有一些其他链接的列表: this answer on Stack Overflow

A wait-free single-producer, single consumer queue

Since I apparently find tangents irresistible, I, of course, set out to build my own lock-free data structure. I went for a wait-free queue for a single producer, single consumer architecture (which means there's only ever two threads involved). If you need a data structure which is safe to use from several threads at once, you'll need to find another implementation (MSVC++ comes with one). I chose this restriction because it's much easier to work with than one where there's a free-for-all of concurrent threads, and because it was all I needed.

I had looked at using some existing libraries (particularly liblfds), but I wasn't satisfied with the memory management (I wanted a queue that wouldn't allocate any memory at all on the critical thread, making it suitable for real-time programming). So, I ignored the copious advice about only doing lock-free programming if you're already an expert in it (how does anyone become an expert like that?), and was able to successfully implement a queue!

一个等待释放的单一的生产者,单一的消费者队列

自从我显然发现不可抗拒的突然转向时(译者:且这么理解),我,当然,建立了自己的无锁数据结构,我为了一个单一的生产者、单一的消费者体系结构尽力想求得一个等待释放的队列(意味着只有两个线程参与)。如果你立刻需要一个安全使用多线程的数据结构,你需要找到另一种实现(MSVC++带有一个)。我选择限制是因为它很大程度上操作要比释放全部并发线程简单,以及这就是我所有我需要的。

我看了一些目前的书籍(特别是liblfds),但是我不太满意内存管理(我想要一个不分配任何内存给所有关键线程的队列,使得它适合实时编程),所以,我忽略了关于只做无锁编程的大量建议除非你已经是这一领域的专家(如何成为一个这样的专家呢?),并能够成功地实现一个队列!

The design is as follows:

A contiguous circular buffer is used to store elements in the queue. This allows memory to be allocated up front, and may provide better cache usage. I call this buffer a "block".

To allow the queue to grow dynamically without needing to copy all the existing elements into a new block when the block becomes too small (which isn't lock-free friendly anyway), multiple blocks (of independent size) are chained together in a circular linked list. This makes for a queue of queues, essentially.

The block at which elements are currently being inserted is called the "tail block". The block at which elements are currently being consumed is called the "front block". Similarly, within each block there is a "front" index and a "tail" index. The front index indicates the next full slot to read from; the tail index indicates the next empty slot to insert into. If these two indices are equal, the block is empty (exactly one slot is left empty when the queue is full in order to avoid ambiguity between a full block and an empty block both having equal head and tail indices).

设计如下:

一个连续的环形缓冲区(buffer)被用来储存队列中的元素。这样允许内存被分配在最前面,并且可能提供更好的缓存使用。我把这个缓冲区叫做“区块”(block)。

为了使得队列能够动态增长,并且当区块变的太小时(总之不是无锁的)不需要复制所有已经存在的元素到新的区块里,许多的区块(有独自的大小)会被链接在一起形成环形相关联的列表。这样就形成了队列的队列。

元素正在被插入的区块叫做“尾区块”。元素正在被消耗的区块叫做“头区块”。相同地,在每一个区块里有一个“头"索引和一个“尾”索引。头索引指示了下一个将被读取的满的位置,尾索引指示了下一个将被插入的空的位置。如果这两个索引是相等的,那么这个区块是空的(一个位置会被清空当队列是满的,这样为了避免当一个满的区块和一个空的区块都有相同头和尾索引时产生的不明确性)。

In order to keep the queue consistent while two threads are operating on the same data structure, we can take advantage of the fact that the producing thread always goes in one direction in the queue, and the same can be said for the consuming thread. This means that even if a variable's value is out of date on a given thread, we know what range it can possibly be in. For example, when we dequeue from a block, we can check the value of the tail index (owned by the other thread) in order to compare it against the front index (which the dequeue thread owns, and is hence always up-to-date). We may get a stale value for the tail index from the CPU cache; the enqueue thread could have added more elements since then. But, we know that the tail will never go backwards -- more elements could have been added, but as long as the tail was not equal to the front at the time we checked it, there's guaranteed to be at least one element to dequeue.

Using these sort of guarantees, and some memory barriers to prevent things like the tail from being incremented before (or perceived as being incremented before) the element is actually added to the queue, it's possible to design a simple algorithm to safely enqueue and dequeue elements under all possible thread interleavings. Here's the pseudocode:

# Enqueue
If room in tail block, add to tail
Else check next block
    If next block is not the head block, enqueue on next block
    Else create a new block and enqueue there
    Advance tail to the block we just enqueued to

# Dequeue
Remember where the tail block is
If the front block has an element in it, dequeue it
Else
    If front block was the tail block when we entered the function, return false
    Else advance to next block and dequeue the item there

为了保持一致性,两个线程操作同一数据结构,我们可以利用生产线程在队列中总是运行在一个方向上的实际情况和同样可以用来说明的消费者线程,这意味着,即使一个变量值在给定的线程中是过时的,我们也知道它可能的所在的范围。例如,当我们从一个阻塞出队,我们可以检查尾部索引的值(由其它线程拥有)来比较紧靠着前面的索引(该线程自己出队,并因此总是最新的),我们可以为尾部索引从CPU缓存获得一个旧值,其后入队线程可以增加更多元素,但是,我们知道尾部绝不会倒退——更多元素会被增加,只要尾部不等于前面我们检查过的,保证至少有一个元素出队。

使用这些担保,和一些内存分界线去阻止像被增加在元素前面(或者被认为是前加)事实上是被增加到队列尾部,设计一个简单算法来安全地入队和出队的元素在所有可能的线程交织下是可能的,这是伪代码:

# Enqueue
If room in tail block, add to tail
Else check next block
    If next block is not the head block, enqueue on next block
    Else create a new block and enqueue there
    Advance tail to the block we just enqueued to

# Dequeue
Remember where the tail block is
If the front block has an element in it, dequeue it
Else
    If front block was the tail block when we entered the function, return false
    Else advance to next block and dequeue the item there
The algorithms for enqueueing and dequeueing within a single block are even simpler since the block is of fixed size:

# Enqueue within a block (assuming we've already checked that there's room in the block)
Copy/move the element into the block's contiguous memory
Increment tail index (wrapping as needed)

# Dequeue from a block (assuming we've already checked that it's not empty)
Copy/move the element from the block's contiguous memory into the output parameter
Increment the front index (wrapping as needed)

Obviously, this pseudocode glosses over some stuff (like where memory barriers are placed), but the actual code is not much more complex if you want to check out the lower-level details.

When thinking about this data structure, it's helpful to make the distinction between variables that are "owned" by (i.e. written to exclusively by) by consumer or producer threads. A variable that is owned by a given thread is never out of date on that thread. A variable that is owned by a thread may have a stale value when it is read on a different thread, but by using memory barriers carefully we can guarantee that the rest of the data is at least as up to date as the variable's value indicates when reading it from the non-owning thread.

To allow the queue to potentially be created and/or destroyed by any thread (independent from the two doing the producing and consuming), a full memory barrier (memory_order_seq_cst) is used at the end of the constructor and at the beginning of the destructor; this effectively forces all the CPU cores to synchronize outstanding changes. Obviously, the producer and consumer must have stopped using the queue before the destructor can be safely called.

单个固定尺寸的块的入队和出队算法是比较简单的:

#向一个块的入队(假设我们已经检查了一个块中有空间)
复制/移动元素进入块的连续存储空间
队尾下标增长(需要重新设置队尾)

#从一个块的出队(假设我们已经检查了非空)
从一个块的连续存储空间复制/移动元素到一个输出参数
队尾下标增长(需要重新设置队尾)
显然,这掩盖了一些东西(比如可能存在的内存屏障),但是如果你希望检查底层错误细节的话, 实际代码也不会非常复杂。

再思考这个数据结构,它对于区分消费者线程或生产者线程所拥有的变量(例如:写入独占变量)是非常有帮助的。对于一个给定的线程,他所拥有的变量永远不会过时。一个线程拥有的变量被另一个线程读取到的可能只是一个旧值,但是通过小心的使用内存屏障,但我们从一个并不拥有这个变量的线程读取时,我们能够保证其余的数据内容至少是新的。

允许队列可能被任意线程创建或销毁(两个相互独立的生产者和消费者线程),一个完整的内存屏障(memory_order_seq_cst)用于构造函数的结尾和析构函数的开始; 这样可以有效的迫使所有CPU内核有效同步。显然,生产者和消费者必须在 析构函数可以安全调用之前已经停止使用队列。


返回顶部
顶部