Psychosomatic, Lobotomy, Saw: April 2015

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
D. Vyukov is an awesome lock-free dude, and I often refer to his instructive and invaluable website 1024cores.net in my posts. On his site he covers lock free queue implementations and in particular a wait-free MPSC linked node queue. This is really rather special when you consider that normally MP would imply lock-free rather than wait-free guarantees. I've ported his algorithm to Java (and so have many others: Netty/Akka/RxJava etc.), and had to tweak it to match the Queue interface. In this post I'd like to explain the algorithm, it's translation to Java, and the implications of making it a j.u.Queue.

Lock free vs. Wait free

Let's review the definitions:

Wait-free: thread progress is guaranteed, all operations finish in a determined number of steps.
Lock-free: global progress is guaranteed, though a particular thread may be blocked.

An example of a transition from lock free to wait free is available with JDK8 changes to AtomicReference::getAndSet(). The change was made by utilizing the newly available Unsafe::getAndSetObject intrinsic which translates directly to XCHG (on x86). So where we used to have for AtomicReference:
T getAndSet(T newVal) {
T currentValue;
do {
currentValue = val; // val is a volatile field
} while (!Unsafe.compareAndSwapObject(this, VAL_FIELD_OFFSET, currentValue, newValue));
return currentValue;
}
Now we have:
T getAndSet(T newVal) {
return Unsafe.getAndSetObject(this, VAL_FIELD_OFFSET, newValue);
}
I discussed a similar change to AtomicLong.getAndAdd in a previous post, replacing the CAS loop with LOCK XADD.

The Vyukov Wait Free MPSC queue

This is a LinkedList type structure and the interesting methods are offer and poll, here's the original (I did some formatting):
Awesome in it's simplicity, deceptive in it's genius. Be aware that head/tail meaning is the other way around than what most people are used to. I personally go for producer/consumerNode (head = producer side, tail = consumer side in the above snippet) in my code, but for consistency I'll stick with Mr. Vs notation for the porting exercise.
But how do we manage the same memory barriers in Java? We can be nice about it and use the AtomicFieldUpdater or more brutal and use Unsafe. I find you get better performance with Unsafe, but you should consider how appropriate this is for you. In any case, here's what we end up with:
The code is pretty similar. Now if we wanted to complete the j.u.Queue we could extend AbstractQueue and implement only size()/peek() and we're done (I'm not going to bother with the iterator()):
There we go, seems reasonable don't it? off to the pub to celebrate some kick ass lock free programming!

Vyukov::poll() ain't Queue::poll!

I've discussed this particular annoyance in a previous post which has some overlap with this one. The names are the same, but as it turns out the guarantees are quite different. While Vyukov is doing a fine job implementing a queue, not any queue is a j.u.Queue. In particular for this case, the poll() method has a different contract:

j.u.Queue: Retrieves and removes the head of this queue, or returns null if this queue is empty.
Vyukov: Retrieves and removes the head of this queue, or returns null if next element is not available.

Why wouldn't the next element be available? Doesn't that mean the queue is empty? Sadly this ain't the case. Imagine for instance there are 2 producers and we run both threads step by step in a debugger. We break at line 33 (4th line of offer, the Java version):

Producer 1: we step 1 line, we just replaced head with node n. We are suspended before executing line 35.
Producer 2: we let the program continue. We've replaced head and linked it to the previous head.

What state are we in? Let's look at head/tail nodes and where they lead (I number the nodes in order of assignment to head):

head = Node[2], this is the node created by Producer 2. We also know that Node[1].next = Node[2], because we let offer run it's course on Producer 2.
tail = Node[0] the node we allocated in the constructor. This is the node head was before the first producer came along. This is what prev is equal to for Producer 1, but because we suspended that thread it never set it's next value. Node[0].next is still null!

If a consumer came along now they would get a null from poll(), indicating the queue is empty. But the queue is obviously not empty!
So it seems we cannot deduce the queue is empty from looking at tail.next here's 2 valid indicators that the queue is empty:

head == tail : this is the starting point set in the constructor and where the consumer ends up after consuming the last element
head.val == null : head can only have a value of null if it is tail

Here's a solution to a correct poll() and the knock on effect on peek():
This is a bit annoying because now we have a wait-free offer(), but poll() and peek() are lock-free(only block one thread, producers can make progress).
This pitfall is tricky enough that not only did I fall for it (on another queue algorithm, but same mistake), it also took by surprise Mr. Manes writing an interesting variant on this queue for his high performance cache implementation (I filed an issue which he promptly fixed), and struck the notorious Dave Dice when considering Vyukov's MPMC queue (see comments for discussion).

So is this it? Are we done?
Almost... size() is still broken.

A Good Size

It's not really surprising that size is broken given the terminating condition for the size loop was relying on next == null to terminate the count. Size was also broken in 2 other subtle ways:

The interface for size dictates that it returns a positive int. But given that the queue is unbounded it is possible (though very unlikely) for it to have more than 2^31 elements. This would require that the linked list consume over 64GB (16b + 8b + 8b=32b per element, refs are 8b since this requires more than 32GB heap so no compressed oops). Unlikely, but not impossible. This edge condition is handled for ConcurrentLinkedQueue (same as here), and for LinkedBlockingQueue (by bounding it's size to MAX_INT, so it's not really unbounded after all) but not for LinkedList (size is maintained in an int and is blindly incremented).
Size is chasing a moving target and as such can in theory never terminate as producers keep adding nodes. We should limit the scope of the measurement to the size of the queue at the time of the call. This is not done for CLQ and should perhaps be considered.

Here's the end result:

If you need a full SPSC/MPSC linked queue implementation they are available in JCTools for your pleasure, enjoy!

Special thanks to Doug and Ben for reviewing!

{This post turned rather long, if you get lazy feel free to skip to the summary}
Let's start at the very beginning, a very good place to start... My very first post on this blog was a short rant on intrinsics, and how they ain't what they seem. In that post I made the following statement:

"intrinsic functions show up as normal methods or native methods"

Which is correct. An intrinsic function is applied as a method substitution. A method call will appear in the code and the compiler will replace it's apparent source-level implementation with a pre-cooked implementation. In some cases intrinsics are sort of compilation cheats, the idea being that some bits of functionality are both very important (i.e. worth while optimizing) and can benefit from a hand crafted solution that will be better than what the compiler can achieve. The end result can be in one of a few flavours:

Method call replaced with a call to a JVM runtime method: E.g. System.arrayCopy is replaced with a call to a method stub generated by the runtime for all array types. This method call is not a JNI call, but it is a static method call that is not inlined.
Method call replaced with one or more instructions inlined: E.g. Unsafe.getByte/compareAndSet/Math.max
Method call replaced with compiler IR implementation: E.g. java.lang.reflect.Array.getLength
A mix of the above: E.g. String.equals is partially implemented in IR, but the array comparison is a call to a method stub.

The intrinsics are all set up in vmSymbols.hpp and if you look, you'll see Arrays.fill is NOT on the list. So why am I talking about Chewbacca? Because it is something like an intrinsic...

The Arrays.fill SIMD Opportunity

Arrays.fill is the Java memset (fills an array with a given value), and just like System.arrayCopy (memcpy in C lingo) is worth the effort to optimize and offers the same kind of opportunity. What opportunity might that be, you ask? the opportunity to use SIMD (Single Instruction Multiple Data) instructions when the underlying CPU offers them (I assume for the sake of discussion AVX enabled CPUs i.e. since Sandy Bridge, I find this listing of intel intrinsics useful to explain and sort through the available instructions). These instructions allow the CPU to operate on up to 256 bit (512 bit soon) chunks of data, thus transforming 32 byte sized MOV instructions into a single wide MOV instruction (E.g. the intel C instrinsic _mm256_storeu_si256 or the corresponding instruction vmovdqu). SIMD instructions are good for all sorts of operations on vectors of data, or arrays, which is why the process of transforming element by element operations into SIMD instructions is also referred to as vectorization.
The actual assembly stub is generated dependent on CPU and available instruction set. For x86 the code is generated by the macroAssembler_x86.cpp, and the observant digger into the code will find it makes use of the widest memory instructions it can identify the processor is capable of. Wider is better baby! If you are not morbidly curious about what the implementation looks like, skip the next wall of assembly and you'll be back in Java land shortly.
Here's what the assembly boils down to when UseAVX>=2/UseSSE>=2/UseUnalignedLoadStores=true:

Roughly speaking the algorithm above is:

Fill up an XMM register with the intended value
Use the XMM register to write 64 byte chunks (2 vmovdqu) until no more are available
Write leftover 32 byte chunk (skipped if no matching leftovers)
Write leftover 8 byte chunks (skipped if no matching leftovers)
Write leftover 4 bytes (skipped if no matching leftovers)
Write leftover 2 bytes (skipped if no matching leftovers)
Write leftover 1 bytes (skipped if no matching leftovers)

It ain't nice, but we do what we gotta for performance! There are variations of the above described across the internets as the done thing for a memset implementation, this might seem complex but is pretty standard... anyway, moving right along.

The Arrays.fill 'intrinsic'

Arrays.fill is different from System.arrayCopy because, as it's absence from vmSymbols suggests, it's not a method substitution kind of intrinsic (so technically not an intrinsic). What is it then? Arrays.fill is a code pattern substitution kind of compiler shortcut, basically looking for this kind of loop:

And replacing it with a call into the JVM memset implementation (I recently learnt the same thing is done by GCC as well, see code to assembly here). The pattern matching bit is done in loopTransform.cpp. This feels enough like an intrinsic grey area that the method doing the pattern match and replace is called intrinsify_fill.
Pattern matching makes this optimization potentially far more powerful than method substitution as the programmer doesn't have to use a special JDK method to convey meaning, they can just express their meaning in code and the compiler 'knows' that this simple loop means 'fill'. Compare that with System.arrayCopy where rolling your own leads to performance that is much worse than that offered by the intrinsic.
Let's prove me right (my favourite thing, beats kittens and all that crap), here's a JMH (see the JMH reference page for more JMH info/examples) benchmark comparing Arrays.fill to a hand rolled fill, and System.arrayCopy to handrolled array copy:
And the results are (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu, array is 32K in size)?
ArrayFill.fillBytes 561.540 ± 10.814 ns/op
ArrayFill.manualFillBytes 557.901 ± 5.255 ns/op
ArrayFill.manualReversedFillBytes 1017.856 ± 0.425 ns/op
ArrayFill.copyBytes 1300.313 ± 13.482 ns/op
ArrayFill.manualCopyBytes 1477.442 ± 13.030 ns/op

We can verify that the call out to the JVM fill method happens for fillBytes/manualFillBytes by printing out the assembly:

So what have we learnt so far:

Use System.arrayCopy, it is better than your handrolled loop. But surprisingly not hugely better, hmmm.
You don't have to use Arrays.fill, you can roll your own and it works the same. Notice the call out to the fill method. But...
Don't get too creative rolling your own. If you get too funky (like filling the array backwards) it'll fall apart and the 'intrinsic' won't happen. But do note that the reverse fill still has some of that good SIMD stuff going, we'll get to that in a sec.

Are The Other Types Filling The Love?

It all sounds great don't it? Let's see how this pans out for other types. We'll be filling an array of 32KB. To be uniform across data types that means a 16K chars/shorts array, an 8K ints/floats array and a 4K array of longs. I added an 8K array of objects, which is the same size for compressed oops on the Oracle JVM (reference size is 4 bytes, same as an int).
The JMH benchmark code is as you'd expect:
Here's some reasonable expectations:

If no optimizations are present, wider writes are more efficient. It follows that the longFill would be the fastest. But...
Given a clever compiler the fill loop is replaced with the widest writes possible, so there should be no significant difference. But the fill optimization does not cover double/long/object arrays, so we might expect longFill to be the worst performer.
An objects array is not that different from an int array, so performance should be similar. Sure there's a write barrier, but it need only be done once per card (not once for the whole array as I thought initially, god bless Shipilev and PrintAssembly), so that's an extra byte write per card of elements filled. A card is per 512 bytes, each element is 4 bytes, so that's one card per 128 elements. Given there is no fill method implemented for it we may expect it to be slightly worse than the longFill.
We should not rely on expectations, because performance is best measured.

As you'd expect the results are somewhat different than the expectations (Oracle JDK8u40/i7-4770@3.40GHz/Ubuntu):

ArrayFill.fillBytes 561.540 ± 10.814 ns/op

ArrayFill.fillChars 541.901 ± 4.833 ns/op

ArrayFill.fillShorts 532.936 ± 4.508 ns/op

ArrayFill.fillInts 543.165 ± 3.788 ns/op

ArrayFill.fillFloats 537.599 ± 2.323 ns/op

ArrayFill.fillLongs 326.770 ± 3.267 ns/op

ArrayFill.fillDoubles 346.840 ± 5.786 ns/op

ArrayFill.fillObjects 4388.242 ± 11.945 ns/op

Say WOT?
For bytes/chars/shorts/ints/floats Arrays.fill performs very similarly. This much is as expected from the second point above. But filling an array of longs/doubles is better than the others. The funny thing is, there's no fill function implemented for the long array, how come it is so darn quick? Also, why does the objects fill suck quite so badly when compared with the rest (I will not be addressing this last question! I refuse! this post is too fucking long as it is!)?
This is what happens when we turn off the OptimizeFill flag:

ArrayFill.fillBytes 1013.803 ± 0.227 ns/op
ArrayFill.fillChars 323.806 ± 3.879 ns/op
ArrayFill.fillShorts 323.689 ± 4.499 ns/op
ArrayFill.fillInts 326.336 ± 1.559 ns/op
ArrayFill.fillFloats 319.449 ± 2.048 ns/op
ArrayFill.fillLongs 328.692 ± 3.282 ns/op
ArrayFill.fillDoubles 345.035 ± 6.362 ns/op
ArrayFill.fillObjects 4397.130 ± 7.161 ns/op

Strange innit? now we got char/int/long arrays all performing similarly. In fact, with the exception of the byte array, everything is better than it was with the optimization.

Superword to the rescue!

Turns out the JIT compiler is clued up on the topic of SIMD parallelisation by way of Superword Level Parallelism (see the original paper here):

In some respects, superword level parallelism is a restricted form of ILP (Instruction Level Parallelism). ILP techniques have been very successful in the general purpose computing arena, partly because of their ability to find parallelism within basic blocks. In the same way that loop unrolling translates loop level parallelism into ILP, vector parallelism can be transformed into SLP. This realization allows for the parallelization of vectorizable loops using the same basic block analysis. As a result, our algorithm does not require any of the complicated loop transformations typically associated with vectorization. In fact, vector parallelism alone can be uncovered using a simplified version of the SLP compiler algorithm.
...
Superword level parallelism is defined as short SIMD parallelism in which the source and result operands of a SIMD operation are packed in a storage location.
...
Vector parallelism is a subset of superword level parallelism.

The Hotspot compiler implements SLP optimizations in superword.cpp and you are invited to dive into the implementation if you like. I'm going to focus on it's impact here, and to do that I only need to know how to turn it on and off (core competency for any software person). It's on by default, so above results are what happens when it is on, here's what life looks like when it is off too (so -XX:-OptimizeFill -XX:-UseSuperWord):
ArrayFill.fillBytes 8501.270 ± 2.896 ns/op
ArrayFill.fillChars 4286.161 ± 4.935 ns/op
ArrayFill.fillShorts 4286.168 ± 3.146 ns/op
ArrayFill.fillInts 2152.489 ± 2.653 ns/op
ArrayFill.fillFloats 2140.649 ± 2.587 ns/op
ArrayFill.fillLongs 1105.926 ± 2.228 ns/op
ArrayFill.fillDoubles 1105.820 ± 2.393 ns/op
ArrayFill.fillObjects 4392.506 ± 11.678 ns/op

Life is revealed in all it's sucky splendour! This is what happens when the compiler shows you no love... did I say no love? hang on, things can get a bit worse.

Detour: Unsafe? We don't serve you kind here

To all the Unsafe fans, I got some sad news for y'all. Unsafe 'fill' loops are not well loved by the compiler. This is the price of stepping off the beaten path I guess. Consider the following benchmark:
The results are:
ArrayFill.unsafeFillOffheapBytes 9742.621 ± 2.270 ns/op
ArrayFill.unsafeFillOnHeapBytes 12640.019 ± 1.977 ns/op
ArrayFill.fillBytes(for reference) 561.540 ± 10.814 ns/op
The Unsafe variant do not enjoy the 'fill' pattern matching magic, nor do they get the SuperWord optimizations. What can you do? For this kind of thing you should use the Unsafe.setMemory method instead:
With the result:
ArrayFill.unsafeSetOffheapBytes 1259.281 ± 21.294 ns/op
ArrayFill.unsafeSetOnHeapBytes 1275.158 ± 27.950 ns/op
Not quite there, still ~2x worse (why? how come it doesn't just call the bytes fill method? a bit of digging shows it ends up calling the underlying platform's memset...) but beats being 20-25x worse like the handrolled method is.

Summary and Musings

It's the circle of life!

So what did we learn:

There's another kind of 'intrinsic' like optimization, which uses pattern matching to swap a block of code rather than a method. This is employed for memset like memory fill loops (in particular Arrays.fill) intrinsicfication. It's not an intrinsic technically, but you know what I fucking mean.
System.arrayCopy/Arrays.fill implementations utilize SIMD instructions to improve their efficiency. These instructions are not available in plain Java, so some compiler intervention is required.
The JIT compiler is also able to use SuperWord Level Parallelism to derive SIMD code from 'normal' sequential code.
In the case of Arrays.fill, it looks like the SuperWord optimized code is faster than the fill specialized implementation for all types except bytes (on the system under test)
If you use Unsafe you will be excluded from these optimizations.

So I look at this process and I imagine history went something like this:

We want to use SIMD instructions, but the JIT compiler isn't really clever enough to generate them by itself. Memset implementations are rather specialized after all. Let's make life a bit easier for the compiler by creating an intrinsic. We'll even go the extra mile and make an effort to automatically identify opportunities to use this intrinsic, so now it's not really an intrinsic any more. The Arrays.fill optimization is available on Oracle JDK6u45 (the oldest I keep around, maybe it was there a while before that) and on that JVM it is twice as fast as the SLP generated code.

Over time, SLP gets better and eventually the compiler is now good enough to optimize the fill loop by itself and beat the specialized method. That is an awesome thing. We just need to remove the training wheels now.

And there's a final punch line to this story. Memset/Memcpy are such common and important opportunities for optimization, so Intel has decided to offer an assembly 'recipe' for them and save everyone the effort in writing them:

3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code name Ivy Bridge, REP string operation using MOVSB and STOSB can provide both flexible and high-performance REP string operations for soft- ware in common situations like memory copy and set operations. Processors that provide enhanced MOVSB/STOSB operations are enumerated by the CPUID feature flag: CPUID:(EAX=7H, ECX=0H):EBX.ERMSB[bit 9] = 1. - [From the Intel Optimization Manual(September 2014)]

From the manual it seems that this method of implementing memcpy/memset can perform well, but like anything else, YMMV (the intel manual discussion of the performance differences is in itself interesting both on the results and the methodology level). One obvious advantage of this method is that it results in much much smaller code that should be trivial to inline into callers. This will however put the SuperWord method at a slight disadvantage, and the tide will change again.
[UPDATE 14/03/2015: It seems the good folks of Oracle have considered and rejected the use of REP MOVSB for array copy.]
Thanks go to the kind reviewers Peter 'Massive' Hughes, Darrach and the Shipster

Psychosomatic, Lobotomy, Saw

Wednesday 22 April 2015

Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue