Tuesday 6 August 2013

JMM Cookbook Footnote: NOOP Memory Barriers on x86 are NOT FREE

In Doug Lea's excellent JMM Cookbook the following observation is made about memory barriers on different architectures:
Processor LoadStore LoadLoad StoreStore StoreLoad Data
dependency
orders loads?
Atomic
Conditional
Other
Atomics
Atomics
provide
barrier?
sparc-TSO no-op no-op no-op membar (StoreLoad) yes CAS: casa swap,ldstub full
x86 no-op no-op no-op mfence or
cpuid or
locked insn
yes CAS:
cmpxchg
xchg,
locked insn
full
ia64 combine
with

st.rel or
ld.acq
ld.acq st.rel mf yes CAS:
cmpxchg
xchg,
fetchadd
target +
acq/rel
arm dmb
(see below)
dmb
(see below)
dmb-st dmb indirection
only
LL/SC:
ldrex/strex

target
only
This leads quite a few people to conclude that there is no performance cost to the memory barriers that are marked no-op, and on occasion leads people to believe they can just not put them in at all. After all, what is the point in a no-op? surely a program with the no-ops removed is the same program, right? Wrong...
After having this debate again recently I thought I'd document it here to avoid repetition, and to get some second opinions I got some help from the good people on the concurrency-interest mailing list.

What's a Memory Barrier?

This is Noop
From the Cookbook:
"Compilers and processors must both obey reordering rules. No particular effort is required to ensure that uniprocessors maintain proper ordering, since they all guarantee "as-if-sequential" consistency. But on multiprocessors, guaranteeing conformance often requires emitting barrier instructions...
...Memory barrier instructions directly control only the interaction of a CPU with its cache, with its write-buffer that holds stores waiting to be flushed to memory, and/or its buffer of waiting loads or speculatively executed instructions."
We need memory barriers to help the compiler and CPU make sense of our intentions in terms of concurrency. We need to hold them back from optimizing away code that can be eliminated or re-ordered if optimal single threaded execution is the only consideration. The last bit suggests that barriers are really about memory ordering.

Why isn't it free?

Volatile reads (LoadLoad) and lazySet/putOrdered (StoreStore) are both "No-op"s on x86. Mike Barker (LMAX dude, maintainer and contributor to the Disruptor, here's his blog) replied to my request for clarification:
"The place to look (as definitive as it gets for the hardware) is section 8.2.2, volume 3A of the Intel programmer manual.  It lists the rules that are applied regarding the reordering of instructions under the X86 memory model.  I've summarised the relevant ones here:
- Reads are not reordered with other reads.
- Writes are not reordered with older reads.
- Writes to memory are not reordered with other writes.
- Reads may be reordered with older writes to different locations but not with older writes to the same location.
The only reordering that will occur with x86 is allowing reads to be executed before writes (to other locations), hence the need for a LOCKed instruction to enforce the store/load barrier.  As you can see with the above rules, store are not reordered with older stores and loads are not reordered with older loads so a series of MOV instructions is sufficient for a store/store or a load/load barrier."
So since a volatile read and a lazySet both boil down to a MOV from/to memory, no further special instruction is required (hence no-op) as the underlying architecture memory model obeys the rules required by the JMM (Note: lazySet is not in the JMM, but is as good as in it, I go into the origins of lazySet here). But while that does mean these barriers are cheap it doesn't make them free. The above rules Mike is quoting are for the generated assembly, but the barriers impact the compiler as well. Memory barriers are there to stop your code being interpreted a particular way by both the compiler and the CPU. The JMM guarantees that memory barriers are respected by BOTH. This is important because the JMM Cookbook is talking about CPU no-ops, not compiler no-ops. Vitaly Davidovich from concurrency interest replied:
"Yes, volatile loads and lazySet do not cause any cpu fence/barrier instructions to be generated - in that sense, they're noop at the hardware level.  However, they are also compiler barriers, which is where the "cheap but ain't free" phrase may apply. The compiler cannot reorder these instructions in ways that violate their documented/spec'd memory ordering effects. So for example, a plain store followed by lazySet cannot actually be moved after the lazySet; whereas if you have two plain stores, the compiler can technically reorder them as it sees fit (if we look at just them two and disregard other surrounding code). So, it may happen that compiler cannot do certain code motion/optimizations due to these compiler fences and therefore you have some penalty vs using plain load and stores.  For volatile loads, compiler cannot enregister the value like it would with plain load, but even this may not have noticeable perf diff if the data is in L1 dcache, for example."
The difference in performance between using a plain write and a lazySet has been demonstrated in this post if you look at the difference between the putOrdered and plain writes versions. I expect this difference is far more pronounced in a benchmark than in a full blown system, but the point is it's a demonstrable difference. Similarly here Mr. Brooker demonstrates there is a difference between normal reads and volatile reads. There's a flaw in the experiment in that the volatile writes are making the volatile reads slower, but his point is valid as a demonstration of inhibited optimization (loop is not unrolled, value is not enregistered) and the uncontended read tells that story well.

So... I can't just drop it out of my program?

The TSO model for x86 dictates how your code will behave, if you were writing assembly. If you are writing Java, you are at the mercy of the JIT compiler first, and only then the CPU. If you fail to inform the compiler about your intentions, your instructions may never hit the CPU, or may get there out of order. Your contract is with the JMM first.
It's like punctuation: "helping your uncle Jack off a horse" != "helping your uncle jack off a horse" => punctuation is not a no-op ;-)