So why do we need a volatile keyword
Most of the time I am working on a highload JVM based
backends and for sure I know the keyword
volatile
. If you google the meaning
, then probably you would find something related to
caching. I did too, and it didn't satisfy my
curiosity ,so I had to go deeper, way deeper. So why not
share my findings
here so curious readers can get a better view on the
concurrent
nature of Java.
It starts with caches
Before explaining what volatile does it would be better to understand what problem it solves in the first place. Let's start with a small code sample
Straight forward counter that allows you to get and increment
the number. But what's going on in terms of CPU ? The
silicon mafia
ships around 4 CPU cache levels between CPU and RAM.
First level is located as close to the CPU as possible and
usually has the smallest size(around 16 kilobytes).
The most frequently used variables are stored there.
Each next level has a bigger size but has a higher access
time than each previous level.
Moreover, nowadays most hardware is shipped with multiple
CPU Cores.
CPU cores which may or may not share the same CPU caches.
Other than that, each CPU has its own registers. CPU
Register is a unit of storage that is accessed primarily
by instructions that directly refer to it by name(for
example R0 or register 0)
Now
let's return to the example above that performs a simple
increment this.cnt++
. To read a single
variable means to check if it's present in CPU
level 1 cache, most probably it isn't there if we just
started
the program. Next, the second level cache has to be checked
and
so
on. In a Single Core environment the process would end here
and
Cpu will have to load the variable from RAM. Just for you to
understand the latency it takes for CPU to check each
individual
cache
and RAM here is the nice diagram.
1 ns for level 1 Cache read versus 100 ns for RAM read, looks impressive, isn't it ? . However, for Multi-CPU systems things become more complicated. Before incrementing the counter, CPU has to ask other cores using system bus whether they have a modified version of this counter in their own caches. If one CPU comes back and says , yes I do, then all cores have to be synchronized with this modified counter(some hardware producers actually make it possible to read this value directly from system bus which eliminate the path through main memory). Now CPU can increment the counter.One thing that I misunderstood is that this variable has to be flushed to main memory after each synchronization which isn't correct(will talk about it later). There is a thing called MESI(I know what you think, but it's another Mesi). It's a cache coherence protocol that basically represents the states that a cached variables can have
- Modified - M. The cache line present only in the current cache, and it was modified
- Exclusive - E. The cache line present only in the current cache, and it wasn't modified
- Shared - S. The cache line may be stored in other Core's caches. But all of them matches the value from Main memory
- Invalid - I. When Cache line was marked as Modified(M) or Exclusive(E), the copies of this variable in other caches will be marked as invalid
MESI makes all caches to be coherent.
So now let's google a volatile keyword in Java and the first example you will encounter covers volatile boolean variable
We have a Thread(first) that does some job in an infinite
loop,
and
then let's assume that another Thread from completely
different Core calls tellMeStop
that will
change
the value of boolean variable which will stop the infinite
loop the first thread is running
. If you don't declare this variable as volatile ,then the
first
Thread can cache this boolean variable in CPU cache and
reread it from there, the thing that another thread changes
the variable won't affect the first thread because the
variable
will still be read from local cache. That's how most
articles explain the volatile
keyword, use it
or
otherwise Thread will cache the variable.
But wait, didn't I just told you about this
MESI thing and that CPU can't modify
variable unless all other copies are synchronized ?
How is it even possible that Thread can cache a value with a
cache coherence protocol? That's why I was confused, and it
took me
a
while to get an answer so here it is. The thing is,it's not
even about CPU caches, it's about CPU registers. When
you compile and run your Java application , javac and then
JIT makes a completely valid assumption that your program
will be executed in a single Threaded environment. As a
result, the compiler can make a really great optimization.
Let's
take a look at the class above after compilation
Compiler is smart enough to see
that main thread doesn't call tellMeStop
which
means there is no way a boolean variable can be modified.
This variable will then be stored in the CPU Register. CPU
register is not the same as CPU cache, it's a small storage
which is located
Within the CPU. The access from this
storage in much faster than reading from CPU cache.
But what if we are in a multithreading environment? Still,
even if an application uses multiple threads, compiler is
allowed to make an assumption that those threads won't
communicate with each other using Object variables so each
Thread is allowed to create a local copy of every
Object/primitive they are working with and store them in CPU
registers where those threads are running.
Remember, all caches are coherent but CPU registers are not.
If thread writes something to a variable and it happens that
this variable is still inside of a CPU register then there
is no way other Threads which are running on different CPUs
will see the change.
Is there a way to prevent the compiler storing this variable
in the Register ? Yes , and it's volatile
Volatile
So now when you know what MESI is, what is a CPU Register
and how the program works
within
multiple Cores , let's talk about volatile.
The volatile
keyword is nothing more than
telling compiler that storing variable in the Register is
prohibited. If thread wants to read a variable it has to
read it from cache(unless the variable isn't there in which
case Main memory would be involved), if Thread wants to
modify the variable then it has to write it back to the CPU
cache where it was stored and then MESI will synchronize
caches from other CPU Cores.
Ok, so what If
you annotate all variables as volatile in a single threaded
program ? As I said , the Thread won't be able to store the
variable in the Registers but there is another problem.
Volatile in Java has another property which
prevents instruction reordering.
Let's look at the code above.For a moment , forget about CPU
registers. Two variables are incremented
sequentially, in terms of a processor, foo
will
be loaded into
the cache and incremented by one, but let's say the cache is
full now(which is not , remember level 1 cache is 16 kb and
there are different cache levels, but let's pretend we can
only save one int in the cache) and CPU can't load
bar
into the cache
so OS decided to flush foo
back to main memory.
Then bar
was loaded and incremented by two and
again, we
can't load foo
into the cache to increment it
by
two
because bar
occupied entire cache size so it
has to be flushed. It
involves a lot of rounds between RAM and CPU caches.
Can we make it better ? Of course , compiler will
reorder your instructions to make it faster.
So foo
will be loaded only once and be
incremented by 3
in one instruction.Because for single threaded program it
doesn't
matter whether
you increment foo
by 3 in one instruction or
with two
separate ones, the end result is the same. However,
for multithreading programs it's not the case. If compiler
reorder instructions the second thread can see the
following state entering the method {foo = 3, bar =
0}
which doesn't make any sense if you look at
the source code. So the second property of
volatile
apart from visibility is preventing
instruction reordering
So to answer the previous question, volatile in single threaded environment will prevent reordering which will make your program slower, so my advice is , do not use it as long as you don't use multithreading
Happens Before
Apart from volatile there are other ways to make changes from one thread to be visible to others . All these ways are combined into a set rules called Happens-Before guarantee. Let's take a look at some of them
- Synchronized block/method - All writes inside of a synchronized method/block will be visible to another Thread that starts executing this block/method
- All write to volatile variables happens before read from this volatile variable. And this one is handy because it also works for all nonvolatile writes that happened before write to a single volatile variable
-
All writes happened before
Thread.start()
will be visible to newly created Thread
False Sharing
When I described the MESI protocol, I was talking about this thing called a cache line. The thing is , CPU doesn't load data from RAM in terms of variables, it loads data in terms of cache lines. Each cache line represents a contiguous list of memory with predefined size(for example on Intel architecture the cache line is 64 bytes). False sharing is when different variables lie on the same cache line. The False Sharing is a source of performance degradation. So hoes does it look like ?
Let's take a look at the code snipped above. We have a class with two long variables, both of them are volatile. The instance of this class will take up to 32 bytes(8*2=16 bytes for two long variables, plus 12 bytes for object metadata plus the alignment to 8 bytes). Now we have two Threads running on different cores each calling a method that increments a single counter. So without cache lines it will work the following way:
- First thread cached
firstCnt
in Level 1 cache and started incrementing it - Second thread cached
secondCnt
in Level 1 cache of another CPU core and started incrementing it
However, CPU works in terms of cache lines and because the instance of a class takes only 32 bytes, it will fit into a single cache line(which is 64 bytes). So now let's see how both Threads will work with same cache line. First thread updates the first counter, but because second variable is stored on the same cache line MESI has to update cache of the second Core and same will happen when second thread updates the second variable. It's a big problem which literally kills the performance because it involves a lot of IO between different Cores. How do we deal with it ? Of course using annotations
Meet @Contended
In order to solve the problem above we can put one counter
into a separate cache line, but counter is a long, and it
only takes 8 bytes. To deal with it, we can annotate the
property of class as @Contended
so the single
variable will occupy the whole cache line and with wasted
space
In this case , firstCnt
will lie in a separate
cache line from a secondCnt
and so each Thread
can work on it's counter without having to synchronized with
one another
Finale
I hope that now readers can understand what's going on at a hardware level and how some compiler optimization can ruin the correctness of a program execution. JVM is a great platform that encapsulates all kind of complicated things related to hardware, however, in order to understand performance degradation of your app it's obligatory to understand how these things work on bare metal. CPU caches solve the gap between the fast CPU and slow main memory , and data structures which are nicely aligned in cache lines will always benefit from caching . Or as was said by Scott Meyers "Show me your fancy data structure, and I promise , simple array will be faster". One thing I want to mention is that JVM specification doesn't talk in terms of CPU caches or registers. It specifies the semantics based on an abstract machine. It's up to the Java Runtime Environment to execute the code on a concrete machine in such a way, that it fulfils the specification There are some resources I found useful related to CPU cache in terms of Hardware and how JVM works with them. Check them out