OpenJDK NMethod Entry Barriers
An nmethod, which I’ll refer to as NMethod, is HotSpot’s representation of a natively compiled Java method, the result of a JIT compilation, typically produced by the C1 or C2 compilers. NMethods are interesting because they require cooperation from the Garbage Collector (GC) to execute efficiently while fitting into Java’s automatic memory management model.
In this post, we’ll deep dive into how NMethods are designed and the mechanism used to enable concurrent execution and cooperation between application threads and GC workers. We’ll cover things like barriers (multiple kinds!), x86 and AArch64 assembly, self- and cross-modifying code, data dependencies, and much more. The underlying theory in this area is a gold mine of interesting technical stuff!
To keep the scope focused on the most interesting aspects of NMethods, we’ll mainly look at how NMethod entry barriers work for concurrent GCs, like ZGC. Stop-the-World (STW) GCs like Serial GC and Parallel GC can simplify some of these problems because Java execution is stopped at a safepoint while GC work is being performed.
All code references are from the JDK 27 codebase.
Background on Barriers
Let’s start by going over some concepts that are useful to know about when understanding this area. A central concept in this post is a barrier. We’ll see HotSpot’s NMethod entry barriers and store/load barriers, as well as AArch64’s instruction synchronization barriers (isb).
In HotSpot, a barrier is a small piece of code that runs in connection with some operation. Thus, an NMethod entry barrier is a piece of code that runs when entering an NMethod, and a store or a load barrier is a piece of code that runs when storing to or loading from an oop, a reference to an object on the Java heap. These barriers differ from memory barriers and instruction synchronization barriers (isb), which are usually specific machine instructions.
HotSpot’s barriers usually have two or more paths: a fast-path, a slow-path, and sometimes a medium-path that sits in between. Barriers are designed so that the fast-path is as fast as possible and taken as often as possible. The fast-path is intended to have low overhead, both in terms of execution speed and code size, which often correlate. The slow-path is taken when the fast-path cannot prove that it is safe to continue directly. In that case, we need to perform a more involved operation so that future executions can hopefully go back to executing the fast-path. Some barriers also have a medium-path, which is normally faster than the slow-path, but slower than the fast-path.
Code and structure for HotSpot’s barriers will be presented in x86 or AArch64 assembly or through HotSpot’s macro assembler syntax, which prefixes instructions with two underscores, e.g., __ cmpl_imm32.
NMethod Entry Barriers
An NMethod entry barrier is a piece of code that runs when entering an NMethod. But why do we need a barrier here? Why can’t we just execute the NMethod immediately? For the purposes of this post, there are two reasons we care about, and both have to do with the changing state of the GC:
-
Embedded oops:
An NMethod may contain oops embedded directly in the compiled code. Those embedded oops are considered roots into the object graph. When marking through the object graph, all those embedded oops need to be visited to make sure that the objects referenced by those oops are kept alive.
Additionally, the objects the embedded oops refer to may move during relocation. To avoid stale embedded oops inside an NMethod, we need a way to ensure that oops are up-to-date with the current GC state before the NMethod is entered and executed.
-
Store/load barriers:
Regardless of whether an NMethod contains embedded oops, it may contain code that loads or stores oop fields, in which case store/load barriers might be required. For ZGC, those generated store/load barriers contain immediate values derived from the current state of the GC cycle. When the GC state changes and the immediates become stale, they need to be patched.
For example, a version of a strong ZGC load barrier in C2 looks roughly like the following on AArch64:
ldr xRef, [xObj, #field_offset] // load colored zpointer tbnz xRef, #remap_bit, .LslowPath // if bad bit set, go to slow-path lsr xRef, xRef, #16 // uncolor: colored zpointer -> address b .Ldone
Guard Value
We’ve established that to execute an NMethod, any embedded oops and store/load barriers need to be up-to-date with the current GC state. Going back to our definition of barriers, we want the fast-path to be as fast as possible, so we need a minimal check that tells us if the NMethod is safe to enter or not. This is where the guard value comes into play.
An NMethod is always in one of two states: armed or disarmed. If an NMethod is armed, that means that extra work needs to be done to fix embedded oops and store/load barriers, which is what we refer to as the slow-path. If the NMethod is disarmed, we don’t need to fix embedded oops and store/load barriers and can usually just enter the NMethod directly via the fast-path.
Whether an NMethod is armed or disarmed is encoded in a 32-bit value (4 bytes) inside the NMethod entry barrier itself. This 32-bit value is referred to as the guard value. The fast-path compares the guard value with a thread-local disarmed guard value. If they match, we may enter the NMethod and execute it. If not, the slow-path determines what work is required to continue executing the NMethod safely.
The thread-local disarmed guard value for most GCs is stored at runtime/thread.hpp#L123:
// On AArch64, the high order 32 bits are used by a "patching epoch" number
// which reflects if this thread has executed the required fences, after
// an nmethod gets disarmed. The low order 32 bits denote the disarmed value.
uint64_t _nmethod_disarmed_guard_value;
ZGC has its own storage for the thread-local disarmed guard value, which encodes the epoch and guard value the same way as the shared value. Since ZGC only supports 64-bit platforms, the size of the uintptr_t is always 64 bits. See gc/z/zThreadLocalData.hpp#L42:
uintptr_t _nmethod_disarmed;
We’ll go into more detail about how the epoch number works in #Epoch Mechanism. For now, the important part is that the low 32 bits hold the thread-local disarmed guard value.
ZGC Phase Changes
Now that we know how armed and disarmed states are represented, let’s look at what causes an NMethod to become armed in the first place.
When ZGC enters the Mark or Relocate phase, it changes the current good color during a short stop-the-world pause, a safepoint. For NMethod entry barriers, the global disarmed guard value is derived from the low 32 bits of ZPointerStoreGoodMask. Changing the good color therefore changes the disarmed value. Existing NMethods still contain their previous guard value, so their entry barriers no longer match the global value and implicitly become armed. The thread-local disarmed value is updated lazily after the safepoint, making sure that the guard-value comparison will fail for unpatched NMethods.
So why are NMethods armed at the start of the Mark and Relocate phases? The reason ties back into our earlier definition in #NMethod Entry Barriers, which has to do with embedded oops and store/load barriers. Store/load barriers compare metadata bits in oops to the current GC state, the good color in ZGC, and since that color changes in a phase transition like this, they need to be updated. Embedded oops have different requirements depending on the phase. To reiterate:
-
In the Mark phase, objects considered reachable from a set of roots are marked as “alive”. Embedded oops in NMethods make up part of the root set, so they must be visited to traverse the object graph during marking.
-
In the Relocate phase, the heap is compacted to free up memory by moving (a subset of) objects that were marked as reachable in the Mark phase into compact regions in the Java heap. This means that objects might receive new locations and old oops become stale and need to be updated, a process referred to as remapping.
Assembly Code Structure
Up to now we’ve discussed the guard value comparison in the fast-path, patching of embedded oops and store/load barriers in the slow-path, as well as an overview of when NMethods are armed. Now we’ll move on to what the NMethod entry barrier looks like in assembly code, focusing on how the guard value is encoded, the body of the entry barrier, and which optimizations reduce the footprint of the fast-path.
Guard Value Encoding
On x86, we leverage the fact that instructions are variably sized by storing the guard value as an immediate inside a cmpl instruction.
__ cmpl_imm32(disarmed_addr, 0); // guard value encoded as an immediate inside the instruction
It’s a little bit trickier on AArch64 since instructions have a fixed size of 32 bits (4 bytes), meaning we can’t fit the entire 32-bit guard value as an immediate in a single instruction like on x86. To get around this, the guard value is emitted directly into the instruction stream. This requires some extra utility to get right, which we’ll get into later.
__ emit_int32(0); // guard value
Entry Barrier
The logical pattern of the x86 NMethod entry barrier is quite straightforward: we compare the NMethod’s guard value to the thread’s disarmed guard value, and if they are not equal, we call StubRoutines::method_entry_barrier(). This is a forwarding call to the slow-path in the runtime, which likely patches the NMethod and disarms it.
Address disarmed_addr(thread, in_bytes(...));
__ cmpl_imm32(disarmed_addr, 0);
__ jccb(Assembler::equal, done);
__ call(RuntimeAddress(StubRoutines::method_entry_barrier()));
__ bind(done);
The corresponding AArch64 code follows the same high-level shape, but the details are a little different.
The load of the guard value is omitted in this example to make the code more approachable for now. We’ll cover how the guard value is loaded along with the epoch in #Epoch Mechanism. A notable difference compared to x86 is that the guard value is emitted as data inside the instruction stream. Since that 32-bit value is not an instruction, both the fast-path and the slow-path have to branch around it.
// guard value combined with epoch is loaded into rscratch1
Address thread_disarmed_and_epoch_addr(rthread, in_bytes(...));
__ ldr(rscratch2, thread_disarmed_and_epoch_addr);
__ cmp(rscratch1, rscratch2);
__ br(condition, skip_barrier);
__ lea(rscratch1, RuntimeAddress(StubRoutines::method_entry_barrier()));
__ blr(rscratch1);
__ b(skip_barrier);
__ bind(local_guard);
__ emit_int32(0);
__ bind(skip_barrier);
Out-of-Line Stub
In the assembly snippets above, the forwarding slow-path call is inlined into the NMethod entry barrier.
The NMethod entry barrier is hot code, run every time an NMethod is entered. The slow-path code is rarely taken compared to the fast-path, so keeping it inlined here results in worse instruction cache performance. On AArch64, this is especially visible because the guard value is emitted in the instruction stream, requiring us to branch around it in both the fast-path and slow-path.
To mitigate the impact of the extra code, the slow-path code is moved out-of-line to a stub for NMethods emitted by the C2 compiler. For compilations by C1 and the runtime, we still keep the slow-path code inline, since C1 code and runtime-generated wrappers are not considered “hot” enough to justify the extra plumbing of the stub.
The stub for the slow-path of the NMethod entry barrier is put inside the NMethod’s stub area, meaning each C2 NMethod gets its own out-of-line slow-path. Here is what the slow-path code in the stub looks like on AArch64 cpu/aarch64/c2_CodeStubs_aarch64.cpp#L55-L64:
void C2EntryBarrierStub::emit(C2_MacroAssembler& masm) {
__ bind(entry()); // slow_path label
__ lea(rscratch1, RuntimeAddress(StubRoutines::method_entry_barrier()));
__ blr(rscratch1);
__ b(continuation()); // continuation label
__ bind(guard());
__ relocate(entry_guard_Relocation::spec());
__ emit_int32(0); // nmethod guard value
}
The effect of moving the slow-path code out-of-line is modest on x86, but on AArch64 the code is noticeably smaller, resulting in a smaller fast-path footprint and better instruction-cache behavior.
// guard value combined with epoch is loaded into rscratch1
Address thread_disarmed_and_epoch_addr(rthread, in_bytes(...));
__ ldr(rscratch2, thread_disarmed_and_epoch_addr);
__ cmp(rscratch1, rscratch2);
__ br(condition, *slow_path);
__ bind(*continuation);
Modifying Code
There are two main reasons for the simpler design of the NMethod entry barrier on x86. The first one is that instructions can be variably sized, allowing a more efficient encoding of the guard value. Second, x86 processors generally provide instruction- to data-cache coherency, which AArch64 processors typically don’t. Let’s shed some more light on the second reason.
When the slow-path is entered to patch an NMethod, a per-NMethod lock is taken to make sure only one thread patches the NMethod. After patching embedded oops and store/load barriers, the changes are published via a releasing store that writes the NMethod-local disarmed guard value.
On x86, writing the disarmed guard value with a naturally atomic write will either be observed in its entirety or not at all. If a thread observes the new disarmed value, it will also observe all other preceding edits to the instruction stream. This is because the release store makes sure all preceding stores are ordered before it. The fact that x86 provides coherent instruction and data caches means that writes to memory (data-cache) when patching the NMethod will be reflected in the instruction stream (instruction-cache) in that thread from that point onward.
Again, AArch64 is more tricky, since we generally cannot rely on changes becoming observable from other threads until an instruction synchronization barrier (isb) is executed, even though the disarmed guard write is still atomic. A thread might observe some edited instructions or perhaps none at all. For example, a thread might observe the guard value being disarmed, but no other changes at all, or the other way around.
The first part to get this right on AArch64 is for the thread making the patches to take care of cache maintenance, forcing coherency between the instruction and data caches. Such operations generally end with an isb, allowing the patching thread to observe the effects of the cache maintenance operations it just performed. For a detailed explanation of the mechanism(s) behind this on AArch64, I refer you to my blog post “JIT Compilers and Cache Coherency”.
For other threads to observe the instruction changes made by the patching thread, they also need to execute an isb, which is where the epoch mechanism comes into play. The epoch is stored in a global variable, and also in a thread-local value. The global epoch represents the current epoch and is incremented just before disarming an NMethod. The thread-local epoch represents the latest epoch for which the thread has executed an isb.
On AArch64, the guard value comparison in the fast-path of the NMethod entry barrier also compares the global epoch to the thread-local epoch. Even if a thread observes only the disarmed guard value and no other changes, the global epoch won’t match the thread-local epoch, so the thread will execute an isb to “acquire” earlier patches. This allows the thread to observe all instruction changes to the NMethod.
The theory in this area is quite complex and makes up a pretty narrow problem surface, which makes it harder to reason about this and get it right. My colleagues Erik Österlund and John Rose have a paper titled “How HotSpot cross-modifies code – a summary” detailing the theory behind this. Beware that it is quite heavy, but an awesome read if you want to understand this area on a deeper level.
Epoch Mechanism
It can be quite an expensive operation to execute an isb instruction on every NMethod entry for a Java thread. To get around this, an isb instruction can be executed once to “acquire” the patched instructions for all NMethods leading up to that point. This is a feature of the epoch mechanism that provides great value.
Looking back at the thread-local disarmed guard value in #Guard Value, it stores both the patching epoch and the disarmed guard value, in the high 32 bits and low 32 bits respectively. When comparing the NMethod’s local guard value with the thread-local disarmed guard value, the local guard value is combined with the global epoch. After combining both values, we compare the result against the thread-local disarmed guard value.
With the addition of the epoch in the NMethod entry barrier fast-path comparison, the local guard value might be disarmed, but the epoch might be old/stale, in which case the comparison fails. When that happens, the slow-path is taken primarily to perform the required synchronization, not necessarily to patch the full NMethod. This still has the same effect as on x86, only allowing disarmed NMethods to be entered, but also requires the thread entering the NMethod to have acquired all instruction edits up to the current epoch.
Guard Value and Epoch Combination
On AArch64, the local guard value is combined with the global patching epoch at the very start of the NMethod entry barrier.
The combination applies a cool trick of embedding an artificial data dependency to order the guard load before the epoch load. Since rscratch1 was loaded with ldrw, its upper 32 bits are zero, so rscratch1 shifted right by 32 is always zero. The computed address is therefore unchanged, but it still depends on the previously loaded guard value. This forces the guard load to be ordered before the epoch load without changing the address being loaded from.
__ ldrw(rscratch1, *guard);
__ lea(rscratch2, ExternalAddress((address)&_patching_epoch));
// Embed an artificial data dependency to order the guard load
// before the epoch load.
__ orr(rscratch2, rscratch2, rscratch1, Assembler::LSR, 32);
// Read the global epoch value.
__ ldrw(rscratch2, rscratch2);
// Combine the guard value (low order) with the epoch value (high order).
__ orr(rscratch1, rscratch1, rscratch2, Assembler::LSL, 32);
// rscratch1 now contains the guard value combined with the patching epoch
The reason the guard load must be ordered before the global epoch load is because the writer publishes them in the opposite order: by incrementing the global epoch before release-storing the disarmed local guard value in the NMethod. That way, if the reader observes a disarmed guard value, it will also observe an updated global epoch. Without this guarantee, the reader could observe a disarmed guard value and a stale epoch value, which could lead to it proceeding into the NMethod without performing the necessary isb.
The artificial data dependency does not pay the full cost of an acquire fence, which would be too costly here in the fast-path. Instead, we get the narrower guarantee needed here: that the load of the guard value is ordered before the load of the global epoch.
Interaction with the Garbage Collector
Now that we have a solid picture of how NMethod entry barriers work with the help of the guard value, epoch mechanism, and how they tie NMethod entry barriers into the GC, let’s look at how Java threads and GC workers cooperate to process every NMethod.
NMethod Tracking
When NMethods are created, they are registered in the GC by calling CollectedHeap::register_nmethod(...), which forwards a call to the specific GC implementation, in this case ZGC, which records it in a ZNMethodTable. The ZNMethodTable keeps track of NMethods and allows for concurrent traversal and handling of removal/addition of NMethods during iteration.
Visit All Registered NMethods
Before finishing the Mark phase, the entire object graph must have been visited so that we know what objects are reachable (alive), which is a prerequisite for continuing to the Relocate phases. In the Relocate phase, objects are moved, which requires oops to be remapped. To remap all oops, we generally need to visit all of them, which is an expensive operation. Instead, remapping is done lazily and full remapping is deferred until the next Mark phase, when the object graph is visited as part of marking.
Between the transition into the Mark or Relocate phase and the completion of marking or remapping, NMethods may (likely will) be executed while ZGC is running concurrently with Java threads. Java threads help by taking the slow-path when the guard value comparison fails, to allow them to continue executing.
Some NMethods will be handled by Java threads, and some “cold” methods that have not been executed need to be handled by the GC. To make sure that all relevant registered NMethods are visited by the time the Mark phase ends, which includes both marking and remapping oops, ZGC visits the cold NMethods in a bulk traversal.
The call chain of the bulk traversal looks like this for the young generation, and is similar for the old generation but with some extra work.
ZGenerationYoung::collect
ZGenerationYoung::concurrent_mark
ZGenerationYoung::mark_roots
ZMark::mark_young_roots
ZMarkYoungRootsTask::work
ZRootsIteratorAllUncolored::apply
ZNMethodsIteratorImpl::apply
ZNMethod::nmethods_do
ZNMethodTable::nmethods_do
ZNMethodTableIteration::nmethods_do
ZMarkYoungNMethodClosure::do_nmethod
// Process the NMethod for young marking
Summary
An NMethod entry barrier is a small piece of code that is executed before entering an NMethod, a natively compiled Java method by one of HotSpot’s JIT compilers. The entry barrier primarily makes sure that object references (oops) and store/load barriers are up to date by comparing an NMethod-local guard value with a thread-local disarmed guard value. If the two values are equal, the NMethod can be entered safely. If not, a slow-path must be taken to patch oops and store/load barriers so that the NMethod becomes safe to enter, which also disarms the NMethod by updating the NMethod-local guard.
OpenJDK/HotSpot uses different approaches for different platforms (x86 and AArch64 among others). On a platform with weaker guarantees like AArch64, more complex approaches are generally needed. The NMethod entry barriers are designed to make the fast-path as fast as possible, while also enabling concurrent execution between Java and GC workers, for the GCs that need it. They also enable concurrent cooperation from Java threads to patch NMethods, and the GC will take care of the rest in bulk traversal at fixed points.
JIT compilation is a really tricky subject to get right, and to keep things sane, I’ve omitted some details in favor of presenting the big picture of NMethod entry barriers as clearly as possible.
Thank you for reading! I hope you found this deep dive interesting and that you’ve learned something new from reading this.