JIT Compilers and Cache Coherency
Cache coherency is notoriously tricky to reason about, and even trickier to debug. With that challenge in mind, we’ll deep dive into cache coherency considerations in the design and development of a JIT compiler I’ve been working on.
The JIT compiler targets the Brainfuck programming language, which is intentionally minimal so that I can focus on all the interesting aspects, such as cache coherency and compiler design when implementing it. Brainfuck has a very simple instruction set and memory model, consisting of just eight different character inputs. The full specification and some background can be found at its Wikipedia page: https://en.wikipedia.org/wiki/Brainfuck.
The primary goal of creating my own JIT compiler is to learn by doing, failing, debugging by hand, and through that develop intuition and deepen my understanding through experience. If you’re interested, the code is available here: https://github.com/joelsiks/bfjit
Background
My implementation, dubbed bfjit, is an optimizing interpreter and Just-In-Time (JIT) compiler, with an Ahead-Of-Time (AOT) mode and support for tiered compilation. It supports both the x86 and AArch64 platforms, and can emit native code for loops in JIT mode, or for the entire program in AOT mode.
In JIT mode, execution starts in the interpreter, which means instructions can begin executing quickly. When loops become “hot” enough, a compilation request is sent to a concurrent thread that compiles the loop asynchronously into a compiled method containing native instructions. Once the loop is compiled, the compiler thread installs a reference to the compiled method, which the interpreter picks up and transfers execution to. This installation step, as we’ll see, is where cache coherency becomes critical.
In AOT mode, the same compilation pipeline is used, but code is generated for the entire program synchronously up front before execution begins.
To bridge the interpreter and JIT compiler, Brainfuck instructions are parsed into an intermediate representation, implemented as a simple AST. This representation is shared between the interpreter and JIT, with the JIT emitting instructions for each node in the AST. This allows execution to move between interpreted and compiled code with minimal state transition. The intermediate representation also makes it straightforward to implement optimizations by applying passes over the AST.
Development and Testing
Initially, bfjit started with only an interpreter, to have a baseline mode that is known to work correctly and against which JIT-compiled execution can be compared. I am running Linux x86_64 locally, so I started with the x86 implementation of the JIT compiler.
I then refactored the code to support multiple platforms by adding platform detection and code selection via the Makefile. This allowed me to move on to the AArch64 implementation. To run AArch64 code on my local machine, I used QEMU, more specifically user-mode emulation via the qemu-aarch64 binary. What is important to note about user-mode emulation is that it translates instructions for the target platform into native instructions. This means that I’m still always executing on my x86 system, meaning AArch64-specific cache behavior is never actually observed.
With both the x86 and AArch64 implementations done, I tested them for correctness and performance using a Brainfuck program that renders a Mandelbrot set to the terminal. Everything looks good when running on my Linux x86_64 machine, both natively and when cross-running via qemu-aarch64. However, from my experience working on the OpenJDK HotSpot implementation, I know that memory models and overall design differ significantly between x86 and AArch64, so I was eager to test the AArch64 implementation on real hardware.
All my machines at home are Intel or AMD x86_64, except for my phone, a Samsung S22+ running an Exynos 2200 based on the ARMv9.0-A architecture. After setting up SSH, installing a C++ compiler, and compiling bfjit on the phone via Termux, everything seemed to be working. However, roughly every tenth run, I would get a SIGILL and the program would crash:
$ ./bfjit "$(cat programs/mandelbrot.b)"
Illegal instruction
Illegal Instruction, What?
The “Illegal instruction” message alone is not telling us much, so I installed a signal handler to catch and print information when a SIGILL is encountered to get more insight into what’s going on.
static void sigill_handler(int sig, siginfo_t *info, void *ctx) {
ucontext_t *uc = (ucontext_t *)ctx;
printf("SIGILL at PC: %p, fault addr: %p\n",
(void *)uc->uc_mcontext.pc,
info->si_addr);
// Print the word at the faulting PC so we can see what was fetched
uint32_t *pc = (uint32_t *)uc->uc_mcontext.pc;
printf("Fetched instruction word: 0x%08x\n", *pc);
exit(1);
}
Running bfjit again with the signal handler installed, and in debug mode so that we get info on compiled methods, we see the following:
$ ./bfjit --debug "$(cat programs/mandelbrot.b)"
...
Compiled method 0x72bc417120 size 572:
e1 03 00 aa 20 00 40 39
60 00 00 35 e0 03 01 aa
...
SIGILL at PC: 0x72bc417120, fault addr: 0x72bc417120
Fetched instruction word: 0xaa0003e1
Disassembling the observed instruction we see an orr. I know that this is what the “mov (register)” instruction is aliased to. This is totally expected, since the first instruction in any compiled method moves the first and only argument of the compiled method from x0 into x1. But this instruction doesn’t seem to be “illegal”, so what’s going on here?
0: aa0003e1 orr x1, xzr, x0, lsl #0
; Which is actually
mov x1, x0
Cache (In)Coherency
The problem is cache coherency! What we’ve observed above is the real-world effects of having an incoherent data and instruction cache.
In most modern hardware, the L1 cache is split into two separate caches, one for data and one for instructions, often referred to as L1d (data cache) and L1i (instruction cache). This is referred to as a (Modified) Harvard Architecture. The boundary where data and instruction caches are guaranteed to see the same copy of a memory location (on a specific core) is called the “Point of Unification”, which is often the L2 cache.
On an x86 system, the hardware guarantees that the data and instruction caches are coherent in its microarchitecture, requiring no intervention from software. However, on AArch64 systems, cache coherency between data and instruction caches is not guaranteed in hardware and might require software intervention.
Incoherent caches are really only problematic when you have self-modifying code. For example, a JIT compiler writing instructions to memory which will then be executed is self-modifying code.
The problem is that instructions are fetched from memory and stored in the instruction cache. But, when the JIT compiler emits instructions by writing them to memory, they are buffered in the data cache. This creates two versions of the instructions: one fetched version stored in the instruction cache, and one that is stored in the data cache. When the CPU continues with normal execution, it will execute the instruction from the instruction cache, which is likely an illegal instruction, instead of the one that is in the data cache.
This phenomenon explains why we’re getting a SIGILL above. And furthermore, this also explains why, even though we executed an “illegal” instruction, we see a valid instruction when reading the memory, since the read fetches memory from the data cache, not the instruction cache.
Ensure Cache Coherency
To fix this we need a way to make sure that when we’re writing a new instruction to memory (e.g., from the JIT compiler), the instruction cache also sees that write and executes the new instruction instead. From C/C++ code this is pretty straightforward to do using __builtin___clear_cache() in GCC, which flushes cache(s) for a region of memory.
On systems that do not need cache flushes, __builtin___clear_cache() does nothing. See gcc/builtins.cc, where CLEAR_INSN_CACHE is undefined for x86 and defined for AArch64.
AArch64 Cache Coherency
As we’ve established by now, AArch64 requires more work to achieve cache coherency. To fully explain this, we’ll approach each step of GCC’s implementation of __builtin___clear_cache(), which can be found at libgcc/config/aarch64/sync-cache.c. It starts by getting the contents of the system register “CTR_EL0”, which: “provides information about the architecture of the caches” (see ARM documentation).
mrs x0, CTR_EL0
“CTR” stands for “Cache Type Register” and “EL0” for “exception level 0”, which is the lowest exception level for unprivileged operations, used by user-space programs.
ARM provides a good visual of the CTR_EL0 register in their documentation, which shows the fieldset of the register. What’s interesting to us are four things: the DIC bit, the IDC bit, DminLine and IminLine:
- DIC: Instruction cache invalidation requirements for data to instruction coherence. This bit tells us if we need to, via software, invalidate the instruction cache to have data to instruction coherence. A 1 tells us this is handled in hardware and no software support is needed.
- IDC: Data cache clean requirements for instruction to data coherence. This bit tells us if we need to, via software, clean the data cache to have data to instruction coherence. A 1 tells us this is handled in hardware and no software support is needed. There are some exceptions to this which are listed clearly in the documentation, which we’ll ignore here.
- DminLine: The size of the (smallest) cache line of all data caches
- IminLine: The size of the (smallest) cache line of all instruction caches
As a side note, while working with this I constantly get confused about what the IDC and DIC bits refer to, since IDC starts with an I but is for cleaning the data cache, not the instruction cache. As a rule of thumb:
- IDC = Instruction to Data Coherence. About cleaning the data cache so the instruction cache can see it.
- DIC = Data to Instruction Coherence. About invalidating the instruction cache so it can see what’s inside the data cache.
Assembly Instructions for Cache Coherency
With information from the CTR_EL0 register we can figure out how to proceed to make sure that the data and instruction caches are coherent. Given that we want to know how this works, we assume for now that the hardware has no support for cache coherency, so we must use software. Let’s go over the steps:
-
Cleaning the data cache (IDC):
Cleaning the data cache here means cleaning it to the Point of Unification, which means pushing the contents of L1d to L2, where L2 is the Point of Unification. This is achieved with the following assembly code, where
dc cvauis executed for each data cache line in the memory we want to clean:dc cvau, <address> dsb ishAfter
dc cvauhas been executed for all affected cache lines we executedsb ishonce. Thedsb ishinstruction is a “data synchronization barrier” (where “ish” stands for “inner shareable domain”, see ARM’s “Memory Attributes” for more information). This instruction makes sure that by the time it is finished, all instruction and data cache operations are complete (and also branch predictor and TLB maintenance). This makes sure alldc cvauthat were just executed are guaranteed to have finished when we continue. -
Invalidate the instruction cache (DIC):
Invalidating the instruction cache means invalidating it to the Point of Unification, essentially throwing everything away until the point where the data and instruction caches are unified. This is achieved with the following assembly code, where
ic ivauis executed for each instruction cache line in the memory we want to invalidate.ic ivau, <address> dsb ishSimilarly here, we execute
dsb ishonce to make sure that the instruction cache invalidation(s) are guaranteed to be complete. -
Instruction Synchronization Barrier (ISB):
After the data and instruction caches are known to be cleaned and invalidated, we execute an
isbinstruction, which is an “instruction synchronization barrier”. Theisbflushes the pipeline in the processor, so that all instructions following it are fetched from cache or memory.isbRegardless of whether software intervention is required for data cache cleaning or instruction cache invalidation, an
isbmust be executed to make sure that new instructions really are fetched from cache or memory.
With the cache cleaning and invalidation done and having executed the instruction synchronization barrier, all instructions are refetched from the L2 cache, which is now guaranteed to contain the up-to-date instructions that the JIT compiler has emitted.
Hardware Support
Let’s check the state of the DIC and IDC bits in the CTR_EL0 register on my Exynos 2200 processor with a small C program:
#include <stdio.h>
#include <stdint.h>
int main() {
uint64_t ctr_el0 = 0;
__asm__ ("mrs %0, ctr_el0" : "=r" (ctr_el0));
printf("DIC: %d\n", (ctr_el0 >> 29) & 0b1);
printf("IDC: %d\n", (ctr_el0 >> 28) & 0b1);
return 0;
}
$ ./ctr_el0_printer
DIC: 0
IDC: 1
From the output we can determine that the Exynos 2200 processor handles instruction to data cache coherency in hardware by the IDC bit being set to 1. This means there is no need to execute dc cvau to clean the affected data cache lines. We only need ic ivau to invalidate the affected instruction cache lines, and as always (regardless of DIC/IDC status) end with an isb to flush the CPU pipeline. But even with hardware support, things can go wrong.
Hardware Bugs (Neoverse N1)
As with any software or hardware design, there is always the risk for bugs. One exceptionally interesting example is the Neoverse N1 Errata 1542419.
The Neoverse N1 CPU is designed with hardware support for automatic instruction and data cache coherency, meaning both DIC and IDC should be set to 1. However, in rare cases the caches might not be coherent, which can lead to exactly the kind of SIGILL we observed earlier, even when the user has done everything right.
The mitigation for the erratum consists of two steps and requires cooperation from both software and firmware.
-
“Hide” the fact that the CPU announces automatic cache coherency by setting the DIC bit to 0, signaling to user-space programs that software intervention is required to have data to instruction cache coherency. Let’s look at how this is done in the Linux kernel.
This is first achieved by trapping access to the CTR_EL0 register (arch/arm64/kernel/cpu_errata.c):
/* ... or if the system is affected by an erratum */ if (cap->capability == ARM64_WORKAROUND_1542419) enable_uct_trap = true;And in the trap handler, hide the DIC bit by setting it to 0, and also increase the minimum instruction cache line size to get fewer calls to
ic ivau(arch/arm64/kernel/traps.c):if (cpus_have_final_cap(ARM64_WORKAROUND_1542419)) { /* Hide DIC so that we can trap the unnecessary maintenance...*/ val &= ~BIT(CTR_EL0_DIC_SHIFT); /* ... and fake IminLine to reduce the number of traps. */ val &= ~CTR_EL0_IminLine_MASK; val |= (PAGE_SHIFT - 2) & CTR_EL0_IminLine_MASK; } -
In firmware, trap the
ic ivauinstruction to perform a much more complete (and expensive) instruction cleanse. ARM has provided a firmware fix for this to TrustedFirmware: https://github.com/ARM-software/arm-trusted-firmware/commit/8094262.Instead of executing
ic ivau, the trap handler performs a global TLB invalidation.func neoverse_n1_errata_ic_trap_handler cmp x1, #NEOVERSE_N1_EC_IC_TRAP b.ne 1f tlbi vae3is, xzr dsb syNormally, instruction cache coherency is handled in hardware on Neoverse N1 (DIC bit set to 1). However, due to the erratum, stale instruction fetch state can still survive across
isbexecution.ic ivauis normally unnecessary on this CPU, and even if you try to use it, it does not address the issue. To solve this, we have to resort to a much bigger hammer:tlbi vae3is, which forces the CPU to refetch instructions from a clean slate.
If you’re interested in knowing how a bug like this might affect performance in real-world applications, I recommend the blog post “It’s Never a Hardware Bug, Until it is.” by Charles Connell, which details how this bug affects Java performance negatively.
Conclusion and Summary
Coming back to bfjit: adding a call to __builtin___clear_cache() before installing a compiled method solves the SIGILL issue we observed earlier in the post. The instructions that were written to memory are now visible to the CPU when instructions are fetched from coherent data and instruction caches.
We’ve learned that on AArch64, JIT compilers must explicitly synchronize instruction and data caches. This requires a bit of work if you want to do it step-by-step yourself, but using GCC’s __builtin___clear_cache() is a portable and easy alternative.
This blog post opened up a real rabbit hole for me; investigating and learning more about caches and cache coherency has been really fun. Personally, caches have always been somewhat abstract, and getting hands-on experience in how caches work has been very valuable. It’s also exciting to learn more about how my own hardware works, especially since I can test things out and see differences across platforms and architectures.
I must admit that cache coherency is a tricky subject, and one that is not immediately obvious as the culprit when encountering an illegal instruction signal. Now that we’re at the end, I hope that you’ve learned something new or drawn a connection that you wouldn’t have otherwise. Thank you for reading and I hope you found this deep dive as interesting as I did!