Inspecting OpenJDK performance on Linux using perf
Analyzing performance can be challenging, especially when diagnosing regressions. Finding the right methods to investigate performance issues is often time-consuming and complex. You may have heard of perf, a powerful Linux tool for performance measurement, available from the kernel.
This post will guide you through using perf to analyze performance on Linux, serving as a practical “quick start” guide for those new to profiling Java applications with perf. It’s not meant to be a comprehensive tutorial, as many of those already exist in the extensive documentation of perf and other excellent sources. However, the examples provided may be all you need or, at the very least, give you an idea of where to look for deeper insights.
Profiling
When profiling, you typically choose between two approaches:
- Profiling the entire run of an application by executing a command.
- Profiling an already running process by specifying its PID.
In this post, I will focus on the first approach - profiling an entire run by providing a command for perf to execute. However, if you need to profile a running process, you can do so with:
perf record -p PID
Let’s walk through a practical example of using perf record
to profile a Java application.
perf record -F 99 -k 1 -e cycles -g -o baseline.data java \
-agentpath:/usr/lib64/libperf-jvmti.so \
-XX:+PreserveFramePointer \
HelloWorld.java
perf inject -i baseline.data --jit -o baseline.data.jitted
The perf record
command samples the program at a set interval and saves the data to a file called perf.data
. Some of the parameters you likely want to configure are:
-
Sampling frequency (
-F 99
) - This sets the sampling frequency to 99 times per second. The default frequency is 4000 Hz, but the kernel may limit it to 1000 Hz (see the warning message below). Choosing higher sampling frequencies (e.g., 4000 Hz) provides more detailed data but introduces higher overhead, making it more suitable for pinpointing small, frequent events. Lower frequencies (e.g., 99 Hz) reduce overhead, making them ideal for minimizing impact on the system, but they might miss finer performance details. The choice depends on a trade-off between precision and system load.warning: Maximum frequency rate (1,000 Hz) exceeded, throttling from 4,000 Hz to 1,000 Hz. The limit can be raised via /proc/sys/kernel/perf_event_max_sample_rate. The kernel will lower it when perf's interrupts take too long. Use --strict-freq to disable this throttling, refusing to record.
-
Clock type (
-k 1
) - This option selects the clock used for sampling, mapped from a list of available clocks. In most cases, you’ll want to use eitherCLOCK_MONOTONIC
(ID 1) orCLOCK_MONOTONIC_RAW
(ID 4). For a complete list of clock IDs, refer toman clock_gettime
. -
Event type (
-e cycles
) - Specifies the type of event to measure, in this case, the number of CPU cycles. For a complete list of event types, refer toperf list
. -
Call graph info (
-g
or--call-graph <type>
) - Includes call graph info in the recording and optionally specifies the type of data to be used. By default, frame pointers are used, so-g
is equivalent to--call-graph fp
. -
Changing default output file (
-o FILE
) - By default,perf record
saves the recording toperf.data
. Changing the output file can be useful when recording multiple runs, such as when comparing different versions of a program.
The perf inject
command adds symbol data for Just-In-Time (JIT) compiled code. For a more detailed explanation, see #Fixing frame pointers
When profiling a Java application, data is collected from at least three different sources: the JVM (native C/C++ code), Java code (typically JIT-compiled), and some OS-level code (native C).
The next two sections on fixing symbols and frame pointers are inspired by a blog post by Brendan Gregg and Martin Spier.
Fixing symbols
The code from the JVM, which I will assume is the OpenJDK from here on, is bundled with debug symbols, allowing perf to map symbols to the executed code. However, this is not the case for Just-In-Time (JIT) compiled Java code, which is generated dynamically at runtime. Without these symbols, perf cannot associate performance data with specific methods or functions in the JIT-compiled code, leading to missing information.
To address this, perf comes bundled with a shared object that can be injected as a Java agent to collect symbols, enabling perf to inspect the JIT-compiled code. Typically, the agent is located at /usr/lib64/libperf-jvmti.so
.
Fixing frame pointers
OpenJDK is compiled with -fno-omit-framepointer
by default, which ensures that stack frames can be unwound efficiently and accurately, without relying solely on debug symbols. However, JIT-compiled Java code does not retain this information by default, similar to how C/C++ code compiled without -fno-omit-framepointer
behaves. To retain frame pointers in JIT-compiled Java code, run the application with -XX:+PreserveFramePointer
.
Note that adding the -XX:+PreserveFramePointer
may introduce a small overhead, typically less than 1% (source). If this is an issue, async-profiler might be worth taking a look at.
Inspecting collected data
Using perf report
will get you very far and might provide all the information you need or want. By default, perf report
examines the default perf.data
file, but if you want to analyze a different file, you can use perf report -i FILE
. If you’re collecting data from multiple runs and want to compare them, consider using perf diff
.
Another effective way to present the collected data is by visualizing it with a flame graph.
Flame Graphs
To generate flame graphs, we will use Brendan Gregg’s FlameGraph tool, available on GitHub.
# Normal use case
perf script -i baseline.data.jitted \
| FlameGraph/stackcollapse-perf.pl --all \
| FlameGraph/flamegraph.pl --cp --colors java > baseline.svg
# Filter frames, will only include frames that match "mmap"
perf script -i baseline.data.jitted | FlameGraph/stackcollapse-perf.pl --all > baseline.folded
grep "mmap" baseline.folded | FlameGraph/flamegraph.pl --cp --colors java > baseline.svg
The output will generate a graph looking something like this:
Brendan Gregg has an excellent blog post covering how to generate, interpret, and customize flame graphs. I recommend checking it out for more detailed guidance. You can find it here
Great resources
You might want to check out async-profiler.
Everything by Brendan Gregg, (Wikipedia, Website) contributor of -XX:+PreserveFramePointer (together with Zoltán Májo), creator of the FlameGraph tool/visualizer, and author of many insightful blog posts and resources on his webiste. Below are a few I found particularly informative and helpful:
- https://www.brendangregg.com/perf.html
- https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#perf
- https://www.brendangregg.com/Slides/KernelRecipes_Perf_Events.pdf (awesome presentation)
- https://www.brendangregg.com/Slides/JavaOne2016_JavaFlameGraphs.pdf
Additionally, fjermic over at OpenJ9 wrote a technical, example-based post: https://blog.openj9.org/2019/07/18/inspecting-openj9-performance-with-perf-on-linux-jit-compiled-methods/.