return ...

How MADV_COLLAPSE chooses a NUMA node

Introduction

MADV_COLLAPSE is a madvise(2) advice (added in Linux 6.1) that requests a best-effort, synchronous collapse of native pages in a range into Transparent Huge Pages (THPs). As the man page states:

Perform a best-effort synchronous collapse of the native pages mapped by the memory range into Transparent Huge Pages (THPs). MADV_COLLAPSE operates on the current state of memory of the calling process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future.

In other words, MADV_COLLAPSE gives applications explicit control over when THPs are formed, rather than leaving it to background kernel mechanisms like khugepaged. This ensures memory can be backed by THPs at predictable points in time, which is especially valuable for performance-sensitive workloads where reducing TLB misses and improving memory locality matter.

The man page also notes, somewhat briefly, how NUMA placement affects collapse:

When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages.

In this post, I’ll explore how MADV_COLLAPSE behaves in practice, highlighting its interactions with NUMA.

Machine

NUMA topology of the host:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node     0    1
   0:   10   20
   1:   20   10

OS: Oracle Linux 9, with kernel version:

$ uname -r
6.12.0-102.36.5.2.el9uek.x86_64

I used libnuma for setting the kernel’s NUMA policy. Install on Oracle Linux:

sudo dnf install numactl-devel.x86_64

To avoid interference from background THP, I set THP mode to “never” for both anonymous and shared memory so MADV_COLLAPSE is the only path to THPs:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled

Requirements and Caveats with MADV_COLLAPSE

From madvise(2):

… for every eligible hugepage-aligned/sized region to be collapsed, at least one page must currently be backed by physical memory

If no page is backed by physical memory when calling madvise(), you get EINVAL, as described in mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse:

EINVAL: Other error: No PMD found, subpage doesn’t have Present bit set, “Special” page not backed by struct page, VMA incorrectly sized, address not page-aligned, …

Additionally, MADV_COLLAPSE ignores the sysfs settings stored in /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/shmem_enabled. For example, even if the mode is set to never in /sys/kernel/mm/transparent_hugepage/enabled, MADV_COLLAPSE will still collapse small pages into huge page(s). There are several notes describing this behavior in the kernel documentation for Transparent Hugepage Support.

MADV_COLLAPSE and NUMA

Now to the interesting part: how MADV_COLLAPSE interacts with NUMA. I use a small test program to exercise the cases below (see madv_collapse_tests.cpp).

As noted earlier, at least one small page must be backed by physical memory when calling madvise(MADV_COLLAPSE); otherwise the call fails with errno set to EINVAL. A page becomes backed either by touching it (typically via a write, sometimes a read) or, for file-backed mappings, via fallocate().

Working with a single VMA and a single 2 MB mmaped range, I observe the following behavior. N denotes the NUMA node on which the base page was backed (for example, N0 = NUMA node 0):

Case Backed small pages (4 KB) in 2 MB chunk Observed collapse node Notes
1 [N0] N0 Single backed page; Majority on N0
2 [N1] N1 Single backed page; Majority on N1
3 [N0, N1] N0 Tie; observed collapse on N0
4 [N1, N0] N0 Tie; observed collapse on N0
5 [N0, N0, N1] N0 Majority on N0
6 [N0, N1, N1] N1 Majority on N1

Working with any range that is larger than 2 MB, but still a multiple of 2 MB, e.g. 4 MB, 6 MB.. etc, will behave like the 2 MB mapping above, but individually. For example, at least one page must be backed in each 2 MB-aligned increment and collapsing a range will end up on the NUMA node that is represented the most in that 2 MB range.

With multiple VMAs instead of a single one, I observe the same behavior that a single VMA shows, if each VMA is at least 2 MB large.

Not necessarily NUMA related, but if a VMA is not a multiple of 2 MB, let’s say 3 MB for example, the remaining 1 MB won’t be collapsed, but the call to madvise(MADV_COLLAPSE) will succeed.

An interesting observation: when NUMA nodes are equally represented in a chunk, collapse appears to favor N0. Looking at the kernel source code for the kernel I’m running (v6.12), we can see that if there is equal coverage on multiple NUMA nodes, the collapsed huge page will end up on the “lowest” or “first” node. In the observation above, we see that if N0 and N1 are equally represented, the collapse is on N0.

Kernel Source Code

The kernel defines the collapse_control struct, which stores state about a collapse operation. In here, we see the node_load[] array.

struct collapse_control {
	bool is_khugepaged;

	/* Num pages scanned per node */
	u32 node_load[MAX_NUMNODES];

	/* nodemask for allocation fallback */
	nodemask_t alloc_nmask;
};

When walking VMAs, node_load[] is updated (anonymous mappings, file mappings).

cc->node_load[node]++;

Finally, hpage_collapse_find_target_node() is called to determine what NUMA node the collapsed huge page should end up on.

static int hpage_collapse_find_target_node(struct collapse_control *cc)
{
   ...
	/* find first node with max normal pages hit */
	for (nid = 0; nid < MAX_NUMNODES; nid++)
		if (cc->node_load[nid] > max_value) {
			max_value = cc->node_load[nid];
			target_node = nid;
		}
   ...
   
   return target_node;
}

Summary

MADV_COLLAPSE lets applications synchronously form THPs and operates independently of THP sysfs settings. On NUMA systems, collapse is evaluated per 2 MB-aligned chunk and requires at least one page in the chunk to be backed by physical memory. The resulting hugepage is allocated on the NUMA node with the majority of base pages in that chunk. Larger mappings collapse chunk-by-chunk, and multiple VMAs (>=2 MB and hugepage-aligned) behave the same way.