2024-10-19

Huge Pages on Linux

When I first found out about huge pages I was a bit confused. Why do we need larger pages? How do I configure this? Is it always on? I had many questions. The first thing you should know is that the concept of larger pages has many names, although they all achieve the same thing: increase the page size, usually to something larger than 4096 bytes (4 KB), which is the default on many systems.

To summarise, the benefits of using huge pages is that there is less pressure on the Translation Lookaside Buffer (TLB), which means better performance in many programs and situations. This optimization is critical now as physical memory tends to be bigger and more readily available. You can read more about what huge pages are here.

The concept of huge pages exist on most operating systems, and on Linux it is called huge pages, which is what I will call it from here on out.

Note that most operations that are mentioned from this point require root priviliges.

Huge Pages (Linux)

Kernel Support

For the Linux kernel to support huge pages it need to be built with the CONFIG_HUGETLBFS (present under “File systems”) and CONFIG_HUGETLB_PAGE (selected automatically when CONFIG_HUGETLBFS is selected) configuration options.

Persistent Huge Pages (at boot)

The most reliable method of allocating huge pages to be available for use is during boot using the kernel command line by specifying the hugepages=N parameter, where N is the number of requested huge pages. During boot is the most appropriate time do allocate huge pages these since memory has not yet become fragmented. Some platforms also support multiple huge page sizes, which can be configured with the parameter hugepagesz=<size>, where size is in bytes and can be optionally suffixed wiht a scale [kKmMgG]. If you want to configure a specific size, it must come before the huge pages boot command parameters. An example config is:

hugepagesz=1G hugepages=16

or you can just specify the number of huge pages, which will be of the default huge page size of the system:

hugepages=16

There is also the possibility to allocate multiple sizes of persistent huge pages by running (source) the following command. You can also specify the default size for huge pages.

default_hugepagesz=1G hugepagesz=1G hugepages=4 hugepagesz=2M hugepages=1024

Dynamic Allocation (at runtime)

The number of huge pages can also be configured when the system is running, although this will not be as reliable since the system needs to find contiguous memory where the huge pages can be allocated. Configuration can be done in a couple of different ways. Arguably the best way is to configure the number of huge pages of a specific size, say 2MB, by running the following command:

$ echo 9216 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

If you instead want to configure the number of 1GB huge pages you would run (as stated previously, increasing (i.e allocating) huge pages requires contiguous memory to be available, which it may not):

$ echo 18 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

It is also possible to configure the number of huge pages for a specific NUMA node by running:

$ echo 9 | sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

Typically, the default huge page size on most systems is 2MB, so when you modify /proc/sys/vm/nr_hugepages, it might affect the value in /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages. However, if you were using huge pages of a different size (e.g., 1GB), you would need to manage that separately under its corresponding directory in /sys/kernel/mm/hugepages.

Note on reliably allocating pages

Again, if you need a reliable method for allocating huge pages, configure them during boot to ensure that enough huge pages are reserved for your program. However, this approach is more cumbersome than dynamically allocating huge pages while your computer is running, using the */nr_hugepages procfs files. Dynamic allocation might fail, take time, or succeed but be used by other programs.

Overcommitting Number of Huge Pages

Overcomitting specifies that the pool of huge pages can grow larger than the number of configured huge pages when more huge pages than what is configuered in /proc/sys/vm/nr_hugepages are requested by applications. This can be configured by writing a non-zero value into /proc/sys/vm/nr_overcommit_hugepages. As the overcommitted pages become unused over time, the are freed to the kernel’s pool of memory for pages.

Can be configured by running:

$ echo 10 | sudo tee /proc/sys/vm/nr_overcommit_hugepages

Interesting functionality: if the number of persistent huge pages are shrinked (via editing nr_hugepages for some configuration) so that the configuration is less than the number of allocated huge pages, the difference will be converted to overcommitted pages. Note that this might exceed the overcommit value.

Ubuntu Configuration

Edit /etc/default/grub and add/change the following line:

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=2M hugepagesz=2M hugepages=10"

and then run:

$ sudo update-grub

See Huge Page Usage

You can check for current huge page usage in the /proc/meminfo file:

cat /proc/meminfo | grep -i hugepages
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize        2048 kB

Hugepagesize shows the default huge page size configuration.

Interesting side notes:

Linux >=6.9 paralellized HugeTLB setup (source).
Reddit user stumbles upon not being able to allocate the desired number of huge pages during runtime (source).

Transparent Huge Pages (THP)

THP is one way to achieve larger page sizes on Linux and is commonly enabled by default.

THP allows the kernel to automatically promote regular memory pages into huge pages. By default THP is in the madvise mode, which means it’s necessary to call madvise(...MADV_HUGEPAGE) to specifically enable THP for a range of memory (since Linux 2.6.38). madvise only operate on whole pages, therefore the address passed to it must be page-aligned.

See the current mode by looking at the /sys/kernel/mm/transparent_hugepage/enabled:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ # The current mode is madvise

and change the value by running:

$ echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

If you want to persistently turn off THP, you can edit your kernel boot commands and add the flag transparent_hugepage=never (check how GRUB is configured above for Ubuntu).

Linux THP support does not guarantee that huge pages will be allocated. If you use posix_memalign to allocate memory aligned to the huge page size it’s more likely that memory will be backed by huge pages.

THP Modes

When THP is enabled system-wide, the kernel tries to assign huge pages to processes when it is possible to allocate huge pages and the process is using a large enough contiguous virtual memory area.

Modes:

System-wide: THP are used automatically by any application.

$ echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Per-process (default mode on Linux): THP are only used if the application uses madvise() with the flag MADV_HUGEPAGE to mark that certain memory segments should be backed by large pages.
```
$ echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
```

Never use THP: THP are never used.

$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Defrag (and modes)

As previously mentioned, there must be enough contiguous physical memory available for huge pages to be created for THP. If there isn’t, the kernel will try to defrag memory to create huge pages. Defrag can be configured in a few different modes by editing the /sys/kernel/mm/transparent_hugepage/defrag file. Modes:

Always: an application requesting THP will stall and try to create a huge page as soon as possible.
```
$ echo always | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
```

Defer: busy-wait until handler threads have created a huge page.

$ echo defer | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Madvise: works like the “always” mode but only for memory regions that have used madvise(MADV_HUGEPAGE).
```
$ echo defer | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
```
Defer+madvise: Regions marked with madvise works like 2). Regions not marked with madvise will work as defer.
```
$ echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
```

never: never.

$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Example in C

Huge Pages

The following example is made on virtual machine running Ubuntu 24.04 (noble), running Linux Kernel 6.8.0-41-generic with 8 GiB of memory available.

I have configured /etc/default/grub with GRUB_CMDLINE_LINUX_DEFAULT="hugepages=1024", which will give me 1024 pages that are 2 MiB large, totalling at 2 GiB of huge pages. I can verify that this works by checking /proc/meminfo, which shows that my persistent huge pages have been allocated:

$ cat /proc/meminfo | grep -i hugepages_total
HugePages_Total:    1024

The simplest way to allocate memory backed by huge pages is to use mmap(2). Here is a small program that checks that mmap works by allocating a chunk of anonymous memory and looping forever:

#include <stdio.h>
#include <stdlib.h>
#include <sys/errno.h>
#include <sys/mman.h>

int main() {
    size_t page_size = 2 * 1024 * 1024; // 2MiB
    size_t num_pages = 1;
    size_t allocation = num_pages * page_size;

    void *mem = mmap(NULL, allocation, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mem == MAP_FAILED) {
        perror("mmap failed");
        exit(1);
    }

    printf("mmap address: %p\n", mem);

    while(1);

    return 0;
}

Sure enough, running this we get an address from mmap:

$ gcc hugepage.c
$ ./a.out
mmap address: 0x7e1f0ba00000

in another terminal we can check that a huge page has been reserved:

$ cat /proc/meminfo | grep -i hugepages_
HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        1
HugePages_Surp:        0

Touch (allocate) pages

The reason that the page has only been reserved in the example above and not allocated is because we have not touched the memory we have allocated using mmap. This behavior is called demand paging, where the virtual memory mappings are set up and physical pages are allocated to back that memory only when it is accessed. To back the memory we have just allocated using mmap we need to touch it in some way, either by reading or writing to it. Many systems force page allocation only on a write operation, which is what we will do in the following example. On my VM however, reading works just as fine.

Let’s modify the example program to also touch the pages by writing to them:

#include <stdio.h>
#include <stdlib.h>
#include <sys/errno.h>
#include <sys/mman.h>

void touch_pages(void *start, size_t page_size, size_t num_pages) {
    for (size_t i = 0; i < num_pages; i++) {
        int *addr = start + i * page_size;
        *addr = 0;
    }
}

int main() {
    size_t page_size = 2 * 1024 * 1024; // 2MiB
    size_t num_pages = 1;
    size_t allocation = num_pages * page_size;

    void *mem = mmap(NULL, allocation, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mem == MAP_FAILED) {
        perror("mmap failed");
        exit(1);
    }

    printf("mmap address: %p\n", mem);

    touch_pages(mem, page_size, num_pages);

    while(1);

    return 0;
}

If we compile and and run the program we can see that the page has been backed/allocatd as the number of free pages has decreased by one:

$ cat /proc/meminfo | grep -i hugepages_
HugePages_Total:    1024
HugePages_Free:     1023
HugePages_Rsvd:        0
HugePages_Surp:        0

Fail immediately if not enough huge pages available

If more pages than are available in the huge page pool are attempted to be allocated, mmap will immediately fail. Running the example program with the following configuration will produce the output:

size_t page_size = 2 * 1024 * 1024; // 2MiB
size_t num_pages = 1025;
size_t allocation = num_pages * page_size;

$ ./a.out
mmap failed: Cannot allocate memory

Thank you for reading, hope this gives you some idea on how to use huge pages on Linux :)