| .. _admin_guide_transhuge: | 
 |  | 
 | ============================ | 
 | Transparent Hugepage Support | 
 | ============================ | 
 |  | 
 | Objective | 
 | ========= | 
 |  | 
 | Performance critical computing applications dealing with large memory | 
 | working sets are already running on top of libhugetlbfs and in turn | 
 | hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of | 
 | using huge pages for the backing of virtual memory with huge pages | 
 | that supports the automatic promotion and demotion of page sizes and | 
 | without the shortcomings of hugetlbfs. | 
 |  | 
 | Currently THP only works for anonymous memory mappings and tmpfs/shmem. | 
 | But in the future it can expand to other filesystems. | 
 |  | 
 | .. note:: | 
 |    in the examples below we presume that the basic page size is 4K and | 
 |    the huge page size is 2M, although the actual numbers may vary | 
 |    depending on the CPU architecture. | 
 |  | 
 | The reason applications are running faster is because of two | 
 | factors. The first factor is almost completely irrelevant and it's not | 
 | of significant interest because it'll also have the downside of | 
 | requiring larger clear-page copy-page in page faults which is a | 
 | potentially negative effect. The first factor consists in taking a | 
 | single page fault for each 2M virtual region touched by userland (so | 
 | reducing the enter/exit kernel frequency by a 512 times factor). This | 
 | only matters the first time the memory is accessed for the lifetime of | 
 | a memory mapping. The second long lasting and much more important | 
 | factor will affect all subsequent accesses to the memory for the whole | 
 | runtime of the application. The second factor consist of two | 
 | components: | 
 |  | 
 | 1) the TLB miss will run faster (especially with virtualization using | 
 |    nested pagetables but almost always also on bare metal without | 
 |    virtualization) | 
 |  | 
 | 2) a single TLB entry will be mapping a much larger amount of virtual | 
 |    memory in turn reducing the number of TLB misses. With | 
 |    virtualization and nested pagetables the TLB can be mapped of | 
 |    larger size only if both KVM and the Linux guest are using | 
 |    hugepages but a significant speedup already happens if only one of | 
 |    the two is using hugepages just because of the fact the TLB miss is | 
 |    going to run faster. | 
 |  | 
 | THP can be enabled system wide or restricted to certain tasks or even | 
 | memory ranges inside task's address space. Unless THP is completely | 
 | disabled, there is ``khugepaged`` daemon that scans memory and | 
 | collapses sequences of basic pages into huge pages. | 
 |  | 
 | The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` | 
 | interface and using madvise(2) and prctl(2) system calls. | 
 |  | 
 | Transparent Hugepage Support maximizes the usefulness of free memory | 
 | if compared to the reservation approach of hugetlbfs by allowing all | 
 | unused memory to be used as cache or other movable (or even unmovable | 
 | entities). It doesn't require reservation to prevent hugepage | 
 | allocation failures to be noticeable from userland. It allows paging | 
 | and all other advanced VM features to be available on the | 
 | hugepages. It requires no modifications for applications to take | 
 | advantage of it. | 
 |  | 
 | Applications however can be further optimized to take advantage of | 
 | this feature, like for example they've been optimized before to avoid | 
 | a flood of mmap system calls for every malloc(4k). Optimizing userland | 
 | is by far not mandatory and khugepaged already can take care of long | 
 | lived page allocations even for hugepage unaware applications that | 
 | deals with large amounts of memory. | 
 |  | 
 | In certain cases when hugepages are enabled system wide, application | 
 | may end up allocating more memory resources. An application may mmap a | 
 | large region but only touch 1 byte of it, in that case a 2M page might | 
 | be allocated instead of a 4k page for no good. This is why it's | 
 | possible to disable hugepages system-wide and to only have them inside | 
 | MADV_HUGEPAGE madvise regions. | 
 |  | 
 | Embedded systems should enable hugepages only inside madvise regions | 
 | to eliminate any risk of wasting any precious byte of memory and to | 
 | only run faster. | 
 |  | 
 | Applications that gets a lot of benefit from hugepages and that don't | 
 | risk to lose memory by using hugepages, should use | 
 | madvise(MADV_HUGEPAGE) on their critical mmapped regions. | 
 |  | 
 | .. _thp_sysfs: | 
 |  | 
 | sysfs | 
 | ===== | 
 |  | 
 | Global THP controls | 
 | ------------------- | 
 |  | 
 | Transparent Hugepage Support for anonymous memory can be entirely disabled | 
 | (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE | 
 | regions (to avoid the risk of consuming more memory resources) or enabled | 
 | system wide. This can be achieved with one of:: | 
 |  | 
 | 	echo always >/sys/kernel/mm/transparent_hugepage/enabled | 
 | 	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled | 
 | 	echo never >/sys/kernel/mm/transparent_hugepage/enabled | 
 |  | 
 | It's also possible to limit defrag efforts in the VM to generate | 
 | anonymous hugepages in case they're not immediately free to madvise | 
 | regions or to never try to defrag memory and simply fallback to regular | 
 | pages unless hugepages are immediately available. Clearly if we spend CPU | 
 | time to defrag memory, we would expect to gain even more by the fact we | 
 | use hugepages later instead of regular pages. This isn't always | 
 | guaranteed, but it may be more likely in case the allocation is for a | 
 | MADV_HUGEPAGE region. | 
 |  | 
 | :: | 
 |  | 
 | 	echo always >/sys/kernel/mm/transparent_hugepage/defrag | 
 | 	echo defer >/sys/kernel/mm/transparent_hugepage/defrag | 
 | 	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag | 
 | 	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag | 
 | 	echo never >/sys/kernel/mm/transparent_hugepage/defrag | 
 |  | 
 | always | 
 | 	means that an application requesting THP will stall on | 
 | 	allocation failure and directly reclaim pages and compact | 
 | 	memory in an effort to allocate a THP immediately. This may be | 
 | 	desirable for virtual machines that benefit heavily from THP | 
 | 	use and are willing to delay the VM start to utilise them. | 
 |  | 
 | defer | 
 | 	means that an application will wake kswapd in the background | 
 | 	to reclaim pages and wake kcompactd to compact memory so that | 
 | 	THP is available in the near future. It's the responsibility | 
 | 	of khugepaged to then install the THP pages later. | 
 |  | 
 | defer+madvise | 
 | 	will enter direct reclaim and compaction like ``always``, but | 
 | 	only for regions that have used madvise(MADV_HUGEPAGE); all | 
 | 	other regions will wake kswapd in the background to reclaim | 
 | 	pages and wake kcompactd to compact memory so that THP is | 
 | 	available in the near future. | 
 |  | 
 | madvise | 
 | 	will enter direct reclaim like ``always`` but only for regions | 
 | 	that are have used madvise(MADV_HUGEPAGE). This is the default | 
 | 	behaviour. | 
 |  | 
 | never | 
 | 	should be self-explanatory. | 
 |  | 
 | By default kernel tries to use huge zero page on read page fault to | 
 | anonymous mapping. It's possible to disable huge zero page by writing 0 | 
 | or enable it back by writing 1:: | 
 |  | 
 | 	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page | 
 | 	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page | 
 |  | 
 | Some userspace (such as a test program, or an optimized memory allocation | 
 | library) may want to know the size (in bytes) of a transparent hugepage:: | 
 |  | 
 | 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size | 
 |  | 
 | khugepaged will be automatically started when | 
 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll | 
 | be automatically shutdown if it's set to "never". | 
 |  | 
 | Khugepaged controls | 
 | ------------------- | 
 |  | 
 | khugepaged runs usually at low frequency so while one may not want to | 
 | invoke defrag algorithms synchronously during the page faults, it | 
 | should be worth invoking defrag at least in khugepaged. However it's | 
 | also possible to disable defrag in khugepaged by writing 0 or enable | 
 | defrag in khugepaged by writing 1:: | 
 |  | 
 | 	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | 
 | 	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag | 
 |  | 
 | You can also control how many pages khugepaged should scan at each | 
 | pass:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan | 
 |  | 
 | and how many milliseconds to wait in khugepaged between each pass (you | 
 | can set this to 0 to run khugepaged at 100% utilization of one core):: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs | 
 |  | 
 | and how many milliseconds to wait in khugepaged if there's an hugepage | 
 | allocation failure to throttle the next allocation attempt:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs | 
 |  | 
 | The khugepaged progress can be seen in the number of pages collapsed (note | 
 | that this counter may not be an exact count of the number of pages | 
 | collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping | 
 | being replaced by a PMD mapping, or (2) All 4K physical pages replaced by | 
 | one 2M hugepage. Each may happen independently, or together, depending on | 
 | the type of memory and the failures that occur. As such, this value should | 
 | be interpreted roughly as a sign of progress, and counters in /proc/vmstat | 
 | consulted for more accurate accounting):: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed | 
 |  | 
 | for each pass:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans | 
 |  | 
 | ``max_ptes_none`` specifies how many extra small pages (that are | 
 | not already mapped) can be allocated when collapsing a group | 
 | of small pages into one large page:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none | 
 |  | 
 | A higher value leads to use additional memory for programs. | 
 | A lower value leads to gain less thp performance. Value of | 
 | max_ptes_none can waste cpu time very little, you can | 
 | ignore it. | 
 |  | 
 | ``max_ptes_swap`` specifies how many pages can be brought in from | 
 | swap when collapsing a group of pages into a transparent huge page:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap | 
 |  | 
 | A higher value can cause excessive swap IO and waste | 
 | memory. A lower value can prevent THPs from being | 
 | collapsed, resulting fewer pages being collapsed into | 
 | THPs, and lower memory access performance. | 
 |  | 
 | ``max_ptes_shared`` specifies how many pages can be shared across multiple | 
 | processes. Exceeding the number would block the collapse:: | 
 |  | 
 | 	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared | 
 |  | 
 | A higher value may increase memory footprint for some workloads. | 
 |  | 
 | Boot parameter | 
 | ============== | 
 |  | 
 | You can change the sysfs boot time defaults of Transparent Hugepage | 
 | Support by passing the parameter ``transparent_hugepage=always`` or | 
 | ``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` | 
 | to the kernel command line. | 
 |  | 
 | Hugepages in tmpfs/shmem | 
 | ======================== | 
 |  | 
 | You can control hugepage allocation policy in tmpfs with mount option | 
 | ``huge=``. It can have following values: | 
 |  | 
 | always | 
 |     Attempt to allocate huge pages every time we need a new page; | 
 |  | 
 | never | 
 |     Do not allocate huge pages; | 
 |  | 
 | within_size | 
 |     Only allocate huge page if it will be fully within i_size. | 
 |     Also respect fadvise()/madvise() hints; | 
 |  | 
 | advise | 
 |     Only allocate huge pages if requested with fadvise()/madvise(); | 
 |  | 
 | The default policy is ``never``. | 
 |  | 
 | ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting | 
 | ``huge=never`` will not attempt to break up huge pages at all, just stop more | 
 | from being allocated. | 
 |  | 
 | There's also sysfs knob to control hugepage allocation policy for internal | 
 | shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount | 
 | is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or | 
 | MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. | 
 |  | 
 | In addition to policies listed above, shmem_enabled allows two further | 
 | values: | 
 |  | 
 | deny | 
 |     For use in emergencies, to force the huge option off from | 
 |     all mounts; | 
 | force | 
 |     Force the huge option on for all - very useful for testing; | 
 |  | 
 | Need of application restart | 
 | =========================== | 
 |  | 
 | The transparent_hugepage/enabled values and tmpfs mount option only affect | 
 | future behavior. So to make them effective you need to restart any | 
 | application that could have been using hugepages. This also applies to the | 
 | regions registered in khugepaged. | 
 |  | 
 | Monitoring usage | 
 | ================ | 
 |  | 
 | The number of anonymous transparent huge pages currently used by the | 
 | system is available by reading the AnonHugePages field in ``/proc/meminfo``. | 
 | To identify what applications are using anonymous transparent huge pages, | 
 | it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields | 
 | for each mapping. | 
 |  | 
 | The number of file transparent huge pages mapped to userspace is available | 
 | by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. | 
 | To identify what applications are mapping file transparent huge pages, it | 
 | is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields | 
 | for each mapping. | 
 |  | 
 | Note that reading the smaps file is expensive and reading it | 
 | frequently will incur overhead. | 
 |  | 
 | There are a number of counters in ``/proc/vmstat`` that may be used to | 
 | monitor how successfully the system is providing huge pages for use. | 
 |  | 
 | thp_fault_alloc | 
 | 	is incremented every time a huge page is successfully | 
 | 	allocated to handle a page fault. | 
 |  | 
 | thp_collapse_alloc | 
 | 	is incremented by khugepaged when it has found | 
 | 	a range of pages to collapse into one huge page and has | 
 | 	successfully allocated a new huge page to store the data. | 
 |  | 
 | thp_fault_fallback | 
 | 	is incremented if a page fault fails to allocate | 
 | 	a huge page and instead falls back to using small pages. | 
 |  | 
 | thp_fault_fallback_charge | 
 | 	is incremented if a page fault fails to charge a huge page and | 
 | 	instead falls back to using small pages even though the | 
 | 	allocation was successful. | 
 |  | 
 | thp_collapse_alloc_failed | 
 | 	is incremented if khugepaged found a range | 
 | 	of pages that should be collapsed into one huge page but failed | 
 | 	the allocation. | 
 |  | 
 | thp_file_alloc | 
 | 	is incremented every time a file huge page is successfully | 
 | 	allocated. | 
 |  | 
 | thp_file_fallback | 
 | 	is incremented if a file huge page is attempted to be allocated | 
 | 	but fails and instead falls back to using small pages. | 
 |  | 
 | thp_file_fallback_charge | 
 | 	is incremented if a file huge page cannot be charged and instead | 
 | 	falls back to using small pages even though the allocation was | 
 | 	successful. | 
 |  | 
 | thp_file_mapped | 
 | 	is incremented every time a file huge page is mapped into | 
 | 	user address space. | 
 |  | 
 | thp_split_page | 
 | 	is incremented every time a huge page is split into base | 
 | 	pages. This can happen for a variety of reasons but a common | 
 | 	reason is that a huge page is old and is being reclaimed. | 
 | 	This action implies splitting all PMD the page mapped with. | 
 |  | 
 | thp_split_page_failed | 
 | 	is incremented if kernel fails to split huge | 
 | 	page. This can happen if the page was pinned by somebody. | 
 |  | 
 | thp_deferred_split_page | 
 | 	is incremented when a huge page is put onto split | 
 | 	queue. This happens when a huge page is partially unmapped and | 
 | 	splitting it would free up some memory. Pages on split queue are | 
 | 	going to be split under memory pressure. | 
 |  | 
 | thp_split_pmd | 
 | 	is incremented every time a PMD split into table of PTEs. | 
 | 	This can happen, for instance, when application calls mprotect() or | 
 | 	munmap() on part of huge page. It doesn't split huge page, only | 
 | 	page table entry. | 
 |  | 
 | thp_zero_page_alloc | 
 | 	is incremented every time a huge zero page used for thp is | 
 | 	successfully allocated. Note, it doesn't count every map of | 
 | 	the huge zero page, only its allocation. | 
 |  | 
 | thp_zero_page_alloc_failed | 
 | 	is incremented if kernel fails to allocate | 
 | 	huge zero page and falls back to using small pages. | 
 |  | 
 | thp_swpout | 
 | 	is incremented every time a huge page is swapout in one | 
 | 	piece without splitting. | 
 |  | 
 | thp_swpout_fallback | 
 | 	is incremented if a huge page has to be split before swapout. | 
 | 	Usually because failed to allocate some continuous swap space | 
 | 	for the huge page. | 
 |  | 
 | As the system ages, allocating huge pages may be expensive as the | 
 | system uses memory compaction to copy data around memory to free a | 
 | huge page for use. There are some counters in ``/proc/vmstat`` to help | 
 | monitor this overhead. | 
 |  | 
 | compact_stall | 
 | 	is incremented every time a process stalls to run | 
 | 	memory compaction so that a huge page is free for use. | 
 |  | 
 | compact_success | 
 | 	is incremented if the system compacted memory and | 
 | 	freed a huge page for use. | 
 |  | 
 | compact_fail | 
 | 	is incremented if the system tries to compact memory | 
 | 	but failed. | 
 |  | 
 | It is possible to establish how long the stalls were using the function | 
 | tracer to record how long was spent in __alloc_pages() and | 
 | using the mm_page_alloc tracepoint to identify which allocations were | 
 | for huge pages. | 
 |  | 
 | Optimizing the applications | 
 | =========================== | 
 |  | 
 | To be guaranteed that the kernel will map a 2M page immediately in any | 
 | memory region, the mmap region has to be hugepage naturally | 
 | aligned. posix_memalign() can provide that guarantee. | 
 |  | 
 | Hugetlbfs | 
 | ========= | 
 |  | 
 | You can use hugetlbfs on a kernel that has transparent hugepage | 
 | support enabled just fine as always. No difference can be noted in | 
 | hugetlbfs other than there will be less overall fragmentation. All | 
 | usual features belonging to hugetlbfs are preserved and | 
 | unaffected. libhugetlbfs will also work fine as usual. |