Linux HugePages 配置與 Oracle 效能關係說明

阿新 • • 發佈：2018-11-19

一. HugePages 說明

1.1 HugePages 介紹

HugePages is a feature integrated into the Linux kernel with release 2.6. This feature basically provides the alternative to the 4K page size (16K for IA64) providing bigger pages.

關於HugePages，有一些相關的專業術語，具體如下：

（1） Page Table: A page table is the data structure of a virtual memory system in an operating system to store the mapping between virtual addresses and physical addresses. This means that on a virtual memory system, the memory is accessed by first accessing a page table and then accessing the actual memory location implicitly.

--Page Table 是作業系統上的虛擬記憶體系統的資料結構，其用來儲存虛擬記憶體地址和實體記憶體地址之間的對映關係。這就意味著在虛擬記憶體系統上，我們訪問記憶體時，是先訪問Page Table，然後根據Page Table 中的對映關係，隱式的轉移到物理的記憶體位置。

（2） TLB: A Translation Lookaside Buffer (TLB) is a buffer (or cache) in a CPU that contains parts of the page table. This is a fixed size buffer being used to do virtual address translation faster.

--TLB(Translation Lookaside Buffer) 是CPU 中的一塊buffer 或者cache，其大小的固定的， TLB中包含了部分Page Table，用來快速進行虛擬地址的轉換。

（3） hugetlb: This is an entry in the TLB that points to a HugePage (a large/big page larger than regular 4K and predefined in size). HugePages are implemented via hugetlb entries, i.e. we can say that a HugePage is handled by a "hugetlb page entry". The 'hugetlb" term is also (and mostly) used synonymously with a HugePage (See Note 261889.1). In this document the term "HugePage" is going to be used but keep in mind that mostly "hugetlb" refers to the same concept.

--hugetlb 是TLB中的一個entry，其指向HugePage（大於4k或預定義的一個large page）。 HugePage 通過hugetlb entries來實現，我們也可以說HugePage 是hugetlb page entry的一個控制代碼。在MOS 文件：Note 261889.1中，二者是幾乎是相同的概念。

（4） hugetlbfs: This is a new in-memory filesystem like tmpfs and is presented by 2.6 kernel. Pages allocated on hugetlbfs type filesystem are allocated in HugePages.

--hugetlbfs 是2.6 核心中提出的一個新的in-memory filesystem，就像tmpfs一樣。

1.2 常見的錯誤概念

WRONG: HugePages is a method to be able to use large SGA on 32-bit VLM systems RIGHT: HugePages is a method to have larger pages where it is useful for working with very large memory. It is both useful in 32- and 64-bit configurations

WRONG: HugePages cannot be used without USE_INDIRECT_DATA_BUFFERS RIGHT: HugePages can be used without indirect buffers. 64-bit systems does not need to use indirect buffers to have a large buffer cache for the RDBMS instance and HugePages can be used there too.

WRONG: hugetlbfs means hugetlb RIGHT: hugetlbfs is a filesystem type **BUT** hugetlb is the mechanism employed in the back where hugetlb can be employed WITHOUT hugetlbfs

WRONG: hugetlbfs means hugepages RIGHT: hugetlbfs is a filesystem type **BUT** HugePages is the mechanism employed in the back (synonymously with hugetlb) where HugePages can be employed WITHOUT hugetlbfs.

1.3 Regular Pages 與 HugePages 說明

When a single process works with a piece of memory, the pages that the process uses are reference in a local page table for the specific process. The entries in this table also contain references to the System-Wide Page Table which actually has references to actual physical memory addresses. So theoretically a user mode process (i.e. Oracle processes), follows its local page table to access to the system page table and then can reference the actual physical table virtually. As you can see below, it is also possible (and very common to Oracle RDBMS due to SGA use) that two different O/S processes can point to the same entry in the system-wide page table.

--當一個程序使用一塊記憶體來工作時，程序使用的page 從local page table 中引用。 Local page table中的entries 又引用了System-Wide Page Table的page，該page 指向了實際的實體記憶體地址。

所以，理論上，使用者的程序（如oracle 程序），根據local page table中的entry 指向了system page table中的entry，而System page table中的entry 指向了實際的實體記憶體。

當然，也有可能，2個不同的O/S 程序指向了system-wide page table 中同一個entry，如下圖所示，最常見的原因是Oracle SGA的使用。

When HugePages are in the play, the usual page tables are employed. The very basic difference is that the entries in both process page table and the system page table has attributes about huge pages. So any page in a page table can be a huge page or a regular page. The following diagram illustrates 4096K hugepages but the diagram would be the same for any huge page size.

--當配置了HugePage後，最基本的不同是 process page table 和 system page table中的entry 都包含了huge page的屬性。所以page table 中的任一page 都可能是huge page 或者regular page。

1.4 Some HugePages Facts/Features

(1) HugePages can be allocated on-the-fly but they must be reserved during system startup. Otherwise the allocation might fail as the memory is already paged in 4K mostly.

(2) HugePage sizes vary from 2MB to 256MB based on kernel version and HW architecture (See related section below.)

(3) HugePages are not subject to reservation / release after the system startup unless there is system administrator intervention, basically changing the hugepages configuration (i.e. number of pages available or pool size)

1.5 Advantages of HugePages Over Normal Sharing Or AMM

（1） Not swappable: 不需要記憶體頁交換

HugePages are not swappable. Therefore there is no page-in/page-out mechanism overhead.HugePages are universally regarded as pinned.

（2）Relief of TLB pressure: 減輕TLB的壓力

1）Hugepge uses fewer pages to cover the physical address space, so the size of “book keeping” (mapping from the virtual to the physical address) decreases, so it requiring fewer entries in the TLB

2）TLB entries will cover a larger part of the address space when use HugePages, there will be fewer TLB misses before the entire or most of the SGA is mapped in the SGA

3）Fewer TLB entries for the SGA also means more for other parts of the address space

（3）Decreased page table overhead: 降低page table 的消耗

Each page table entry can be as large as 64 bytes and if we are trying to handle 50GB of RAM, the pagetable will be approximately 800MB in size which is practically will not fit in 880MB size lowmem (in 2.4 kernels - the page table is not necessarily in lowmem in 2.6 kernels) considering the other uses of lowmem. When 95% of memory is accessed via 256MB hugepages, this can work with a page table of approximately 40MB in total.

每個一個page table 的entry最大需要64 bytes的記憶體，如果我們管理50GB的記憶體,那麼Pagetable 就需要約800MB的記憶體空間. 如果我們使用256MB的hugepage，同樣對於50G的記憶體，我們只需要40MB的page table。

Dave 註釋：

按普通模式，每個page 4k，那麼需要的entries個數是：(50*1024*1024/4)

每個entry 是64bytes，所以總的記憶體大小就是：(50*1024*1024/4) * 64/1024/1024=800M

注意，這只是一個程序的page table，如果有10個程序，那麼光處理這些page 就需要800*10，約8G的記憶體空間，而我們總共的記憶體也不過50G而已，所以大記憶體的情況下，需要HugePage就顯的尤其重要。

HugePage 最大的大小從2M到256MB，按2MB算：

(50*1024/2)*64/1024/1024= 1.6M

10 程序也才16M而已。

（4）Eliminated page table lookup overhead: 降低page table 的lookup 次數

Since the pages are not subject to replacement, page table lookups are not required.

（5）Faster overall memory performance: 提升記憶體的效能

On virtual memory systems each memory operation is actually two abstract memory operations. Since there are fewer pages to work on, the possible bottleneck on page table access is clearly avoided.

--virtual memory system 上的每一次記憶體操作實際上都需要2次記憶體的操作， hugepage減少了page數量從而避免了訪問page table上的瓶頸。

1.6 HugePage 的大小

單個HugePage的大小根據平臺的不同而不同：

(1) Kernel version/linux distribution

(2) HW Platform

HugePage 的實際大小可以使用如下命令檢視：

$ grep Hugepagesize /proc/meminfo

The table below shows the sizes of HugePages on different configurations. Note that these are general numbers taken from the most recent versions of the kernels. For a specific kernel source package, you can check for the HPAGE_SIZE macro value (based on HPAGE_SHIFT) for a different (more recent) kernel source tree.

--下表顯示了不同平臺下HugePages的值：

HW Platform Source Code Tree Kernel 2.4 Kernel 2.6

Linux x86 (IA32) i386 4 MB 4 MB *

Linux x86-64 (AMD64, EM64T) x86_64 2 MB 2 MB

Linux Itanium (IA64) ia64 256 MB 256 MB

IBM Power Based Linux (PPC64) ppc64/powerpc N/A ** 16 MB

IBM zSeries Based Linux s390 N/A N/A

IBM S/390 Based Linux s390 N/A N/A

* Some older packaging for the 2.6.5 kernel on SLES8 (like 2.6.5-7.97) can have 2 MB Hugepagesize.

** Oracle RDBMS is also not certified in this configuration. See Document 341507.1

1.7 HugePages and Oracle 11g Automatic Memory Management (AMM)

The AMM and HugePages are not compatible. One needs to disable AMM on 11g to be able to use HugePages. See Document 749851.1 for further information.

--Oracle 11g的AMM與HugePages不相容。需要注意。

1.8 沒配置HugePages 的危險

在Linux OS下，如果對delicate 程序沒有配置合適的的HugePage，那麼可能會遇到如下的問題：

(1) HugePages not used (HugePages_Total = HugePages_Free) at all wasting the amount configured for

(2) Poor database performance 影響資料庫效能

(3) System running out of memory or excessive swapping 記憶體不足或者經常需要進行swap

(4) Some or any database instance cannot be started 某些資料庫例項不能啟動

(5) Crucial system services failing (e.g.: CRS) 嚴重的系統故障

To avoid / help with such situations Bug 10153816 was filed to introduce a database initialization parameter in 11.2.0.2 (use_large_pages) to help manage which SGAs will use huge pages and potentially give warnings or not start up at all if they cannot get those pages.

1.9 為什麼需要配置HugePages

HugePages is crucial for faster Oracle database performance on Linux if you have a large RAM and SGA. If your combined database SGAs is large (like more than 8GB, can even be important for smaller), you will need HugePages configured. Note that the size of the SGA matters. Advantages of HugePages are:

--如果使用了大記憶體和SGA，那麼HugePage對提高資料庫效能就非常重要。如果資料庫SGA指令碼，比如超過8G，就需要配置HugePages。配置HugePages 有如下好處：

(1) Larger Page Size and Less # of Pages: Default page size is 4K whereas the HugeTLB size is 2048K. That means the system would need to handle 512 times less pages.

(2) No Page Table Lookups: Since the HugePages are not subject to replacement (despite regular pages), page table lookups are not required.

(3) Better Overall Memory Performance: On virtual memory systems (any modern OS) each memory operation is actually two abstract memory operations. With HugePages, since there are less number of pages to work on, the possible bottleneck on page table access is clearly avoided.

(4) No Swapping: We must avoid swapping to happen on Linux OS at all Document 1295478.1. HugePages are not swappable (whereas regular pages are). Therefore there is no page replacement mechanism overhead. HugePages are universally regarded as pinned.

(5) No 'kswapd' Operations: kswapd will get very busy if there is a very large area to be paged (i.e. 13 million page table entries for 50GB memory) and will use an incredible amount of CPU resource. When HugePages are used, kswapd is not involved in managing them. See also Document 361670.1

二．配置HugePages

2.1 第一步：設定memlock

在/etc/security/limits.conf檔案中新增memlock的限制，注意該值略微小於實際實體記憶體的大小。比如實體記憶體是64GB，可以設定為如下：

* soft memlock 60397977

* hard memlock 60397977

如果這裡的值超過了SGA的需求，也沒有不利的影響。

如果使用了Oracle Linux的oracle¬-validated 包，或者Exadata DB compute會自動配置這個引數。

2.2 第二步：驗證memlock

使用如下命令檢視引數值：

$ ulimit -l

60397977

2.3 第三步：11g中禁用AMM

如果Oracle 是11g以後的版本，那麼預設建立的例項會使用Automatic Memory Management (AMM)的特性，該特性與HugePage不相容。

在設定HugePage之前需要先禁用AMM。設定初始化引數MEMORY_TARGET 和MEMORY_MAX_TARGET 為0即可。

使用AMM的情況下，所有的SGA 記憶體都是在/dev/shm 下分配的，因此在分配SGA時不會使用HugePage。這也是AMM 與HugePage 不相容的原因。

另外：預設情況下ASM instance 也是使用AMM的，但因為ASM 例項不需要大SGA，所以對ASM 例項使用HugePages意義不大。

如果我們要使用HugePage，那麼就必須先確保沒有設定MEMORY_TARGET / MEMORY_MAX_TARGET引數。

2.4 第四步：計算vm.nr_hugepages的建議值

確保所有的資料庫例項都已經啟動，包括ASM 例項。使用hugepages_settings.sh 指令碼獲取thevm.nr_hugepages 核心引數的建議值。

$ ./hugepages_settings.sh
...
Recommended setting: vm.nr_hugepages = 1496
$
也可以根據自己的經驗來計算該值。
指令碼如下：
#!/bin/bash
#
# hugepages_settings.sh
#
# Linux bash script to compute values for the
# recommended HugePages/HugeTLB configuration
#
# Note: This script does calculation for all shared memory
# segments available when the script is run, no matter it
# is an Oracle RDBMS shared memory segment or not.
#
# This script is provided by Doc ID 401749.1 from My Oracle Support 
# http://support.oracle.com
# Welcome text
echo "
This script is provided by Doc ID 401749.1 from My Oracle Support 
(http://support.oracle.com) where it is intended to compute values for 
the recommended HugePages/HugeTLB configuration for the current shared 
memory segments. Before proceeding with the execution please note following:
 * For ASM instance, it needs to configure ASMM instead of AMM.
 * The 'pga_aggregate_target' is outside the SGA and 
   you should accommodate this while calculating SGA size.
 * In case you changes the DB SGA size, 
   as the new SGA will not fit in the previous HugePages configuration, 
   it had better disable the whole HugePages, 
   start the DB with new SGA size and run the script again.
And make sure that:
 * Oracle Database instance(s) are up and running
 * Oracle Database 11g Automatic Memory Management (AMM) is not setup 
   (See Doc ID 749851.1)
 * The shared memory segments can be listed by command:
     # ipcs -m
Press Enter to proceed..."
read
# Check for the kernel version
KERN=`uname -r | awk -F. '{ printf("%d.%d/n",$1,$2); }'`
# Find out the HugePage size
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk '{print $2}'`
if [ -z "$HPG_SZ" ];then
    echo "The hugepages may not be supported in the system where the script is being executed."
    exit 1
fi
# Initialize the counter
NUM_PG=0
# Cumulative number of pages required to handle the running shared memory segments
for SEG_BYTES in `ipcs -m | cut -c44-300 | awk '{print $1}' | grep "[0-9][0-9]*"`
do
    MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`
    if [ $MIN_PG -gt 0 ]; then
        NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`
    fi
done
RES_BYTES=`echo "$NUM_PG * $HPG_SZ * 1024" | bc -q`
# An SGA less than 100MB does not make sense
# Bail out if that is the case
if [ $RES_BYTES -lt 100000000 ]; then
    echo "***********"
    echo "** ERROR **"
    echo "***********"
    echo "Sorry! There are not enough total of shared memory segments allocated for 
HugePages configuration. HugePages can only be used for shared memory segments 
that you can list by command:
    # ipcs -m
of a size that can match an Oracle Database SGA. Please make sure that:
 * Oracle Database instance is up and running 
 * Oracle Database 11g Automatic Memory Management (AMM) is not configured"
    exit 1
fi
# Finish with results
case $KERN in
    '2.4') HUGETLB_POOL=`echo "$NUM_PG*$HPG_SZ/1024" | bc -q`;
           echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;
    '2.6') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;
     *) echo "Unrecognized kernel version $KERN. Exiting." ;;
esac
# End

2.5 第五步：在/etc/sysctl.conf檔案中設定vm.nr_hugepages引數

...

vm.nr_hugepages = 1496

...

2.6 第六步：停止所有例項，並重啟伺服器

2.7 驗證配置

在重啟系統之後，確保所有的資料庫例項都已經啟動，使用如下命令檢查HugePage的狀態：

# grep HugePages /proc/meminfo
HugePages_Total:    1496
HugePages_Free:      485
HugePages_Rsvd:      446
HugePages_Surp:        0

為了確保HugePages配置的有效性，HugePages_Free值應該小於HugePages_Total 的值，並且應該等於HugePages_Rsvd的值。

Hugepages_Free 和HugePages_Rsvd 的值應該小於SGA 分配的gages。

2.8 故障處理

一些常見的問題如下：

Symptom Possible Cause Troubleshooting Action

System is running out of memory or swapping Not enough HugePages to cover the SGA(s) and therefore the area reserved for HugePages are wasted where SGAs are allocated through regular pages. Review your HugePages configuration to make sure that all SGA(s) are covered.

Databases fail to start memlock limits are not set properly Make sure the settings in limits.conf apply to database owner account.

One of the database fail to start while another is up The SGA of the specific database could not find available HugePages and remaining RAM is not enough. Make sure that the RAM and HugePages are enough to cover all your database SGAs

Cluster Ready Services (CRS) fail to start HugePages configured too large (maybe larger than installed RAM) Make sure the total SGA is less than the installed RAM and re-calculate HugePages.

HugePages_Total = HugePages_Free HugePages are not used at all. No database instances are up or using AMM. Disable AMM and make sure that the database instances are up.

Database started successfully and the performance is slow The SGA of the specific database could not find available HugePages and therefore the SGA is handled by regular pages, which leads to slow performance Make sure that the HugePages are many enough to cover all your database SGAs

2.9 MOS 相關文件

HugePages and Oracle Database 11g Automatic Memory Management (AMM) on Linux [ID 749851.1]

Hugepages are Not used by Database Buffer Cache [ID 829850.1]

Oracle Not Utilizing Hugepages [ID 803238.1]

/proc/meminfo Does Not Provide HugePages Information on Oracle Enterprise Linux (OEL5) [ID 860350.1]

HugePages Not Released On Oracle RDBMS Instance Shutdown with RHEL / EL 5 Update 1 (5.1) [ID 550443.1]

Shell Script to Calculate Values Recommended Linux HugePages / HugeTLB Configuration [ID 401749.1]

HugePages on Oracle Linux 64-bit [ID 361468.1]

HugePages on Linux: What It Is... and What It Is Not... [ID 361323.1]

Document 749851.1 HugePages and Oracle Database 11g Automatic Memory Management (AMM) on Linux

Document 829850.1 Hugepages Are Not Used by Database Buffer Cache

Document 803238.1 Oracle Not Utilizing Hugepages

Document 728063.1 Setup HugePages in an Guest Does Not Work with Oracle VM 2.1 or 2.1.1

Document 550443.1 HugePages Not Released On Oracle RDBMS Instance Shutdown with RHEL / EL 5 Update 1 (5.1)

Document 860350.1 /proc/meminfo Does Not Provide HugePages Information on Oracle Enterprise Linux (OEL5)