加载中

Introduction

One of the most important aspects of an operating system is the Virtual Memory Management system. Virtual Memory (VM) allows an operating system to perform many of its advanced functions, such as process isolation, file caching, and swapping. As such, it is imperative that an administrator understand the functions and tunable parameters of an operating system's Virtual Memory Manager so that optimal performance for a given workload may be achieved. After reading this article, the reader should have a rudimentary understanding of the data the Red Hat Enterprise Linux (RHEL3) VM controls and the algorithms it uses. Further, the reader should have a fairly good understanding of general Linux VM tuning techniques. It is important to note that Linux as an operating system has a proud legacy of overhaul. Items which no longer serve useful purposes or which have better implementations as technology advances are phased out. This implies that the tuning parameters described in this article may be out of date if you are using a newer or older kernel. Fear not however! With a well grounded understanding of the general mechanics of a VM, it is fairly easy to convert knowledge of VM tuning to another VM. The same general principles apply, and documentation for a given kernel (including its specific tunable parameters) can be found in the corresponding kernel source tree under the fileDocumentation/sysctl/vm.txt.

简介

虚拟内存管理子系统是操作系统的核心之一. 有了虚拟内存(Virtual Machine-VM), 操作系统中诸如进程间隔离, 文件缓存, 存储交换(swapping)等一系列高级的功能才得以实现. 因此, 系统管理员只有在掌握操作系统中的虚拟内存管理的原理以及如何配置虚拟内存相关参数, 才能在一定的工作负载下配置出机器的最优的性能. 读完了本篇文章后, 你应该能基本掌握红帽公司企业级的Linux系统(RHEL3)的虚拟内存控制及其背后的实现算法. 不仅如此, 你更应该对通用的Linux虚拟内存管理的参数配置有一定的心得. 你也应该注意到, Linux作为操作系统其设计的革新是值得称道的. 内核中不适用的设计被摒弃了, 新的更好的设计也取代了旧的设计. 因而本文中所描述的配置参数对新版或者老版的Linux内核可能不再适用. 但是也不要气馁. 对虚拟内存管理有了深入的认识后, 配置其它的虚拟内存系统也就是小菜一碟了, 因为基本的原理是相通的. 对于特定的Linux内核版本, 可以在内核源码中Documentation/sysctl/vm.txt文件中找到虚拟内存相关的帮助, VM可供配置的参数也会描述于此文件中.

Definitions

To properly understand how a Virtual Memory Manager does its job, it helps to understand what components comprise a VM. While the low level view of a VM are overwhelming for most, a high level view is necessary to understand how a VM works and how it can be optimized for workloads.

What Comprises a VM

High Level Overview of VM Subsystem
Figure 1. High Level Overview of VM Subsystem

The inner workings of the Linux virtual memory subsystem are quite complex, but it can be defined at a high level with the following components:

MMU

The Memory Management Unit (MMU) is the hardware base that makes a VM system possible. The MMU allows software to reference physical memory by aliased addresses, quite often more than one. It accomplishes this through the use of pages and page tables. The MMU uses a section of memory to translate virtual addresses into physical addresses via a series of table lookups.

定义

为了正确的理解虚拟内存管理器的工作原理,磨刀不误砍柴工,我们先来了解一下虚拟内存的组成。虽然对于虚拟内存低层级组成概念很有益处,但是有必要更深入地了解虚拟内存如何工作以及怎样才能优化其性能。

虚拟内存是由什么构成?

图表1. 高级虚拟内存子系统组成图

Linux系统中的虚拟内存子系统复杂极其复杂,但是我们可以通过下面的组件更深入地了解虚拟内存:

MMU

内存管理单元(MMU, Memory Management Unit,下面简称MMU)是作为实现虚拟内存系统的物理硬件基础,MMU可以允许软件通过一个别名的地址跟物理地址建立映射,通常是多于一个。这是通过使用分页(pages)和分页表(分页表:分页表是一种数据结构,为使用电脑操作系统之虚拟内存技术,将内存空间切割成分页的形式,用于储存虚拟内存及实体内存间的对应). MMU再使用一部分内存,通过一系列的查找表(Table lookups)来翻译虚拟地址到物理地址的映射

Zoned Buddy Allocator

The Zoned Buddy Allocator is responsible for the management of page allocations to the entire system. This code manages lists of physically contiguous pages and maps them into the MMU page tables, so as to provide other kernel subsystems with valid physical address ranges when the kernel requests them (Physical to Virtual Address mapping is handled by a higher layer of the VM). The name Buddy Allocator is derived from the algorithm this subsystem uses to maintain it free page lists. All physical pages in RAM are cataloged by the Buddy Allocator and grouped into lists. Each list represents clusters of 2n pages, where n is incremented in each list. If no entries exist on the requested list, an entry from the next list up is broken into two separate clusters and is returned to the caller while the other is added to the next list down. When an allocation is returned to the buddy allocator, the reverse process happens. Note that the Buddy Allocator also manages memory zones, which define pools of memory which have different purposes. Currently there are three memory pools which the Buddy Allocator manages accesses for:

  • DMA — This zone consists of the first 16 MB of RAM, from which legacy devices allocate to perform direct memory operations.

  • NORMAL — This zone encompasses memory addresses from 16 MB to 1 GB and is used by the kernel for internal data structures as well as other system and user space allocations.

  • HIGHMEM — This zone includes all memory above 1 GB and is used exclusively for system allocations (file system buffers, user space allocations, etc).

Zoned Buddy Allocator (暂译为:区域内存分配器 没有找到中文标准的翻译, Buddy Allocator暂译为友内存分配器)

区域内存分配器负责整个虚拟内存系统分页存储管理。 这部分代码管理连续物理内存分页的链表并且让他们映射到MMU的分页表(page tables),当其他系统和核心子系统请求分配物理地址的时候,由其提供有效的物理地址(物理地址到虚拟内存地址的映射是被虚拟内存系统较高层处理的)。通过友内存分配器的名字我们就可以推断出子系统用来维护空闲列表的算法。所有在内存中的物理分页是被内存分配器分类和分组进入列表的。每一个列表代表了2n分页个的簇,这里的n会随着每个逐步自增。如果在请求列表中没有任何请求,下一个里诶包的请求将会被分在两个隔离的簇中并且在下一个请求到达的时候返回给请求者。当分配返回请求给到好友分配器内存分配器的时候,反转处理便开始了;注意到内存分配器也管理着定义不同用途的内存池的内存区域。目前内存分配器能够管理进入一下三种内存池:

  • DMA-这个区域包含内存最开始的16MB空间,这部分是作为遗留设备(legacy devices)用作直接对内存进行操作的空间的。
  • NORMAL-这部分区域包括接下来的16MB到1GB的内存地址,被用作内核的内部数据结构以及系统和用户的空间来分配使用。
  • HIGHMEM-这部分区域包含1GB以上的地址,被专门留给操作系统分配使用(如文件系统缓冲,用户空间分配,等等)。
Slab Allocator

The Slab Allocator provides a more usable front end to the Buddy Allocator for those sections of the kernel which require memory in sizes that are more flexible than the standard 4 KB page. The Slab Allocator allows other kernel components to create caches of memory objects of a given size. The Slab Allocator is responsible for placing as many of the cache's objects on a page as possible and monitoring which objects are free and which are allocated. When allocations are requested and no more are available, the Slab Allocator requests more pages from the Buddy Allocator to satisfy the request. This allows kernel components to use memory in a much simpler way. This way components which make use of many small portions of memory are not required to individually implement memory management code so that too many pages are not wasted. The Slab Allocator may only allocate from the DMA and NORMAL zones.

Slab 分配器

Slab分配器提供了一种可用性更高的前端实现来配合Buddy(伙伴算法)分配器,它主要用来应对内核中某些部分需求大小更加灵活内存(并非常用的4KB)的请求。Slab分配器允许内核组件创建给定大小的内存对象缓存。Slab分配器负责将尽可能多的缓存对象放在一页并且监控哪些对象已经释放,哪些内存已经被分配。当有内存分配请求但是页面中没有内存可用时,Slab分配器会向Buddy分配器请求更多的页来满足分配请求。这就使得内核组件用一种更简单的方法来使用内存。使用这种方法,很多只利用一小部分内存的组件就不需要各自独立实现内存管理的代码,从而不需要浪费很多的页。Slab分配器只可能从DMA和NORMAL区域分配内存。

有关于slab分配器请参考:http://www.ibm.com/developerworks/cn/linux/l-linux-slab-allocator/

Kernel Threads

The last component in the VM subsystem are the kernel threads:kscand,kswapd,kupdated, andbdflush. These tasks are responsible for the recovery and management of in use memory. All pages of memory have an associated state (for more information on the memory state machine, refer to the section called “The Life of a Page” section. In general, the active tasks in the kernel related to VM usage are responsible for attempting to move pages out of RAM. Periodically they examine RAM, trying to identify and free inactive memory so that it can be put to other uses in the system.

内核线程

最后一个虚拟内存子系统的组件是内核线程,包括:kscand, kswapd, kupdated, 和bdflush。这些线程负责正在使用的内存的恢复和管理。虚拟内存中的所有页面都有一个关联的状态(更多关于内存状态机的信息请参考"页面的生命周期"章节)一般来说,内核中虚拟内存相关的活跃线程负责尝试将页面移出RAM的操作。它们定期的检查RAM,尝试识别和释放非活跃的内存,从而使得这一部分内存可以在系统中另作他用。

The Life of a Page

All of the memory managed by the VM is labeled by a state. These states help let the VM know what to do with a given page under various circumstances. Dependent on the current needs of the system, the VM may transfer pages from one state to the next, according to the state machine in Figure 2. “VM Page State Machine”. Using these states, the VM can determine what is being done with a page by the system at a given time and what actions the VM may take on the page. The states that have particular meanings are as follows:

  1. FREE — All pages available for allocation begin in this state. This indicates to the VM that the page is not being used for any purpose and is available for allocation.

  2. ACTIVE — Pages which have been allocated from the Buddy Allocator enter this state. It indicates to the VM that the page has been allocated and is actively in use by the kernel or a user process.

  3. INACTIVE DIRTY — This state indicates that the page has fallen into disuse by the entity which allocated it and thus is a candidate for removal from main memory. Thekscandtask periodically sweeps through all the pages in memory, taking note of the amount of time the page has been in memory since it was last accessed. Ifkscandfinds that a page has been accessed since it last visited the page, it increments the page's age counter; otherwise, it decrements that counter. Ifkscandfinds a page with its age counter at zero, it moves the page to the inactive dirty state. Pages in the inactive dirty state are kept in a list of pages to be laundered.

  4. INACTIVE LAUNDERED — This is an interim state in which those pages which have been selected for removal from main memory enter while their contents are being moved to disk. Only pages which were in the inactive dirty state can enter this state. When the disk I/O operation is complete, the page is moved to the inactive clean state, where it may be deallocated or overwritten for another purpose. If, during the disk operation, the page is accessed, the page is moved back into the active state.

  5. INACTIVE CLEAN — Pages in this state have been laundered. This means that the contents of the page are in sync with the backed up data on disk. Thus, they may be deallocated by the VM or overwritten for other purposes.

VM Page State Machine
Figure 2. VM Page State Machine

页面的生命周期

所有由虚拟内存管理的内存都会被一个状态标记。这些状态帮助虚拟内存知道在各种各样的情形下对给定的页面该做些什么。依赖于当前系统的需要,虚拟内存可能依据状态机(图示2. "虚拟内存页面状态机")将页面从一种状态转移到下一个状态。利用这些状态, 虚拟内存可以决定操作系统在某个时间对某个页面做了什么,并且它还可以决定对这个页面做什么操作。这些有特殊意义的状态如下所示

1.FREE —— 所有可被分配的页面从这个状态开始。这个状态告诉虚拟内存本页面没有被用于任何目的,并且可分配。

2.ACTIVE —— 页面已经被Buddy分配器分配了之后进入ACTIVE状态。这个状态告诉虚拟内存本页面已经被分配,并且它已经被内存进程或者用户进程所使用。

3. INACTIVE DIRTY —— 这个状态预示着本页面已经被要求分配它的进程所抛弃,并且它成为将要从主存中被剔除的候选者。kscand任务会定期扫描内存中的页面,并记下页面自从最后一次访问的到当前呆在内存的总时间。如果kscand任务发现自从上次它扫面这个页面以来,这个页面有被访问,它会增加这个页面的年龄计数器的值,否则,它会减少这个页面的年龄计数器的值。当kscand任务发现这个页面的年龄计数器的值为0,它会将这个页面的状态置成INACTIVE DIRTY状态。在INACTIVE DIRTY状态下的页面被保存在将要被清除的页面列表里面。

4. INACTIVE LAUNDERED —— 这是一个临时的状态,在这个状态下的页面已经被选择出要从主存中剔除,与此同时这个页面的内容将被保存在磁盘上。只有在INACTIVE DIRTY状态下的页面才能进入这个状态。一旦磁盘I/O操作(写磁盘操作)完成,这个页面的状态转移到INACTIVE CLEAN,在INACTIVE CLEAN 状态下,这个页面可能会被释放或者由于其他目的而被重写。如果在(写)磁盘操作期间,这个页面被访问了, 它的状态将变成ACTIVE。

5. INACTIVE CLEAN —— 这个状态下的页面已经被从内存中清除了。这意味着此页面的内容已经同步到磁盘上。从而,此页面可能会被虚拟内存释放或者由于其他目的而被重写。

VM Page State Machine

图示2. 虚拟内存页面状态机

Tuning the VM

Now that the picture of the VM mechanism is sufficiently illustrated, how is it adjusted to fit certain workloads? There are two methods for changing tunable parameters in the Linux VM. The first is the sysctl interface. The sysctl interface is a programming oriented interface, which allows software programs to modify various tunable parameters directly. It is exported to system administrators via the sysctl utility, which allows an administrator to specify a value for any of the tunable VM parameters via the command line. For example:

sysctl -w vm.max map count=65535

VM调优

上图充分描述了VM的工作机制, 那么它是经过怎样的调整来适应特定的工作负载?在Linux VM中有两种方法可以修改一些可调参数。第一个是sysctl接口。 这个sysctl接口是一个面向对象的编程接口, 它可以让我们的应用程序直接修改各种系统的可调参数。 sysctl非常实用,它允许管理员通过命令行为任何一个可调VM参数指定一个值。举个例子:

sysctl -w vm.max map count=65535
The sysctl utility also supports the use of a configuration file (/etc/sysctl.conf), in which all the desirable changes to a VM can be recorded for a system and restored after a restart of the operating system, making this access method suitable for long term changes to a system VM. The file is straightforward in its layout, using simple key-value pairs with comments for clarity. For example:

#Adjust the min and max read-ahead for files
vm.max-readahead=64
vm.min-readahead=32
#turn on memory over-commit 
vm.overcommit_memory=2
#bump up the percentage of memory in use to activate bdflush
vm.bdflush="40 500 0 0 500 3000 60 20 0"

sysctl工具同样支持配置文件(/etc/sysctl.conf), 写入此文件中的配置可以保存起来, 下次开机的时候会自动加载. 使用这种方式, 你可以保证你的配置设置一次后, 常久有用. 配置文件语法使用键-值对的格式, 辅以一定的注释(译者注: 注释以#号开头), 一目了然, 示例如下:

#Adjust the min and max read-ahead for files
vm.max-readahead=64
vm.min-readahead=32
#turn on memory over-commit 
vm.overcommit_memory=2
#bump up the percentage of memory in use to activate bdflush
vm.bdflush="40 500 0 0 500 3000 60 20 0"
The second method of modifying VM tunable parameters is via the proc file system. This method exports every group of VM tunables as a virtual file, accessible via all the common Linux utilities used for modifying file contents. The VM tunables are available in the directory/proc/sys/vm/and are most commonly read and modified using thecatandechocommands. For example, use the commandcat /proc/sys/vm/kswapdto view the current value of thekswapdtunable. The output should be similar to:

512 32 8

Then, use the following command to modify the value of the tunable:

echo 511 31 7 > /proc/sys/vm/kswapd

Use thecat /proc/sys/vm/kswapdcommand again to verify that the value was modified. The output should be:

511 31 7

The proc file system interface is a convenient method for making adjustments to the VM while attempting to isolate the peak performance of a system. For convenience, the following sections list the VM tunable parameters as the filenames they are exported to in the/proc/sys/vm/directory. Unless otherwise noted, these tunables apply to the RHEL3 2.4.21-4 kernel.

还有第二种内存参数调优的方法是通过proc文件系统来实现。  这种方式里,每一个可以调整的内存参数被对应到不同的虚拟文件,然后通过系统中通用的文件读写命令来修改这些文件的内容以达到调整参数的目的。 内存相关的参数文件在/proc/sys/vm/目录下,一般都是用cat和echo命令来分别完成参数文件的读写操作。 举个例子, 执行命令 cat /proc/sys/vm/kswapd就可以查看kswapd参数的值,输出结果类似如下:
512 32 8

然后通过下面的命令可以修改这个参数:

echo 511 31 7 > /proc/sys/vm/kswapd

再使用cat /proc/sys/vm/kswapd 命令来核实一下参数是否被修改,这次的输出应该是下面这样:

511 31 7

proc文件系统接口十分便捷,有助于于我们快速调整内存参数以使系统性能达到最佳。为了方便起见,接下来的小节中将会列出/proc/sys/vm/目录里的文件所对应的各个参数及其含义。在无特殊说明的情况下这些参数适用于RHEL3 2.4.21-4版本的内核。

bdflush

Thebdflushfile contains 9 parameters, of which 6 are tunable. These parameters affect the rate at which pages in the buffer cache (the subset of pagecache which stores files in memory) are freed and returned to disk. By adjusting the various values in this file, a system can be tuned to achieve better performance in environments where large amounts of file I/O are performed. Table 1. “bdflush Parameters” defines the parameters forbdflushin the order they appear in the file.

Parameter Description
nfract The percentage of dirty pages in the buffer cache required to activate thebdflushtask
ndirty The maximum number of dirty pages in the buffer cache to write to disk in eachbdflushexecution
reserved1 Reserved for future use
reserved2 Reserved for future
interval The number of jiffies (10ms periods) to delay betweenbdflushiterations
age_buffer The time for a normal buffer to age before it is considered for flushing back to disk
nfract_sync The percentage of dirty pages in the buffer cache required to cause the tasks which are writing pages of memory to begin writing those pages to disk instead
nfract_stop_bdflush The percentage of dirty pages in buffer cache required to allowbdflushto return to idle state
reserved3 Reserved for future use
Table 1.bdflushParameters

bdflush

bdflush文件包含9个参数,其中6个是可以调整的。这些参数影响交换到硬盘的页的比率(存贮文件内容的缓冲页)。通过调整文件里的这些值,可以使需要进行大量I/O操作的系统获得更好的性能。表1,“bdflush参数”以在文件出现的顺序描述了bdflush的参数。

参数 描述
nfract 缓冲中脏页所占半分比。用来激活bdflush任务
ndirty 每次执行bdflush时,缓存中将要写入硬盘的脏页的最大数量
reserved1 预留
reserved2 预留
interval 每次bdflush之间的延迟量jiffies(以10毫秒为单位)
age_buffer 缓冲区被认为可以写回到硬盘的时间
nfract_sync 内存中脏页占缓冲区的百分比,用来引发把脏页写入到磁盘
nfract_stop_bdflush 缓冲中脏页所占半分比,这会要求bdflush停止工作
reserved3 预留
表1,“bdflush参数”


返回顶部
顶部