本篇介绍
本篇介绍下Linux的内存管理,用系统角度看内存的寻址和分配机制。
内容介绍
内存管理应该是系统中最难的模块之一了,而且历史也悠久,就先来简单回顾下。
分段和分页
谈到内存管理,最先想到的就是分段和分页机制。计算机刚出现的时候,并没有这些,刚开始是直接使用的物理地址,也就是代码中操作的地址是可以直接和物理地址对应上的,可是后来随着多进程调度的需求,以及有限的物理内存,于是人们就开始做规定,比如对于一块内存,某个范围是属于内核,然后另外一个范围属于进程A,再另外一个范围属于进程B,如下图所示
image.png
这样不好的地方是每个进程位置都固定和大小都固定好了,比如A需要10M,然后进程D需要5M,那么当A切换出去,把D换到A的位置上,那么5M的空间就闲置了,这就是外部碎片,而且进程需要直接使用物理地址也不安全,别人的进程可以修改我进程地址空间的内容,这样是存在很大安全风险的,另外对于进程来说,使用固定的物理地址也不是很方便,谁也不能保证规定的物理地址是否可用,移植性也受到了约束。
分段
看来得先解决安全问题,于是就引入的分段机制,分段机制最大的优势就是寻址可以不用物理地址了,使用虚拟地址就行了,然后用一个寄存器存放段地址表的地址,也就是后来的GDT和LDT,运行时候的代码段寄存器或数据寄存器存放段寄存器表的索引,这样就可以在GDT和LDT中存放段地址信息和权限信息,程序使用的虚拟地址就成了段偏移了,由于GDT和LDT表中还有权限校验信息,这样就从机制上避免了进程A访问进程B的地址空间,而且由于代码使用的是虚拟地址,也就不需要感知物理地址空间的范围了,另外由于代码可以分为不同的段,在程序运行的时候就可以按需加载,也可以一定程序上避免内存外部碎片。
分页
分段机制的确很优秀,不过还可以再优秀一点,那就是分页机制,分段有个不足就是按段来管理,毕竟段相对于内存来说还是太大了,比如代码段,数据段之类的有可能很大,按这么大粒度管理可能还是会存在内存碎片问题,比如目前空闲着5M空间,可是程序的数据段是10M,实际只会用到1M,那不好意思,程序还是用不了,因为按照分段加载就得加载10M,少1M也不行。这时候就希望有一种粒度更细的机制,于是分页就呼之欲出了。分页把地址空间按照页框来管理,一般是4k,也有其他款式的,总之要和物理内存的页框大小匹配上。这样内存就按照页框的粒度来管理就好了。
再看下内存访问过程,首先虚拟地址和段索引经过分段机制得到了线性地址,如果没有分页的话,此时的线性地址就是物理地址,有分页的话,就需要走分页地址了,首先有一个页表,页表中会记录线性地址和物理地址的映射关系,当然也是页框的维度,于是就开始从页表中查找对应页框的物理地址,找到后再加上线性地址的偏移地址部分,比如低12位,就得到了真正的物理地址。这个过程如下:
image.png
那还有2个问题,
- 既然分页机制怎么细粒度,那可以不需要分段机制么?
- 分页机制可以完全避免内存碎片问题么?
公布下答案:
- 的确有分页机制就可以完全不需要分段机制,目前linux是在分段的基础上实现了分页,这个也有考虑到是兼容性问题。
- 分页机制只是将内存管理的粒度变小了,不过还是不能完全避免内存碎片问题,只是目前的内存碎片一定会小于页框大小,比起之前的方案已经改进很多了。
物理内存管理
在内核中物理内存是按页框管理的,每个页框对应一个page结构,定义如下:
struct page { | |
unsigned long flags; /* Atomic flags, some possibly | |
* updated asynchronously */ | |
/* | |
* Five words (20/40 bytes) are available in this union. | |
* WARNING: bit 0 of the first word is used for PageTail(). That | |
* means the other users of this union MUST NOT use the bit to | |
* avoid collision and false-positive PageTail(). | |
*/ | |
union { | |
struct { /* Page cache and anonymous pages */ | |
/** | |
* @lru: Pageout list, eg. active_list protected by | |
* lruvec->lru_lock. Sometimes used as a generic list | |
* by the page owner. | |
*/ | |
union { | |
struct list_head lru; | |
/* Or, for the Unevictable "LRU list" slot */ | |
struct { | |
/* Always even, to negate PageTail */ | |
void *__filler; | |
/* Count page's or folio's mlocks */ | |
unsigned int mlock_count; | |
}; | |
/* Or, free page */ | |
struct list_head buddy_list; | |
struct list_head pcp_list; | |
}; | |
/* See page-flags.h for PAGE_MAPPING_FLAGS */ | |
struct address_space *mapping; | |
union { | |
pgoff_t index; /* Our offset within mapping. */ | |
unsigned long share; /* share count for fsdax */ | |
}; | |
/** | |
* @private: Mapping-private opaque data. | |
* Usually used for buffer_heads if PagePrivate. | |
* Used for swp_entry_t if PageSwapCache. | |
* Indicates order in the buddy system if PageBuddy. | |
*/ | |
unsigned long private; | |
}; | |
struct { /* page_pool used by netstack */ | |
/** | |
* @pp_magic: magic value to avoid recycling non | |
* page_pool allocated pages. | |
*/ | |
unsigned long pp_magic; | |
struct page_pool *pp; | |
unsigned long _pp_mapping_pad; | |
unsigned long dma_addr; | |
union { | |
/** | |
* dma_addr_upper: might require a 64-bit | |
* value on 32-bit architectures. | |
*/ | |
unsigned long dma_addr_upper; | |
/** | |
* For frag page support, not supported in | |
* 32-bit architectures with 64-bit DMA. | |
*/ | |
atomic_long_t pp_frag_count; | |
}; | |
}; | |
struct { /* Tail pages of compound page */ | |
unsigned long compound_head; /* Bit zero is set */ | |
/* First tail page only */ | |
unsigned char compound_dtor; | |
unsigned char compound_order; | |
atomic_t compound_mapcount; | |
atomic_t subpages_mapcount; | |
atomic_t compound_pincount; | |
unsigned int compound_nr; /* 1 << compound_order */ | |
}; | |
struct { /* Second tail page of transparent huge page */ | |
unsigned long _compound_pad_1; /* compound_head */ | |
unsigned long _compound_pad_2; | |
/* For both global and memcg */ | |
struct list_head deferred_list; | |
}; | |
struct { /* Second tail page of hugetlb page */ | |
unsigned long _hugetlb_pad_1; /* compound_head */ | |
void *hugetlb_subpool; | |
void *hugetlb_cgroup; | |
void *hugetlb_cgroup_rsvd; | |
void *hugetlb_hwpoison; | |
/* No more space on 32-bit: use third tail if more */ | |
}; | |
struct { /* Page table pages */ | |
unsigned long _pt_pad_1; /* compound_head */ | |
pgtable_t pmd_huge_pte; /* protected by page->ptl */ | |
unsigned long _pt_pad_2; /* mapping */ | |
union { | |
struct mm_struct *pt_mm; /* x86 pgds only */ | |
atomic_t pt_frag_refcount; /* powerpc */ | |
}; | |
spinlock_t *ptl; | |
spinlock_t ptl; | |
}; | |
struct { /* ZONE_DEVICE pages */ | |
/** @pgmap: Points to the hosting device page map. */ | |
struct dev_pagemap *pgmap; | |
void *zone_device_data; | |
/* | |
* ZONE_DEVICE private pages are counted as being | |
* mapped so the next 3 words hold the mapping, index, | |
* and private fields from the source anonymous or | |
* page cache page while the page is migrated to device | |
* private memory. | |
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also | |
* use the mapping, index, and private fields when | |
* pmem backed DAX files are mapped. | |
*/ | |
}; | |
/** @rcu_head: You can use this to free a page by RCU. */ | |
struct rcu_head rcu_head; | |
}; | |
union { /* This union is 4 bytes in size. */ | |
/* | |
* If the page can be mapped to userspace, encodes the number | |
* of times this page is referenced by a page table. | |
*/ | |
atomic_t _mapcount; | |
/* | |
* If the page is neither PageSlab nor mappable to userspace, | |
* the value stored here may help determine what this page | |
* is used for. See page-flags.h for a list of page types | |
* which are currently stored here. | |
*/ | |
unsigned int page_type; | |
}; | |
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ | |
atomic_t _refcount; | |
unsigned long memcg_data; | |
/* | |
* On machines where all RAM is mapped into kernel address space, | |
* we can simply calculate the virtual address. On machines with | |
* highmem some memory is mapped into kernel virtual memory | |
* dynamically, so we need a place to store that address. | |
* Note that this field could be 16 bits on x86 ... ;) | |
* | |
* Architectures with slow multiplication can define | |
* WANT_PAGE_VIRTUAL in asm/page.h | |
*/ | |
void *virtual; /* Kernel virtual address (NULL if | |
not kmapped, ie. highmem) */ | |
/* | |
* KMSAN metadata for this page: | |
* - shadow page: every bit indicates whether the corresponding | |
* bit of the original page is initialized (0) or not (1); | |
* - origin page: every 4 bytes contain an id of the stack trace | |
* where the uninitialized value was created. | |
*/ | |
struct page *kmsan_shadow; | |
struct page *kmsan_origin; | |
int _last_cpupid; | |
} _struct_page_alignment; |
flag 的定义如下:
enum pageflags { | |
PG_locked, /* Page is locked. Don't touch. */ | |
PG_referenced, | |
PG_uptodate, | |
PG_dirty, | |
PG_lru, | |
PG_active, | |
PG_workingset, | |
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ | |
PG_error, | |
PG_slab, | |
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ | |
PG_arch_1, | |
PG_reserved, | |
PG_private, /* If pagecache, has fs-private data */ | |
PG_private_2, /* If pagecache, has fs aux data */ | |
PG_writeback, /* Page is under writeback */ | |
PG_head, /* A head page */ | |
PG_mappedtodisk, /* Has blocks allocated on-disk */ | |
PG_reclaim, /* To be reclaimed asap */ | |
PG_swapbacked, /* Page is backed by RAM/swap */ | |
PG_unevictable, /* Page is "unevictable" */ | |
PG_mlocked, /* Page is vma mlocked */ | |
PG_uncached, /* Page has been mapped as uncached */ | |
PG_hwpoison, /* hardware poisoned page. Don't touch */ | |
PG_young, | |
PG_idle, | |
PG_arch_2, | |
PG_skip_kasan_poison, | |
__NR_PAGEFLAGS, | |
PG_readahead = PG_reclaim, | |
/* | |
* Depending on the way an anonymous folio can be mapped into a page | |
* table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped | |
* THP), PG_anon_exclusive may be set only for the head page or for | |
* tail pages of an anonymous folio. For now, we only expect it to be | |
* set on tail pages for PTE-mapped THP. | |
*/ | |
PG_anon_exclusive = PG_mappedtodisk, | |
/* Filesystems */ | |
PG_checked = PG_owner_priv_1, | |
/* SwapBacked */ | |
PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */ | |
/* Two page bits are conscripted by FS-Cache to maintain local caching | |
* state. These bits are set on pages belonging to the netfs's inodes | |
* when those inodes are being locally cached. | |
*/ | |
PG_fscache = PG_private_2, /* page backed by cache */ | |
/* XEN */ | |
/* Pinned in Xen as a read-only pagetable page. */ | |
PG_pinned = PG_owner_priv_1, | |
/* Pinned as part of domain save (see xen_mm_pin_all()). */ | |
PG_savepinned = PG_dirty, | |
/* Has a grant mapping of another (foreign) domain's page. */ | |
PG_foreign = PG_owner_priv_1, | |
/* Remapped by swiotlb-xen. */ | |
PG_xen_remapped = PG_owner_priv_1, | |
/* SLOB */ | |
PG_slob_free = PG_private, | |
/* | |
* Compound pages. Stored in first tail page's flags. | |
* Indicates that at least one subpage is hwpoisoned in the | |
* THP. | |
*/ | |
PG_has_hwpoisoned = PG_error, | |
/* non-lru isolated movable page */ | |
PG_isolated = PG_reclaim, | |
/* Only valid for buddy pages. Used to track pages that are reported */ | |
PG_reported = PG_uptodate, | |
/* For self-hosted memmap pages */ | |
PG_vmemmap_self_hosted = PG_owner_priv_1, | |
}; |
看page的结构会发现并没有物理内存地址字段,那怎样知道这个page对应哪个物理地址呢?在内核中有一个数组mem_map,page的索引就对应的是物理内存地址,这样就没必要保存了。
整块物理内存也可以按照功能进行划分,比如可以分成以下几部分: ZONE_DMA:用于执行DMA操作 ZONE_NORMAL:用于线性映射物理内存 ZONE_HIGHMEM:用于管理高端内存,这部分是不能直接线程映射到内核地址空间的。
这样的内存管理区用zone描述:
struct zone { | |
/* Read-mostly fields */ | |
/* zone watermarks, access with *_wmark_pages(zone) macros */ | |
unsigned long _watermark[NR_WMARK]; | |
unsigned long watermark_boost; | |
unsigned long nr_reserved_highatomic; | |
/* | |
* We don't know if the memory that we're going to allocate will be | |
* freeable or/and it will be released eventually, so to avoid totally | |
* wasting several GB of ram we must reserve some of the lower zone | |
* memory (otherwise we risk to run OOM on the lower zones despite | |
* there being tons of freeable ram on the higher zones). This array is | |
* recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl | |
* changes. | |
*/ | |
long lowmem_reserve[MAX_NR_ZONES]; | |
int node; | |
struct pglist_data *zone_pgdat; | |
struct per_cpu_pages __percpu *per_cpu_pageset; | |
struct per_cpu_zonestat __percpu *per_cpu_zonestats; | |
/* | |
* the high and batch values are copied to individual pagesets for | |
* faster access | |
*/ | |
int pageset_high; | |
int pageset_batch; | |
/* | |
* Flags for a pageblock_nr_pages block. See pageblock-flags.h. | |
* In SPARSEMEM, this map is stored in struct mem_section | |
*/ | |
unsigned long *pageblock_flags; | |
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ | |
unsigned long zone_start_pfn; | |
/* | |
* spanned_pages is the total pages spanned by the zone, including | |
* holes, which is calculated as: | |
* spanned_pages = zone_end_pfn - zone_start_pfn; | |
* | |
* present_pages is physical pages existing within the zone, which | |
* is calculated as: | |
* present_pages = spanned_pages - absent_pages(pages in holes); | |
* | |
* present_early_pages is present pages existing within the zone | |
* located on memory available since early boot, excluding hotplugged | |
* memory. | |
* | |
* managed_pages is present pages managed by the buddy system, which | |
* is calculated as (reserved_pages includes pages allocated by the | |
* bootmem allocator): | |
* managed_pages = present_pages - reserved_pages; | |
* | |
* cma pages is present pages that are assigned for CMA use | |
* (MIGRATE_CMA). | |
* | |
* So present_pages may be used by memory hotplug or memory power | |
* management logic to figure out unmanaged pages by checking | |
* (present_pages - managed_pages). And managed_pages should be used | |
* by page allocator and vm scanner to calculate all kinds of watermarks | |
* and thresholds. | |
* | |
* Locking rules: | |
* | |
* zone_start_pfn and spanned_pages are protected by span_seqlock. | |
* It is a seqlock because it has to be read outside of zone->lock, | |
* and it is done in the main allocator path. But, it is written | |
* quite infrequently. | |
* | |
* The span_seq lock is declared along with zone->lock because it is | |
* frequently read in proximity to zone->lock. It's good to | |
* give them a chance of being in the same cacheline. | |
* | |
* Write access to present_pages at runtime should be protected by | |
* mem_hotplug_begin/done(). Any reader who can't tolerant drift of | |
* present_pages should use get_online_mems() to get a stable value. | |
*/ | |
atomic_long_t managed_pages; | |
unsigned long spanned_pages; | |
unsigned long present_pages; | |
unsigned long present_early_pages; | |
unsigned long cma_pages; | |
const char *name; | |
/* | |
* Number of isolated pageblock. It is used to solve incorrect | |
* freepage counting problem due to racy retrieving migratetype | |
* of pageblock. Protected by zone->lock. | |
*/ | |
unsigned long nr_isolate_pageblock; | |
/* see spanned/present_pages for more description */ | |
seqlock_t span_seqlock; | |
int initialized; | |
/* Write-intensive fields used from the page allocator */ | |
CACHELINE_PADDING(_pad1_); | |
/* free areas of different sizes */ | |
struct free_area free_area[MAX_ORDER]; | |
/* zone flags, see below */ | |
unsigned long flags; | |
/* Primarily protects free_area */ | |
spinlock_t lock; | |
/* Write-intensive fields used by compaction and vmstats. */ | |
CACHELINE_PADDING(_pad2_); | |
/* | |
* When free pages are below this point, additional steps are taken | |
* when reading the number of free pages to avoid per-cpu counter | |
* drift allowing watermarks to be breached | |
*/ | |
unsigned long percpu_drift_mark; | |
/* pfn where compaction free scanner should start */ | |
unsigned long compact_cached_free_pfn; | |
/* pfn where compaction migration scanner should start */ | |
unsigned long compact_cached_migrate_pfn[ASYNC_AND_SYNC]; | |
unsigned long compact_init_migrate_pfn; | |
unsigned long compact_init_free_pfn; | |
/* | |
* On compaction failure, 1<<compact_defer_shift compactions | |
* are skipped before trying again. The number attempted since | |
* last failure is tracked with compact_considered. | |
* compact_order_failed is the minimum compaction failed order. | |
*/ | |
unsigned int compact_considered; | |
unsigned int compact_defer_shift; | |
int compact_order_failed; | |
/* Set to true when the PG_migrate_skip bits should be cleared */ | |
bool compact_blockskip_flush; | |
bool contiguous; | |
CACHELINE_PADDING(_pad3_); | |
/* Zone statistics */ | |
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; | |
atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; | |
} ____cacheline_internodealigned_in_smp; |
管理物理内存使用的是伙伴算法,按照页的阶数管理内存,形式如下:
image.png
分配物理页面的函数如下:
static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) | |
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) |
释放物理内存的函数如下:
extern void __free_pages(struct page *page, unsigned int order); | |
extern void free_pages(unsigned long addr, unsigned int order); |
slab 分配器
当内核中需要分配小块内存的时候怎么办?比如分配10个字节之类的,这时候也是直接分配一个page,那内部碎片就太严重了,而且效率也极低,这时候我们可以设想搞一个cache,这个cache可以缓存不同大小字节的内存,这样就可以避免该问题。这就是slab分配器的想法,可以先看下对应的api:
// 分配slab描述符 | |
struct kmem_cache *kmem_cache_create(const char *name, unsigned int size, | |
unsigned int align, slab_flags_t flags, | |
void (*ctor)(void *)); | |
// 释放slab描述符 | |
void kmem_cache_destroy(struct kmem_cache *s); | |
// 分配缓存对象 | |
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_alignment __malloc; | |
// 释放缓存对象 | |
void kmem_cache_free(struct kmem_cache *s, void *objp); |
kmem_cache的结构定义如下:
struct kmem_cache { | |
struct array_cache __percpu *cpu_cache; | |
/* 1) Cache tunables. Protected by slab_mutex */ | |
unsigned int batchcount; | |
unsigned int limit; | |
unsigned int shared; | |
unsigned int size; | |
struct reciprocal_value reciprocal_buffer_size; | |
/* 2) touched by every alloc & free from the backend */ | |
slab_flags_t flags; /* constant flags */ | |
unsigned int num; /* # of objs per slab */ | |
/* 3) cache_grow/shrink */ | |
/* order of pgs per slab (2^n) */ | |
unsigned int gfporder; | |
/* force GFP flags, e.g. GFP_DMA */ | |
gfp_t allocflags; | |
size_t colour; /* cache colouring range */ | |
unsigned int colour_off; /* colour offset */ | |
unsigned int freelist_size; | |
/* constructor func */ | |
void (*ctor)(void *obj); | |
/* 4) cache creation/removal */ | |
const char *name; | |
struct list_head list; | |
int refcount; | |
int object_size; | |
int align; | |
/* 5) statistics */ | |
unsigned long num_active; | |
unsigned long num_allocations; | |
unsigned long high_mark; | |
unsigned long grown; | |
unsigned long reaped; | |
unsigned long errors; | |
unsigned long max_freeable; | |
unsigned long node_allocs; | |
unsigned long node_frees; | |
unsigned long node_overflow; | |
atomic_t allochit; | |
atomic_t allocmiss; | |
atomic_t freehit; | |
atomic_t freemiss; | |
/* | |
* If debugging is enabled, then the allocator can add additional | |
* fields and/or padding to every object. 'size' contains the total | |
* object size including these internal fields, while 'obj_offset' | |
* and 'object_size' contain the offset to the user object and its | |
* size. | |
*/ | |
int obj_offset; | |
struct kasan_cache kasan_info; | |
unsigned int *random_seq; | |
unsigned int useroffset; /* Usercopy region offset */ | |
unsigned int usersize; /* Usercopy region size */ | |
struct kmem_cache_node *node[MAX_NUMNODES]; | |
}; |
slab中有本地缓存池和共享缓存池,本地缓存池是per cpu变量,也就是每个cpu一个本地缓存池,这样在分配对象的时候可以先从本地缓存池中拿,如果拿不到就去共享缓存池中申请,再拿不到就找伙伴系统申请一页作为slab。在释放的时候也是先回退给本地缓存池,本地缓存池空闲对象数量大于阈值后就会返回给共享缓存池,共享缓存池共享对象超过阈值后就会返回给伙伴系统。 先看下本地缓存对象:
struct array_cache { | |
unsigned int avail; | |
unsigned int limit; | |
unsigned int batchcount; | |
unsigned int touched; | |
void *entry[]; /* | |
* Must have this definition in here for the proper | |
* alignment of array_cache. Also simplifies accessing | |
* the entries. | |
*/ | |
}; |
这些字段就可以和本地缓存池的管理机制对上了。 再看下共享缓存池:
/* | |
* The slab lists for all objects. | |
*/ | |
struct kmem_cache_node { | |
raw_spinlock_t list_lock; | |
struct list_head slabs_partial; /* partial list first, better asm code */ | |
struct list_head slabs_full; | |
struct list_head slabs_free; | |
unsigned long total_slabs; /* length of all slab lists */ | |
unsigned long free_slabs; /* length of free slab list only */ | |
unsigned long free_objects; | |
unsigned int free_limit; | |
unsigned int colour_next; /* Per-node cache coloring */ | |
struct array_cache *shared; /* shared per node */ | |
struct alien_cache **alien; /* on other nodes */ | |
unsigned long next_reap; /* updated without locking */ | |
int free_touched; /* updated without locking */ | |
spinlock_t list_lock; | |
unsigned long nr_partial; | |
struct list_head partial; | |
atomic_long_t nr_slabs; | |
atomic_long_t total_objects; | |
struct list_head full; | |
}; |
基本也可以和共享缓存池的管理机制对的上。
kmalloc
看了slab机制后再看kmalloc就容易了,kmalloc本质上就是有各种大小的slab缓存池构成的,比如kmalloc-16,kmalloc-32,kmalloc-64等,在kmalloc内部会按照大小路由到对应的slab缓存池中。
虚拟内存管理
每个进程都有自己的虚拟地址空间,那这个信息是怎样呈现的呢?在内核中是用mm_struct结构描述的:
struct mm_struct { | |
struct { | |
struct maple_tree mm_mt; | |
unsigned long (*get_unmapped_area) (struct file *filp, | |
unsigned long addr, unsigned long len, | |
unsigned long pgoff, unsigned long flags); | |
unsigned long mmap_base; /* base of mmap area */ | |
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ | |
/* Base addresses for compatible mmap() */ | |
unsigned long mmap_compat_base; | |
unsigned long mmap_compat_legacy_base; | |
unsigned long task_size; /* size of task vm space */ | |
pgd_t * pgd; | |
/** | |
* @membarrier_state: Flags controlling membarrier behavior. | |
* | |
* This field is close to @pgd to hopefully fit in the same | |
* cache-line, which needs to be touched by switch_mm(). | |
*/ | |
atomic_t membarrier_state; | |
/** | |
* @mm_users: The number of users including userspace. | |
* | |
* Use mmget()/mmget_not_zero()/mmput() to modify. When this | |
* drops to 0 (i.e. when the task exits and there are no other | |
* temporary reference holders), we also release a reference on | |
* @mm_count (which may then free the &struct mm_struct if | |
* @mm_count also drops to 0). | |
*/ | |
atomic_t mm_users; | |
/** | |
* @mm_count: The number of references to &struct mm_struct | |
* (@mm_users count as 1). | |
* | |
* Use mmgrab()/mmdrop() to modify. When this drops to 0, the | |
* &struct mm_struct is freed. | |
*/ | |
atomic_t mm_count; | |
atomic_long_t pgtables_bytes; /* PTE page table pages */ | |
int map_count; /* number of VMAs */ | |
spinlock_t page_table_lock; /* Protects page tables and some | |
* counters | |
*/ | |
/* | |
* With some kernel config, the current mmap_lock's offset | |
* inside 'mm_struct' is at 0x120, which is very optimal, as | |
* its two hot fields 'count' and 'owner' sit in 2 different | |
* cachelines, and when mmap_lock is highly contended, both | |
* of the 2 fields will be accessed frequently, current layout | |
* will help to reduce cache bouncing. | |
* | |
* So please be careful with adding new fields before | |
* mmap_lock, which can easily push the 2 fields into one | |
* cacheline. | |
*/ | |
struct rw_semaphore mmap_lock; | |
struct list_head mmlist; /* List of maybe swapped mm's. These | |
* are globally strung together off | |
* init_mm.mmlist, and are protected | |
* by mmlist_lock | |
*/ | |
unsigned long hiwater_rss; /* High-watermark of RSS usage */ | |
unsigned long hiwater_vm; /* High-water virtual memory usage */ | |
unsigned long total_vm; /* Total pages mapped */ | |
unsigned long locked_vm; /* Pages that have PG_mlocked set */ | |
atomic64_t pinned_vm; /* Refcount permanently increased */ | |
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ | |
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ | |
unsigned long stack_vm; /* VM_STACK */ | |
unsigned long def_flags; | |
/** | |
* @write_protect_seq: Locked when any thread is write | |
* protecting pages mapped by this mm to enforce a later COW, | |
* for instance during page table copying for fork(). | |
*/ | |
seqcount_t write_protect_seq; | |
spinlock_t arg_lock; /* protect the below fields */ | |
unsigned long start_code, end_code, start_data, end_data; | |
unsigned long start_brk, brk, start_stack; | |
unsigned long arg_start, arg_end, env_start, env_end; | |
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ | |
struct percpu_counter rss_stat[NR_MM_COUNTERS]; | |
struct linux_binfmt *binfmt; | |
/* Architecture-specific MM context */ | |
mm_context_t context; | |
unsigned long flags; /* Must use atomic bitops to access */ | |
spinlock_t ioctx_lock; | |
struct kioctx_table __rcu *ioctx_table; | |
/* | |
* "owner" points to a task that is regarded as the canonical | |
* user/owner of this mm. All of the following must be true in | |
* order for it to be changed: | |
* | |
* current == mm->owner | |
* current->mm != mm | |
* new_owner->mm == mm | |
* new_owner->alloc_lock is held | |
*/ | |
struct task_struct __rcu *owner; | |
struct user_namespace *user_ns; | |
/* store ref to file /proc/<pid>/exe symlink points to */ | |
struct file __rcu *exe_file; | |
struct mmu_notifier_subscriptions *notifier_subscriptions; | |
pgtable_t pmd_huge_pte; /* protected by page_table_lock */ | |
/* | |
* numa_next_scan is the next time that PTEs will be remapped | |
* PROT_NONE to trigger NUMA hinting faults; such faults gather | |
* statistics and migrate pages to new nodes if necessary. | |
*/ | |
unsigned long numa_next_scan; | |
/* Restart point for scanning and remapping PTEs. */ | |
unsigned long numa_scan_offset; | |
/* numa_scan_seq prevents two threads remapping PTEs. */ | |
int numa_scan_seq; | |
/* | |
* An operation with batched TLB flushing is going on. Anything | |
* that can move process memory needs to flush the TLB when | |
* moving a PROT_NONE mapped page. | |
*/ | |
atomic_t tlb_flush_pending; | |
/* See flush_tlb_batched_pending() */ | |
atomic_t tlb_flush_batched; | |
struct uprobes_state uprobes_state; | |
struct rcu_head delayed_drop; | |
atomic_long_t hugetlb_usage; | |
struct work_struct async_put_work; | |
u32 pasid; | |
/* | |
* Represent how many pages of this process are involved in KSM | |
* merging. | |
*/ | |
unsigned long ksm_merging_pages; | |
/* | |
* Represent how many pages are checked for ksm merging | |
* including merged and not merged. | |
*/ | |
unsigned long ksm_rmap_items; | |
struct { | |
/* this mm_struct is on lru_gen_mm_list */ | |
struct list_head list; | |
/* | |
* Set when switching to this mm_struct, as a hint of | |
* whether it has been used since the last time per-node | |
* page table walkers cleared the corresponding bits. | |
*/ | |
unsigned long bitmap; | |
/* points to the memcg of "owner" above */ | |
struct mem_cgroup *memcg; | |
} lru_gen; | |
} __randomize_layout; | |
/* | |
* The mm_cpumask needs to be at the end of mm_struct, because it | |
* is dynamically sized based on nr_cpu_ids. | |
*/ | |
unsigned long cpu_bitmap[]; | |
}; |
task_struct中有指针指向该结构,也就是每个进程一个mm_struct, 如果有共享内存的场景,那么对存在多个进程指向同一个mm_struct,该结构也可以体现代码段,数据段,堆栈段,内存映射区间的范围信息。 具体内存段是由vm_area_struct表示的:
/* | |
* This struct describes a virtual memory area. There is one of these | |
* per VM-area/task. A VM area is any part of the process virtual memory | |
* space that has a special rule for the page-fault handlers (ie a shared | |
* library, the executable area etc). | |
*/ | |
struct vm_area_struct { | |
/* The first cache line has the info for VMA tree walking. */ | |
unsigned long vm_start; /* Our start address within vm_mm. */ | |
unsigned long vm_end; /* The first byte after our end address | |
within vm_mm. */ | |
struct mm_struct *vm_mm; /* The address space we belong to. */ | |
/* | |
* Access permissions of this VMA. | |
* See vmf_insert_mixed_prot() for discussion. | |
*/ | |
pgprot_t vm_page_prot; | |
unsigned long vm_flags; /* Flags, see mm.h. */ | |
/* | |
* For areas with an address space and backing store, | |
* linkage into the address_space->i_mmap interval tree. | |
* | |
*/ | |
struct { | |
struct rb_node rb; | |
unsigned long rb_subtree_last; | |
} shared; | |
/* | |
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma | |
* list, after a COW of one of the file pages. A MAP_SHARED vma | |
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack | |
* or brk vma (with NULL file) can only be in an anon_vma list. | |
*/ | |
struct list_head anon_vma_chain; /* Serialized by mmap_lock & | |
* page_table_lock */ | |
struct anon_vma *anon_vma; /* Serialized by page_table_lock */ | |
/* Function pointers to deal with this struct. */ | |
const struct vm_operations_struct *vm_ops; | |
/* Information about our backing store: */ | |
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE | |
units */ | |
struct file * vm_file; /* File we map to (can be NULL). */ | |
void * vm_private_data; /* was vm_pte (shared mem) */ | |
/* | |
* For private and shared anonymous mappings, a pointer to a null | |
* terminated string containing the name given to the vma, or NULL if | |
* unnamed. Serialized by mmap_sem. Use anon_vma_name to access. | |
*/ | |
struct anon_vma_name *anon_name; | |
atomic_long_t swap_readahead_info; | |
struct vm_region *vm_region; /* NOMMU mapping region */ | |
struct mempolicy *vm_policy; /* NUMA policy for the VMA */ | |
struct vm_userfaultfd_ctx vm_userfaultfd_ctx; | |
} __randomize_layout; |
VMA 操作如下:
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ | |
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); | |
extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr, | |
struct vm_area_struct **pprev); | |
/* | |
* Look up the first VMA which intersects the interval [start_addr, end_addr) | |
* NULL if none. Assume start_addr < end_addr. | |
*/ | |
struct vm_area_struct *find_vma_intersection(struct mm_struct *mm, | |
unsigned long start_addr, unsigned long end_addr); | |
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *); | |
extern struct vm_area_struct *vma_merge(struct mm_struct *, | |
struct vm_area_struct *prev, unsigned long addr, unsigned long end, | |
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, | |
struct mempolicy *, struct vm_userfaultfd_ctx, struct anon_vma_name *); |
上述操作基本都涵盖了VMA的查找,插入,合并操作。对于系统,所有的mm_struct会串成一个链表,对于进程,所有的vma_ares_struct也会串成一个链表,这样就可以遍历一个进程使用的所有虚拟内存空间。 接下来可以再看看malloc的操作:
image.png
mmap和munmap操作
mmap也是一种内存分配方法,通过创建文件映射的形式来访问内存,如果是指定fd,那就是文件映射,直接将用户空间地址和文件某个区间对应起来,如果没指定fd,那就是匿名映射,可以简单理解成就是分配了一块内存,当malloc大于128kb时候就不用brk了,直接mmap映射分配内存了。 mmap也可以指定是private还是shared映射,如果是private,那就是malloc场景,比较常见。 如果是shared映射,如果是文件映射,那么在修改文件的时候会写入磁盘,这样其他进程可以看到写入的内容,如果是匿名映射, 那么就会借助shmem搞一个内存文件,用内存文件来存放写入的内容,这样其他进程也可以看到写入的内容。
mmap流程如下:
image.png
缺页异常
linux 是在不得不使用物理内存的时候才会分配物理内存。这句话该怎么理解呢?就是我们用malloc或者mmap映射一块内存的时候,只是修改了对应的vma,可是具体的页表项和物理地址并不会立马分配映射,而是在需要写请求的时候才会分配,注意是写请求,那就意味着如果是读请求也不会分配物理内存,会临时映射一个zero数据的页框。缺页异常处理的核心函数是do_page_fault, 实现流程如下:
image.png
页面回收
当我们看到可用物理内存不是太多的时候可用不用急着换更大的内存,因为对于系统,如果物理内存够的话,会尽量用物理内存,这样可以提升系统性能,并在不足的时候自动清理cache回收物理页。因此看到物理可用内存不足并不表示需要换物理内存条了。
接下来就看看页面回收的策略,系统主要是通过LRU来管理物理页,并且按照是否匿名分为不活跃和活跃的匿名页面链表,不活跃和活跃的文件映射页面链表,还有不可回收的页面链表。为什么需要这样区分呢?因为系统倾向于回收文件映射页面,因为大部分的文件cache是不需要回写的,直接丢弃回收就可以,而匿名映射则一定需要回写交换分区,从代价上回收文件映射部分会更低一些。
如果页面回收是在解决不了内存不足问题,这时候就需要用OOM killer了,OOM killer 杀死一个内存占用比较高的进程,那这个进程怎么选呢?可以参考几个系统节点:
/proc/pid/oom_score_adj: 可以设置-1000 到1000的数值,当设置为-1000时,表示不会被OOM killer选中 | |
/proc/pid/oom_adj: 值介于-17到15,值越小越不容易被OOM选中,为-17就表示永远不会选中,这个节点是为了兼容oom_score_adj, 主要参考oom_score_adj就好 | |
/proc/pid/oom_score:表示当前进程的OOM分数 |
内存管理统计信息
首先可用的内存统计信息是meminfo, cat /proc/meminfo 就可以看到
cat /proc/meminfo | |
MemTotal: 20341776 kB | |
MemFree: 13530272 kB | |
MemAvailable: 17516664 kB | |
Buffers: 170512 kB | |
Cached: 4323440 kB | |
SwapCached: 0 kB | |
Active: 1822112 kB | |
Inactive: 3897448 kB | |
... |
这儿需要解释下MemAvailable 一般会大于等于MemFree,因为前者除了空闲内存外,还包含活跃的文件映射页面,不活跃的文件映射页面,可回收的slab页面以及其他可回收的内核页面。
查看伙伴信息信息的内存节点是/proc/pagetypeinfo 和 /proc/buddyinfo 查看内存管理区的内存节点是/proc/zoneinfo 用top 和vmstat也可以看到系统内存信息
如果要查看进程级别的内存信息,内存节点就是/proc/pid/status。