remarkable-linux

redonkable

History

Andrew Shewmaker c9b1d0981f mm: limit growth of 3% hardcoded other user reserve Add user_reserve_kbytes knob. Limit the growth of the memory reserved for other user processes to min(3% current process size, user_reserve_pages). Only about 8MB is necessary to enable recovery in the default mode, and only a few hundred MB are required even when overcommit is disabled. user_reserve_pages defaults to min(3% free pages, 128MB) I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ... then adding the RSS of each. This only affects OVERCOMMIT_NEVER mode. Background 1. user reserve __vm_enough_memory reserves a hardcoded 3% of the current process size for other applications when overcommit is disabled. This was done so that a user could recover if they launched a memory hogging process. Without the reserve, a user would easily run into a message such as: bash: fork: Cannot allocate memory 2. admin reserve Additionally, a hardcoded 3% of free memory is reserved for root in both overcommit 'guess' and 'never' modes. This was intended to prevent a scenario where root-cant-log-in and perform recovery operations. Note that this reserve shrinks, and doesn't guarantee a useful reserve. Motivation The two hardcoded memory reserves should be updated to account for current memory sizes. Also, the admin reserve would be more useful if it didn't shrink too much. When the current code was originally written, 1GB was considered "enterprise". Now the 3% reserve can grow to multiple GB on large memory systems, and it only needs to be a few hundred MB at most to enable a user or admin to recover a system with an unwanted memory hogging process. I've found that reducing these reserves is especially beneficial for a specific type of application load: * single application system * one or few processes (e.g. one per core) * allocating all available memory * not initializing every page immediately * long running I've run scientific clusters with this sort of load. A long running job sometimes failed many hours (weeks of CPU time) into a calculation. They weren't initializing all of their memory immediately, and they weren't using calloc, so I put systems into overcommit 'never' mode. These clusters run diskless and have no swap. However, with the current reserves, a user wishing to allocate as much memory as possible to one process may be prevented from using, for example, almost 2GB out of 32GB. The effect is less, but still significant when a user starts a job with one process per core. I have repeatedly seen a set of processes requesting the same amount of memory fail because one of them could not allocate the amount of memory a user would expect to be able to allocate. For example, Message Passing Interfce (MPI) processes, one per core. And it is similar for other parallel programming frameworks. Changing this reserve code will make the overcommit never mode more useful by allowing applications to allocate nearly all of the available memory. Also, the new admin_reserve_kbytes will be safer than the current behavior since the hardcoded 3% of available memory reserve can shrink to something useless in the case where applications have grabbed all available memory. Risks * "bash: fork: Cannot allocate memory" The downside of the first patch-- which creates a tunable user reserve that is only used in overcommit 'never' mode--is that an admin can set it so low that a user may not be able to kill their process, even if they already have a shell prompt. Of course, a user can get in the same predicament with the current 3% reserve--they just have to launch processes until 3% becomes negligible. * root-cant-log-in problem The second patch, adding the tunable rootuser_reserve_pages, allows the admin to shoot themselves in the foot by setting it too small. They can easily get the system into a state where root-can't-log-in. However, the new admin_reserve_kbytes will be safer than the current behavior since the hardcoded 3% of available memory reserve can shrink to something useless in the case where applications have grabbed all available memory. Alternatives * Memory cgroups provide a more flexible way to limit application memory. Not everyone wants to set up cgroups or deal with their overhead. * We could create a fourth overcommit mode which provides smaller reserves. The size of useful reserves may be drastically different depending on the whether the system is embedded or enterprise. * Force users to initialize all of their memory or use calloc. Some users don't want/expect the system to overcommit when they malloc. Overcommit 'never' mode is for this scenario, and it should work well. The new user and admin reserve tunables are simple to use, with low overhead compared to cgroups. The patches preserve current behavior where 3% of memory is less than 128MB, except that the admin reserve doesn't shrink to an unusable size under pressure. The code allows admins to tune for embedded and enterprise usage. FAQ * How is the root-cant-login problem addressed? What happens if admin_reserve_pages is set to 0? Root is free to shoot themselves in the foot by setting admin_reserve_kbytes too low. On x86_64, the minimum useful reserve is: 8MB for overcommit 'guess' 128MB for overcommit 'never' admin_reserve_pages defaults to min(3% free memory, 8MB) So, anyone switching to 'never' mode needs to adjust admin_reserve_pages. * How do you calculate a minimum useful reserve? A user or the admin needs enough memory to login and perform recovery operations, which includes, at a minimum: sshd or login + bash (or some other shell) + top (or ps, kill, etc.) For overcommit 'guess', we can sum resident set sizes (RSS) because we only need enough memory to handle what the recovery programs will typically use. On x86_64 this is about 8MB. For overcommit 'never', we can take the max of their virtual sizes (VSZ) and add the sum of their RSS. We use VSZ instead of RSS because mode forces us to ensure we can fulfill all of the requested memory allocations-- even if the programs only use a fraction of what they ask for. On x86_64 this is about 128MB. When swap is enabled, reserves are useful even when they are as small as 10MB, regardless of overcommit mode. When both swap and overcommit are disabled, then the admin should tune the reserves higher to be absolutley safe. Over 230MB each was safest in my testing. * What happens if user_reserve_pages is set to 0? Note, this only affects overcomitt 'never' mode. Then a user will be able to allocate all available memory minus admin_reserve_kbytes. However, they will easily see a message such as: "bash: fork: Cannot allocate memory" And they won't be able to recover/kill their application. The admin should be able to recover the system if admin_reserve_kbytes is set appropriately. * What's the difference between overcommit 'guess' and 'never'? "Guess" allows an allocation if there are enough free + reclaimable pages. It has a hardcoded 3% of free pages reserved for root. "Never" allows an allocation if there is enough swap + a configurable percentage (default is 50) of physical RAM. It has a hardcoded 3% of free pages reserved for root, like "Guess" mode. It also has a hardcoded 3% of the current process size reserved for additional applications. * Why is overcommit 'guess' not suitable even when an app eventually writes to every page? It takes free pages, file pages, available swap pages, reclaimable slab pages into consideration. In other words, these are all pages available, then why isn't overcommit suitable? Because it only looks at the present state of the system. It does not take into account the memory that other applications have malloced, but haven't initialized yet. It overcommits the system. Test Summary There was little change in behavior in the default overcommit 'guess' mode with swap enabled before and after the patch. This was expected. Systems run most predictably (i.e. no oom kills) in overcommit 'never' mode with swap enabled. This also allowed the most memory to be allocated to a user application. Overcommit 'guess' mode without swap is a bad idea. It is easy to crash the system. None of the other tested combinations crashed. This matches my experience on the Roadrunner supercomputer. Without the tunable user reserve, a system in overcommit 'never' mode and without swap does not allow the admin to recover, although the admin can. With the new tunable reserves, a system in overcommit 'never' mode and without swap can be configured to: 1. maximize user-allocatable memory, running close to the edge of recoverability 2. maximize recoverability, sacrificing allocatable memory to ensure that a user cannot take down a system Test Description Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap System is booted into multiuser console mode, with unnecessary services turned off. Caches were dropped before each test. Hogs are user memtester processes that attempt to allocate all free memory as reported by /proc/meminfo In overcommit 'never' mode, memory_ratio=100 Test Results 3.9.0-rc1-mm1 Overcommit \| Swap \| Hogs \| MB Got/Wanted \| OOMs \| User Recovery \| Admin Recovery ---------- ---- ---- ------------- ---- ------------- -------------- guess yes 1 5432/5432 no yes yes guess yes 4 5444/5444 1 yes yes guess no 1 5302/5449 no yes yes guess no 4 - crash no no never yes 1 5460/5460 1 yes yes never yes 4 5460/5460 1 yes yes never no 1 5218/5432 no no yes never no 4 5203/5448 no no yes 3.9.0-rc1-mm1-tunablereserves User and Admin Recovery show their respective reserves, if applicable. Overcommit \| Swap \| Hogs \| MB Got/Wanted \| OOMs \| User Recovery \| Admin Recovery ---------- ---- ---- ------------- ---- ------------- -------------- guess yes 1 5419/5419 no - yes 8MB yes guess yes 4 5436/5436 1 - yes 8MB yes guess no 1 5440/5440 * - yes 8MB yes guess no 4 - crash - no 8MB no * process would successfully mlock, then the oom killer would pick it never yes 1 5446/5446 no 10MB yes 20MB yes never yes 4 5456/5456 no 10MB yes 20MB yes never no 1 5387/5429 no 128MB no 8MB barely never no 1 5323/5428 no 226MB barely 8MB barely never no 1 5323/5428 no 226MB barely 8MB barely never no 1 5359/5448 no 10MB no 10MB barely never no 1 5323/5428 no 0MB no 10MB barely never no 1 5332/5428 no 0MB no 50MB yes never no 1 5293/5429 no 0MB no 90MB yes never no 1 5001/5427 no 230MB yes 338MB yes never no 4* 4998/5424 no 230MB yes 338MB yes * more memtesters were launched, able to allocate approximately another 100MB Future Work - Test larger memory systems. - Test an embedded image. - Test other architectures. - Time malloc microbenchmarks. - Would it be useful to be able to set overcommit policy for each memory cgroup? - Some lines are slightly above 80 chars. Perhaps define a macro to convert between pages and kb? Other places in the kernel do this. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make init_user_reserve() static] Signed-off-by: Andrew Shewmaker <agshew@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2013-04-29 15:54:36 -07:00
..
Kconfig	Select VIRT_TO_BUS directly where needed	2013-03-12 11:16:40 -07:00
Kconfig.debug	mm: more intensive memory corruption debugging	2012-01-10 16:30:42 -08:00
Makefile	mm: introduce a common interface for balloon pages mobility	2012-12-11 17:22:26 -08:00
backing-dev.c	bdi: allow block devices to say that they require stable page writes	2013-02-21 17:22:19 -08:00
balloon_compaction.c	mm: introduce a common interface for balloon pages mobility	2012-12-11 17:22:26 -08:00
bootmem.c	mm: Add alloc_bootmem_low_pages_nopanic()	2013-01-29 19:32:59 -08:00
bounce.c	mm: make snapshotting pages for stable writes a per-bio operation	2013-04-29 15:54:33 -07:00
cleancache.c	fs: encode_fh: return FILEID_INVALID if invalid fid_type	2013-02-26 02:46:10 -05:00
compaction.c	mm: add & use zone_end_pfn() and zone_spans_pfn()	2013-02-23 17:50:20 -08:00
debug-pagealloc.c	mm, x86: Remove debug_pagealloc_enabled	2011-12-06 09:24:07 +01:00
dmapool.c	dmapool: make DMAPOOL_DEBUG detect corruption of free marker	2012-12-11 17:22:24 -08:00
fadvise.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2013-02-26 20:16:07 -08:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap.c	mm: trace filemap add and del	2013-04-29 15:54:28 -07:00
filemap_xip.c	mm: move all mmu notifier invocations to be done outside the PT lock	2012-10-09 16:22:58 +09:00
fremap.c	Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace programs"	2013-03-28 17:45:51 -07:00
frontswap.c	frontswap: support exclusive gets if tmem backend is capable	2012-09-21 10:38:12 -04:00
highmem.c	Some nice cleanups, and even a patch my wife did as a "live" demo for	2012-12-20 08:37:05 -08:00
huge_memory.c	hlist: drop the node parameter from iterators	2013-02-27 19:10:24 -08:00
hugetlb.c	mm, hugetlb: include hugepages in meminfo	2013-04-29 15:54:35 -07:00
hugetlb_cgroup.c	mm/hugetlb: create hugetlb cgroup file in hugetlb_init	2012-12-18 15:02:15 -08:00
hwpoison-inject.c	memcg: rename config variables	2012-07-31 18:42:43 -07:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm: accelerate munlock() treatment of THP pages	2013-02-27 19:10:09 -08:00
interval_tree.c	mm: add CONFIG_DEBUG_VM_RB build option	2012-10-09 16:22:42 +09:00
kmemcheck.c	…
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	hlist: drop the node parameter from iterators	2013-02-27 19:10:24 -08:00
ksm.c	ksm: fix m68k build: only NUMA needs pfn_to_nid	2013-03-08 15:05:34 -08:00
maccess.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
madvise.c	mm: make madvise(MADV_WILLNEED) support swap file prefetch	2013-02-23 17:50:10 -08:00
memblock.c	memblock: add assertion for zero allocation alignment	2013-04-29 15:54:28 -07:00
memcontrol.c	memcg: do not check for do_swap_account in mem_cgroup_{read,write,reset}	2013-04-29 15:54:34 -07:00
memory-failure.c	HWPOISON: check dirty flag to match against clean page	2013-04-29 15:54:28 -07:00
memory.c	vm: add vm_iomap_memory() helper function	2013-04-16 16:45:45 -07:00
memory_hotplug.c	mm: walk_memory_range(): fix typo in comment	2013-04-29 15:54:28 -07:00
mempolicy.c	mm/mempolicy.c: fix sp_node_init() argument ordering	2013-03-08 15:05:34 -08:00
mempool.c	mempool: add @gfp_mask to mempool_create_node()	2012-06-25 11:53:47 +02:00
migrate.c	mm/migrate: fix comment typo syncronous->synchronous	2013-04-29 15:54:35 -07:00
mincore.c	swap: make each swap partition have one address_space	2013-02-23 17:50:17 -08:00
mlock.c	Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace programs"	2013-03-28 17:45:51 -07:00
mm_init.c	mm: init: report on last-nid information stored in page->flags	2013-02-23 17:50:18 -08:00
mmap.c	mm: limit growth of 3% hardcoded other user reserve	2013-04-29 15:54:36 -07:00
mmu_context.c	mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()	2012-03-21 17:54:59 -07:00
mmu_notifier.c	hlist: drop the node parameter from iterators	2013-02-27 19:10:24 -08:00
mmzone.c	mm: rename page struct field helpers	2013-02-23 17:50:18 -08:00
mprotect.c	mm/mprotect.c: coding-style cleanups	2012-12-18 15:02:15 -08:00
mremap.c	mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write()	2013-02-23 17:50:17 -08:00
msync.c	…
nobootmem.c	mm: Add alloc_bootmem_low_pages_nopanic()	2013-01-29 19:32:59 -08:00
nommu.c	mm: limit growth of 3% hardcoded other user reserve	2013-04-29 15:54:36 -07:00
oom_kill.c	memcg, oom: provide more precise dump info while memcg oom happening	2013-02-23 17:50:08 -08:00
page-writeback.c	mm: make snapshotting pages for stable writes a per-bio operation	2013-04-29 15:54:33 -07:00
page_alloc.c	page_alloc: make setup_nr_node_ids() usable for arch init code	2013-04-29 15:54:36 -07:00
page_cgroup.c	memcontrol: use N_MEMORY instead N_HIGH_MEMORY	2012-12-12 17:38:32 -08:00
page_io.c	mm: add support for direct_IO to highmem pages	2012-07-31 18:42:47 -07:00
page_isolation.c	mm: fix zone_watermark_ok_safe() accounting of isolated pages	2013-01-04 16:11:46 -08:00
pagewalk.c	thp: change split_huge_page_pmd() interface	2012-12-12 17:38:31 -08:00
percpu-km.c	percpu: clear memory allocated with the km allocator	2010-10-02 10:28:42 +03:00
percpu-vm.c	mm: fix kernel-doc warnings	2012-06-20 14:39:36 -07:00
percpu.c	mm, percpu: Make sure percpu_alloc early parameter has an argument	2012-12-02 06:23:04 -08:00
pgtable-generic.c	mm: Only flush the TLB when clearing an accessible pte	2012-12-11 14:28:34 +00:00
process_vm_access.c	Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys	2013-03-12 11:05:45 -07:00
quicklist.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
readahead.c	switch simple cases of fget_light to fdget	2012-09-26 22:20:08 -04:00
rmap.c	rmap: recompute pgoff for unmapping huge page	2013-04-29 15:54:28 -07:00
shmem.c	mm/shmem.c: remove an ifdef	2013-04-29 15:54:28 -07:00
slab.c	taint: add explicit flag to show whether lock dep is still OK.	2013-01-21 17:17:57 +10:30
slab.h	slab: propagate tunable values	2012-12-18 15:02:14 -08:00
slab_common.c	slab: propagate tunable values	2012-12-18 15:02:14 -08:00
slob.c	mm: rename page struct field helpers	2013-02-23 17:50:18 -08:00
slub.c	mm/slub.c: use register_hotmemory_notifier()	2013-04-29 15:54:36 -07:00
sparse-vmemmap.c	sparse-vmemmap: specify vmemmap population range in bytes	2013-04-29 15:54:35 -07:00
sparse.c	sparse-vmemmap: specify vmemmap population range in bytes	2013-04-29 15:54:35 -07:00
swap.c	swap: make each swap partition have one address_space	2013-02-23 17:50:17 -08:00
swap_state.c	swap: add per-partition lock for swapfile	2013-02-23 17:50:17 -08:00
swapfile.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2013-02-26 20:16:07 -08:00
truncate.c	mm: drop vmtruncate	2012-12-20 18:46:29 -05:00
util.c	swap: make each swap partition have one address_space	2013-02-23 17:50:17 -08:00
vmalloc.c	kexec, vmalloc: export additional vmalloc layer information	2013-04-29 15:54:34 -07:00
vmscan.c	mm/vmscan.c: minor cleanup for kswapd	2013-04-29 15:54:29 -07:00
vmstat.c	mm: add & use zone_end_pfn() and zone_spans_pfn()	2013-02-23 17:50:20 -08:00