alistair23-linux

redonkable

History

Chris Down 9783aa9917 mm, memcg: proportional memory.{low,min} reclaim cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have any effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we know we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources without the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting some protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ \| memory.low \| test (pgscan/s) \| control (pgscan/s) \| % of control \| +------------+-----------------+--------------------+--------------+ \| 21G \| 0 \| 0 \| N/A \| \| 17G \| 867 \| 3799 \| 23% \| \| 12G \| 1203 \| 3543 \| 34% \| \| 8G \| 2534 \| 3979 \| 64% \| \| 4G \| 3980 \| 4147 \| 96% \| \| 0 \| 3799 \| 3980 \| 95% \| +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2019-10-07 15:47:20 -07:00
..
kasan	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
Kconfig	mm,thp: add read-only THP support for (non-shmem) FS	2019-09-24 15:54:11 -07:00
Kconfig.debug	mm, page_owner, debug_pagealloc: save and dump freeing stack trace	2019-09-24 15:54:08 -07:00
Makefile	mm: silence -Woverride-init/initializer-overrides	2019-09-24 15:54:10 -07:00
backing-dev.c	writeback: Separate out wb_get_lookup() from wb_get_create()	2019-08-27 09:22:38 -06:00
balloon_compaction.c	mm/balloon_compaction: suppress allocation warnings	2019-09-04 07:42:01 -04:00
cleancache.c	Driver Core and debugfs changes for 5.3-rc1	2019-07-12 12:24:03 -07:00
cma.c	mm/cma.c: fail if fixed declaration can't be honored	2019-07-16 19:23:21 -07:00
cma.h	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
cma_debug.c	mm/cma_debug.c: fix the break condition in cma_maxchunk_get()	2019-05-14 09:47:45 -07:00
compaction.c	mm/compaction.c: remove unnecessary zone parameter in isolate_migratepages()	2019-09-24 15:54:10 -07:00
debug.c	mm: update references to page _refcount	2019-05-14 19:52:47 -07:00
debug_page_ref.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
dmapool.c	mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options	2019-07-12 11:05:46 -07:00
early_ioremap.c	mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep	2017-12-11 14:54:44 +01:00
fadvise.c	fs: Export generic_fadvise()	2019-08-30 22:43:58 -07:00
failslab.c	mm/failslab.c: by default, do not fail allocations with direct reclaim only	2019-07-12 11:05:43 -07:00
filemap.c	mm,thp: avoid writes to file with THP in pagecache	2019-09-24 15:54:11 -07:00
frame_vector.c	mm: untag user pointers in get_vaddr_frames	2019-09-25 17:51:41 -07:00
frontswap.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482	2019-06-19 17:09:52 +02:00
gup.c	mm: untag user pointers in mm/gup.c	2019-09-25 17:51:41 -07:00
gup_benchmark.c	mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM	2019-05-14 09:47:45 -07:00
highmem.c	mm: convert totalram_pages and totalhigh_pages variables to atomic	2018-12-28 12:11:47 -08:00
hmm.c	pagewalk: separate function pointers from iterator data	2019-09-07 04:28:04 -03:00
huge_memory.c	Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)	2019-09-28 14:26:47 -07:00
hugetlb.c	hugetlbfs: don't retry when pool page allocations start to fail	2019-09-24 15:54:10 -07:00
hugetlb_cgroup.c	mm: introduce compound_nr()	2019-09-24 15:54:08 -07:00
hwpoison-inject.c	hwpoison-inject: no need to check return value of debugfs_create functions	2019-06-03 15:39:40 +02:00
init-mm.c	mm: use CPU_BITS_NONE to initialize init_mm.cpu_bitmask	2019-09-24 15:54:10 -07:00
internal.h	mm: introduce MADV_COLD	2019-09-25 17:51:41 -07:00
interval_tree.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248	2019-06-19 17:09:08 +02:00
khugepaged.c	khugepaged: enable collapse pmd for pte-mapped THP	2019-09-24 15:54:11 -07:00
kmemleak-test.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333	2019-06-05 17:37:06 +02:00
kmemleak.c	mm/kmemleak.c: record the current memory pool size	2019-09-24 15:54:07 -07:00
ksm.c	mm: move memcmp_pages() and pages_identical()	2019-09-24 15:54:11 -07:00
list_lru.c	mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages	2019-07-12 11:05:44 -07:00
maccess.c	The main changes in this release include:	2019-07-18 11:51:00 -07:00
madvise.c	mm: factor out common parts between MADV_COLD and MADV_PAGEOUT	2019-09-25 17:51:41 -07:00
memblock.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
memcontrol.c	mm, memcg: proportional memory.{low,min} reclaim	2019-10-07 15:47:20 -07:00
memfd.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
memory-failure.c	HMM patches for 5.3	2019-07-14 19:42:11 -07:00
memory.c	mm: do not hash address in print_bad_pte()	2019-09-24 15:54:09 -07:00
memory_hotplug.c	mm/memory_hotplug.c: s/is/if	2019-09-24 15:54:09 -07:00
mempolicy.c	Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)	2019-09-28 14:26:47 -07:00
mempool.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
memremap.c	mm/memremap: drop unused SECTION_SIZE and SECTION_MASK	2019-10-07 15:47:19 -07:00
memtest.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
migrate.c	mm: untag user pointers passed to memory syscalls	2019-09-25 17:51:41 -07:00
mincore.c	mm: untag user pointers passed to memory syscalls	2019-09-25 17:51:41 -07:00
mlock.c	mm: untag user pointers passed to memory syscalls	2019-09-25 17:51:41 -07:00
mm_init.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
mmap.c	mm: untag user pointers in mmap/munmap/mremap/brk	2019-09-25 17:51:41 -07:00
mmu_context.c	sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h>	2017-03-02 08:42:38 +01:00
mmu_gather.c	mm: remove quicklist page table caches	2019-09-24 15:54:09 -07:00
mmu_notifier.c	mm, notifier: Catch sleeping/blocking for !blockable	2019-09-07 04:28:05 -03:00
mmzone.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
mprotect.c	mm: untag user pointers passed to memory syscalls	2019-09-25 17:51:41 -07:00
mremap.c	mm: untag user pointers in mmap/munmap/mremap/brk	2019-09-25 17:51:41 -07:00
msync.c	mm: untag user pointers passed to memory syscalls	2019-09-25 17:51:41 -07:00
nommu.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
oom_kill.c	mm: introduce MADV_COLD	2019-09-25 17:51:41 -07:00
page-writeback.c	writeback, memcg: Implement foreign dirty flushing	2019-08-27 09:22:38 -06:00
page_alloc.c	mm/page_alloc.c: fix a crash in free_pages_prepare()	2019-10-07 15:47:19 -07:00
page_counter.c	memcg: introduce memory.min	2018-06-07 17:34:36 -07:00
page_ext.c	mm, debug_pagealloc: use a page type instead of page_ext flag	2019-07-12 11:05:43 -07:00
page_idle.c	mm/page_idle.c: fix oops because end_pfn is larger than max_pfn	2019-06-29 16:43:45 +08:00
page_io.c	mm, swap: use rbtree for swap_extent	2019-07-12 11:05:43 -07:00
page_isolation.c	mm/page_isolation.c: change the prototype of undo_isolate_page_range()	2019-07-12 11:05:43 -07:00
page_owner.c	mm, page_owner, debug_pagealloc: save and dump freeing stack trace	2019-09-24 15:54:08 -07:00
page_poison.c	mm/page_poison.c: fix a typo in a comment	2019-09-24 15:54:08 -07:00
page_vma_mapped.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
pagewalk.c	pagewalk: use lockdep_assert_held for locking validation	2019-09-07 04:28:04 -03:00
percpu-internal.h	percpu: convert chunk hints to be based on pcpu_block_md	2019-03-13 12:25:31 -07:00
percpu-km.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu-stats.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu-vm.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428	2019-06-05 17:37:16 +02:00
percpu.c	percpu: Use struct_size() helper	2019-09-04 13:40:49 -07:00
pgtable-generic.c	x86/mm: Page size aware flush_tlb_mm_range()	2018-10-09 16:51:11 +02:00
process_vm_access.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
readahead.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
rmap.c	mm,thp: add read-only THP support for (non-shmem) FS	2019-09-24 15:54:11 -07:00
rodata_test.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441	2019-06-05 17:37:17 +02:00
shmem.c	Merge branch 'hugepage-fallbacks' (hugepatch patches from David Rientjes)	2019-09-28 14:26:47 -07:00
shuffle.c	mm: fix -Wmissing-prototypes warnings	2019-10-07 15:47:19 -07:00
shuffle.h	mm: maintain randomization of page free lists	2019-05-14 19:52:48 -07:00
slab.c	mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options	2019-07-12 11:05:46 -07:00
slab.h	mm, slab: move memcg_cache_params structure to mm/slab.h	2019-09-24 15:54:07 -07:00
slab_common.c	mm, slab: extend slab/shrink to shrink all memcg caches	2019-09-24 15:54:07 -07:00
slob.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
slub.c	mm: introduce page_size()	2019-09-24 15:54:08 -07:00
sparse-vmemmap.c	mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()	2019-07-18 17:08:07 -07:00
sparse.c	mm: fix -Wmissing-prototypes warnings	2019-10-07 15:47:19 -07:00
swap.c	mm: introduce MADV_COLD	2019-09-25 17:51:41 -07:00
swap_cgroup.c	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	2017-11-02 11:10:55 +01:00
swap_slots.c	mm, swap, get_swap_pages: use entry_size instead of cluster in parameter	2018-08-22 10:52:44 -07:00
swap_state.c	mm: page cache: store only head pages in i_pages	2019-09-24 15:54:08 -07:00
swapfile.c	vfs: don't allow writes to swap files	2019-08-20 07:55:16 -07:00
truncate.c	treewide: Add SPDX license identifier for missed files	2019-05-21 10:50:45 +02:00
usercopy.c	usercopy: Avoid HIGHMEM pfn warning	2019-09-17 15:20:17 -07:00
userfaultfd.c	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499	2019-06-19 17:09:53 +02:00
util.c	arm64, mm: make randomization selected by generic topdown mmap layout	2019-09-24 15:54:11 -07:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
vmalloc.c	augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro	2019-09-25 17:51:39 -07:00
vmpressure.c	mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()	2019-10-07 15:47:19 -07:00
vmscan.c	mm, memcg: proportional memory.{low,min} reclaim	2019-10-07 15:47:20 -07:00
vmstat.c	mm,thp: stats for file backed THP	2019-09-24 15:54:11 -07:00
workingset.c	mm: workingset: fix vmstat counters for shadow nodes	2019-08-13 16:06:52 -07:00
z3fold.c	mm/z3fold.c: claim page in the beginning of free	2019-10-07 15:47:19 -07:00
zbud.c	treewide: Add SPDX license identifier for more missed files	2019-05-21 10:50:45 +02:00
zpool.c	zpool: add malloc_support_movable to zpool_driver	2019-09-24 15:54:12 -07:00
zsmalloc.c	mm/zsmalloc.c: fix a -Wunused-function warning	2019-09-24 15:54:12 -07:00
zswap.c	zswap: do not map same object twice	2019-09-24 15:54:12 -07:00