remarkable-linux/mm
Hugh Dickins 9ba6929480 ksm: fix oom deadlock
There's a now-obvious deadlock in KSM's out-of-memory handling:
imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
trying to allocate a page to break KSM in an mm which becomes the
OOM victim (quite likely in the unmerge case): it's killed and goes
to exit, and hangs there waiting to acquire ksm_thread_mutex.

Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
though that made everything else: perhaps use mmap_sem somehow?
And part of the answer lies in the comments on unmerge_ksm_pages:
__ksm_exit should also leave all the rmap_item removal to ksmd.

But there's a fundamental problem, that KSM relies upon mmap_sem to
guarantee the consistency of the mm it's dealing with, yet exit_mmap
tears down an mm without taking mmap_sem.  And bumping mm_users won't
help at all, that just ensures that the pages the OOM killer assumes
are on their way to being freed will not be freed.

The best answer seems to be, to move the ksm_exit callout from just
before exit_mmap, to the middle of exit_mmap: after the mm's pages
have been freed (if the mmu_gather is flushed), but before its page
tables and vma structures have been freed; and down_write,up_write
mmap_sem there to serialize with KSM's own reliance on mmap_sem.

But KSM then needs to be careful, whenever it downs mmap_sem, to
check that the mm is not already exiting: there's a danger of using
find_vma on a layout that's being torn apart, or writing into page
tables which have been freed for reuse; and even do_anonymous_page
and __do_fault need to check they're not being called by break_ksm
to reinstate a pte after zap_pte_range has zapped that page table.

Though it might be clearer to add an exiting flag, set while holding
mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
a zapped pte.  All we need is to check whether mm_users is 0 - but
must remember that ksmd may detect that before __ksm_exit is reached.
So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

__ksm_exit now has to leave clearing up the rmap_items to ksmd,
that needs ksm_thread_mutex; but shift the exiting mm just after the
ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
mm_count to hold the mm_struct, ksmd's exit processing (exactly like
its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

But also give __ksm_exit a fast path: when there's no complication
(no rmap_items attached to mm and it's not at the ksm_scan cursor),
it can safely do all the exiting work itself.  This is not just an
optimization: when ksmd is not running, the raised mm_count would
otherwise leak mm_structs.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:32 -07:00
..
allocpercpu.c percpu: use dynamic percpu allocator as the default percpu allocator 2009-06-24 15:13:35 +09:00
backing-dev.c writeback: splice dirty inode entries to default bdi on bdi_destroy() 2009-09-16 15:18:52 +02:00
bootmem.c kmemleak: Do not report alloc_bootmem blocks as leaks 2009-08-27 14:29:17 +01:00
bounce.c block: remove some includings of blktrace_api.h 2009-06-16 11:19:36 +02:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c dmapools: protect page_list walk in show_pools() 2009-06-30 18:56:00 -07:00
fadvise.c readahead: move max_sane_readahead() calls into force_page_cache_readahead() 2009-06-16 19:47:28 -07:00
failslab.c kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c 2009-04-03 12:23:01 +02:00
filemap.c mm: oom analysis: add shmem vmstat 2009-09-22 07:17:27 -07:00
filemap_xip.c mm: do_xip_mapping_read: fix length calculation 2009-04-02 19:04:49 -07:00
fremap.c Do not account for the address space used by hugetlbfs using VM_ACCOUNT 2009-02-10 10:48:42 -08:00
highmem.c block: remove some includings of blktrace_api.h 2009-06-16 11:19:36 +02:00
hugetlb.c hugetlb: restore interleaving of bootmem huge pages 2009-09-22 07:17:26 -07:00
init-mm.c mm: consolidate init_mm definition 2009-06-16 19:47:28 -07:00
internal.h vmscan: do not unconditionally treat zones that fail zone_reclaim() as full 2009-06-16 19:47:45 -07:00
Kconfig ksm: the mm interface to ksm 2009-09-22 07:17:31 -07:00
Kconfig.debug kmemcheck: enable in the x86 Kconfig 2009-06-15 15:49:15 +02:00
kmemcheck.c kmemcheck: add hooks for the page allocator 2009-06-15 15:48:33 +02:00
kmemleak-test.c percpu: clean up percpu variable definitions 2009-06-24 15:13:48 +09:00
kmemleak.c kmemleak: Improve the "Early log buffer exceeded" error message 2009-09-11 10:42:09 +01:00
ksm.c ksm: fix oom deadlock 2009-09-22 07:17:32 -07:00
maccess.c [S390] maccess: add weak attribute to probe_kernel_write 2009-06-12 10:27:37 +02:00
madvise.c ksm: the mm interface to ksm 2009-09-22 07:17:31 -07:00
Makefile ksm: the mm interface to ksm 2009-09-22 07:17:31 -07:00
memcontrol.c cgroup avoid permanent sleep at rmdir 2009-07-29 19:10:35 -07:00
memory.c ksm: fix oom deadlock 2009-09-22 07:17:32 -07:00
memory_hotplug.c memory hotplug: update zone pcp at memory online 2009-09-22 07:17:25 -07:00
mempolicy.c mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware 2009-08-07 10:39:55 -07:00
mempool.c mempool.c: clean up type-casting 2009-08-10 08:31:16 -07:00
migrate.c mm: vmstat: add isolate pages 2009-09-22 07:17:29 -07:00
mincore.c [CVE-2009-0029] System call wrappers part 14 2009-01-14 14:15:24 +01:00
mlock.c mm: remove CONFIG_UNEVICTABLE_LRU config option 2009-06-16 19:47:42 -07:00
mm_init.c mm: mminit_loglevel cannot be __meminitdata anymore 2008-08-20 15:40:30 -07:00
mmap.c ksm: fix oom deadlock 2009-09-22 07:17:32 -07:00
mmu_notifier.c ksm: add mmu_notifier set_pte_at_notify() 2009-09-22 07:17:31 -07:00
mmzone.c [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 2009-05-18 11:22:24 +01:00
mprotect.c perf: Do the big rename: Performance Counters -> Performance Events 2009-09-21 14:28:04 +02:00
mremap.c ksm: prevent mremap move poisoning 2009-09-22 07:17:31 -07:00
msync.c [CVE-2009-0029] System call wrappers part 13 2009-01-14 14:15:23 +01:00
nommu.c nommu: fix error handling in do_mmap_pgoff() 2009-09-05 11:30:42 -07:00
oom_kill.c mm: revert "oom: move oom_adj value" 2009-08-18 16:31:13 -07:00
page-writeback.c mm: count only reclaimable lru pages 2009-09-22 07:17:30 -07:00
page_alloc.c mm: perform non-atomic test-clear of PG_mlocked on free 2009-09-22 07:17:30 -07:00
page_cgroup.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
page_io.c mm: remove file argument from swap_readpage() 2009-06-16 19:47:44 -07:00
page_isolation.c memory hotplug: fix page_zone() calculation in test_pages_isolated() 2008-11-06 15:41:19 -08:00
pagewalk.c pagemap: pass mm into pagewalkers 2008-06-12 18:05:41 -07:00
percpu.c Merge branch 'for-next' into for-linus 2009-09-15 09:57:19 +09:00
prio_tree.c spelling fixes: mm/ 2007-10-20 01:27:18 +02:00
quicklist.c percpu: cleanup percpu array definitions 2009-06-24 15:13:45 +09:00
readahead.c readahead: introduce context readahead algorithm 2009-06-16 19:47:30 -07:00
rmap.c ksm: no debug in page_dup_rmap() 2009-09-22 07:17:31 -07:00
shmem.c Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev 2009-09-15 09:50:49 -07:00
shmem_acl.c shmfs: use 'check_acl' instead of 'permission' 2009-09-08 11:08:46 -07:00
slab.c SLAB: Fix lockdep annotations 2009-06-29 09:57:10 +03:00
slob.c slab: remove duplicate kmem_cache_init_late() declarations 2009-08-06 11:36:25 +03:00
slub.c slub: Fix build error in kmem_cache_open() with !CONFIG_SLUB_DEBUG 2009-09-15 22:32:10 +03:00
sparse-vmemmap.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
sparse.c memory hotplug: alloc page from other node in memory online 2009-09-22 07:17:26 -07:00
swap.c mm: fix Committed_AS underflow on large NR_CPUS environment 2009-05-02 15:36:10 -07:00
swap_state.c writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
swapfile.c block: use blkdev_issue_discard in blk_ioctl_discard 2009-09-14 08:24:53 +02:00
thrash.c mm: pass mm to grab_swap_token 2009-06-23 12:50:05 -07:00
truncate.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
util.c Merge branches 'slab/documentation', 'slab/fixes', 'slob/cleanups' and 'slub/fixes' into for-linus 2009-06-17 08:30:15 +03:00
vmalloc.c vmalloc.c: fix double error checking 2009-09-22 07:17:30 -07:00
vmscan.c vmscan: kill unnecessary prefetch 2009-09-22 07:17:30 -07:00
vmstat.c mm: vmstat: add isolate pages 2009-09-22 07:17:29 -07:00