alistair23-linux/mm
Mel Gorman 215ddd6664 mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.  This
is expected behaviour.  Unfortunately, if the highest zone is small, a
problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable.  Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1.  It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again.  As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.

There are two problems here.  1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling.  2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.

This patch does two things that are related.  If the end_zone is
unreclaimable, this information is communicated back.  Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep.  Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08 21:14:43 -07:00
..
backing-dev.c backing-dev: Kill set but not used var in bdi_debug_stats_show() 2011-05-20 21:23:37 +02:00
bootmem.c crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn 2011-03-23 19:47:19 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2 2011-06-15 20:04:02 -07:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() 2011-01-13 17:32:48 -08:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-06 11:26:25 -08:00
failslab.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
filemap.c more conservative S_NOSEC handling 2011-06-03 18:24:58 -04:00
filemap_xip.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
fremap.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
huge_memory.c mm: remove khugepaged double thp vmstat update with CONFIG_NUMA=n 2011-06-15 20:03:58 -07:00
hugetlb.c mm: fix negative commitlimit when gigantic hugepages are allocated 2011-06-15 20:04:01 -07:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c mm: convert mm->cpu_vm_cpumask into cpumask_var_t 2011-05-25 08:39:21 -07:00
internal.h mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
Kconfig mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
Kconfig.debug mm: debug-pagealloc: fix kconfig dependency warning 2011-03-22 17:44:02 -07:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c kmemleak: Do not return a pointer to an object that kmemleak did not get 2011-05-19 17:35:28 +01:00
ksm.c ksm: fix NULL pointer dereference in scan_get_next_rmap_item() 2011-06-15 20:04:02 -07:00
maccess.c maccess,probe_kernel: Make write/read src const void * 2011-05-25 19:56:23 -04:00
madvise.c thp: khugepaged: make khugepaged aware about madvise 2011-01-13 17:32:47 -08:00
Makefile mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
memblock.c mm/memblock: properly handle overlaps and fix error path 2011-03-22 17:44:09 -07:00
memcontrol.c mm: move shmem prototypes to shmem_fs.h 2011-06-27 18:00:12 -07:00
memory-failure.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
memory.c mm: move vmtruncate_range to truncate.c 2011-06-27 18:00:12 -07:00
memory_hotplug.c mm, hotplug: protect zonelist building with zonelists_mutex 2011-06-22 21:06:48 -07:00
mempolicy.c mm: proc: move show_numa_map() to fs/proc/task_mmu.c 2011-05-25 08:39:34 -07:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c migrate: don't account swapcache as shmem 2011-06-16 15:01:24 -07:00
mincore.c thp: mincore transparent hugepage support 2011-01-13 17:32:44 -08:00
mlock.c mm: don't access vm_flags as 'int' 2011-05-26 09:20:31 -07:00
mm_init.c mm: mminit_loglevel cannot be __meminitdata anymore 2008-08-20 15:40:30 -07:00
mmap.c mm: get rid of the most spurious find_vma_prev() users 2011-06-16 00:35:09 -07:00
mmu_context.c exit: fix oops in sync_mm_rss 2010-03-24 16:31:21 -07:00
mmu_notifier.c thp: mmu_notifier_test_young 2011-01-13 17:32:46 -08:00
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c thp: mprotect: transparent huge page support 2011-01-13 17:32:44 -08:00
mremap.c mm: Convert i_mmap_lock to a mutex 2011-05-25 08:39:18 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high() 2011-05-25 08:39:31 -07:00
nommu.c nommu: add page alignment to mmap 2011-05-25 08:39:38 -07:00
oom_kill.c oom: replace PF_OOM_ORIGIN with toggling oom_score_adj 2011-05-25 08:39:10 -07:00
page-writeback.c Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block 2011-03-24 10:16:26 -07:00
page_alloc.c Revert "mm: fail GFP_DMA allocations when ZONE_DMA is not configured" 2011-06-02 06:11:24 +09:00
page_cgroup.c memcg: fix init_page_cgroup nid with sparsemem 2011-06-15 20:04:01 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
pagewalk.c pagewalk: only split huge pages when necessary 2011-03-22 17:44:04 -07:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: remove gfp mask from pcpu_get_vm_areas 2011-01-13 17:32:34 -08:00
percpu.c Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-05-24 11:53:42 -07:00
pgtable-generic.c mm/pgtable-generic.c: fix CONFIG_SWAP=n build 2011-01-26 10:49:58 +10:00
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
quicklist.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
readahead.c readahead: readahead page allocations are OK to fail 2011-05-25 08:39:25 -07:00
rmap.c mm/memory-failure.c: fix spinlock vs mutex order 2011-06-27 18:00:13 -07:00
shmem.c tmpfs: add shmem_read_mapping_page_gfp 2011-06-27 18:00:12 -07:00
slab.c SLAB: Record actual last user of freed objects. 2011-06-03 19:33:50 +03:00
slob.c mm: Remove support for kmem_cache_name() 2011-01-23 21:00:05 +02:00
slub.c slub: always align cpu_slab to honor cmpxchg_double requirement 2011-06-03 19:33:49 +03:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c Fix common misspellings 2011-03-31 11:26:23 -03:00
swap.c mm: batch activate_page() to reduce lock contention 2011-05-25 08:39:37 -07:00
swap_state.c block: remove per-queue plugging 2011-03-10 08:52:07 +01:00
swapfile.c mm: move shmem prototypes to shmem_fs.h 2011-06-27 18:00:12 -07:00
thrash.c vmscan: implement swap token priority aging 2011-06-15 20:03:59 -07:00
truncate.c mm: fix assertion mapping->nrpages == 0 in end_writeback() 2011-06-27 18:00:13 -07:00
util.c mm: nommu: sort mm->mmap list properly 2011-05-25 08:39:05 -07:00
vmalloc.c Merge branch 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen 2011-05-26 19:01:15 -07:00
vmscan.c mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfully 2011-07-08 21:14:43 -07:00
vmstat.c mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur 2011-05-25 08:39:09 -07:00