1
0
Fork 0
alistair23-linux/mm
shidao.ytt a7ab400d6f mm/fadvise: discard partial page if endbyte is also EOF
During our recent testing with fadvise(FADV_DONTNEED), we find that if
given offset/length is not page-aligned, the last page will not be
discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
we map a 10KB-sized file into memory and then try to run this tool to
evict the whole file mapping, but the last single page always remains
staying in the memory:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 2.1e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 1/3  4K/12K  33.3%
         Elapsed: 5.5e-05 seconds

However when we test with an older kernel, say 3.10, this problem is
gone.  So we wonder if this is a regression:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 8.2e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
         Elapsed: 5e-05 seconds

After digging a little bit into this problem, we find it seems not a
regression.  Not discarding partial page is likely to be on purpose
according to commit 441c228f81 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
Gorman.  He explained why partial pages should be preserved instead of
being discarded when using fadvise(FADV_DONTNEED).

However, the interesting part is that the actual code did NOT work as
the same as it was described, the partial page was still discarded
anyway, due to a calculation mistake of `end_index' passed to
invalidate_mapping_pages().  This mistake has not been fixed until
recently, that's why we fail to reproduce our problem in old kernels.
The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard
partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.

Back to the original testing, our problem becomes that there is a
special case that, if the page-unaligned `endbyte' is also the end of
file, it is not necessary at all to preserve the last partial page, as
we all know no one else will use the rest of it.  It should be safe
enough if we just discard the whole page.  So we add an EOF check in
this patch.

We also find a poosbile real world issue in mainline kernel.  Assume
such scenario: A userspace backup application want to backup a huge
amount of small files (<4k) at once, the developer might (I guess) want
to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
won't really happen since the only page mapped is a partial page, and
kernel will preserve it.  Our patch also fixes this problem, since we
know the endbyte is EOF, so we discard it.

Here is a simple reproducer to reproduce and verify each scenario we
described above:

  test_fadvise.c
  ==============================
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <stdlib.h>
  #include <string.h>
  #include <stdio.h>
  #include <unistd.h>

  int main(int argc, char **argv)
  {
  	int i, fd, ret, len;
  	struct stat buf;
  	void *addr;
  	unsigned char *vec;
  	char *strbuf;
  	ssize_t pagesize = getpagesize();
  	ssize_t filesize;

  	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
  	if (fd < 0)
  		return -1;
  	filesize = strtoul(argv[2], NULL, 10);

  	strbuf = malloc(filesize);
  	memset(strbuf, 42, filesize);
  	write(fd, strbuf, filesize);
  	free(strbuf);
  	fsync(fd);

  	len = (filesize + pagesize - 1) / pagesize;
  	printf("length of pages: %d\n", len);

  	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
  	if (addr == MAP_FAILED)
  		return -1;

  	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
  	if (ret < 0)
  		return -1;

  	vec = malloc(len);
  	ret = mincore(addr, filesize, (void *)vec);
  	if (ret < 0)
  		return -1;

  	for (i = 0; i < len; i++)
  		printf("pages[%d]: %x\n", i, vec[i] & 0x1);

  	free(vec);
  	close(fd);

  	return 0;
  }
  ==============================

Test 1: running on kernel with commit 18aba41cbf reverted:

  [root@caspar ~]# uname -r
  4.15.0-rc6.revert+
  [root@caspar ~]# ./test_fadvise file1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page discarded
  [root@caspar ~]# ./test_fadvise file2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise file3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page discarded

Test 2: running on mainline kernel:

  [root@caspar ~]# uname -r
  4.15.0-rc6+
  [root@caspar ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 1    # <-- partial and the only page not discarded
  [root@caspar ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 1    # <-- partial page not discarded

Test 3: running on kernel with this patch:

  [root@caspar ~]# uname -r
  4.15.0-rc6.patched+
  [root@caspar ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page and EOF, discarded
  [root@caspar ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page and EOF, discarded

[akpm@linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.com
Signed-off-by: shidao.ytt <shidao.ytt@alibaba-inc.com>
Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com>
Reviewed-by: Oliver Yang <zhiche.yy@alibaba-inc.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
..
kasan kasan: use %px to print addresses instead of %p 2017-11-29 12:13:16 +11:00
Kconfig mm: relax deferred struct page requirements 2018-01-31 17:18:36 -08:00
Kconfig.debug kmemcheck: rip it out 2017-11-15 18:21:05 -08:00
Makefile mm: add infrastructure for get_user_pages_fast() benchmarking 2017-11-17 16:10:04 -08:00
backing-dev.c Revert "bdi: add error handle for bdi_debug_register" 2017-12-21 10:01:30 -07:00
balloon_compaction.c virtio_balloon: fix deadlock on OOM 2017-11-14 23:57:38 +02:00
bootmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
cleancache.c fs: switch ->s_uuid to uuid_t 2017-06-05 16:59:12 +02:00
cma.c mm/cma.c: change pr_info to pr_err for cma_alloc fail log 2017-11-15 18:21:03 -08:00
cma.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
cma_debug.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compaction.c mm, compaction: remove unneeded pageblock_skip_persistent() checks 2017-11-17 16:10:00 -08:00
debug.c mm/debug.c: provide useful debugging information for VM_BUG 2018-01-04 16:45:09 -08:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dmapool.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c mm/fadvise: discard partial page if endbyte is also EOF 2018-01-31 17:18:39 -08:00
failslab.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
filemap.c mm/filemap.c: remove include of hardirq.h 2018-01-31 17:18:36 -08:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c mm, frontswap: convert frontswap_enabled to static key 2016-07-26 16:19:19 -07:00
gup.c Merge branch 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-01-31 10:01:08 -08:00
gup_benchmark.c mm: add infrastructure for get_user_pages_fast() benchmarking 2017-11-17 16:10:04 -08:00
highmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hmm.c Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" 2017-12-15 18:53:22 -08:00
huge_memory.c mm/thp: remove pmd_huge_split_prepare() 2018-01-31 17:18:38 -08:00
hugetlb.c mm, hugetlb: remove hugepages_treat_as_movable sysctl 2018-01-31 17:18:37 -08:00
hugetlb_cgroup.c mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size 2016-05-20 17:58:30 -07:00
hwpoison-inject.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
init-mm.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
internal.h Revert "mm, thp: Do not make pmd/pud dirty without a reason" 2017-11-29 09:01:01 -08:00
interval_tree.c mm/interval_tree.c: use vma_pages() helper 2018-01-31 17:18:37 -08:00
khugepaged.c mm: thp: use down_read_trylock() in khugepaged to avoid long block 2018-01-31 17:18:38 -08:00
kmemleak-test.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
kmemleak.c mm: kmemleak: remove unused hardirq.h 2018-01-31 17:18:36 -08:00
ksm.c mm/ksm: Remove now-redundant smp_read_barrier_depends() 2017-12-04 10:52:56 -08:00
list_lru.c mm/list_lru.c: mark expected switch fall-through 2017-11-15 18:21:07 -08:00
maccess.c x86: remove more uaccess_32.h complexity 2016-05-22 17:21:27 -07:00
madvise.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
memblock.c mm: define memblock_virt_alloc_try_nid_raw 2017-11-15 18:21:05 -08:00
memcontrol.c mm: memcontrol: fix excessive complexity in memory.stat reporting 2018-01-31 17:18:36 -08:00
memory-failure.c signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr 2018-01-23 12:17:48 -06:00
memory.c mm: add unmap_mapping_pages() 2018-01-31 17:18:37 -08:00
memory_hotplug.c mm: memory_hotplug: remove second __nr_to_section in register_page_bootmem_info_section() 2018-01-31 17:18:37 -08:00
mempolicy.c mm/mempolicy: add nodes_empty check in SYSC_migrate_pages 2018-01-31 17:18:36 -08:00
mempool.c mm/mempool.c: use kmalloc_array_node() 2017-11-15 18:21:02 -08:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c Revert "mm, thp: Do not make pmd/pud dirty without a reason" 2017-11-29 09:01:01 -08:00
mincore.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mlock.c mm: Eliminate cond_resched_rcu_qs() in favor of cond_resched() 2017-11-28 16:00:28 -08:00
mm_init.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
mmap.c mm, oom_reaper: fix memory corruption 2017-12-14 16:00:49 -08:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_notifier.c mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks 2018-01-31 17:18:38 -08:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c mm/mprotect: add a cond_resched() inside change_pmd_range() 2018-01-04 16:45:09 -08:00
mremap.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
msync.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nobootmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nommu.c mm: add unmap_mapping_pages() 2018-01-31 17:18:37 -08:00
oom_kill.c mm, oom: avoid reaping only for mm's with blockable invalidate callbacks 2018-01-31 17:18:38 -08:00
page-writeback.c Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" 2017-11-29 18:40:43 -08:00
page_alloc.c mm/page_alloc.c: fix comment in __get_free_pages() 2018-01-31 17:18:36 -08:00
page_counter.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
page_ext.c mm/page_ext.c: check if page_ext is not prepared 2017-11-15 18:21:07 -08:00
page_idle.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
page_io.c block: convert to bio_first_bvec_all & bio_first_page_all 2018-01-06 09:18:00 -07:00
page_isolation.c mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages() 2017-11-15 18:21:02 -08:00
page_owner.c mm/page_owner.c: use PTR_ERR_OR_ZERO() 2018-01-31 17:18:36 -08:00
page_poison.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
page_vma_mapped.c mm, page_vma_mapped: Introduce pfn_in_hpage() 2018-01-22 12:15:57 -08:00
pagewalk.c mm/pagewalk.c: report holes in hugetlb ranges 2017-11-15 13:12:08 -08:00
percpu-internal.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
percpu-km.c percpu: replace area map allocator with bitmap 2017-07-26 17:41:05 -04:00
percpu-stats.c percpu: fix starting offset for chunk statistics traversal 2017-09-27 14:45:57 -07:00
percpu-vm.c mm: remove __GFP_COLD 2017-11-15 18:21:06 -08:00
percpu.c percpu: hack to let the CRIS architecture to boot until they clean up 2017-11-27 12:53:12 -08:00
pgtable-generic.c mm: do not lose dirty and accessed bits in pmdp_invalidate() 2018-01-31 17:18:38 -08:00
process_vm_access.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> 2017-03-02 08:42:28 +01:00
quicklist.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
readahead.c mm: don't cap request size based on read-ahead setting 2016-12-12 18:55:08 -08:00
rmap.c mm: remove cold parameter from free_hot_cold_page* 2017-11-15 18:21:06 -08:00
rodata_test.c mm: fix RODATA_TEST failure "rodata_test: test data was not read only" 2017-10-03 17:54:24 -07:00
shmem.c shmem: add sealing support to hugetlb-backed memfd 2018-01-31 17:18:39 -08:00
slab.c mm/slab.c: remove redundant assignments for slab_state 2018-01-31 17:18:35 -08:00
slab.h mm/slab_common.c: make calculate_alignment() static 2018-01-31 17:18:35 -08:00
slab_common.c mm/slab_common.c: make calculate_alignment() static 2018-01-31 17:18:35 -08:00
slob.c slab, slub, slob: add slab_flags_t 2017-11-15 18:21:01 -08:00
slub.c slub: remove obsolete comments of put_cpu_partial() 2018-01-31 17:18:36 -08:00
sparse-vmemmap.c mm, sparse: do not swamp log with huge vmemmap allocation failures 2017-11-15 18:21:07 -08:00
sparse.c mm/sparse.c: wrong allocation for mem_section 2018-01-04 16:45:09 -08:00
swap.c mm: drop hotplug lock from lru_add_drain_all() 2018-01-31 17:18:36 -08:00
swap_cgroup.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
swap_slots.c mm/swap_slots.c: fix race conditions in swap_slots cache init 2017-11-15 18:21:03 -08:00
swap_state.c mm: remove cold parameter for release_pages 2017-11-15 18:21:06 -08:00
swapfile.c ipc, kernel, mm: annotate ->poll() instances 2017-11-27 16:20:05 -05:00
truncate.c mm: add unmap_mapping_pages() 2018-01-31 17:18:37 -08:00
usercopy.c mm/usercopy: Drop extra is_vmalloc_or_module() check 2017-04-05 12:30:18 -07:00
userfaultfd.c userfaultfd: shmem: wire up shmem_mfill_zeropage_pte 2017-09-06 17:27:28 -07:00
util.c new primitive: vmemdup_user() 2018-01-07 13:06:15 -05:00
vmacache.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
vmalloc.c Revert "vmalloc: back off when the current task is killed" 2017-10-13 16:18:32 -07:00
vmpressure.c mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
vmscan.c mm: pin address_space before dereferencing it while isolating an LRU page 2018-01-31 17:18:39 -08:00
vmstat.c mm, sysctl: make NUMA stats configurable 2017-11-15 18:21:07 -08:00
workingset.c mm, truncate: do not check mapping for every page being truncated 2017-11-15 18:21:06 -08:00
z3fold.c mm/z3fold.c: use kref to prevent page free/compact race 2017-11-17 16:10:00 -08:00
zbud.c mm/zbud.c: use list_last_entry() instead of list_tail_entry() 2016-01-15 11:40:52 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c mm/zsmalloc: simplify shrinker init/destroy 2018-01-31 17:18:38 -08:00
zswap.c zswap: same-filled pages handling 2018-01-31 17:18:36 -08:00