1
0
Fork 0
alistair23-linux/include/linux/mm_types.h

418 lines
12 KiB
C
Raw Normal View History

#ifndef _LINUX_MM_TYPES_H
#define _LINUX_MM_TYPES_H
#include <linux/auxvec.h>
#include <linux/types.h>
#include <linux/threads.h>
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 16:46:29 -06:00
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
uprobes/core: Allocate XOL slots for uprobes use Uprobes executes the original instruction at a probed location out of line. For this, we allocate a page (per mm) upon the first uprobe hit, in the process user address space, divide it into slots that are used to store the actual instructions to be singlestepped. These slots are known as xol (execution out of line) slots. Care is taken to ensure that the allocation is in an unmapped area as close to the top of the user address space as possible, with appropriate permission settings to keep selinux like frameworks happy. Upon a uprobe hit, a free slot is acquired, and is released after the singlestep completes. Lots of improvements courtesy suggestions/inputs from Peter and Oleg. [ Folded a fix for build issue on powerpc fixed and reported by Stephen Rothwell. ] Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com> Cc: Linux-mm <linux-mm@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Anton Arapov <anton@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-30 12:26:31 -06:00
#include <linux/uprobes.h>
#include <asm/page.h>
#include <asm/mmu.h>
#ifndef AT_VECTOR_SIZE_ARCH
#define AT_VECTOR_SIZE_ARCH 0
#endif
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
struct address_space;
#define USE_SPLIT_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page, rmap structures can tell us
* who is mapping it.
*
* The objects in struct page are organized in double word blocks in
* order to allows us to use atomic double word operations on portions
* of struct page. That is currently only used by slub but the arrangement
* allows the use of atomic double word operations on the flags/mapping
* and lru list pointers also.
*/
struct page {
/* First double word block */
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
/* Second double word */
struct {
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* slub/slob first free object */
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 17:43:58 -06:00
bool pfmemalloc; /* If set by the page allocator,
* ALLOC_NO_WATERMARKS was set
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 17:43:58 -06:00
* and the low watermark was not
* met implying that the system
* is under some pressure. The
* caller should try ensure
* this page is only used to
* free other pages.
*/
};
union {
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
/* Used for cmpxchg_double in slub */
unsigned long counters;
#else
/*
* Keep _count separate from slub cmpxchg_double data.
* As the rest of the double word is protected by
* slab_lock but _count is not.
*/
unsigned counters;
#endif
struct {
union {
mm: thp: tail page refcounting fix Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michel Lespinasse <walken@google.com> Reviewed-by: Michel Lespinasse <walken@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 14:36:59 -06:00
/*
* Count of ptes mapped in
* mms, to show when page is
* mapped & limit reverse map
* searches.
*
* Used also for tail pages
* refcounting instead of
* _count. Tail pages cannot
* be mapped and keeping the
* tail page _count zero at
* all times guarantees
* get_page_unless_zero() will
* never succeed on tail
* pages.
*/
atomic_t _mapcount;
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
int units; /* SLOB */
};
atomic_t _count; /* Usage count, see below. */
};
};
SLUB core This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on boot up and merges them into the corresponding general caches. This leads to more effective memory use. About 50% of all caches can be eliminated through slab merging. This will also decrease slab fragmentation because partial allocated slabs can be filled up again. Slab merging can be switched off by specifying slub_nomerge on boot up. Note that merging can expose heretofore unknown bugs in the kernel because corrupted objects may now be placed differently and corrupt differing neighboring objects. Enable sanity checks to find those. H. Diagnostics The current slab diagnostics are difficult to use and require a recompilation of the kernel. SLUB contains debugging code that is always available (but is kept out of the hot code paths). SLUB diagnostics can be enabled via the "slab_debug" option. Parameters can be specified to select a single or a group of slab caches for diagnostics. This means that the system is running with the usual performance and it is much more likely that race conditions can be reproduced. I. Resiliency If basic sanity checks are on then SLUB is capable of detecting common error conditions and recover as best as possible to allow the system to continue. J. Tracing Tracing can be enabled via the slab_debug=T,<slabcache> option during boot. SLUB will then protocol all actions on that slabcache and dump the object contents on free. K. On demand DMA cache creation. Generally DMA caches are not needed. If a kmalloc is used with __GFP_DMA then just create this single slabcache that is needed. For systems that have no ZONE_DMA requirement the support is completely eliminated. L. Performance increase Some benchmarks have shown speed improvements on kernbench in the range of 5-10%. The locking overhead of slub is based on the underlying base allocation size. If we can reliably allocate larger order pages then it is possible to increase slub performance much further. The anti-fragmentation patches may enable further performance increases. Tested on: i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator SLUB Boot options slub_nomerge Disable merging of slabs slub_min_order=x Require a minimum order for slab caches. This increases the managed chunk size and therefore reduces meta data and locking overhead. slub_min_objects=x Mininum objects per slab. Default is 8. slub_max_order=x Avoid generating slabs larger than order specified. slub_debug Enable all diagnostics for all caches slub_debug=<options> Enable selective options for all caches slub_debug=<o>,<cache> Enable selective options for a certain set of caches Available Debug options F Double Free checking, sanity and resiliency R Red zoning P Object / padding poisoning U Track last free / alloc T Trace all allocs / frees (only use for individual slabs). To use SLUB: Apply this patch and then select SLUB as the default slab allocator. [hugh@veritas.com: fix an oops-causing locking error] [akpm@linux-foundation.org: various stupid cleanups and small fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-06 15:49:36 -06:00
};
/* Third double word block */
slub: per cpu cache for partial pages Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to partial pages. The partial page list is used in slab_free() to avoid per node lock taking. In __slab_alloc() we can then take multiple partial pages off the per node partial list in one go reducing node lock pressure. We can also use the per cpu partial list in slab_alloc() to avoid scanning partial lists for pages with free objects. The main effect of a per cpu partial list is that the per node list_lock is taken for batches of partial pages instead of individual ones. Potential future enhancements: 1. The pickup from the partial list could be perhaps be done without disabling interrupts with some work. The free path already puts the page into the per cpu partial list without disabling interrupts. 2. __slab_free() may have some code paths that could use optimization. Performance: Before After ./hackbench 100 process 200000 Time: 1953.047 1564.614 ./hackbench 100 process 20000 Time: 207.176 156.940 ./hackbench 100 process 20000 Time: 204.468 156.940 ./hackbench 100 process 20000 Time: 204.879 158.772 ./hackbench 10 process 20000 Time: 20.153 15.853 ./hackbench 10 process 20000 Time: 20.153 15.986 ./hackbench 10 process 20000 Time: 19.363 16.111 ./hackbench 1 process 20000 Time: 2.518 2.307 ./hackbench 1 process 20000 Time: 2.258 2.339 ./hackbench 1 process 20000 Time: 2.864 2.163 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 15:12:27 -06:00
union {
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
slub: per cpu cache for partial pages Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to partial pages. The partial page list is used in slab_free() to avoid per node lock taking. In __slab_alloc() we can then take multiple partial pages off the per node partial list in one go reducing node lock pressure. We can also use the per cpu partial list in slab_alloc() to avoid scanning partial lists for pages with free objects. The main effect of a per cpu partial list is that the per node list_lock is taken for batches of partial pages instead of individual ones. Potential future enhancements: 1. The pickup from the partial list could be perhaps be done without disabling interrupts with some work. The free path already puts the page into the per cpu partial list without disabling interrupts. 2. __slab_free() may have some code paths that could use optimization. Performance: Before After ./hackbench 100 process 200000 Time: 1953.047 1564.614 ./hackbench 100 process 20000 Time: 207.176 156.940 ./hackbench 100 process 20000 Time: 204.468 156.940 ./hackbench 100 process 20000 Time: 204.879 158.772 ./hackbench 10 process 20000 Time: 20.153 15.853 ./hackbench 10 process 20000 Time: 20.153 15.986 ./hackbench 10 process 20000 Time: 19.363 16.111 ./hackbench 1 process 20000 Time: 2.518 2.307 ./hackbench 1 process 20000 Time: 2.258 2.339 ./hackbench 1 process 20000 Time: 2.864 2.163 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 15:12:27 -06:00
struct { /* slub per cpu partial pages */
struct page *next; /* Next partial slab */
#ifdef CONFIG_64BIT
int pages; /* Nr of partial slabs left */
int pobjects; /* Approximate # of objects */
#else
short int pages;
short int pobjects;
#endif
};
struct list_head list; /* slobs list of pages */
struct { /* slab fields */
struct kmem_cache *slab_cache;
struct slab *slab_page;
};
slub: per cpu cache for partial pages Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to partial pages. The partial page list is used in slab_free() to avoid per node lock taking. In __slab_alloc() we can then take multiple partial pages off the per node partial list in one go reducing node lock pressure. We can also use the per cpu partial list in slab_alloc() to avoid scanning partial lists for pages with free objects. The main effect of a per cpu partial list is that the per node list_lock is taken for batches of partial pages instead of individual ones. Potential future enhancements: 1. The pickup from the partial list could be perhaps be done without disabling interrupts with some work. The free path already puts the page into the per cpu partial list without disabling interrupts. 2. __slab_free() may have some code paths that could use optimization. Performance: Before After ./hackbench 100 process 200000 Time: 1953.047 1564.614 ./hackbench 100 process 20000 Time: 207.176 156.940 ./hackbench 100 process 20000 Time: 204.468 156.940 ./hackbench 100 process 20000 Time: 204.879 158.772 ./hackbench 10 process 20000 Time: 20.153 15.853 ./hackbench 10 process 20000 Time: 20.153 15.986 ./hackbench 10 process 20000 Time: 19.363 16.111 ./hackbench 1 process 20000 Time: 2.518 2.307 ./hackbench 1 process 20000 Time: 2.258 2.339 ./hackbench 1 process 20000 Time: 2.864 2.163 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 15:12:27 -06:00
};
/* Remainder is not double word aligned */
union {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
SLUB core This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on boot up and merges them into the corresponding general caches. This leads to more effective memory use. About 50% of all caches can be eliminated through slab merging. This will also decrease slab fragmentation because partial allocated slabs can be filled up again. Slab merging can be switched off by specifying slub_nomerge on boot up. Note that merging can expose heretofore unknown bugs in the kernel because corrupted objects may now be placed differently and corrupt differing neighboring objects. Enable sanity checks to find those. H. Diagnostics The current slab diagnostics are difficult to use and require a recompilation of the kernel. SLUB contains debugging code that is always available (but is kept out of the hot code paths). SLUB diagnostics can be enabled via the "slab_debug" option. Parameters can be specified to select a single or a group of slab caches for diagnostics. This means that the system is running with the usual performance and it is much more likely that race conditions can be reproduced. I. Resiliency If basic sanity checks are on then SLUB is capable of detecting common error conditions and recover as best as possible to allow the system to continue. J. Tracing Tracing can be enabled via the slab_debug=T,<slabcache> option during boot. SLUB will then protocol all actions on that slabcache and dump the object contents on free. K. On demand DMA cache creation. Generally DMA caches are not needed. If a kmalloc is used with __GFP_DMA then just create this single slabcache that is needed. For systems that have no ZONE_DMA requirement the support is completely eliminated. L. Performance increase Some benchmarks have shown speed improvements on kernbench in the range of 5-10%. The locking overhead of slub is based on the underlying base allocation size. If we can reliably allocate larger order pages then it is possible to increase slub performance much further. The anti-fragmentation patches may enable further performance increases. Tested on: i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator SLUB Boot options slub_nomerge Disable merging of slabs slub_min_order=x Require a minimum order for slab caches. This increases the managed chunk size and therefore reduces meta data and locking overhead. slub_min_objects=x Mininum objects per slab. Default is 8. slub_max_order=x Avoid generating slabs larger than order specified. slub_debug Enable all diagnostics for all caches slub_debug=<options> Enable selective options for all caches slub_debug=<o>,<cache> Enable selective options for a certain set of caches Available Debug options F Double Free checking, sanity and resiliency R Red zoning P Object / padding poisoning U Track last free / alloc T Trace all allocs / frees (only use for individual slabs). To use SLUB: Apply this patch and then select SLUB as the default slab allocator. [hugh@veritas.com: fix an oops-causing locking error] [akpm@linux-foundation.org: various stupid cleanups and small fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-06 15:49:36 -06:00
};
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
unsigned long debug_flags; /* Use atomic bitops on this */
#endif
#ifdef CONFIG_KMEMCHECK
/*
* kmemcheck wants to track the status of each byte in a page; this
* is a pointer to such a status block. NULL if not tracked.
*/
void *shadow;
#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
* on double words work. The SLUB allocator can make use of such a feature.
*/
#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
__aligned(2 * sizeof(unsigned long))
#endif
;
struct page_frag {
struct page *page;
#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
__u32 offset;
__u32 size;
#else
__u16 offset;
__u16 size;
#endif
};
typedef unsigned long __nocast vm_flags_t;
NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 05:04:47 -07:00
/*
* A region containing a mapping of a non-memory backed file under NOMMU
* conditions. These are held in a global tree and are pinned by the VMAs that
* map parts of them.
*/
struct vm_region {
struct rb_node vm_rb; /* link in global region tree */
vm_flags_t vm_flags; /* VMA vm_flags */
NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 05:04:47 -07:00
unsigned long vm_start; /* start address of region */
unsigned long vm_end; /* region initialised to here */
unsigned long vm_top; /* region allocated to here */
NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 05:04:47 -07:00
unsigned long vm_pgoff; /* the offset in vm_file corresponding to vm_start */
struct file *vm_file; /* the backing file or NULL */
int vm_usage; /* region usage count (access under nommu_region_sem) */
bool vm_icache_flushed : 1; /* true if the icache has been flushed for
* this region */
NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 05:04:47 -07:00
};
/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */
struct rb_node vm_rb;
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
union {
struct {
struct rb_node rb;
unsigned long rb_subtree_last;
} linear;
struct list_head nonlinear;
} shared;
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
mm: change anon_vma linking to fix multi-process server scalability issue The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 14:42:07 -07:00
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
* page_table_lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
#ifndef CONFIG_MMU
NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>
2009-01-08 05:04:47 -07:00
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
};
struct core_thread {
struct task_struct *task;
struct core_thread *next;
};
struct core_state {
atomic_t nr_threads;
struct core_thread dumper;
struct completion startup;
};
enum {
MM_FILEPAGES,
MM_ANONPAGES,
MM_SWAPENTS,
NR_MM_COUNTERS
};
#if USE_SPLIT_PTLOCKS && defined(CONFIG_MMU)
#define SPLIT_RSS_COUNTING
/* per-thread cached information, */
struct task_rss_stat {
int events; /* for synchronization threshold */
int count[NR_MM_COUNTERS];
};
mm: delete non-atomic mm counter implementation The problem with having two different types of counters is that developers adding new code need to keep in mind whether it's safe to use both the atomic and non-atomic implementations. For example, when adding new callers of the *_mm_counter() functions a developer needs to ensure that those paths are always executed with page_table_lock held, in case we're using the non-atomic implementation of mm counters. Hugh Dickins introduced the atomic mm counters in commit f412ac08c986 ("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the non-atomic counters around he said, | The only reason was to avoid adding costly atomic operations into a | configuration that had no need for them there: the page_table_lock | sufficed. | | Certainly it would be simpler just to delete the non-atomic variant. | | And I think it's fair to say that any configuration on which we're | measuring performance to that degree (rather than "does it boot fast?" | type measurements), would already be going the split ptlocks route. Removing the non-atomic counters eases the maintenance burden because developers no longer have to mindful of the two implementations when using *_mm_counter(). Note that all architectures provide a means of atomically updating atomic_long_t variables, even if they have to revert to the generic spinlock implementation because they don't support 64-bit atomic instructions (see lib/atomic64.c). Signed-off-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-24 18:12:36 -06:00
#endif /* USE_SPLIT_PTLOCKS */
struct mm_rss_stat {
mm: delete non-atomic mm counter implementation The problem with having two different types of counters is that developers adding new code need to keep in mind whether it's safe to use both the atomic and non-atomic implementations. For example, when adding new callers of the *_mm_counter() functions a developer needs to ensure that those paths are always executed with page_table_lock held, in case we're using the non-atomic implementation of mm counters. Hugh Dickins introduced the atomic mm counters in commit f412ac08c986 ("[PATCH] mm: fix rss and mmlist locking"). When asked why he left the non-atomic counters around he said, | The only reason was to avoid adding costly atomic operations into a | configuration that had no need for them there: the page_table_lock | sufficed. | | Certainly it would be simpler just to delete the non-atomic variant. | | And I think it's fair to say that any configuration on which we're | measuring performance to that degree (rather than "does it boot fast?" | type measurements), would already be going the split ptlocks route. Removing the non-atomic counters eases the maintenance burden because developers no longer have to mindful of the two implementations when using *_mm_counter(). Note that all architectures provide a means of atomically updating atomic_long_t variables, even if they have to revert to the generic spinlock implementation because they don't support 64-bit atomic instructions (see lib/atomic64.c). Signed-off-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-24 18:12:36 -06:00
atomic_long_t count[NR_MM_COUNTERS];
};
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some counters */
struct rw_semaphore mmap_sem;
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned long total_vm; /* Total pages mapped */
unsigned long locked_vm; /* Pages that have PG_mlocked set */
unsigned long pinned_vm; /* Refcount permanently increased */
unsigned long shared_vm; /* Shared pages (files) */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
unsigned long def_flags;
unsigned long nr_ptes; /* Page table pages */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;
struct linux_binfmt *binfmt;
cpumask_var_t cpu_vm_mask_var;
/* Architecture-specific MM context */
mm_context_t context;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
cgroups: add an owner to the mm_struct Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 02:00:16 -06:00
#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
#endif
/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 16:46:29 -06:00
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 16:46:29 -06:00
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
uprobes/core: Allocate XOL slots for uprobes use Uprobes executes the original instruction at a probed location out of line. For this, we allocate a page (per mm) upon the first uprobe hit, in the process user address space, divide it into slots that are used to store the actual instructions to be singlestepped. These slots are known as xol (execution out of line) slots. Care is taken to ensure that the allocation is in an unmapped area as close to the top of the user address space as possible, with appropriate permission settings to keep selinux like frameworks happy. Upon a uprobe hit, a free slot is acquired, and is released after the singlestep completes. Lots of improvements courtesy suggestions/inputs from Peter and Oleg. [ Folded a fix for build issue on powerpc fixed and reported by Stephen Rothwell. ] Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com> Cc: Linux-mm <linux-mm@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Anton Arapov <anton@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-30 12:26:31 -06:00
struct uprobes_state uprobes_state;
};
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
mm->cpu_vm_mask_var = &mm->cpumask_allocation;
#endif
}
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
{
return mm->cpu_vm_mask_var;
}
#endif /* _LINUX_MM_TYPES_H */