1
0
Fork 0
alistair23-linux/arch/parisc/include/uapi/asm/mman.h

79 lines
3.4 KiB
C
Raw Normal View History

License cleanup: add SPDX license identifier to uapi header files with no license Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 08:08:43 -06:00
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef __PARISC_MMAN_H__
#define __PARISC_MMAN_H__
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
/* 0x01 - 0x03 are defined in linux/mman.h */
#define MAP_TYPE 0x2b /* Mask for type of mapping, includes bits 0x08 and 0x20 */
#define MAP_FIXED 0x04 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x10 /* don't use a file */
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
#define MAP_LOCKED 0x2000 /* pages are locked */
#define MAP_NORESERVE 0x4000 /* don't check for reservations */
#define MAP_GROWSDOWN 0x8000 /* stack-like segment */
#define MAP_POPULATE 0x10000 /* populate (prefault) pagetables */
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
mm: introduce MAP_FIXED_NOREPLACE Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2. This has started as a follow up discussion [3][4] resulting in the runtime failure caused by hardening patch [5] which removes MAP_FIXED from the elf loader because MAP_FIXED is inherently dangerous as it might silently clobber an existing underlying mapping (e.g. stack). The reason for the failure is that some architectures enforce an alignment for the given address hint without MAP_FIXED used (e.g. for shared or file backed mappings). One way around this would be excluding those archs which do alignment tricks from the hardening [6]. The patch is really trivial but it has been objected, rightfully so, that this screams for a more generic solution. We basically want a non-destructive MAP_FIXED. The first patch introduced MAP_FIXED_NOREPLACE which enforces the given address but unlike MAP_FIXED it fails with EEXIST if the given range conflicts with an existing one. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. Except we won't export expose the new semantic to the userspace at all. It seems there are users who would like to have something like that. Jemalloc has been mentioned by Michael Ellerman [7] Florian Weimer has mentioned the following: : glibc ld.so currently maps DSOs without hints. This means that the kernel : will map right next to each other, and the offsets between them a completely : predictable. We would like to change that and supply a random address in a : window of the address space. If there is a conflict, we do not want the : kernel to pick a non-random address. Instead, we would try again with a : random address. John Hubbard has mentioned CUDA example : a) Searches /proc/<pid>/maps for a "suitable" region of available : VA space. "Suitable" generally means it has to have a base address : within a certain limited range (a particular device model might : have odd limitations, for example), it has to be large enough, and : alignment has to be large enough (again, various devices may have : constraints that lead us to do this). : : This is of course subject to races with other threads in the process. : : Let's say it finds a region starting at va. : : b) Next it does: : p = mmap(va, ...) : : *without* setting MAP_FIXED, of course (so va is just a hint), to : attempt to safely reserve that region. If p != va, then in most cases, : this is a failure (almost certainly due to another thread getting a : mapping from that region before we did), and so this layer now has to : call munmap(), before returning a "failure: retry" to upper layers. : : IMPROVEMENT: --> if instead, we could call this: : : p = mmap(va, ... MAP_FIXED_NOREPLACE ...) : : , then we could skip the munmap() call upon failure. This : is a small thing, but it is useful here. (Thanks to Piotr : Jaroszynski and Mark Hairgrove for helping me get that detail : exactly right, btw.) : : c) After that, CUDA suballocates from p, via: : : q = mmap(sub_region_start, ... MAP_FIXED ...) : : Interestingly enough, "freeing" is also done via MAP_FIXED, and : setting PROT_NONE to the subregion. Anyway, I just included (c) for : general interest. Atomic address range probing in the multithreaded programs in general sounds like an interesting thing to me. The second patch simply replaces MAP_FIXED use in elf loader by MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED should follow. Actually real MAP_FIXED usages should be docummented properly and they should be more of an exception. [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au This patch (of 2): MAP_FIXED is used quite often to enforce mapping at the particular range. The main problem of this flag is, however, that it is inherently dangerous because it unmaps existing mappings covered by the requested range. This can cause silent memory corruptions. Some of them even with serious security implications. While the current semantic might be really desiderable in many cases there are others which would want to enforce the given range but rather see a failure than a silent memory corruption on a clashing range. Please note that there is no guarantee that a given range is obeyed by the mmap even when it is free - e.g. arch specific code is allowed to apply an alignment. Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this behavior. It has the same semantic as MAP_FIXED wrt. the given address request with a single exception that it fails with EEXIST if the requested address is already covered by an existing mapping. We still do rely on get_unmaped_area to handle all the arch specific MAP_FIXED treatment and check for a conflicting vma after it returns. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt. flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. [mpe@ellerman.id.au: fix whitespace] [fail on clashing range with EEXIST as per Florian Weimer] [set MAP_FIXED before round_hint_to_min as per Khalid Aziz] Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Russell King - ARM Linux <linux@armlinux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Edward Tomasz Napierała <trasz@FreeBSD.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-10 17:35:57 -06:00
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
#define MS_INVALIDATE 4 /* invalidate the caches */
#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage The previous patch introduced a flag that specified pages in a VMA should be placed on the unevictable LRU, but they should not be made present when the area is created. This patch adds the ability to set this state via the new mlock system calls. We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall. MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED. MCL_ONFAULT should be used as a modifier to the two other mlockall flags. When used with MCL_CURRENT, all current mappings will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. Prior to this patch, mlockall() will unconditionally clear the mm->def_flags any time it is called without MCL_FUTURE. This behavior is maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and new VMAs will be unlocked. This remains true with or without MCL_ONFAULT in either mlockall() invocation. munlock() will unconditionally clear both vma flags. munlockall() unconditionally clears for VMA flags on all VMAs and in the mm->def_flags field. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:51:39 -07:00
#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
#define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */
#define MADV_NORMAL 0 /* no further special treatment */
#define MADV_RANDOM 1 /* expect random page references */
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
/* common/generic parameters */
arch/*/include/uapi/asm/mman.h: : let MADV_FREE have same value for all architectures For uapi, need try to let all macros have same value, and MADV_FREE is added into main branch recently, so need redefine MADV_FREE for it. At present, '8' can be shared with all architectures, so redefine it to '8'. [sudipm.mukherjee@gmail.com: correct uniform value of MADV_FREE] Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Roland Dreier <roland@kernel.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:55:02 -07:00
#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
mm: introduce MADV_COLD Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements will vary based on the memory requirements of individual workloads. - Problem Naturally, cached apps were dominant consumers of memory on the system. However, they were not significant consumers of swap even though they are good candidate for swap. Under investigation, swapping out only begins once the low zone watermark is hit and kswapd wakes up, but the overall allocation rate in the system might trip lmkd thresholds and cause a cached process to be killed(we measured performance swapping out vs. zapping the memory by killing a process. Unsurprisingly, zapping is 10x times faster even though we use zram which is much faster than real storage) so kill from lmkd will often satisfy the high zone watermark, resulting in very few pages actually being moved to swap. - Approach The approach we chose was to use a new interface to allow userspace to proactively reclaim entire processes by leveraging platform information. This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages that are known to be cold from userspace and to avoid races with lmkd by reclaiming apps as soon as they entered the cached state. Additionally, it could provide many chances for platform to use much information to optimize memory efficiency. To achieve the goal, the patchset introduce two new options for madvise. One is MADV_COLD which will deactivate activated pages and the other is MADV_PAGEOUT which will reclaim private pages instantly. These new options complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way that it hints the kernel that memory region is not currently needed and should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way that it hints the kernel that memory region is not currently needed and should be reclaimed when memory pressure rises. This patch (of 5): When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:49:08 -06:00
#define MADV_COLD 20 /* deactivate these pages */
ksm: define MADV_MERGEABLE and MADV_UNMERGEABLE The out-of-tree KSM used ioctls on fds cloned from /dev/ksm to register a memory area for merging: we prefer now to use an madvise(2) interface. This patch just defines MADV_MERGEABLE (to tell KSM it may merge pages in this area found identical to pages in other mergeable areas) and MADV_UNMERGEABLE (to undo that). Most architectures use asm-generic, but alpha, mips, parisc, xtensa need their own definitions: included here for mmotm convenience, but we'll probably want to split this and feed pieces to arch maintainers. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-21 18:01:53 -06:00
#define MADV_MERGEABLE 65 /* KSM may merge identical pages */
#define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */
#define MADV_HUGEPAGE 67 /* Worth backing with hugepages */
#define MADV_NOHUGEPAGE 68 /* Not worth backing with hugepages */
#define MADV_DONTDUMP 69 /* Explicity exclude from the core dump,
overrides the coredump filter bits */
#define MADV_DODUMP 70 /* Clear the MADV_NODUMP flag */
mm,fork: introduce MADV_WIPEONFORK Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty in the child process after fork. This differs from MADV_DONTFORK in one important way. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. MADV_WIPEONFORK only works on private, anonymous VMAs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patch also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: Colm MacCártaigh <colm@allcosts.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:25:15 -06:00
#define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */
#define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */
#define MADV_HWPOISON 100 /* poison a page for testing */
#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
/* compatibility flags */
#define MAP_FILE 0
#define MAP_VARIABLE 0
x86/pkeys: Allocation/free syscalls This patch adds two new system calls: int pkey_alloc(unsigned long flags, unsigned long init_access_rights) int pkey_free(int pkey); These implement an "allocator" for the protection keys themselves, which can be thought of as analogous to the allocator that the kernel has for file descriptors. The kernel tracks which numbers are in use, and only allows operations on keys that are valid. A key which was not obtained by pkey_alloc() may not, for instance, be passed to pkey_mprotect(). These system calls are also very important given the kernel's use of pkeys to implement execute-only support. These help ensure that userspace can never assume that it has control of a key unless it first asks the kernel. The kernel does not promise to preserve PKRU (right register) contents except for allocated pkeys. The 'init_access_rights' argument to pkey_alloc() specifies the rights that will be established for the returned pkey. For instance: pkey = pkey_alloc(flags, PKEY_DENY_WRITE); will allocate 'pkey', but also sets the bits in PKRU[1] such that writing to 'pkey' is already denied. The kernel does not prevent pkey_free() from successfully freeing in-use pkeys (those still assigned to a memory range by pkey_mprotect()). It would be expensive to implement the checks for this, so we instead say, "Just don't do it" since sane software will never do it anyway. Any piece of userspace calling pkey_alloc() needs to be prepared for it to fail. Why? pkey_alloc() returns the same error code (ENOSPC) when there are no pkeys and when pkeys are unsupported. They can be unsupported for a whole host of reasons, so apps must be prepared for this. Also, libraries or LD_PRELOADs might steal keys before an application gets access to them. This allocation mechanism could be implemented in userspace. Even if we did it in userspace, we would still need additional user/kernel interfaces to tell userspace which keys are being used by the kernel internally (such as for execute-only mappings). Having the kernel provide this facility completely removes the need for these additional interfaces, or having an implementation of this in userspace at all. Note that we have to make changes to all of the architectures that do not use mman-common.h because we use the new PKEY_DENY_ACCESS/WRITE macros in arch-independent code. 1. PKRU is the Protection Key Rights User register. It is a usermode-accessible register that controls whether writes and/or access to each individual pkey is allowed or denied. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 10:30:15 -06:00
#define PKEY_DISABLE_ACCESS 0x1
#define PKEY_DISABLE_WRITE 0x2
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)
#endif /* __PARISC_MMAN_H__ */