From 9225e4e02936e0de920f45aff21095d453b53be3 Mon Sep 17 00:00:00 2001 From: Christina Quast Date: Wed, 11 Apr 2018 18:33:26 +0200 Subject: [PATCH 001/103] Some files where renamed from .txt to .rst, but the Documentation was not fixed yet. Signed-off-by: Christina Quast Signed-off-by: Jonathan Corbet --- Documentation/sound/alsa-configuration.rst | 4 ++-- Documentation/sound/soc/codec.rst | 2 +- Documentation/sound/soc/platform.rst | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/sound/alsa-configuration.rst b/Documentation/sound/alsa-configuration.rst index aed6b4fb8e46..ab5761148163 100644 --- a/Documentation/sound/alsa-configuration.rst +++ b/Documentation/sound/alsa-configuration.rst @@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel ML (see the section `Links and Addresses`_). ``power_save`` and ``power_save_controller`` options are for power-saving -mode. See powersave.txt for details. +mode. See powersave.rst for details. Note 2: If you get click noises on output, try the module option ``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB @@ -1133,7 +1133,7 @@ line_outs_monitor enable_monitor Enable Analog Out on Channel 63/64 by default. -See hdspm.txt for details. +See hdspm.rst for details. Module snd-ice1712 ------------------ diff --git a/Documentation/sound/soc/codec.rst b/Documentation/sound/soc/codec.rst index f87612b94812..240770ea761e 100644 --- a/Documentation/sound/soc/codec.rst +++ b/Documentation/sound/soc/codec.rst @@ -139,7 +139,7 @@ DAPM description ---------------- The Dynamic Audio Power Management description describes the codec power components and their relationships and registers to the ASoC core. -Please read dapm.txt for details of building the description. +Please read dapm.rst for details of building the description. Please also see the examples in other codec drivers. diff --git a/Documentation/sound/soc/platform.rst b/Documentation/sound/soc/platform.rst index d5574904d981..02c93a8b9c3b 100644 --- a/Documentation/sound/soc/platform.rst +++ b/Documentation/sound/soc/platform.rst @@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:- 4. SYSCLK configuration 5. Suspend and resume (optional) -Please see codec.txt for a description of items 1 - 4. +Please see codec.rst for a description of items 1 - 4. SoC DSP Drivers From 1ba2211c525a205caed76adc6a328b423556f6e5 Mon Sep 17 00:00:00 2001 From: Konstantin Ryabitsev Date: Thu, 12 Apr 2018 16:44:10 -0400 Subject: [PATCH 002/103] Documentation/process: updates to the PGP guide Small tweaks to the Maintainer PGP guide: - Use --quick-addkey command that is compatible between GnuPG-2.2 and GnuPG-2.1 (which many people still have) - Add a note about the Nitrokey program - Warn that some devices can't change the passphrase before there are keys on the card (specifically, Nitrokeys) - Link to the GnuPG wiki page about gpg-agent forwarding over ssh - Tell git to use gpgv2 instead of legacy gpgv when verifying signed tags or commits Signed-off-by: Konstantin Ryabitsev Signed-off-by: Jonathan Corbet --- .../process/maintainer-pgp-guide.rst | 39 ++++++++++++++++++- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/Documentation/process/maintainer-pgp-guide.rst b/Documentation/process/maintainer-pgp-guide.rst index b453561a7148..aff9b1a4d77b 100644 --- a/Documentation/process/maintainer-pgp-guide.rst +++ b/Documentation/process/maintainer-pgp-guide.rst @@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so if you only have a combined **[SC]** key, then you should create a separate signing subkey:: - $ gpg --quick-add-key [fpr] ed25519 sign + $ gpg --quick-addkey [fpr] ed25519 sign Remember to tell the keyservers about this change, so others can pull down your new subkey:: @@ -450,11 +450,18 @@ functionality. There are several options available: others. If you want to use ECC keys, your best bet among commercially available devices is the Nitrokey Start. +.. note:: + + If you are listed in MAINTAINERS or have an account at kernel.org, + you `qualify for a free Nitrokey Start`_ courtesy of The Linux + Foundation. + .. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6 .. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3 .. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/ .. _Gnuk: http://www.fsij.org/doc-gnuk/ .. _`LWN has a good review`: https://lwn.net/Articles/736231/ +.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html Configure your smartcard device ------------------------------- @@ -482,7 +489,7 @@ there are no convenient command-line switches:: You should set the user PIN (1), Admin PIN (3), and the Reset Code (4). Please make sure to record and store these in a safe place -- especially the Admin PIN and the Reset Code (which allows you to completely wipe -the smartcard). You so rarely need to use the Admin PIN, that you will +the smartcard). You so rarely need to use the Admin PIN, that you will inevitably forget what it is if you do not record it. Getting back to the main card menu, you can also set other values (such @@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it. Despite having the name "PIN", neither the user PIN nor the admin PIN on the card need to be numbers. +.. warning:: + + Some devices may require that you move the subkeys onto the device + before you can change the passphrase. Please check the documentation + provided by the device manufacturer. + Move the subkeys to your smartcard ---------------------------------- @@ -655,6 +668,20 @@ want to import these changes back into your regular working directory:: $ gpg --export | gpg --homedir ~/.gnupg --import $ unset GNUPGHOME +Using gpg-agent over ssh +~~~~~~~~~~~~~~~~~~~~~~~~ + +You can forward your gpg-agent over ssh if you need to sign tags or +commits on a remote system. Please refer to the instructions provided +on the GnuPG wiki: + +- `Agent Forwarding over SSH`_ + +It works more smoothly if you can modify the sshd server settings on the +remote end. + +.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding + Using PGP with Git ================== @@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key):: tell git to always use it instead of the legacy ``gpg`` from version 1:: $ git config --global gpg.program gpg2 + $ git config --global gpgv.program gpgv2 How to work with signed tags ---------------------------- @@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to import their PGP key. Please refer to the ":ref:`verify_identities`" section below. +.. note:: + + If you get "``gpg: Can't check signature: unknown pubkey + algorithm``" error, you need to tell git to use gpgv2 for + verification, so it properly processes signatures made by ECC keys. + See instructions at the start of this section. + Configure git to always sign annotated tags ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 32fb7ef69a9f1e3c8ec18a174fbc474b90ee645e Mon Sep 17 00:00:00 2001 From: Steffen Maier Date: Fri, 13 Apr 2018 17:39:15 +0200 Subject: [PATCH 003/103] Documentation: ftrace: clarify filters with dynamic ftrace and graph I fell into the trap of having set up function tracer with a very limited filter and then switched over to function_graph and was erroneously wondering why the latter did not trace what I expected, which was the full unabridged graph recursion. Signed-off-by: Steffen Maier Reviewed-by: Steven Rostedt (VMware) Cc: Ingo Molnar Signed-off-by: Jonathan Corbet --- Documentation/trace/ftrace.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index e45f0786f3f9..9bbd3aefadb2 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files: has a side effect of enabling or disabling specific functions to be traced. Echoing names of functions into this file will limit the trace to only those functions. + This influences the tracers "function" and "function_graph" + and thus also function profiling (see "function_profile_enabled"). The functions listed in "available_filter_functions" are what can be written into this file. @@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files: Functions listed in this file will cause the function graph tracer to only trace these functions and the functions that they call. (See the section "dynamic ftrace" for more details). + Note, set_ftrace_filter and set_ftrace_notrace still affects + what functions are being traced. set_graph_notrace: @@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files: This lists the functions that ftrace has processed and can trace. These are the function names that you can pass to - "set_ftrace_filter" or "set_ftrace_notrace". + "set_ftrace_filter", "set_ftrace_notrace", + "set_graph_function", or "set_graph_notrace". (See the section "dynamic ftrace" below for more details.) dyn_ftrace_total_info: From 438b8e24d1766cce644dc2a909fd925e2fcb4d9e Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:17 +0200 Subject: [PATCH 004/103] docs/vm: active_mm.txt convert to ReST format Just add a label for cross-referencing and indent the text to make it ``literal`` Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/active_mm.txt | 138 +++++++++++++++++---------------- 1 file changed, 73 insertions(+), 65 deletions(-) diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt index dbf45817405f..c84471b180f8 100644 --- a/Documentation/vm/active_mm.txt +++ b/Documentation/vm/active_mm.txt @@ -1,83 +1,91 @@ -List: linux-kernel -Subject: Re: active_mm -From: Linus Torvalds -Date: 1999-07-30 21:36:24 +.. _active_mm: -Cc'd to linux-kernel, because I don't write explanations all that often, -and when I do I feel better about more people reading them. +========= +Active MM +========= -On Fri, 30 Jul 1999, David Mosberger wrote: -> -> Is there a brief description someplace on how "mm" vs. "active_mm" in -> the task_struct are supposed to be used? (My apologies if this was -> discussed on the mailing lists---I just returned from vacation and -> wasn't able to follow linux-kernel for a while). +:: -Basically, the new setup is: + List: linux-kernel + Subject: Re: active_mm + From: Linus Torvalds + Date: 1999-07-30 21:36:24 - - we have "real address spaces" and "anonymous address spaces". The - difference is that an anonymous address space doesn't care about the - user-level page tables at all, so when we do a context switch into an - anonymous address space we just leave the previous address space - active. + Cc'd to linux-kernel, because I don't write explanations all that often, + and when I do I feel better about more people reading them. - The obvious use for a "anonymous address space" is any thread that - doesn't need any user mappings - all kernel threads basically fall into - this category, but even "real" threads can temporarily say that for - some amount of time they are not going to be interested in user space, - and that the scheduler might as well try to avoid wasting time on - switching the VM state around. Currently only the old-style bdflush - sync does that. + On Fri, 30 Jul 1999, David Mosberger wrote: + > + > Is there a brief description someplace on how "mm" vs. "active_mm" in + > the task_struct are supposed to be used? (My apologies if this was + > discussed on the mailing lists---I just returned from vacation and + > wasn't able to follow linux-kernel for a while). - - "tsk->mm" points to the "real address space". For an anonymous process, - tsk->mm will be NULL, for the logical reason that an anonymous process - really doesn't _have_ a real address space at all. + Basically, the new setup is: - - however, we obviously need to keep track of which address space we - "stole" for such an anonymous user. For that, we have "tsk->active_mm", - which shows what the currently active address space is. + - we have "real address spaces" and "anonymous address spaces". The + difference is that an anonymous address space doesn't care about the + user-level page tables at all, so when we do a context switch into an + anonymous address space we just leave the previous address space + active. - The rule is that for a process with a real address space (ie tsk->mm is - non-NULL) the active_mm obviously always has to be the same as the real - one. + The obvious use for a "anonymous address space" is any thread that + doesn't need any user mappings - all kernel threads basically fall into + this category, but even "real" threads can temporarily say that for + some amount of time they are not going to be interested in user space, + and that the scheduler might as well try to avoid wasting time on + switching the VM state around. Currently only the old-style bdflush + sync does that. - For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the - "borrowed" mm while the anonymous process is running. When the - anonymous process gets scheduled away, the borrowed address space is - returned and cleared. + - "tsk->mm" points to the "real address space". For an anonymous process, + tsk->mm will be NULL, for the logical reason that an anonymous process + really doesn't _have_ a real address space at all. -To support all that, the "struct mm_struct" now has two counters: a -"mm_users" counter that is how many "real address space users" there are, -and a "mm_count" counter that is the number of "lazy" users (ie anonymous -users) plus one if there are any real users. + - however, we obviously need to keep track of which address space we + "stole" for such an anonymous user. For that, we have "tsk->active_mm", + which shows what the currently active address space is. -Usually there is at least one real user, but it could be that the real -user exited on another CPU while a lazy user was still active, so you do -actually get cases where you have a address space that is _only_ used by -lazy users. That is often a short-lived state, because once that thread -gets scheduled away in favour of a real thread, the "zombie" mm gets -released because "mm_users" becomes zero. + The rule is that for a process with a real address space (ie tsk->mm is + non-NULL) the active_mm obviously always has to be the same as the real + one. -Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any -more. "init_mm" should be considered just a "lazy context when no other -context is available", and in fact it is mainly used just at bootup when -no real VM has yet been created. So code that used to check + For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the + "borrowed" mm while the anonymous process is running. When the + anonymous process gets scheduled away, the borrowed address space is + returned and cleared. - if (current->mm == &init_mm) + To support all that, the "struct mm_struct" now has two counters: a + "mm_users" counter that is how many "real address space users" there are, + and a "mm_count" counter that is the number of "lazy" users (ie anonymous + users) plus one if there are any real users. -should generally just do + Usually there is at least one real user, but it could be that the real + user exited on another CPU while a lazy user was still active, so you do + actually get cases where you have a address space that is _only_ used by + lazy users. That is often a short-lived state, because once that thread + gets scheduled away in favour of a real thread, the "zombie" mm gets + released because "mm_users" becomes zero. - if (!current->mm) + Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any + more. "init_mm" should be considered just a "lazy context when no other + context is available", and in fact it is mainly used just at bootup when + no real VM has yet been created. So code that used to check -instead (which makes more sense anyway - the test is basically one of "do -we have a user context", and is generally done by the page fault handler -and things like that). + if (current->mm == &init_mm) -Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, -because it slightly changes the interfaces to accommodate the alpha (who -would have thought it, but the alpha actually ends up having one of the -ugliest context switch codes - unlike the other architectures where the MM -and register state is separate, the alpha PALcode joins the two, and you -need to switch both together). + should generally just do -(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2) + if (!current->mm) + + instead (which makes more sense anyway - the test is basically one of "do + we have a user context", and is generally done by the page fault handler + and things like that). + + Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, + because it slightly changes the interfaces to accommodate the alpha (who + would have thought it, but the alpha actually ends up having one of the + ugliest context switch codes - unlike the other architectures where the MM + and register state is separate, the alpha PALcode joins the two, and you + need to switch both together). + + (From http://marc.info/?l=linux-kernel&m=93337278602211&w=2) From d04f9f5a78b836cc51f8000e2049f2709c0b61f6 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:18 +0200 Subject: [PATCH 005/103] docs/vm: balance: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/balance | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/Documentation/vm/balance b/Documentation/vm/balance index 964595481af6..6a1fadf3e173 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance @@ -1,3 +1,9 @@ +.. _balance: + +================ +Memory Balancing +================ + Started Jan 2000 by Kanoj Sarcar Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as @@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, so as to give a fighting chance for replace_with_highmem() to get a HIGHMEM page, as well as to ensure that HIGHMEM allocations do not fall back into regular zone. This also makes sure that HIGHMEM pages -are not leaked (for example, in situations where a HIGHMEM page is in +are not leaked (for example, in situations where a HIGHMEM page is in the swapcache but is not being used by anyone) kswapd also needs to know about the zones it should balance. kswapd is -primarily needed in a situation where balancing can not be done, +primarily needed in a situation where balancing can not be done, probably because all allocation requests are coming from intr context and all process contexts are sleeping. For 2.3, kswapd does not really need to balance the highmem zone, since intr context does not request @@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: + 1. Dynamic experience should influence balancing: number of failed requests -for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) + for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve -dma pages. (lkd@tantalophile.demon.co.uk) + dma pages. (lkd@tantalophile.demon.co.uk) From 5ef829e056c82579329ccec67a6f5fda2f724dc7 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:19 +0200 Subject: [PATCH 006/103] docs/vm: cleancache.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/cleancache.txt | 105 +++++++++++++++++++------------- 1 file changed, 62 insertions(+), 43 deletions(-) diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt index e4b49df7a048..68cba9131c31 100644 --- a/Documentation/vm/cleancache.txt +++ b/Documentation/vm/cleancache.txt @@ -1,4 +1,11 @@ -MOTIVATION +.. _cleancache: + +========== +Cleancache +========== + +Motivation +========== Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for @@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented in Xen (using hypervisor memory) and zcache (using in-kernel compressed memory) and other implementations are in development. -FAQs are included below. +:ref:`FAQs ` are included below. -IMPLEMENTATION OVERVIEW +Implementation Overview +======================= A cleancache "backend" that provides transcendent memory registers itself to the kernel's cleancache "frontend" by calling cleancache_register_ops, @@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page with the same handle, the results are indeterminate. Callers must lock the page to ensure serial behavior. -CLEANCACHE PERFORMANCE METRICS +Cleancache Performance Metrics +============================== If properly configured, monitoring of cleancache is done via debugfs in -the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache +the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache can be measured (across all filesystems) with: -succ_gets - number of gets that were successful -failed_gets - number of gets that failed -puts - number of puts attempted (all "succeed") -invalidates - number of invalidates attempted +``succ_gets`` + number of gets that were successful + +``failed_gets`` + number of gets that failed + +``puts`` + number of puts attempted (all "succeed") + +``invalidates`` + number of invalidates attempted A backend implementation may provide additional metrics. -FAQ +.. _faq: -1) Where's the value? (Andrew Morton) +FAQ +=== + +* Where's the value? (Andrew Morton) Cleancache provides a significant performance benefit to many workloads in many environments with negligible overhead by improving the @@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And the proposed "RAMster" driver shares RAM across multiple physical systems. -2) Why does cleancache have its sticky fingers so deep inside the - filesystems and VFS? (Andrew Morton and Christoph Hellwig) +* Why does cleancache have its sticky fingers so deep inside the + filesystems and VFS? (Andrew Morton and Christoph Hellwig) The core hooks for cleancache in VFS are in most cases a single line and the minimum set are placed precisely where needed to maintain @@ -168,9 +187,9 @@ filesystems in the future. The total impact of the hooks to existing fs and mm files is only about 40 lines added (not counting comments and blank lines). -3) Why not make cleancache asynchronous and batched so it can - more easily interface with real devices with DMA instead - of copying each individual page? (Minchan Kim) +* Why not make cleancache asynchronous and batched so it can more + easily interface with real devices with DMA instead of copying each + individual page? (Minchan Kim) The one-page-at-a-time copy semantics simplifies the implementation on both the frontend and backend and also allows the backend to @@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device" or for real kernel-addressable RAM, it makes perfect sense for transcendent memory. -4) Why is non-shared cleancache "exclusive"? And where is the - page "invalidated" after a "get"? (Minchan Kim) +* Why is non-shared cleancache "exclusive"? And where is the + page "invalidated" after a "get"? (Minchan Kim) The main reason is to free up space in transcendent memory and to avoid unnecessary cleancache_invalidate calls. If you want inclusive, @@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call. The invalidate is done by the cleancache backend implementation. -5) What's the performance impact? +* What's the performance impact? Performance analysis has been presented at OLS'09 and LCA'10. Briefly, performance gains can be significant on most workloads, @@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache has little value, but in newer multicore machines, especially consolidated/virtualized machines, it has great value. -6) How do I add cleancache support for filesystem X? (Boaz Harrash) +* How do I add cleancache support for filesystem X? (Boaz Harrash) Filesystems that are well-behaved and conform to certain restrictions can utilize cleancache simply by making a call to @@ -217,26 +236,26 @@ not enable the optional cleancache. Some points for a filesystem to consider: -- The FS should be block-device-based (e.g. a ram-based FS such - as tmpfs should not enable cleancache) -- To ensure coherency/correctness, the FS must ensure that all - file removal or truncation operations either go through VFS or - add hooks to do the equivalent cleancache "invalidate" operations -- To ensure coherency/correctness, either inode numbers must - be unique across the lifetime of the on-disk file OR the - FS must provide an "encode_fh" function. -- The FS must call the VFS superblock alloc and deactivate routines - or add hooks to do the equivalent cleancache calls done there. -- To maximize performance, all pages fetched from the FS should - go through the do_mpag_readpage routine or the FS should add - hooks to do the equivalent (cf. btrfs) -- Currently, the FS blocksize must be the same as PAGESIZE. This - is not an architectural restriction, but no backends currently - support anything different. -- A clustered FS should invoke the "shared_init_fs" cleancache - hook to get best performance for some backends. + - The FS should be block-device-based (e.g. a ram-based FS such + as tmpfs should not enable cleancache) + - To ensure coherency/correctness, the FS must ensure that all + file removal or truncation operations either go through VFS or + add hooks to do the equivalent cleancache "invalidate" operations + - To ensure coherency/correctness, either inode numbers must + be unique across the lifetime of the on-disk file OR the + FS must provide an "encode_fh" function. + - The FS must call the VFS superblock alloc and deactivate routines + or add hooks to do the equivalent cleancache calls done there. + - To maximize performance, all pages fetched from the FS should + go through the do_mpag_readpage routine or the FS should add + hooks to do the equivalent (cf. btrfs) + - Currently, the FS blocksize must be the same as PAGESIZE. This + is not an architectural restriction, but no backends currently + support anything different. + - A clustered FS should invoke the "shared_init_fs" cleancache + hook to get best performance for some backends. -7) Why not use the KVA of the inode as the key? (Christoph Hellwig) +* Why not use the KVA of the inode as the key? (Christoph Hellwig) If cleancache would use the inode virtual address instead of inode/filehandle, the pool id could be eliminated. But, this @@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache is potentially much larger than the kernel pagecache and is most useful if the pages survive inode cache removal. -8) Why is a global variable required? +* Why is a global variable required? The cleancache_enabled flag is checked in all of the frequently-used cleancache hooks. The alternative is a function call to check a static @@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile time, but have insignificant performance impact when cleancache remains disabled at runtime. -9) Does cleanache work with KVM? +* Does cleanache work with KVM? The memory model of KVM is sufficiently different that a cleancache backend may have less value for KVM. This remains to be tested, especially in an overcommitted system. -10) Does cleancache work in userspace? It sounds useful for - memory hungry caches like web browsers. (Jamie Lokier) +* Does cleancache work in userspace? It sounds useful for + memory hungry caches like web browsers. (Jamie Lokier) No plans yet, though we agree it sounds useful, at least for apps that bypass the page cache (e.g. O_DIRECT). From 76b387bd3c4873d1420868260bc49978406276ea Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:20 +0200 Subject: [PATCH 007/103] docs/vm: frontswap.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/frontswap.txt | 59 +++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt index c71a019be600..1979f430c1c5 100644 --- a/Documentation/vm/frontswap.txt +++ b/Documentation/vm/frontswap.txt @@ -1,13 +1,20 @@ +.. _frontswap: + +========= +Frontswap +========= + Frontswap provides a "transcendent memory" interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. -(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" +(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends" and the only necessary changes to the core kernel for transcendent memory; all other supporting code -- the "backends" -- is implemented as drivers. -See the LWN.net article "Transcendent memory in a nutshell" for a detailed -overview of frontswap and related kernel parts: -https://lwn.net/Articles/454795/ ) +See the LWN.net article `Transcendent memory in a nutshell`_ +for a detailed overview of frontswap and related kernel parts) + +.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The storage is assumed to be @@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may never be obtained from frontswap. If properly configured, monitoring of frontswap is done via debugfs in -the /sys/kernel/debug/frontswap directory. The effectiveness of +the `/sys/kernel/debug/frontswap` directory. The effectiveness of frontswap can be measured (across all swap devices) with: -failed_stores - how many store attempts have failed -loads - how many loads were attempted (all should succeed) -succ_stores - how many store attempts have succeeded -invalidates - how many invalidates were attempted +``failed_stores`` + how many store attempts have failed + +``loads`` + how many loads were attempted (all should succeed) + +``succ_stores`` + how many store attempts have succeeded + +``invalidates`` + how many invalidates were attempted A backend implementation may provide additional metrics. FAQ +=== -1) Where's the value? +* Where's the value? When a workload starts swapping, performance falls through the floor. Frontswap significantly increases performance in many such workloads by @@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And, using frontswap, investigation is also underway on the use of NVM as a memory extension technology. -2) Sure there may be performance advantages in some situations, but - what's the space/time overhead of frontswap? +* Sure there may be performance advantages in some situations, but + what's the space/time overhead of frontswap? If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into nothingness and the only overhead is a few extra bytes per swapon'ed @@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A backend, such as zcache, must implement policies to carefully (but dynamically) manage memory limits to ensure this doesn't happen. -3) OK, how about a quick overview of what this frontswap patch does - in terms that a kernel hacker can grok? +* OK, how about a quick overview of what this frontswap patch does + in terms that a kernel hacker can grok? Let's assume that a frontswap "backend" has registered during kernel initialization; this registration indicates that this @@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend store" and (possibly) a "frontswap backend loads", which are presumably much faster. -4) Can't frontswap be configured as a "special" swap device that is - just higher priority than any real swap device (e.g. like zswap, - or maybe swap-over-nbd/NFS)? +* Can't frontswap be configured as a "special" swap device that is + just higher priority than any real swap device (e.g. like zswap, + or maybe swap-over-nbd/NFS)? No. First, the existing swap subsystem doesn't allow for any kind of swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, @@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices can still use frontswap but a backend for such devices must configure some kind of "ghost" swap device and ensure that it is never used. -5) Why this weird definition about "duplicate stores"? If a page - has been previously successfully stored, can't it always be - successfully overwritten? +* Why this weird definition about "duplicate stores"? If a page + has been previously successfully stored, can't it always be + successfully overwritten? Nearly always it can, but no, sometimes it cannot. Consider an example where data is compressed and the original 4K page has been compressed @@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the swap subsystem then writes the new data to the read swap device, this is the correct course of action to ensure coherency. -6) What is frontswap_shrink for? +* What is frontswap_shrink for? When the (non-frontswap) swap subsystem swaps out a page to a real swap device, that page is only taking up low-value pre-allocated disk @@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine; this is driven using the frontswap_shrink mechanism when memory pressure subsides. -7) Why does the frontswap patch create the new include file swapfile.h? +* Why does the frontswap patch create the new include file swapfile.h? The frontswap code depends on some swap-subsystem-internal data structures that have, over the years, moved back and forth between From eeb8a6426ec04740058447b111db1c5fc455a4a0 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:21 +0200 Subject: [PATCH 008/103] docs/vm: highmem.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/highmem.txt | 87 +++++++++++++++--------------------- 1 file changed, 36 insertions(+), 51 deletions(-) diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt index 4324d24ffacd..0f69a9fec34d 100644 --- a/Documentation/vm/highmem.txt +++ b/Documentation/vm/highmem.txt @@ -1,25 +1,14 @@ +.. _highmem: - ==================== - HIGH MEMORY HANDLING - ==================== +==================== +High Memory Handling +==================== By: Peter Zijlstra -Contents: +.. contents:: :local: - (*) What is high memory? - - (*) Temporary virtual mappings. - - (*) Using kmap_atomic. - - (*) Cost of temporary mappings. - - (*) i386 PAE. - - -==================== -WHAT IS HIGH MEMORY? +What Is High Memory? ==================== High memory (highmem) is used when the size of physical memory approaches or @@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be divided between user and kernel space. The traditional split for architectures using this approach is 3:1, 3GiB for -userspace and the top 1GiB for kernel space: +userspace and the top 1GiB for kernel space:: +--------+ 0xffffffff | Kernel | @@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual space when they use mm context tags. -========================== -TEMPORARY VIRTUAL MAPPINGS +Temporary Virtual Mappings ========================== The kernel contains several ways of creating temporary mappings: - (*) vmap(). This can be used to make a long duration mapping of multiple - physical pages into a contiguous virtual space. It needs global - synchronization to unmap. +* vmap(). This can be used to make a long duration mapping of multiple + physical pages into a contiguous virtual space. It needs global + synchronization to unmap. - (*) kmap(). This permits a short duration mapping of a single page. It needs - global synchronization, but is amortized somewhat. It is also prone to - deadlocks when using in a nested fashion, and so it is not recommended for - new code. +* kmap(). This permits a short duration mapping of a single page. It needs + global synchronization, but is amortized somewhat. It is also prone to + deadlocks when using in a nested fashion, and so it is not recommended for + new code. - (*) kmap_atomic(). This permits a very short duration mapping of a single - page. Since the mapping is restricted to the CPU that issued it, it - performs well, but the issuing task is therefore required to stay on that - CPU until it has finished, lest some other task displace its mappings. +* kmap_atomic(). This permits a very short duration mapping of a single + page. Since the mapping is restricted to the CPU that issued it, it + performs well, but the issuing task is therefore required to stay on that + CPU until it has finished, lest some other task displace its mappings. - kmap_atomic() may also be used by interrupt contexts, since it is does not - sleep and the caller may not sleep until after kunmap_atomic() is called. + kmap_atomic() may also be used by interrupt contexts, since it is does not + sleep and the caller may not sleep until after kunmap_atomic() is called. - It may be assumed that k[un]map_atomic() won't fail. + It may be assumed that k[un]map_atomic() won't fail. -================= -USING KMAP_ATOMIC +Using kmap_atomic ================= When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two -functions, and they can be used in a manner similar to the following: +functions, and they can be used in a manner similar to the following:: /* Find the page of interest. */ struct page *page = find_get_page(mapping, offset); @@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument. If you need to map two pages because you want to copy from one page to -another you need to keep the kmap_atomic calls strictly nested, like: +another you need to keep the kmap_atomic calls strictly nested, like:: vaddr1 = kmap_atomic(page1); vaddr2 = kmap_atomic(page2); @@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like: kunmap_atomic(vaddr1); -========================== -COST OF TEMPORARY MAPPINGS +Cost of Temporary Mappings ========================== The cost of creating temporary mappings can be quite high. The arch has to @@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no highmem. In such a case, the arithmetic approach will also be used. -======== i386 PAE ======== The i386 arch, under some circumstances, will permit you to stick up to 64GiB of RAM into your 32-bit machine. This has a number of consequences: - (*) Linux needs a page-frame structure for each page in the system and the - pageframes need to live in the permanent mapping, which means: +* Linux needs a page-frame structure for each page in the system and the + pageframes need to live in the permanent mapping, which means: - (*) you can have 896M/sizeof(struct page) page-frames at most; with struct - page being 32-bytes that would end up being something in the order of 112G - worth of pages; the kernel, however, needs to store more than just - page-frames in that memory... +* you can have 896M/sizeof(struct page) page-frames at most; with struct + page being 32-bytes that would end up being something in the order of 112G + worth of pages; the kernel, however, needs to store more than just + page-frames in that memory... - (*) PAE makes your page tables larger - which slows the system down as more - data has to be accessed to traverse in TLB fills and the like. One - advantage is that PAE has more PTE bits and can provide advanced features - like NX and PAT. +* PAE makes your page tables larger - which slows the system down as more + data has to be accessed to traverse in TLB fills and the like. One + advantage is that PAE has more PTE bits and can provide advanced features + like NX and PAT. The general recommendation is that you don't use more than 8GiB on a 32-bit machine - although more might work for you and your workload, you're pretty From aa9f34e5da6b48744190156d8eca084f65a5e55a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:22 +0200 Subject: [PATCH 009/103] docs/vm: hmm.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hmm.txt | 66 +++++++++++++++++----------------------- 1 file changed, 28 insertions(+), 38 deletions(-) diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt index 4d3aac9f4a5d..3fafa3381730 100644 --- a/Documentation/vm/hmm.txt +++ b/Documentation/vm/hmm.txt @@ -1,4 +1,8 @@ +.. hmm: + +===================================== Heterogeneous Memory Management (HMM) +===================================== Transparently allow any component of a program to use any memory region of said program with a device without using device specific memory allocator. This is @@ -14,19 +18,10 @@ deals with how device memory is represented inside the kernel. Finaly the last section present the new migration helper that allow to leverage the device DMA engine. +.. contents:: :local: -1) Problems of using device specific memory allocator: -2) System bus, device memory characteristics -3) Share address space and migration -4) Address space mirroring implementation and API -5) Represent and manage device memory from core kernel point of view -6) Migrate to and from device memory -7) Memory cgroup (memcg) and rss accounting - - -------------------------------------------------------------------------------- - -1) Problems of using device specific memory allocator: +Problems of using device specific memory allocator +================================================== Device with large amount of on board memory (several giga bytes) like GPU have historically manage their memory through dedicated driver specific API. This @@ -68,9 +63,8 @@ only do-able with a share address. It is as well more reasonable to use a share address space for all the other patterns. -------------------------------------------------------------------------------- - -2) System bus, device memory characteristics +System bus, device memory characteristics +========================================= System bus cripple share address due to few limitations. Most system bus only allow basic memory access from device to main memory, even cache coherency is @@ -100,9 +94,8 @@ access any memory memory but we must also permit any memory to be migrated to device memory while device is using it (blocking CPU access while it happens). -------------------------------------------------------------------------------- - -3) Share address space and migration +Share address space and migration +================================= HMM intends to provide two main features. First one is to share the address space by duplication the CPU page table into the device page table so same @@ -140,14 +133,13 @@ leverage device memory by migrating part of data-set that is actively use by a device. -------------------------------------------------------------------------------- - -4) Address space mirroring implementation and API +Address space mirroring implementation and API +============================================== Address space mirroring main objective is to allow to duplicate range of CPU page table into a device page table and HMM helps keeping both synchronize. A device driver that want to mirror a process address space must start with the -registration of an hmm_mirror struct: +registration of an hmm_mirror struct:: int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm); @@ -156,7 +148,7 @@ registration of an hmm_mirror struct: The locked variant is to be use when the driver is already holding the mmap_sem of the mm in write mode. The mirror struct has a set of callback that are use -to propagate CPU page table: +to propagate CPU page table:: struct hmm_mirror_ops { /* sync_cpu_device_pagetables() - synchronize page tables @@ -187,7 +179,8 @@ be done with the update. When device driver wants to populate a range of virtual address it can use -either: +either:: + int hmm_vma_get_pfns(struct vm_area_struct *vma, struct hmm_range *range, unsigned long start, @@ -211,7 +204,7 @@ that array correspond to an address in the virtual range. HMM provide a set of flags to help driver identify special CPU page table entries. Locking with the update() callback is the most important aspect the driver must -respect in order to keep things properly synchronize. The usage pattern is : +respect in order to keep things properly synchronize. The usage pattern is:: int driver_populate_range(...) { @@ -251,9 +244,8 @@ concurrently for multiple devices. Waiting for each device to report commands as executed is serialize (there is no point in doing this concurrently). -------------------------------------------------------------------------------- - -5) Represent and manage device memory from core kernel point of view +Represent and manage device memory from core kernel point of view +================================================================= Several differents design were try to support device memory. First one use device specific data structure to keep information about migrated memory and @@ -269,14 +261,14 @@ un-aware of the difference. We only need to make sure that no one ever try to map those page from the CPU side. HMM provide a set of helpers to register and hotplug device memory as a new -region needing struct page. This is offer through a very simple API: +region needing struct page. This is offer through a very simple API:: struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, unsigned long size); void hmm_devmem_remove(struct hmm_devmem *devmem); -The hmm_devmem_ops is where most of the important things are: +The hmm_devmem_ops is where most of the important things are:: struct hmm_devmem_ops { void (*free)(struct hmm_devmem *devmem, struct page *page); @@ -294,13 +286,12 @@ second callback happens whenever CPU try to access a device page which it can not do. This second callback must trigger a migration back to system memory. -------------------------------------------------------------------------------- - -6) Migrate to and from device memory +Migrate to and from device memory +================================= Because CPU can not access device memory, migration must use device DMA engine to perform copy from and to device memory. For this we need a new migration -helper: +helper:: int migrate_vma(const struct migrate_vma_ops *ops, struct vm_area_struct *vma, @@ -319,7 +310,7 @@ such migration base on range of address the device is actively accessing. The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy()) control destination memory allocation and copy operation. Second one is there -to allow device driver to perform cleanup operation after migration. +to allow device driver to perform cleanup operation after migration:: struct migrate_vma_ops { void (*alloc_and_copy)(struct vm_area_struct *vma, @@ -353,9 +344,8 @@ bandwidth but this is considered as a rare event and a price that we are willing to pay to keep all the code simpler. -------------------------------------------------------------------------------- - -7) Memory cgroup (memcg) and rss accounting +Memory cgroup (memcg) and rss accounting +======================================== For now device memory is accounted as any regular page in rss counters (either anonymous if device page is use for anonymous, file if device page is use for From 148723f711d1a6b9b9d66bc54b41f3a7f1db9776 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:23 +0200 Subject: [PATCH 010/103] docs/vm: hugetlbpage.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hugetlbpage.txt | 237 ++++++++++++++++++------------- 1 file changed, 136 insertions(+), 101 deletions(-) diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index faf077d50d42..3bb0d991f102 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -1,3 +1,11 @@ +.. _hugetlbpage: + +============= +HugeTLB Pages +============= + +Overview +======== The intent of this file is to give a brief summary of hugetlbpage support in the Linux kernel. This support is built on top of multiple page size support @@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS automatically when CONFIG_HUGETLBFS is selected) configuration options. -The /proc/meminfo file provides information about the total number of +The ``/proc/meminfo`` file provides information about the total number of persistent hugetlb pages in the kernel's huge page pool. It also displays default huge page size and information about the number of free, reserved and surplus huge pages in the pool of huge pages of default size. The huge page size is needed for generating the proper alignment and size of the arguments to system calls that map huge page regions. -The output of "cat /proc/meminfo" will include lines like: +The output of ``cat /proc/meminfo`` will include lines like:: -..... -HugePages_Total: uuu -HugePages_Free: vvv -HugePages_Rsvd: www -HugePages_Surp: xxx -Hugepagesize: yyy kB -Hugetlb: zzz kB + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize: yyy kB + Hugetlb: zzz kB where: -HugePages_Total is the size of the pool of huge pages. -HugePages_Free is the number of huge pages in the pool that are not yet - allocated. -HugePages_Rsvd is short for "reserved," and is the number of huge pages for - which a commitment to allocate from the pool has been made, - but no allocation has yet been made. Reserved huge pages - guarantee that an application will be able to allocate a - huge page from the pool of huge pages at fault time. -HugePages_Surp is short for "surplus," and is the number of huge pages in - the pool above the value in /proc/sys/vm/nr_hugepages. The - maximum number of surplus huge pages is controlled by - /proc/sys/vm/nr_overcommit_hugepages. -Hugepagesize is the default hugepage size (in Kb). -Hugetlb is the total amount of memory (in kB), consumed by huge - pages of all sizes. - If huge pages of different sizes are in use, this number - will exceed HugePages_Total * Hugepagesize. To get more - detailed information, please, refer to - /sys/kernel/mm/hugepages (described below). + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in + the pool above the value in ``/proc/sys/vm/nr_hugepages``. The + maximum number of surplus huge pages is controlled by + ``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb + is the total amount of memory (in kB), consumed by huge + pages of all sizes. + If huge pages of different sizes are in use, this number + will exceed HugePages_Total \* Hugepagesize. To get more + detailed information, please, refer to + ``/sys/kernel/mm/hugepages`` (described below). -/proc/filesystems should also show a filesystem of type "hugetlbfs" configured -in the kernel. +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. -/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge pages in the kernel's huge page pool. "Persistent" huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages -by increasing or decreasing the value of 'nr_hugepages'. +by increasing or decreasing the value of ``nr_hugepages``. Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under @@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=". must be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the "default_hugepagesz=" boot parameter. -When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages +When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` indicates the current number of pre-allocated huge pages of the default size. Thus, one can use the following command to dynamically allocate/deallocate -default sized persistent huge pages: +default sized persistent huge pages:: echo 20 > /proc/sys/vm/nr_hugepages @@ -98,7 +112,7 @@ huge page pool to 20, allocating or freeing huge pages, as required. On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the -task that modifies nr_hugepages. The default for the allowed nodes--when the +task that modifies ``nr_hugepages``. The default for the allowed nodes--when the task has default memory policy--is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be silently skipped when allocating persistent huge pages. See the discussion @@ -117,51 +131,52 @@ init files. This will enable the kernel to allocate huge pages early in the boot process when the possibility of getting physical contiguous pages is still very high. Administrators can verify the number of huge pages actually allocated by checking the sysctl or meminfo. To check the per node -distribution of huge pages in a NUMA system, use: +distribution of huge pages in a NUMA system, use:: cat /sys/devices/system/node/node*/meminfo | fgrep Huge -/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of -huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are +``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of +huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are requested by applications. Writing any non-zero value into this file indicates that the hugetlb subsystem is allowed to try to obtain that number of "surplus" huge pages from the kernel's normal page pool, when the persistent huge page pool is exhausted. As these surplus huge pages become unused, they are freed back to the kernel's normal page pool. -When increasing the huge page pool size via nr_hugepages, any existing surplus -pages will first be promoted to persistent huge pages. Then, additional +When increasing the huge page pool size via ``nr_hugepages``, any existing +surplus pages will first be promoted to persistent huge pages. Then, additional huge pages will be allocated, if necessary and if possible, to fulfill the new persistent huge page pool size. The administrator may shrink the pool of persistent huge pages for -the default huge page size by setting the nr_hugepages sysctl to a +the default huge page size by setting the ``nr_hugepages`` sysctl to a smaller value. The kernel will attempt to balance the freeing of huge pages -across all nodes in the memory policy of the task modifying nr_hugepages. +across all nodes in the memory policy of the task modifying ``nr_hugepages``. Any free huge pages on the selected nodes will be freed back to the kernel's normal page pool. -Caveat: Shrinking the persistent huge page pool via nr_hugepages such that +Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. This will occur even if the number of surplus pages it would exceed the overcommit value. As long as -this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is +this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is increased sufficiently, or the surplus huge pages go out of use and are freed-- no more surplus huge pages will be allowed to be allocated. With support for multiple huge page pools at run-time available, much of -the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. -The /proc interfaces discussed above have been retained for backwards -compatibility. The root huge page control directory in sysfs is: +the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in +sysfs. +The ``/proc`` interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is:: /sys/kernel/mm/hugepages For each huge page size supported by the running kernel, a subdirectory -will exist, of the form: +will exist, of the form:: hugepages-${size}kB -Inside each of these directories, the same set of files will exist: +Inside each of these directories, the same set of files will exist:: nr_hugepages nr_hugepages_mempolicy @@ -176,33 +191,33 @@ which function as described above for the default huge page-sized case. Interaction of Task Memory Policy with Huge Page Allocation/Freeing =================================================================== -Whether huge pages are allocated and freed via the /proc interface or -the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA -nodes from which huge pages are allocated or freed are controlled by the -NUMA memory policy of the task that modifies the nr_hugepages_mempolicy -sysctl or attribute. When the nr_hugepages attribute is used, mempolicy +Whether huge pages are allocated and freed via the ``/proc`` interface or +the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the +NUMA nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` +sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy is ignored. The recommended method to allocate or free huge pages to/from the kernel -huge page pool, using the nr_hugepages example above, is: +huge page pool, using the ``nr_hugepages`` example above, is:: numactl --interleave echo 20 \ >/proc/sys/vm/nr_hugepages_mempolicy -or, more succinctly: +or, more succinctly:: numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy -This will allocate or free abs(20 - nr_hugepages) to or from the nodes +This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes specified in , depending on whether number of persistent huge pages is initially less than or greater than 20, respectively. No huge pages will be allocated nor freed on any node not included in the specified . -When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any +When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: -1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], +#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous @@ -212,7 +227,7 @@ resulting effect on persistent huge page allocation is as follows: possibly, allocation of persistent huge pages on nodes not allowed by the task's memory policy. -2) One or more nodes may be specified with the bind or interleave policy. +#. One or more nodes may be specified with the bind or interleave policy. If more than one node is specified with the preferred policy, only the lowest numeric id will be used. Local policy will select the node where the task is running at the time the nodes_allowed mask is constructed. @@ -222,20 +237,20 @@ resulting effect on persistent huge page allocation is as follows: indeterminate. Thus, local policy is not very useful for this purpose. Any of the other mempolicy modes may be used to specify a single node. -3) The nodes allowed mask will be derived from any non-default task mempolicy, +#. The nodes allowed mask will be derived from any non-default task mempolicy, whether this policy was set explicitly by the task itself or one of its ancestors, such as numactl. This means that if the task is invoked from a shell with non-default policy, that policy will be used. One can specify a node list of "all" with numactl --interleave or --membind [-m] to achieve interleaving over all nodes in the system or cpuset. -4) Any task mempolicy specified--e.g., using numactl--will be constrained by +#. Any task mempolicy specified--e.g., using numactl--will be constrained by the resource limits of any cpuset in which the task runs. Thus, there will be no way for a task with non-default policy running in a cpuset with a subset of the system nodes to allocate huge pages outside the cpuset without first moving to a cpuset that contains all of the desired nodes. -5) Boot-time huge page allocation attempts to distribute the requested number +#. Boot-time huge page allocation attempts to distribute the requested number of huge pages over all on-lines nodes with memory. Per Node Hugepages Attributes @@ -243,22 +258,22 @@ Per Node Hugepages Attributes A subset of the contents of the root huge page control directory in sysfs, described above, will be replicated under each the system device of each -NUMA node with memory in: +NUMA node with memory in:: /sys/devices/system/node/node[0-9]*/hugepages/ Under this directory, the subdirectory for each supported huge page size -contains the following attribute files: +contains the following attribute files:: nr_hugepages free_hugepages surplus_hugepages -The free_' and surplus_' attribute files are read-only. They return the number +The free\_' and surplus\_' attribute files are read-only. They return the number of free and surplus [overcommitted] huge pages, respectively, on the parent node. -The nr_hugepages attribute returns the total number of huge pages on the +The ``nr_hugepages`` attribute returns the total number of huge pages on the specified node. When this attribute is written, the number of persistent huge pages on the parent node will be adjusted to the specified value, if sufficient resources exist, regardless of the task's mempolicy or cpuset constraints. @@ -273,37 +288,51 @@ Using Huge Pages If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of -type hugetlbfs: +type hugetlbfs:: mount -t hugetlbfs \ -o uid=,gid=,mode=,pagesize=,size=,\ min_size=,nr_inodes= none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid -options sets the owner and group of the root of the file system. By default -the uid and gid of the current process are taken. The mode option sets the -mode of root of file system to value & 01777. This value is given in octal. -By default the value 0755 is picked. If the platform supports multiple huge -page sizes, the pagesize option can be used to specify the huge page size and -associated pool. pagesize is specified in bytes. If pagesize is not specified -the platform's default huge page size and associated pool will be used. The -size option sets the maximum value of memory (huge pages) allowed for that -filesystem (/mnt/huge). The size option can be specified in bytes, or as a -percentage of the specified huge page pool (nr_hugepages). The size is -rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum -value of memory (huge pages) allowed for the filesystem. min_size can be -specified in the same way as size, either bytes or a percentage of the -huge page pool. At mount time, the number of huge pages specified by -min_size are reserved for use by the filesystem. If there are not enough -free huge pages available, the mount will fail. As huge pages are allocated -to the filesystem and freed, the reserve count is adjusted so that the sum -of allocated and reserved huge pages is always at least min_size. The option -nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the -size, min_size or nr_inodes option is not provided on command line then -no limits are set. For pagesize, size, min_size and nr_inodes options, you -can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K -has the same meaning as size=2048. +``/mnt/huge``. Any files created on ``/mnt/huge`` uses huge pages. + +The ``uid`` and ``gid`` options sets the owner and group of the root of the +file system. By default the ``uid`` and ``gid`` of the current process +are taken. + +The ``mode`` option sets the mode of root of file system to value & 01777. +This value is given in octal. By default the value 0755 is picked. + +If the platform supports multiple huge page sizes, the ``pagesize`` option can +be used to specify the huge page size and associated pool. ``pagesize`` +is specified in bytes. If ``pagesize`` is not specified the platform's +default huge page size and associated pool will be used. + +The ``size`` option sets the maximum value of memory (huge pages) allowed +for that filesystem (``/mnt/huge``). The ``size`` option can be specified +in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). +The size is rounded down to HPAGE_SIZE boundary. + +The ``min_size`` option sets the minimum value of memory (huge pages) allowed +for the filesystem. ``min_size`` can be specified in the same way as ``size``, +either bytes or a percentage of the huge page pool. +At mount time, the number of huge pages specified by ``min_size`` are reserved +for use by the filesystem. +If there are not enough free huge pages available, the mount will fail. +As huge pages are allocated to the filesystem and freed, the reserve count +is adjusted so that the sum of allocated and reserved huge pages is always +at least ``min_size``. + +The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` +can use. + +If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on +command line then no limits are set. + +For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can +use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. +For example, size=2K has the same meaning as size=2048. While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. @@ -313,12 +342,12 @@ used to change the file attributes on hugetlbfs. Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb -below. +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +:ref:`map_hugetlb ` below. Users who wish to use hugetlb memory via shared memory segment should be a member of a supplementary group and system admin needs to configure that gid -into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different +into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different applications to use any combination of mmaps and shm* calls, though the mount of filesystem will be required for using mmap calls without MAP_HUGETLB. @@ -332,15 +361,21 @@ a hugetlb page and the length is smaller than the hugepage size. Examples ======== -1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c +.. _map_hugetlb: -2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c +``map_hugetlb`` + see tools/testing/selftests/vm/map_hugetlb.c -3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c +``hugepage-shm`` + see tools/testing/selftests/vm/hugepage-shm.c -4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library - provides a wide range of userspace tools to help with huge page usability, - environment setup, and control. +``hugepage-mmap`` + see tools/testing/selftests/vm/hugepage-mmap.c + +The `libhugetlbfs`_ library provides a wide range of userspace tools +to help with huge page usability, environment setup, and control. + +.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs Kernel development regression testing ===================================== From 88ececc23cc8b4b25ac0118df00b25c403ead428 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:24 +0200 Subject: [PATCH 011/103] docs/vm: hugetlbfs_reserv.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hugetlbfs_reserv.txt | 212 ++++++++++++++++---------- 1 file changed, 135 insertions(+), 77 deletions(-) diff --git a/Documentation/vm/hugetlbfs_reserv.txt b/Documentation/vm/hugetlbfs_reserv.txt index 9aca09a76bed..36a87a2ea435 100644 --- a/Documentation/vm/hugetlbfs_reserv.txt +++ b/Documentation/vm/hugetlbfs_reserv.txt @@ -1,6 +1,13 @@ -Hugetlbfs Reservation Overview ------------------------------- -Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically +.. _hugetlbfs_reserve: + +===================== +Hugetlbfs Reservation +===================== + +Overview +======== + +Huge pages as described at :ref:`hugetlbpage` are typically preallocated for application use. These huge pages are instantiated in a task's address space at page fault time if the VMA indicates huge pages are to be used. If no huge page exists at page fault time, the task is sent @@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel. Audience --------- +======== This description is primarily targeted at kernel developers who are modifying hugetlbfs code. The Data Structures -------------------- +=================== + resv_huge_pages This is a global (per-hstate) count of reserved huge pages. Reserved huge pages are only available to the task which reserved them. Therefore, the number of huge pages generally available is computed - as (free_huge_pages - resv_huge_pages). + as (``free_huge_pages - resv_huge_pages``). Reserve Map - A reserve map is described by the structure: - struct resv_map { - struct kref refs; - spinlock_t lock; - struct list_head regions; - long adds_in_progress; - struct list_head region_cache; - long region_cache_count; - }; + A reserve map is described by the structure:: + + struct resv_map { + struct kref refs; + spinlock_t lock; + struct list_head regions; + long adds_in_progress; + struct list_head region_cache; + long region_cache_count; + }; + There is one reserve map for each huge page mapping in the system. The regions list within the resv_map describes the regions within - the mapping. A region is described as: - struct file_region { - struct list_head link; - long from; - long to; - }; + the mapping. A region is described as:: + + struct file_region { + struct list_head link; + long from; + long to; + }; + The 'from' and 'to' fields of the file region structure are huge page indices into the mapping. Depending on the type of mapping, a region in the reserv_map may indicate reservations exist for the range, or reservations do not exist. Flags for MAP_PRIVATE Reservations These are stored in the bottom bits of the reservation map pointer. - #define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the - owner of the reservations associated with the mapping. - #define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally - mapping this range (and creating reserves) has unmapped a - page from this task (the child) due to a failed COW. + + ``#define HPAGE_RESV_OWNER (1UL << 0)`` + Indicates this task is the owner of the reservations + associated with the mapping. + ``#define HPAGE_RESV_UNMAPPED (1UL << 1)`` + Indicates task originally mapping this range (and creating + reserves) has unmapped a page from this task (the child) + due to a failed COW. Page Flags The PagePrivate page flag is used to indicate that a huge page reservation must be restored when the huge page is freed. More @@ -65,12 +80,14 @@ Page Flags Reservation Map Location (Private or Shared) --------------------------------------------- +============================================ + A huge page mapping or segment is either private or shared. If private, it is typically only available to a single address space (task). If shared, it can be mapped into multiple address spaces (tasks). The location and semantics of the reservation map is significantly different for two types of mappings. Location differences are: + - For private mappings, the reservation map hangs off the the VMA structure. Specifically, vma->vm_private_data. This reserve map is created at the time the mapping (mmap(MAP_PRIVATE)) is created. @@ -82,15 +99,15 @@ of mappings. Location differences are: Creating Reservations ---------------------- +===================== Reservations are created when a huge page backed shared memory segment is created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). -These operations result in a call to the routine hugetlb_reserve_pages() +These operations result in a call to the routine hugetlb_reserve_pages():: -int hugetlb_reserve_pages(struct inode *inode, - long from, long to, - struct vm_area_struct *vma, - vm_flags_t vm_flags) + int hugetlb_reserve_pages(struct inode *inode, + long from, long to, + struct vm_area_struct *vma, + vm_flags_t vm_flags) The first thing hugetlb_reserve_pages() does is check for the NORESERVE flag was specified in either the shmget() or mmap() call. If NORESERVE @@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset. One of the big differences between PRIVATE and SHARED mappings is the way in which reservations are represented in the reservation map. + - For shared mappings, an entry in the reservation map indicates a reservation exists or did exist for the corresponding page. As reservations are consumed, the reservation map is not modified. @@ -121,12 +139,13 @@ to indicate this VMA owns the reservations. The reservation map is consulted to determine how many huge page reservations are needed for the current mapping/segment. For private mappings, this is always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the -section "Reservation Map Modifications" for details on how this is accomplished. +section :ref:`Reservation Map Modifications ` +for details on how this is accomplished. The mapping may be associated with a subpool. If so, the subpool is consulted to ensure there is sufficient space for the mapping. It is possible that the subpool has set aside reservations that can be used for the mapping. See the -section "Subpool Reservations" for more details. +section :ref:`Subpool Reservations ` for more details. After consulting the reservation map and subpool, the number of needed new reservations is known. The routine hugetlb_acct_memory() is called to check @@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts. However, within those routines the code is simply checking to ensure there are enough free huge pages to accommodate the reservation. If there are, the global reservation count resv_huge_pages is adjusted something like the -following. +following:: + if (resv_needed <= (resv_huge_pages - free_huge_pages)) resv_huge_pages += resv_needed; + Note that the global lock hugetlb_lock is held when checking and adjusting these counters. @@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and reservation map associated with the mapping will be modified as required to ensure reservations exist for the range 'from' - 'to'. +.. _consume_resv: Consuming Reservations/Allocating a Huge Page ---------------------------------------------- +============================================= + Reservations are consumed when huge pages associated with the reservations are allocated and instantiated in the corresponding mapping. The allocation -is performed within the routine alloc_huge_page(). -struct page *alloc_huge_page(struct vm_area_struct *vma, - unsigned long addr, int avoid_reserve) +is performed within the routine alloc_huge_page():: + + struct page *alloc_huge_page(struct vm_area_struct *vma, + unsigned long addr, int avoid_reserve) + alloc_huge_page is passed a VMA pointer and a virtual address, so it can consult the reservation map to determine if a reservation exists. In addition, alloc_huge_page takes the argument avoid_reserve which indicates reserves @@ -170,8 +195,9 @@ page are being allocated. The helper routine vma_needs_reservation() is called to determine if a reservation exists for the address within the mapping(vma). See the section -"Reservation Map Helper Routines" for detailed information on what this -routine does. The value returned from vma_needs_reservation() is generally +:ref:`Reservation Map Helper Routines ` for detailed +information on what this routine does. +The value returned from vma_needs_reservation() is generally 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. If a reservation does not exist, and there is a subpool associated with the mapping the subpool is consulted to determine if it contains reservations. @@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of a reservation for the allocation. After determining whether a reservation exists and can be used for the allocation, the routine dequeue_huge_page_vma() is called. This routine takes two arguments related to reservations: + - avoid_reserve, this is the same value/argument passed to alloc_huge_page() - chg, even though this argument is of type long only the values 0 or 1 are passed to dequeue_huge_page_vma. If the value is 0, it indicates a reservation exists (see the section "Memory Policy and Reservations" for possible issues). If the value is 1, it indicates a reservation does not exist and the page must be taken from the global free pool if possible. + The free lists associated with the memory policy of the VMA are searched for a free page. If a page is found, the value free_huge_pages is decremented when the page is removed from the free list. If there was a reservation -associated with the page, the following adjustments are made: +associated with the page, the following adjustments are made:: + SetPagePrivate(page); /* Indicates allocating this page consumed * a reservation, and if an error is * encountered such that the page must be * freed, the reservation will be restored. */ resv_huge_pages--; /* Decrement the global reservation count */ + Note, if no huge page can be found that satisfies the VMA's memory policy an attempt will be made to allocate one using the buddy allocator. This brings up the issue of surplus huge pages and overcommit which is beyond @@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count will be off by one. This rare condition can be identified by comparing the return value from vma_needs_reservation and vma_commit_reservation. If such a race is detected, the subpool and global reserve counts are adjusted to -compensate. See the section "Reservation Map Helper Routines" for more +compensate. See the section +:ref:`Reservation Map Helper Routines ` for more information on these routines. Instantiate Huge Pages ----------------------- +====================== + After huge page allocation, the page is typically added to the page tables of the allocating task. Before this, pages in a shared mapping are added to the page cache and pages in private mappings are added to an anonymous @@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages). Freeing Huge Pages ------------------- +================== + Huge page freeing is performed by the routine free_huge_page(). This routine is the destructor for hugetlbfs compound pages. As a result, it is only passed a pointer to the page struct. When a huge page is freed, reservation @@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored. The page->private field points to any subpool associated with the page. If the PagePrivate flag is set, it indicates the global reserve count should -be adjusted (see the section "Consuming Reservations/Allocating a Huge Page" +be adjusted (see the section +:ref:`Consuming Reservations/Allocating a Huge Page ` for information on how these are set). The routine first calls hugepage_subpool_put_pages() for the page. If this @@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case. If the PagePrivate flag was set in the page, the global resv_huge_pages counter will always be incremented. +.. _sub_pool_resv: Subpool Reservations --------------------- +==================== + There is a struct hstate associated with each huge page size. The hstate tracks all huge pages of the specified size. A subpool represents a subset of pages within a hstate that is associated with a mounted hugetlbfs @@ -295,7 +331,8 @@ the global pools. COW and Reservations --------------------- +==================== + Since shared mappings all point to and use the same underlying pages, the biggest reservation concern for COW is private mappings. In this case, two tasks can be pointing at the same previously allocated page. One task @@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the mapping/reservation will behave as expected. +.. _resv_map_modifications: + Reservation Map Modifications ------------------------------ +============================= + The following low level routines are used to make modifications to a reservation map. Typically, these routines are not called directly. Rather, a reservation map helper routine is called which calls one of these low level routines. These low level routines are fairly well documented in the source -code (mm/hugetlb.c). These routines are: -long region_chg(struct resv_map *resv, long f, long t); -long region_add(struct resv_map *resv, long f, long t); -void region_abort(struct resv_map *resv, long f, long t); -long region_count(struct resv_map *resv, long f, long t); +code (mm/hugetlb.c). These routines are:: + + long region_chg(struct resv_map *resv, long f, long t); + long region_add(struct resv_map *resv, long f, long t); + void region_abort(struct resv_map *resv, long f, long t); + long region_count(struct resv_map *resv, long f, long t); Operations on the reservation map typically involve two operations: + 1) region_chg() is called to examine the reserve map and determine how many pages in the specified range [f, t) are NOT currently represented. The calling code performs global checks and allocations to determine if there are enough huge pages for the operation to succeed. -2a) If the operation can succeed, region_add() is called to actually modify - the reservation map for the same range [f, t) previously passed to - region_chg(). -2b) If the operation can not succeed, region_abort is called for the same range - [f, t) to abort the operation. +2) + a) If the operation can succeed, region_add() is called to actually modify + the reservation map for the same range [f, t) previously passed to + region_chg(). + b) If the operation can not succeed, region_abort is called for the same + range [f, t) to abort the operation. Note that this is a two step process where region_add() and region_abort() are guaranteed to succeed after a prior call to region_chg() for the same @@ -371,6 +414,7 @@ and make the appropriate adjustments. The routine region_del() is called to remove regions from a reservation map. It is typically called in the following situations: + - When a file in the hugetlbfs filesystem is being removed, the inode will be released and the reservation map freed. Before freeing the reservation map, all the individual file_region structures must be freed. In this case @@ -384,6 +428,7 @@ It is typically called in the following situations: removed, region_del() is called to remove the corresponding entry from the reservation map. In this case, region_del is passed the range [page_idx, page_idx + 1). + In every case, region_del() will return the number of pages removed from the reservation map. In VERY rare cases, region_del() can fail. This can only happen in the hole punch case where it has to split an existing file_region @@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)). Since the mapping is going away, the subpool and global reservation counts are decremented by the number of outstanding reservations. +.. _resv_map_helpers: Reservation Map Helper Routines -------------------------------- +=============================== + Several helper routines exist to query and modify the reservation maps. These routines are only interested with reservations for a specific huge page, so they just pass in an address instead of a range. In addition, @@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be determined. These routines simply call the underlying routines described in the section "Reservation Map Modifications". However, they do take into account the 'opposite' meaning of reservation map entries for private and -shared mappings and hide this detail from the caller. +shared mappings and hide this detail from the caller:: + + long vma_needs_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_needs_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This routine calls region_chg() for the specified page. If no reservation -exists, 1 is returned. If a reservation exists, 0 is returned. +exists, 1 is returned. If a reservation exists, 0 is returned:: + + long vma_commit_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_commit_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This calls region_add() for the specified page. As in the case of region_chg and region_add, this routine is to be called after a previous call to vma_needs_reservation. It will add a reservation entry for the page. It returns 1 if the reservation was added and 0 if not. The return value should be compared with the return value of the previous call to vma_needs_reservation. An unexpected difference indicates the reservation -map was modified between calls. +map was modified between calls:: + + void vma_end_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -void vma_end_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This calls region_abort() for the specified page. As in the case of region_chg and region_abort, this routine is to be called after a previous call to vma_needs_reservation. It will abort/end the in progress reservation add -operation. +operation:: + + long vma_add_reservation(struct hstate *h, + struct vm_area_struct *vma, + unsigned long addr) -long vma_add_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) This is a special wrapper routine to help facilitate reservation cleanup on error paths. It is only called from the routine restore_reserve_on_error(). This routine is used in conjunction with vma_needs_reservation in an attempt @@ -453,8 +508,10 @@ be done on error paths. Reservation Cleanup in Error Paths ----------------------------------- -As mentioned in the section "Reservation Map Helper Routines", reservation +================================== + +As mentioned in the section +:ref:`Reservation Map Helper Routines `, reservation map modifications are performed in two steps. First vma_needs_reservation is called before a page is allocated. If the allocation is successful, then vma_commit_reservation is called. If not, vma_end_reservation is called. @@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed. Reservations and Memory Policy ------------------------------- +============================== Per-node huge page lists existed in struct hstate when git was first used to manage Linux code. The concept of reservations was added some time later. When reservations were added, no attempt was made to take memory policy into account. While cpusets are not exactly the same as memory policy, this comment in hugetlb_acct_memory sums up the interaction between reservations -and cpusets/memory policy. +and cpusets/memory policy:: + /* * When cpuset is configured, it breaks the strict hugetlb page * reservation as the accounting is done on a global variable. Such From b53ba58845fcafcdf7b30909c056d3deb064e983 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:25 +0200 Subject: [PATCH 012/103] docs/vm: hwpoison.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hwpoison.txt | 141 +++++++++++++++++----------------- 1 file changed, 70 insertions(+), 71 deletions(-) diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index e912d7eee769..b1a8c241d6c2 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -1,7 +1,14 @@ +.. hwpoison: + +======== +hwpoison +======== + What is hwpoison? +================= Upcoming Intel CPUs have support for recovering from some memory errors -(``MCA recovery''). This requires the OS to declare a page "poisoned", +(``MCA recovery``). This requires the OS to declare a page "poisoned", kill the processes associated with it and avoid using it in the future. This patchkit implements the necessary infrastructure in the VM. @@ -46,9 +53,10 @@ address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won't do that, but some very specialized ones might. ---- +Failure recovery modes +====================== -There are two (actually three) modi memory failure recovery can be in: +There are two (actually three) modes memory failure recovery can be in: vm.memory_failure_recovery sysctl set to zero: All memory failures cause a panic. Do not attempt recovery. @@ -67,9 +75,8 @@ late kill This is best for memory error unaware applications and default Note some pages are always handled as late kill. ---- - -User control: +User control +============ vm.memory_failure_recovery See sysctl.txt @@ -79,11 +86,19 @@ vm.memory_failure_early_kill PR_MCE_KILL Set early/late kill mode/revert to system default - arg1: PR_MCE_KILL_CLEAR: Revert to system default - arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode - PR_MCE_KILL_EARLY: Early kill - PR_MCE_KILL_LATE: Late kill - PR_MCE_KILL_DEFAULT: Use system global default + + arg1: PR_MCE_KILL_CLEAR: + Revert to system default + arg1: PR_MCE_KILL_SET: + arg2 defines thread specific mode + + PR_MCE_KILL_EARLY: + Early kill + PR_MCE_KILL_LATE: + Late kill + PR_MCE_KILL_DEFAULT + Use system global default + Note that if you want to have a dedicated thread which handles the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, @@ -92,77 +107,64 @@ PR_MCE_KILL PR_MCE_KILL_GET return current mode +Testing +======= ---- +* madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the + process for testing -Testing: +* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` -madvise(MADV_HWPOISON, ....) - (as root) - Poison a page in the process for testing + corrupt-pfn + Inject hwpoison fault at PFN echoed into this file. This does + some early filtering to avoid corrupted unintended pages in test suites. + unpoison-pfn + Software-unpoison page at PFN echoed into this file. This way + a page can be reused again. This only works for Linux + injected failures, not for real memory failures. -hwpoison-inject module through debugfs + Note these injection interfaces are not stable and might change between + kernel versions -/sys/kernel/debug/hwpoison/ + corrupt-filter-dev-major, corrupt-filter-dev-minor + Only handle memory failures to pages associated with the file + system defined by block device major/minor. -1U is the + wildcard value. This should be only used for testing with + artificial injection. -corrupt-pfn + corrupt-filter-memcg + Limit injection to pages owned by memgroup. Specified by inode + number of the memcg. -Inject hwpoison fault at PFN echoed into this file. This does -some early filtering to avoid corrupted unintended pages in test suites. + Example:: -unpoison-pfn + mkdir /sys/fs/cgroup/mem/hwpoison -Software-unpoison page at PFN echoed into this file. This -way a page can be reused again. -This only works for Linux injected failures, not for real -memory failures. + usemem -m 100 -s 1000 & + echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks -Note these injection interfaces are not stable and might change between -kernel versions + memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') + echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg -corrupt-filter-dev-major -corrupt-filter-dev-minor + page-types -p `pidof init` --hwpoison # shall do nothing + page-types -p `pidof usemem` --hwpoison # poison its pages -Only handle memory failures to pages associated with the file system defined -by block device major/minor. -1U is the wildcard value. -This should be only used for testing with artificial injection. + corrupt-filter-flags-mask, corrupt-filter-flags-value + When specified, only poison pages if ((page_flags & mask) == + value). This allows stress testing of many kinds of + pages. The page_flags are the same as in /proc/kpageflags. The + flag bits are defined in include/linux/kernel-page-flags.h and + documented in Documentation/vm/pagemap.txt -corrupt-filter-memcg +* Architecture specific MCE injector -Limit injection to pages owned by memgroup. Specified by inode number -of the memcg. + x86 has mce-inject, mce-test -Example: - mkdir /sys/fs/cgroup/mem/hwpoison + Some portable hwpoison test programs in mce-test, see below. - usemem -m 100 -s 1000 & - echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks - - memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') - echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg - - page-types -p `pidof init` --hwpoison # shall do nothing - page-types -p `pidof usemem` --hwpoison # poison its pages - -corrupt-filter-flags-mask -corrupt-filter-flags-value - -When specified, only poison pages if ((page_flags & mask) == value). -This allows stress testing of many kinds of pages. The page_flags -are the same as in /proc/kpageflags. The flag bits are defined in -include/linux/kernel-page-flags.h and documented in -Documentation/vm/pagemap.txt - -Architecture specific MCE injector - -x86 has mce-inject, mce-test - -Some portable hwpoison test programs in mce-test, see blow. - ---- - -References: +References +========== http://halobates.de/mce-lc09-2.pdf Overview presentation from LinuxCon 09 @@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git x86 specific injector ---- - -Limitations: - +Limitations +=========== - Not all page types are supported and never will. Most kernel internal -objects cannot be recovered, only LRU pages for now. + objects cannot be recovered, only LRU pages for now. - Right now hugepage support is missing. --- Andi Kleen, Oct 2009 - From e3f2025a574ff56b301876cc8d3ac50021066779 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:26 +0200 Subject: [PATCH 013/103] docs/vm: idle_page_tracking.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/idle_page_tracking.txt | 55 ++++++++++++++++--------- 1 file changed, 36 insertions(+), 19 deletions(-) diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/vm/idle_page_tracking.txt index 85dcc3bb85dc..9cbe6f8d7a99 100644 --- a/Documentation/vm/idle_page_tracking.txt +++ b/Documentation/vm/idle_page_tracking.txt @@ -1,4 +1,11 @@ -MOTIVATION +.. _idle_page_tracking: + +================== +Idle Page Tracking +================== + +Motivation +========== The idle page tracking feature allows to track which memory pages are being accessed by a workload and which are idle. This information can be useful for @@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster. It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. -USER API +.. _user_api: -The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, -it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. +User API +======== + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. The file implements a bitmap where each bit corresponds to a memory page. The bitmap is represented by an array of 8-byte integers, and the page at PFN #i is @@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle -(for more details on what "accessed" actually means see the IMPLEMENTATION -DETAILS section). To mark a page idle one has to set the bit corresponding to +(for more details on what "accessed" actually means see the :ref:`Implementation +Details ` section). +To mark a page idle one has to set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. @@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, and hence such pages are never reported idle. For huge pages the idle flag is set only on the head page, so one has to read -/proc/kpageflags in order to correctly count idle huge pages. +``/proc/kpageflags`` in order to correctly count idle huge pages. -Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return -EINVAL if you are not starting the read/write on an 8-byte boundary, or if the size of the read/write is not a multiple of 8 bytes. Writing to this file beyond max PFN will return -ENXIO. @@ -41,21 +53,25 @@ That said, in order to estimate the amount of pages that are not used by a workload one should: 1. Mark all the workload's pages as idle by setting corresponding bits in - /sys/kernel/mm/page_idle/bitmap. The pages can be found by reading - /proc/pid/pagemap if the workload is represented by a process, or by - filtering out alien pages using /proc/kpagecgroup in case the workload is - placed in a memory cgroup. + ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading + ``/proc/pid/pagemap`` if the workload is represented by a process, or by + filtering out alien pages using ``/proc/kpagecgroup`` in case the workload + is placed in a memory cgroup. 2. Wait until the workload accesses its working set. - 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If - one wants to ignore certain types of pages, e.g. mlocked pages since they - are not reclaimable, he or she can filter them out using /proc/kpageflags. + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. + If one wants to ignore certain types of pages, e.g. mlocked pages since they + are not reclaimable, he or she can filter them out using + ``/proc/kpageflags``. -See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, -/proc/kpageflags, and /proc/kpagecgroup. +See Documentation/vm/pagemap.txt for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. -IMPLEMENTATION DETAILS +.. _impl_details: + +Implementation Details +====================== The kernel internally keeps track of accesses to user memory pages in order to reclaim unreferenced pages first on memory shortage conditions. A page is @@ -77,7 +93,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or exceeding the dirty memory limit, it is not marked referenced. The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API ` section), and cleared automatically whenever a page is referenced as defined above. From 2fcbc413803f390e2ca8f82ccaf4b3634a56ec4f Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:27 +0200 Subject: [PATCH 014/103] docs/vm: ksm.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.txt | 199 ++++++++++++++++++++------------------- 1 file changed, 102 insertions(+), 97 deletions(-) diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index 6686bd267dc9..87e7eef5ea9c 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt @@ -1,8 +1,11 @@ -How to use the Kernel Samepage Merging feature ----------------------------------------------- +.. _ksm: + +======================= +Kernel Samepage Merging +======================= KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, -added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ The KSM daemon ksmd periodically scans those areas of user memory which @@ -51,110 +54,112 @@ Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM's scans may use a lot of processing power: some installations will disable KSM for that reason. -The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, +The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, readable by all but writable only by root: -pages_to_scan - how many present pages to scan before ksmd goes to sleep - e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" - Default: 100 (chosen for demonstration purposes) +pages_to_scan + how many present pages to scan before ksmd goes to sleep + e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan`` Default: 100 + (chosen for demonstration purposes) -sleep_millisecs - how many milliseconds ksmd should sleep before next scan - e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" - Default: 20 (chosen for demonstration purposes) +sleep_millisecs + how many milliseconds ksmd should sleep before next scan + e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` Default: 20 + (chosen for demonstration purposes) -merge_across_nodes - specifies if pages from different numa nodes can be merged. - When set to 0, ksm merges only pages which physically - reside in the memory area of same NUMA node. That brings - lower latency to access of shared pages. Systems with more - nodes, at significant NUMA distances, are likely to benefit - from the lower latency of setting 0. Smaller systems, which - need to minimize memory usage, are likely to benefit from - the greater sharing of setting 1 (default). You may wish to - compare how your system performs under each setting, before - deciding on which to use. merge_across_nodes setting can be - changed only when there are no ksm shared pages in system: - set run 2 to unmerge pages first, then to 1 after changing - merge_across_nodes, to remerge according to the new setting. - Default: 1 (merging across nodes as in earlier releases) +merge_across_nodes + specifies if pages from different numa nodes can be merged. + When set to 0, ksm merges only pages which physically reside + in the memory area of same NUMA node. That brings lower + latency to access of shared pages. Systems with more nodes, at + significant NUMA distances, are likely to benefit from the + lower latency of setting 0. Smaller systems, which need to + minimize memory usage, are likely to benefit from the greater + sharing of setting 1 (default). You may wish to compare how + your system performs under each setting, before deciding on + which to use. merge_across_nodes setting can be changed only + when there are no ksm shared pages in system: set run 2 to + unmerge pages first, then to 1 after changing + merge_across_nodes, to remerge according to the new setting. + Default: 1 (merging across nodes as in earlier releases) -run - set 0 to stop ksmd from running but keep merged pages, - set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", - set 2 to stop ksmd and unmerge all pages currently merged, - but leave mergeable areas registered for next run - Default: 0 (must be changed to 1 to activate KSM, - except if CONFIG_SYSFS is disabled) +run + set 0 to stop ksmd from running but keep merged pages, + set 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, + set 2 to stop ksmd and unmerge all pages currently merged, but + leave mergeable areas registered for next run Default: 0 (must + be changed to 1 to activate KSM, except if CONFIG_SYSFS is + disabled) -use_zero_pages - specifies whether empty pages (i.e. allocated pages - that only contain zeroes) should be treated specially. - When set to 1, empty pages are merged with the kernel - zero page(s) instead of with each other as it would - happen normally. This can improve the performance on - architectures with coloured zero pages, depending on - the workload. Care should be taken when enabling this - setting, as it can potentially degrade the performance - of KSM for some workloads, for example if the checksums - of pages candidate for merging match the checksum of - an empty page. This setting can be changed at any time, - it is only effective for pages merged after the change. - Default: 0 (normal KSM behaviour as in earlier releases) +use_zero_pages + specifies whether empty pages (i.e. allocated pages that only + contain zeroes) should be treated specially. When set to 1, + empty pages are merged with the kernel zero page(s) instead of + with each other as it would happen normally. This can improve + the performance on architectures with coloured zero pages, + depending on the workload. Care should be taken when enabling + this setting, as it can potentially degrade the performance of + KSM for some workloads, for example if the checksums of pages + candidate for merging match the checksum of an empty + page. This setting can be changed at any time, it is only + effective for pages merged after the change. Default: 0 + (normal KSM behaviour as in earlier releases) -max_page_sharing - Maximum sharing allowed for each KSM page. This - enforces a deduplication limit to avoid the virtual - memory rmap lists to grow too large. The minimum - value is 2 as a newly created KSM page will have at - least two sharers. The rmap walk has O(N) - complexity where N is the number of rmap_items - (i.e. virtual mappings) that are sharing the page, - which is in turn capped by max_page_sharing. So - this effectively spread the the linear O(N) - computational complexity from rmap walk context - over different KSM pages. The ksmd walk over the - stable_node "chains" is also O(N), but N is the - number of stable_node "dups", not the number of - rmap_items, so it has not a significant impact on - ksmd performance. In practice the best stable_node - "dup" candidate will be kept and found at the head - of the "dups" list. The higher this value the - faster KSM will merge the memory (because there - will be fewer stable_node dups queued into the - stable_node chain->hlist to check for pruning) and - the higher the deduplication factor will be, but - the slowest the worst case rmap walk could be for - any given KSM page. Slowing down the rmap_walk - means there will be higher latency for certain - virtual memory operations happening during - swapping, compaction, NUMA balancing and page - migration, in turn decreasing responsiveness for - the caller of those virtual memory operations. The - scheduler latency of other tasks not involved with - the VM operations doing the rmap walk is not - affected by this parameter as the rmap walks are - always schedule friendly themselves. +max_page_sharing + Maximum sharing allowed for each KSM page. This enforces a + deduplication limit to avoid the virtual memory rmap lists to + grow too large. The minimum value is 2 as a newly created KSM + page will have at least two sharers. The rmap walk has O(N) + complexity where N is the number of rmap_items (i.e. virtual + mappings) that are sharing the page, which is in turn capped + by max_page_sharing. So this effectively spread the the linear + O(N) computational complexity from rmap walk context over + different KSM pages. The ksmd walk over the stable_node + "chains" is also O(N), but N is the number of stable_node + "dups", not the number of rmap_items, so it has not a + significant impact on ksmd performance. In practice the best + stable_node "dup" candidate will be kept and found at the head + of the "dups" list. The higher this value the faster KSM will + merge the memory (because there will be fewer stable_node dups + queued into the stable_node chain->hlist to check for pruning) + and the higher the deduplication factor will be, but the + slowest the worst case rmap walk could be for any given KSM + page. Slowing down the rmap_walk means there will be higher + latency for certain virtual memory operations happening during + swapping, compaction, NUMA balancing and page migration, in + turn decreasing responsiveness for the caller of those virtual + memory operations. The scheduler latency of other tasks not + involved with the VM operations doing the rmap walk is not + affected by this parameter as the rmap walks are always + schedule friendly themselves. -stable_node_chains_prune_millisecs - How frequently to walk the whole - list of stable_node "dups" linked in the - stable_node "chains" in order to prune stale - stable_nodes. Smaller milllisecs values will free - up the KSM metadata with lower latency, but they - will make ksmd use more CPU during the scan. This - only applies to the stable_node chains so it's a - noop if not a single KSM page hit the - max_page_sharing yet (there would be no stable_node - chains in such case). +stable_node_chains_prune_millisecs + How frequently to walk the whole list of stable_node "dups" + linked in the stable_node "chains" in order to prune stale + stable_nodes. Smaller milllisecs values will free up the KSM + metadata with lower latency, but they will make ksmd use more + CPU during the scan. This only applies to the stable_node + chains so it's a noop if not a single KSM page hit the + max_page_sharing yet (there would be no stable_node chains in + such case). -The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: +The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: -pages_shared - how many shared pages are being used -pages_sharing - how many more sites are sharing them i.e. how much saved -pages_unshared - how many pages unique but repeatedly checked for merging -pages_volatile - how many pages changing too fast to be placed in a tree -full_scans - how many times all mergeable areas have been scanned - -stable_node_chains - number of stable node chains allocated, this is - effectively the number of KSM pages that hit the - max_page_sharing limit -stable_node_dups - number of stable node dups queued into the - stable_node chains +pages_shared + how many shared pages are being used +pages_sharing + how many more sites are sharing them i.e. how much saved +pages_unshared + how many pages unique but repeatedly checked for merging +pages_volatile + how many pages changing too fast to be placed in a tree +full_scans + how many times all mergeable areas have been scanned +stable_node_chains + number of stable node chains allocated, this is effectively + the number of KSM pages that hit the max_page_sharing limit +stable_node_dups + number of stable node dups queued into the stable_node chains A high ratio of pages_sharing to pages_shared indicates good sharing, but a high ratio of pages_unshared to pages_sharing indicates wasted effort. From 16f9f7f924e020e30cac5102ec5750e86f6810bc Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:28 +0200 Subject: [PATCH 015/103] docs/vm: mmu_notifier.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/mmu_notifier.txt | 108 ++++++++++++++++-------------- 1 file changed, 57 insertions(+), 51 deletions(-) diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt index 23b462566bb7..47baa1cf28c5 100644 --- a/Documentation/vm/mmu_notifier.txt +++ b/Documentation/vm/mmu_notifier.txt @@ -1,7 +1,10 @@ +.. _mmu_notifier: + When do you need to notify inside page table lock ? +=================================================== When clearing a pte/pmd we are given a choice to notify the event through -(notify version of *_clear_flush call mmu_notifier_invalidate_range) under +(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under the page table lock. But that notification is not necessary in all cases. For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use @@ -18,6 +21,7 @@ a page that might now be used by some completely different task. Case B is more subtle. For correctness it requires the following sequence to happen: + - take page table lock - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) - set page table entry to point to new page @@ -28,58 +32,60 @@ the device. Consider the following scenario (device use a feature similar to ATS/PASID): -Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume +Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). -[Time N] -------------------------------------------------------------------- -CPU-thread-0 {try to write to addrA} -CPU-thread-1 {try to write to addrB} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA and populate device TLB} -DEV-thread-2 {read addrB and populate device TLB} -[Time N+1] ------------------------------------------------------------------ -CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} -CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+2] ------------------------------------------------------------------ -CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} -CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {write to addrA which is a write to new page} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {} -CPU-thread-3 {write to addrB which is a write to new page} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+4] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+5] ------------------------------------------------------------------ -CPU-thread-0 {preempted} -CPU-thread-1 {} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA from old page} -DEV-thread-2 {read addrB from new page} +:: + + [Time N] -------------------------------------------------------------------- + CPU-thread-0 {try to write to addrA} + CPU-thread-1 {try to write to addrB} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {read addrA and populate device TLB} + DEV-thread-2 {read addrB and populate device TLB} + [Time N+1] ------------------------------------------------------------------ + CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} + CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+2] ------------------------------------------------------------------ + CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} + CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {write to addrA which is a write to new page} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {} + CPU-thread-3 {write to addrB which is a write to new page} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+4] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+5] ------------------------------------------------------------------ + CPU-thread-0 {preempted} + CPU-thread-1 {} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {read addrA from old page} + DEV-thread-2 {read addrB from new page} So here because at time N+2 the clear page table entry was not pair with a notification to invalidate the secondary TLB, the device see the new value for From cb5e4376e5f72f539feb6830869f6135ec739c22 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:29 +0200 Subject: [PATCH 016/103] docs/vm: numa_memory_policy.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/numa_memory_policy.txt | 465 +++++++++++++----------- 1 file changed, 249 insertions(+), 216 deletions(-) diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 622b927816e7..8cd942ca114e 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -1,5 +1,11 @@ +.. _numa_memory_policy: + +=================== +Linux Memory Policy +=================== What is Linux Memory Policy? +============================ In the Linux kernel, "memory policy" determines from which node the kernel will allocate memory in a NUMA system or in an emulated NUMA system. Linux has @@ -9,35 +15,36 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy support. Memory policies should not be confused with cpusets -(Documentation/cgroup-v1/cpusets.txt) +(``Documentation/cgroup-v1/cpusets.txt``) which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes. Memory policies are a programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset -takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +takes priority. See :ref:`Memory Policies and cpusets ` +below for more details. -MEMORY POLICY CONCEPTS +Memory Policy Concepts +====================== Scope of Memory Policies +------------------------ The Linux kernel supports _scopes_ of memory policy, described here from most general to most specific: - System Default Policy: this policy is "hard coded" into the kernel. It - is the policy that governs all page allocations that aren't controlled - by one of the more specific policy scopes discussed below. When the - system is "up and running", the system default policy will use "local - allocation" described below. However, during boot up, the system - default policy will be set to interleave allocations across all nodes - with "sufficient" memory, so as not to overload the initial boot node - with boot-time allocations. +System Default Policy + this policy is "hard coded" into the kernel. It is the policy + that governs all page allocations that aren't controlled by + one of the more specific policy scopes discussed below. When + the system is "up and running", the system default policy will + use "local allocation" described below. However, during boot + up, the system default policy will be set to interleave + allocations across all nodes with "sufficient" memory, so as + not to overload the initial boot node with boot-time + allocations. - Task/Process Policy: this is an optional, per-task policy. When defined - for a specific task, this policy controls all page allocations made by or - on behalf of the task that aren't controlled by a more specific scope. - If a task does not define a task policy, then all page allocations that - would have been controlled by the task policy "fall back" to the System - Default Policy. +Task/Process Policy + this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy. The task policy applies to the entire address space of a task. Thus, it is inheritable, and indeed is inherited, across both fork() @@ -58,56 +65,66 @@ most general to most specific: changes its task policy remain where they were allocated based on the policy at the time they were allocated. - VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's - virtual address space. A task may define a specific policy for a range - of its virtual address space. See the MEMORY POLICIES APIS section, - below, for an overview of the mbind() system call used to set a VMA - policy. +.. _vma_policy: - A VMA policy will govern the allocation of pages that back this region of - the address space. Any regions of the task's address space that don't - have an explicit VMA policy will fall back to the task policy, which may - itself fall back to the System Default Policy. +VMA Policy + A "VMA" or "Virtual Memory Area" refers to a range of a task's + virtual address space. A task may define a specific policy for a range + of its virtual address space. See the MEMORY POLICIES APIS section, + below, for an overview of the mbind() system call used to set a VMA + policy. - VMA policies have a few complicating details: + A VMA policy will govern the allocation of pages that back + this region ofthe address space. Any regions of the task's + address space that don't have an explicit VMA policy will fall + back to the task policy, which may itself fall back to the + System Default Policy. - VMA policy applies ONLY to anonymous pages. These include pages - allocated for anonymous segments, such as the task stack and heap, and - any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. - If a VMA policy is applied to a file mapping, it will be ignored if - the mapping used the MAP_SHARED flag. If the file mapping used the - MAP_PRIVATE flag, the VMA policy will only be applied when an - anonymous page is allocated on an attempt to write to the mapping-- - i.e., at Copy-On-Write. + VMA policies have a few complicating details: - VMA policies are shared between all tasks that share a virtual address - space--a.k.a. threads--independent of when the policy is installed; and - they are inherited across fork(). However, because VMA policies refer - to a specific region of a task's address space, and because the address - space is discarded and recreated on exec*(), VMA policies are NOT - inheritable across exec(). Thus, only NUMA-aware applications may - use VMA policies. + * VMA policy applies ONLY to anonymous pages. These include + pages allocated for anonymous segments, such as the task + stack and heap, and any regions of the address space + mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is + applied to a file mapping, it will be ignored if the mapping + used the MAP_SHARED flag. If the file mapping used the + MAP_PRIVATE flag, the VMA policy will only be applied when + an anonymous page is allocated on an attempt to write to the + mapping-- i.e., at Copy-On-Write. - A task may install a new VMA policy on a sub-range of a previously - mmap()ed region. When this happens, Linux splits the existing virtual - memory area into 2 or 3 VMAs, each with it's own policy. + * VMA policies are shared between all tasks that share a + virtual address space--a.k.a. threads--independent of when + the policy is installed; and they are inherited across + fork(). However, because VMA policies refer to a specific + region of a task's address space, and because the address + space is discarded and recreated on exec*(), VMA policies + are NOT inheritable across exec(). Thus, only NUMA-aware + applications may use VMA policies. - By default, VMA policy applies only to pages allocated after the policy - is installed. Any pages already faulted into the VMA range remain - where they were allocated based on the policy at the time they were - allocated. However, since 2.6.16, Linux supports page migration via - the mbind() system call, so that page contents can be moved to match - a newly installed policy. + * A task may install a new VMA policy on a sub-range of a + previously mmap()ed region. When this happens, Linux splits + the existing virtual memory area into 2 or 3 VMAs, each with + it's own policy. - Shared Policy: Conceptually, shared policies apply to "memory objects" - mapped shared into one or more tasks' distinct address spaces. An - application installs a shared policies the same way as VMA policies--using - the mbind() system call specifying a range of virtual addresses that map - the shared object. However, unlike VMA policies, which can be considered - to be an attribute of a range of a task's address space, shared policies - apply directly to the shared object. Thus, all tasks that attach to the - object share the policy, and all pages allocated for the shared object, - by any task, will obey the shared policy. + * By default, VMA policy applies only to pages allocated after + the policy is installed. Any pages already faulted into the + VMA range remain where they were allocated based on the + policy at the time they were allocated. However, since + 2.6.16, Linux supports page migration via the mbind() system + call, so that page contents can be moved to match a newly + installed policy. + +Shared Policy + Conceptually, shared policies apply to "memory objects" mapped + shared into one or more tasks' distinct address spaces. An + application installs a shared policies the same way as VMA + policies--using the mbind() system call specifying a range of + virtual addresses that map the shared object. However, unlike + VMA policies, which can be considered to be an attribute of a + range of a task's address space, shared policies apply + directly to the shared object. Thus, all tasks that attach to + the object share the policy, and all pages allocated for the + shared object, by any task, will obey the shared policy. As of 2.6.22, only shared memory segments, created by shmget() or mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared @@ -118,11 +135,12 @@ most general to most specific: Although hugetlbfs segments now support lazy allocation, their support for shared policy has not been completed. - As mentioned above [re: VMA policies], allocations of page cache - pages for regular files mmap()ed with MAP_SHARED ignore any VMA - policy installed on the virtual address range backed by the shared - file mapping. Rather, shared page cache pages, including pages backing - private mappings that have not yet been written by the task, follow + As mentioned above :ref:`VMA policies `, + allocations of page cache pages for regular files mmap()ed + with MAP_SHARED ignore any VMA policy installed on the virtual + address range backed by the shared file mapping. Rather, + shared page cache pages, including pages backing private + mappings that have not yet been written by the task, follow task policy, if any, else System Default Policy. The shared policy infrastructure supports different policies on subset @@ -135,164 +153,175 @@ most general to most specific: one or more ranges of the region. Components of Memory Policies +----------------------------- - A Linux memory policy consists of a "mode", optional mode flags, and an - optional set of nodes. The mode determines the behavior of the policy, - the optional mode flags determine the behavior of the mode, and the - optional set of nodes can be viewed as the arguments to the policy - behavior. +A Linux memory policy consists of a "mode", optional mode flags, and +an optional set of nodes. The mode determines the behavior of the +policy, the optional mode flags determine the behavior of the mode, +and the optional set of nodes can be viewed as the arguments to the +policy behavior. - Internally, memory policies are implemented by a reference counted - structure, struct mempolicy. Details of this structure will be discussed - in context, below, as required to explain the behavior. +Internally, memory policies are implemented by a reference counted +structure, struct mempolicy. Details of this structure will be +discussed in context, below, as required to explain the behavior. - Linux memory policy supports the following 4 behavioral modes: +Linux memory policy supports the following 4 behavioral modes: - Default Mode--MPOL_DEFAULT: This mode is only used in the memory - policy APIs. Internally, MPOL_DEFAULT is converted to the NULL - memory policy in all policy scopes. Any existing non-default policy - will simply be removed when MPOL_DEFAULT is specified. As a result, - MPOL_DEFAULT means "fall back to the next most specific policy scope." +Default Mode--MPOL_DEFAULT + This mode is only used in the memory policy APIs. Internally, + MPOL_DEFAULT is converted to the NULL memory policy in all + policy scopes. Any existing non-default policy will simply be + removed when MPOL_DEFAULT is specified. As a result, + MPOL_DEFAULT means "fall back to the next most specific policy + scope." - For example, a NULL or default task policy will fall back to the - system default policy. A NULL or default vma policy will fall - back to the task policy. + For example, a NULL or default task policy will fall back to the + system default policy. A NULL or default vma policy will fall + back to the task policy. - When specified in one of the memory policy APIs, the Default mode - does not use the optional set of nodes. + When specified in one of the memory policy APIs, the Default mode + does not use the optional set of nodes. - It is an error for the set of nodes specified for this policy to - be non-empty. + It is an error for the set of nodes specified for this policy to + be non-empty. - MPOL_BIND: This mode specifies that memory must come from the - set of nodes specified by the policy. Memory will be allocated from - the node in the set with sufficient free memory that is closest to - the node where the allocation takes place. +MPOL_BIND + This mode specifies that memory must come from the set of + nodes specified by the policy. Memory will be allocated from + the node in the set with sufficient free memory that is + closest to the node where the allocation takes place. - MPOL_PREFERRED: This mode specifies that the allocation should be - attempted from the single node specified in the policy. If that - allocation fails, the kernel will search other nodes, in order of - increasing distance from the preferred node based on information - provided by the platform firmware. +MPOL_PREFERRED + This mode specifies that the allocation should be attempted + from the single node specified in the policy. If that + allocation fails, the kernel will search other nodes, in order + of increasing distance from the preferred node based on + information provided by the platform firmware. - Internally, the Preferred policy uses a single node--the - preferred_node member of struct mempolicy. When the internal - mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and - the policy is interpreted as local allocation. "Local" allocation - policy can be viewed as a Preferred policy that starts at the node - containing the cpu where the allocation takes place. + Internally, the Preferred policy uses a single node--the + preferred_node member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_node is ignored + and the policy is interpreted as local allocation. "Local" + allocation policy can be viewed as a Preferred policy that + starts at the node containing the cpu where the allocation + takes place. - It is possible for the user to specify that local allocation is - always preferred by passing an empty nodemask with this mode. - If an empty nodemask is passed, the policy cannot use the - MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described - below. + It is possible for the user to specify that local allocation + is always preferred by passing an empty nodemask with this + mode. If an empty nodemask is passed, the policy cannot use + the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags + described below. - MPOL_INTERLEAVED: This mode specifies that page allocations be - interleaved, on a page granularity, across the nodes specified in - the policy. This mode also behaves slightly differently, based on - the context where it is used: +MPOL_INTERLEAVED + This mode specifies that page allocations be interleaved, on a + page granularity, across the nodes specified in the policy. + This mode also behaves slightly differently, based on the + context where it is used: - For allocation of anonymous pages and shared memory pages, - Interleave mode indexes the set of nodes specified by the policy - using the page offset of the faulting address into the segment - [VMA] containing the address modulo the number of nodes specified - by the policy. It then attempts to allocate a page, starting at - the selected node, as if the node had been specified by a Preferred - policy or had been selected by a local allocation. That is, - allocation will follow the per node zonelist. + For allocation of anonymous pages and shared memory pages, + Interleave mode indexes the set of nodes specified by the + policy using the page offset of the faulting address into the + segment [VMA] containing the address modulo the number of + nodes specified by the policy. It then attempts to allocate a + page, starting at the selected node, as if the node had been + specified by a Preferred policy or had been selected by a + local allocation. That is, allocation will follow the per + node zonelist. - For allocation of page cache pages, Interleave mode indexes the set - of nodes specified by the policy using a node counter maintained - per task. This counter wraps around to the lowest specified node - after it reaches the highest specified node. This will tend to - spread the pages out over the nodes specified by the policy based - on the order in which they are allocated, rather than based on any - page offset into an address range or file. During system boot up, - the temporary interleaved system default policy works in this - mode. + For allocation of page cache pages, Interleave mode indexes + the set of nodes specified by the policy using a node counter + maintained per task. This counter wraps around to the lowest + specified node after it reaches the highest specified node. + This will tend to spread the pages out over the nodes + specified by the policy based on the order in which they are + allocated, rather than based on any page offset into an + address range or file. During system boot up, the temporary + interleaved system default policy works in this mode. - Linux memory policy supports the following optional mode flags: +Linux memory policy supports the following optional mode flags: - MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by +MPOL_F_STATIC_NODES + This flag specifies that the nodemask passed by the user should not be remapped if the task or VMA's set of allowed nodes changes after the memory policy has been defined. - Without this flag, anytime a mempolicy is rebound because of a - change in the set of allowed nodes, the node (Preferred) or - nodemask (Bind, Interleave) is remapped to the new set of - allowed nodes. This may result in nodes being used that were - previously undesired. + Without this flag, anytime a mempolicy is rebound because of a + change in the set of allowed nodes, the node (Preferred) or + nodemask (Bind, Interleave) is remapped to the new set of + allowed nodes. This may result in nodes being used that were + previously undesired. - With this flag, if the user-specified nodes overlap with the - nodes allowed by the task's cpuset, then the memory policy is - applied to their intersection. If the two sets of nodes do not - overlap, the Default policy is used. + With this flag, if the user-specified nodes overlap with the + nodes allowed by the task's cpuset, then the memory policy is + applied to their intersection. If the two sets of nodes do not + overlap, the Default policy is used. - For example, consider a task that is attached to a cpuset with - mems 1-3 that sets an Interleave policy over the same set. If - the cpuset's mems change to 3-5, the Interleave will now occur - over nodes 3, 4, and 5. With this flag, however, since only node - 3 is allowed from the user's nodemask, the "interleave" only - occurs over that node. If no nodes from the user's nodemask are - now allowed, the Default behavior is used. + For example, consider a task that is attached to a cpuset with + mems 1-3 that sets an Interleave policy over the same set. If + the cpuset's mems change to 3-5, the Interleave will now occur + over nodes 3, 4, and 5. With this flag, however, since only node + 3 is allowed from the user's nodemask, the "interleave" only + occurs over that node. If no nodes from the user's nodemask are + now allowed, the Default behavior is used. - MPOL_F_STATIC_NODES cannot be combined with the - MPOL_F_RELATIVE_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). + MPOL_F_STATIC_NODES cannot be combined with the + MPOL_F_RELATIVE_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). - MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed +MPOL_F_RELATIVE_NODES + This flag specifies that the nodemask passed by the user will be mapped relative to the set of the task or VMA's set of allowed nodes. The kernel stores the user-passed nodemask, and if the allowed nodes changes, then that original nodemask will be remapped relative to the new set of allowed nodes. - Without this flag (and without MPOL_F_STATIC_NODES), anytime a - mempolicy is rebound because of a change in the set of allowed - nodes, the node (Preferred) or nodemask (Bind, Interleave) is - remapped to the new set of allowed nodes. That remap may not - preserve the relative nature of the user's passed nodemask to its - set of allowed nodes upon successive rebinds: a nodemask of - 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of - allowed nodes is restored to its original state. + Without this flag (and without MPOL_F_STATIC_NODES), anytime a + mempolicy is rebound because of a change in the set of allowed + nodes, the node (Preferred) or nodemask (Bind, Interleave) is + remapped to the new set of allowed nodes. That remap may not + preserve the relative nature of the user's passed nodemask to its + set of allowed nodes upon successive rebinds: a nodemask of + 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of + allowed nodes is restored to its original state. - With this flag, the remap is done so that the node numbers from - the user's passed nodemask are relative to the set of allowed - nodes. In other words, if nodes 0, 2, and 4 are set in the user's - nodemask, the policy will be effected over the first (and in the - Bind or Interleave case, the third and fifth) nodes in the set of - allowed nodes. The nodemask passed by the user represents nodes - relative to task or VMA's set of allowed nodes. + With this flag, the remap is done so that the node numbers from + the user's passed nodemask are relative to the set of allowed + nodes. In other words, if nodes 0, 2, and 4 are set in the user's + nodemask, the policy will be effected over the first (and in the + Bind or Interleave case, the third and fifth) nodes in the set of + allowed nodes. The nodemask passed by the user represents nodes + relative to task or VMA's set of allowed nodes. - If the user's nodemask includes nodes that are outside the range - of the new set of allowed nodes (for example, node 5 is set in - the user's nodemask when the set of allowed nodes is only 0-3), - then the remap wraps around to the beginning of the nodemask and, - if not already set, sets the node in the mempolicy nodemask. + If the user's nodemask includes nodes that are outside the range + of the new set of allowed nodes (for example, node 5 is set in + the user's nodemask when the set of allowed nodes is only 0-3), + then the remap wraps around to the beginning of the nodemask and, + if not already set, sets the node in the mempolicy nodemask. - For example, consider a task that is attached to a cpuset with - mems 2-5 that sets an Interleave policy over the same set with - MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the - interleave now occurs over nodes 3,5-7. If the cpuset's mems - then change to 0,2-3,5, then the interleave occurs over nodes - 0,2-3,5. + For example, consider a task that is attached to a cpuset with + mems 2-5 that sets an Interleave policy over the same set with + MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the + interleave now occurs over nodes 3,5-7. If the cpuset's mems + then change to 0,2-3,5, then the interleave occurs over nodes + 0,2-3,5. - Thanks to the consistent remapping, applications preparing - nodemasks to specify memory policies using this flag should - disregard their current, actual cpuset imposed memory placement - and prepare the nodemask as if they were always located on - memory nodes 0 to N-1, where N is the number of memory nodes the - policy is intended to manage. Let the kernel then remap to the - set of memory nodes allowed by the task's cpuset, as that may - change over time. + Thanks to the consistent remapping, applications preparing + nodemasks to specify memory policies using this flag should + disregard their current, actual cpuset imposed memory placement + and prepare the nodemask as if they were always located on + memory nodes 0 to N-1, where N is the number of memory nodes the + policy is intended to manage. Let the kernel then remap to the + set of memory nodes allowed by the task's cpuset, as that may + change over time. - MPOL_F_RELATIVE_NODES cannot be combined with the - MPOL_F_STATIC_NODES flag. It also cannot be used for - MPOL_PREFERRED policies that were created with an empty nodemask - (local allocation). + MPOL_F_RELATIVE_NODES cannot be combined with the + MPOL_F_STATIC_NODES flag. It also cannot be used for + MPOL_PREFERRED policies that were created with an empty nodemask + (local allocation). -MEMORY POLICY REFERENCE COUNTING +Memory Policy Reference Counting +================================ To resolve use/free races, struct mempolicy contains an atomic reference count field. Internal interfaces, mpol_get()/mpol_put() increment and @@ -360,60 +389,62 @@ follows: or by prefaulting the entire shared memory region into memory and locking it down. However, this might not be appropriate for all applications. -MEMORY POLICY APIs +Memory Policy APIs Linux supports 3 system calls for controlling memory policy. These APIS always affect only the calling task, the calling task's address space, or some shared object mapped into the calling task's address space. - Note: the headers that define these APIs and the parameter data types - for user space applications reside in a package that is not part of - the Linux kernel. The kernel system call interfaces, with the 'sys_' - prefix, are defined in ; the mode and flag - definitions are defined in . +.. note:: + the headers that define these APIs and the parameter data types for + user space applications reside in a package that is not part of the + Linux kernel. The kernel system call interfaces, with the 'sys\_' + prefix, are defined in ; the mode and flag + definitions are defined in . -Set [Task] Memory Policy: +Set [Task] Memory Policy:: long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode); - Set's the calling task's "task/process memory policy" to mode - specified by the 'mode' argument and the set of nodes defined - by 'nmask'. 'nmask' points to a bit mask of node ids containing - at least 'maxnode' ids. Optional mode flags may be passed by - combining the 'mode' argument with the flag (for example: - MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). +Set's the calling task's "task/process memory policy" to mode +specified by the 'mode' argument and the set of nodes defined by +'nmask'. 'nmask' points to a bit mask of node ids containing at least +'maxnode' ids. Optional mode flags may be passed by combining the +'mode' argument with the flag (for example: MPOL_INTERLEAVE | +MPOL_F_STATIC_NODES). - See the set_mempolicy(2) man page for more details +See the set_mempolicy(2) man page for more details -Get [Task] Memory Policy or Related Information +Get [Task] Memory Policy or Related Information:: long get_mempolicy(int *mode, const unsigned long *nmask, unsigned long maxnode, void *addr, int flags); - Queries the "task/process memory policy" of the calling task, or - the policy or location of a specified virtual address, depending - on the 'flags' argument. +Queries the "task/process memory policy" of the calling task, or the +policy or location of a specified virtual address, depending on the +'flags' argument. - See the get_mempolicy(2) man page for more details +See the get_mempolicy(2) man page for more details -Install VMA/Shared Policy for a Range of Task's Address Space +Install VMA/Shared Policy for a Range of Task's Address Space:: long mbind(void *start, unsigned long len, int mode, const unsigned long *nmask, unsigned long maxnode, unsigned flags); - mbind() installs the policy specified by (mode, nmask, maxnodes) as - a VMA policy for the range of the calling task's address space - specified by the 'start' and 'len' arguments. Additional actions - may be requested via the 'flags' argument. +mbind() installs the policy specified by (mode, nmask, maxnodes) as a +VMA policy for the range of the calling task's address space specified +by the 'start' and 'len' arguments. Additional actions may be +requested via the 'flags' argument. - See the mbind(2) man page for more details. +See the mbind(2) man page for more details. -MEMORY POLICY COMMAND LINE INTERFACE +Memory Policy Command Line Interface +==================================== Although not strictly part of the Linux implementation of memory policy, a command line tool, numactl(8), exists that allows one to: @@ -428,8 +459,10 @@ containing the memory policy system call wrappers. Some distributions package the headers and compile-time libraries in a separate development package. +.. _mem_pol_and_cpusets: -MEMORY POLICIES AND CPUSETS +Memory Policies and cpusets +=========================== Memory policies work within cpusets as described above. For memory policies that require a node or set of nodes, the nodes are restricted to the set of From 8d83d826c2951e18217e31601fb82b41006c9162 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:30 +0200 Subject: [PATCH 017/103] docs/vm: overcommit-accounting: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/overcommit-accounting | 105 +++++++++++++------------ 1 file changed, 56 insertions(+), 49 deletions(-) diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index cbfaaa674118..0dd54bbe4afa 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting @@ -1,80 +1,87 @@ +.. _overcommit_accounting: + +===================== +Overcommit Accounting +===================== + The Linux kernel supports the following overcommit handling modes -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slightly more memory in this mode. This is the - default. +0 + Heuristic overcommit handling. Obvious overcommits of address + space are refused. Used for a typical system. It ensures a + seriously wild allocation fails while allowing overcommit to + reduce swap usage. root is allowed to allocate slightly more + memory in this mode. This is the default. -1 - Always overcommit. Appropriate for some scientific - applications. Classic example is code using sparse arrays - and just relying on the virtual memory consisting almost - entirely of zero pages. +1 + Always overcommit. Appropriate for some scientific + applications. Classic example is code using sparse arrays and + just relying on the virtual memory consisting almost entirely + of zero pages. -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap + a - configurable amount (default is 50%) of physical RAM. - Depending on the amount you use, in most situations - this means a process will not be killed while accessing - pages but will receive errors on memory allocation as - appropriate. +2 + Don't overcommit. The total address space commit for the + system is not permitted to exceed swap + a configurable amount + (default is 50%) of physical RAM. Depending on the amount you + use, in most situations this means a process will not be + killed while accessing pages but will receive errors on memory + allocation as appropriate. - Useful for applications that want to guarantee their - memory allocations will be available in the future - without having to initialize every page. + Useful for applications that want to guarantee their memory + allocations will be available in the future without having to + initialize every page. -The overcommit policy is set via the sysctl `vm.overcommit_memory'. +The overcommit policy is set via the sysctl ``vm.overcommit_memory``. -The overcommit amount can be set via `vm.overcommit_ratio' (percentage) -or `vm.overcommit_kbytes' (absolute value). +The overcommit amount can be set via ``vm.overcommit_ratio`` (percentage) +or ``vm.overcommit_kbytes`` (absolute value). The current overcommit limit and amount committed are viewable in -/proc/meminfo as CommitLimit and Committed_AS respectively. +``/proc/meminfo`` as CommitLimit and Committed_AS respectively. Gotchas -------- +======= The C language stack growth does an implicit mremap. If you want absolute -guarantees and run close to the edge you MUST mmap your stack for the +guarantees and run close to the edge you MUST mmap your stack for the largest size you think you will need. For typical stack usage this does not matter much but it's a corner case if you really really care -In mode 2 the MAP_NORESERVE flag is ignored. +In mode 2 the MAP_NORESERVE flag is ignored. How It Works ------------- +============ The overcommit is based on the following rules For a file backed map - SHARED or READ-only - 0 cost (the file is the map not swap) - PRIVATE WRITABLE - size of mapping per instance + | SHARED or READ-only - 0 cost (the file is the map not swap) + | PRIVATE WRITABLE - size of mapping per instance -For an anonymous or /dev/zero map - SHARED - size of mapping - PRIVATE READ-only - 0 cost (but of little use) - PRIVATE WRITABLE - size of mapping per instance +For an anonymous or ``/dev/zero`` map + | SHARED - size of mapping + | PRIVATE READ-only - 0 cost (but of little use) + | PRIVATE WRITABLE - size of mapping per instance Additional accounting - Pages made writable copies by mmap - shmfs memory drawn from the same pool + | Pages made writable copies by mmap + | shmfs memory drawn from the same pool Status ------- +====== -o We account mmap memory mappings -o We account mprotect changes in commit -o We account mremap changes in size -o We account brk -o We account munmap -o We report the commit status in /proc -o Account and check on fork -o Review stack handling/building on exec -o SHMfs accounting -o Implement actual limit enforcement +* We account mmap memory mappings +* We account mprotect changes in commit +* We account mremap changes in size +* We account brk +* We account munmap +* We report the commit status in /proc +* Account and check on fork +* Review stack handling/building on exec +* SHMfs accounting +* Implement actual limit enforcement To Do ------ -o Account ptrace pages (this is hard) +===== +* Account ptrace pages (this is hard) From 4a832588f4ceda7054b9b89b3d46bede4e646786 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:31 +0200 Subject: [PATCH 018/103] docs/vm: page_frags convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/page_frags | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags index a6714565dbf9..637cc49d1b2f 100644 --- a/Documentation/vm/page_frags +++ b/Documentation/vm/page_frags @@ -1,5 +1,8 @@ +.. _page_frags: + +============== Page fragments --------------- +============== A page fragment is an arbitrary-length arbitrary-offset area of memory which resides within a 0 or higher order compound page. Multiple From 137b45527e9d84a05b39c3501d8e4faf966cc9cb Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:32 +0200 Subject: [PATCH 019/103] docs/vm: numa: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/numa | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/vm/numa b/Documentation/vm/numa index a31b85b9bb88..c81e7c56f0f9 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa @@ -1,6 +1,10 @@ +.. _numa: + Started Nov 1999 by Kanoj Sarcar +============= What is NUMA? +============= This question can be answered from a couple of perspectives: the hardware view and the Linux software view. From 25c3bf8aaf23d245f03fc8f96554cfd10b94977c Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:33 +0200 Subject: [PATCH 020/103] docs/vm: pagemap.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/pagemap.txt | 156 +++++++++++++++++++---------------- 1 file changed, 85 insertions(+), 71 deletions(-) diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index eafcefa15261..bd6d71740c88 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -1,13 +1,16 @@ -pagemap, from the userspace perspective ---------------------------------------- +.. _pagemap: + +====================================== +pagemap from the Userspace Perspective +====================================== pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow userspace programs to examine the page tables and related information by -reading files in /proc. +reading files in ``/proc``. There are four components to pagemap: - * /proc/pid/pagemap. This file lets a userspace process find out which + * ``/proc/pid/pagemap``. This file lets a userspace process find out which physical frame each virtual page is mapped to. It contains one 64-bit value for each virtual page, containing the following data (from fs/proc/task_mmu.c, above pagemap_read): @@ -37,24 +40,24 @@ There are four components to pagemap: determine which areas of memory are actually mapped and llseek to skip over unmapped regions. - * /proc/kpagecount. This file contains a 64-bit count of the number of + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. - * /proc/kpageflags. This file contains a 64-bit set of flags for each + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each page, indexed by PFN. - The flags are (from fs/proc/page.c, above kpageflags_read): + The flags are (from ``fs/proc/page.c``, above kpageflags_read): - 0. LOCKED - 1. ERROR - 2. REFERENCED - 3. UPTODATE - 4. DIRTY - 5. LRU - 6. ACTIVE - 7. SLAB - 8. WRITEBACK - 9. RECLAIM + 0. LOCKED + 1. ERROR + 2. REFERENCED + 3. UPTODATE + 4. DIRTY + 5. LRU + 6. ACTIVE + 7. SLAB + 8. WRITEBACK + 9. RECLAIM 10. BUDDY 11. MMAP 12. ANON @@ -72,98 +75,108 @@ There are four components to pagemap: 24. ZERO_PAGE 25. IDLE - * /proc/kpagecgroup. This file contains a 64-bit inode number of the + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. Short descriptions to the page flags: +===================================== - 0. LOCKED - page is being locked for exclusive access, eg. by undergoing read/write IO - - 7. SLAB - page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator - When compound page is used, SLUB/SLQB will only set this flag on the head - page; SLOB will not flag it at all. - -10. BUDDY +0 - LOCKED + page is being locked for exclusive access, eg. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY a free memory block managed by the buddy system allocator The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag set for and _only_ for the first page. - -15. COMPOUND_HEAD -16. COMPOUND_TAIL +15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. -17. HUGE +16 - COMPOUND_TAIL + A compound page tail (see description above). +17 - HUGE this is an integral part of a HugeTLB page - -19. HWPOISON +19 - HWPOISON hardware detected memory corruption on this page: don't touch the data! - -20. NOPAGE +20 - NOPAGE no page frame exists at the requested address - -21. KSM +21 - KSM identical memory pages dynamically shared between one or more processes - -22. THP +22 - THP contiguous pages which construct transparent hugepages - -23. BALLOON +23 - BALLOON balloon compaction page - -24. ZERO_PAGE +24 - ZERO_PAGE zero page for pfn_zero or huge_zero page - -25. IDLE +25 - IDLE page has not been accessed since it was marked idle (see Documentation/vm/idle_page_tracking.txt). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag - is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. + is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. - [IO related page flags] - 1. ERROR IO error occurred - 3. UPTODATE page has up-to-date data - ie. for file backed page: (in-memory data revision >= on-disk one) - 4. DIRTY page has been written to, hence contains new data - ie. for file backed page: (in-memory data revision > on-disk one) - 8. WRITEBACK page is being synced to disk +IO related page flags +--------------------- - [LRU related page flags] - 5. LRU page is in one of the LRU lists - 6. ACTIVE page is in the active LRU list -18. UNEVICTABLE page is in the unevictable (non-)LRU list - It is somehow pinned and not a candidate for LRU page reclaims, - eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments - 2. REFERENCED page has been referenced since last LRU list enqueue/requeue - 9. RECLAIM page will be reclaimed soon after its pageout IO completed -11. MMAP a memory mapped page -12. ANON a memory mapped page that is not part of a file -13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry -14. SWAPBACKED page is backed by swap/RAM +1 - ERROR + IO error occurred +3 - UPTODATE + page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) +4 - DIRTY + page has been written to, hence contains new data + ie. for file backed page: (in-memory data revision > on-disk one) +8 - WRITEBACK + page is being synced to disk + +LRU related page flags +---------------------- + +5 - LRU + page is in one of the LRU lists +6 - ACTIVE + page is in the active LRU list +18 - UNEVICTABLE + page is in the unevictable (non-)LRU list It is somehow pinned and + not a candidate for LRU page reclaims, eg. ramfs pages, + shmctl(SHM_LOCK) and mlock() memory segments +2 - REFERENCED + page has been referenced since last LRU list enqueue/requeue +9 - RECLAIM + page will be reclaimed soon after its pageout IO completed +11 - MMAP + a memory mapped page +12 - ANON + a memory mapped page that is not part of a file +13 - SWAPCACHE + page is mapped to swap space, ie. has an associated swap entry +14 - SWAPBACKED + page is backed by swap/RAM The page-types tool in the tools/vm directory can be used to query the above flags. -Using pagemap to do something useful: +Using pagemap to do something useful +==================================== The general procedure for using pagemap to find out about a process' memory usage goes like this: - 1. Read /proc/pid/maps to determine which parts of the memory space are + 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are mapped to what. 2. Select the maps you are interested in -- all of them, or a particular library, or the stack or the heap, etc. - 3. Open /proc/pid/pagemap and seek to the pages you would like to examine. + 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. 4. Read a u64 for each page from pagemap. - 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just - read, seek to that entry in the file, and read the data you want. + 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you + just read, seek to that entry in the file, and read the data you want. For example, to find the "unique set size" (USS), which is the amount of memory that a process is using that is not shared with any other process, @@ -171,7 +184,8 @@ you can go through every map in the process, find the PFNs, look those up in kpagecount, and tally up the number of pages that are only referenced once. -Other notes: +Other notes +=========== Reading from any of the files will return -EINVAL if you are not starting the read on an 8-byte boundary (e.g., if you sought an odd number of bytes From 1b7599b5de7240e5182f0d9bc7fc98aac7970251 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:34 +0200 Subject: [PATCH 021/103] docs/vm: page_migration: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/page_migration | 139 +++++++++++++++++--------------- 1 file changed, 72 insertions(+), 67 deletions(-) diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration index 0478ae2ad44a..07b67a821a12 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration @@ -1,5 +1,8 @@ +.. _page_migration: + +============== Page migration --------------- +============== Page migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the @@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen (a version later than 0.9.3 is required. Get it from ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma which provides an interface similar to other numa functionality for page -migration. cat /proc//numa_maps allows an easy review of where the +migration. cat ``/proc//numa_maps`` allows an easy review of where the pages of a process are located. See also the numa_maps documentation in the proc(5) man page. @@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel (for userspace usage see the Andi Kleen's numactl package mentioned above) and then a low level description of how the low level details work. -A. In kernel use of migrate_pages() ------------------------------------ +In kernel use of migrate_pages() +================================ 1. Remove pages from the LRU. @@ -78,8 +81,8 @@ A. In kernel use of migrate_pages() the new page for each page that is considered for moving. -B. How migrate_pages() works ----------------------------- +How migrate_pages() works +========================= migrate_pages() does several passes over its list of pages. A page is moved if all references to a page are removable at the time. The page has @@ -142,8 +145,8 @@ Steps: 20. The new page is moved to the LRU and can be scanned by the swapper etc again. -C. Non-LRU page migration -------------------------- +Non-LRU page migration +====================== Although original migration aimed for reducing the latency of memory access for NUMA, compaction who want to create high-order page is also main customer. @@ -164,89 +167,91 @@ migration path. If a driver want to make own pages movable, it should define three functions which are function pointers of struct address_space_operations. -1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); +1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` -What VM expects on isolate_page function of driver is to return *true* -if driver isolates page successfully. On returing true, VM marks the page -as PG_isolated so concurrent isolation in several CPUs skip the page -for isolation. If a driver cannot isolate the page, it should return *false*. + What VM expects on isolate_page function of driver is to return *true* + if driver isolates page successfully. On returing true, VM marks the page + as PG_isolated so concurrent isolation in several CPUs skip the page + for isolation. If a driver cannot isolate the page, it should return *false*. -Once page is successfully isolated, VM uses page.lru fields so driver -shouldn't expect to preserve values in that fields. + Once page is successfully isolated, VM uses page.lru fields so driver + shouldn't expect to preserve values in that fields. -2. int (*migratepage) (struct address_space *mapping, - struct page *newpage, struct page *oldpage, enum migrate_mode); +2. ``int (*migratepage) (struct address_space *mapping,`` +| ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` -After isolation, VM calls migratepage of driver with isolated page. -The function of migratepage is to move content of the old page to new page -and set up fields of struct page newpage. Keep in mind that you should -indicate to the VM the oldpage is no longer movable via __ClearPageMovable() -under page_lock if you migrated the oldpage successfully and returns -MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver -can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time -because VM interprets -EAGAIN as "temporal migration failure". On returning -any error except -EAGAIN, VM will give up the page migration without retrying -in this time. + After isolation, VM calls migratepage of driver with isolated page. + The function of migratepage is to move content of the old page to new page + and set up fields of struct page newpage. Keep in mind that you should + indicate to the VM the oldpage is no longer movable via __ClearPageMovable() + under page_lock if you migrated the oldpage successfully and returns + MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver + can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time + because VM interprets -EAGAIN as "temporal migration failure". On returning + any error except -EAGAIN, VM will give up the page migration without retrying + in this time. -Driver shouldn't touch page.lru field VM using in the functions. + Driver shouldn't touch page.lru field VM using in the functions. -3. void (*putback_page)(struct page *); +3. ``void (*putback_page)(struct page *);`` -If migration fails on isolated page, VM should return the isolated page -to the driver so VM calls driver's putback_page with migration failed page. -In this function, driver should put the isolated page back to the own data -structure. + If migration fails on isolated page, VM should return the isolated page + to the driver so VM calls driver's putback_page with migration failed page. + In this function, driver should put the isolated page back to the own data + structure. 4. non-lru movable page flags -There are two page flags for supporting non-lru movable page. + There are two page flags for supporting non-lru movable page. -* PG_movable + * PG_movable -Driver should use the below function to make page movable under page_lock. + Driver should use the below function to make page movable under page_lock:: void __SetPageMovable(struct page *page, struct address_space *mapping) -It needs argument of address_space for registering migration family functions -which will be called by VM. Exactly speaking, PG_movable is not a real flag of -struct page. Rather than, VM reuses page->mapping's lower bits to represent it. + It needs argument of address_space for registering migration + family functions which will be called by VM. Exactly speaking, + PG_movable is not a real flag of struct page. Rather than, VM + reuses page->mapping's lower bits to represent it. +:: #define PAGE_MAPPING_MOVABLE 0x2 page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; -so driver shouldn't access page->mapping directly. Instead, driver should -use page_mapping which mask off the low two bits of page->mapping under -page lock so it can get right struct address_space. + so driver shouldn't access page->mapping directly. Instead, driver should + use page_mapping which mask off the low two bits of page->mapping under + page lock so it can get right struct address_space. -For testing of non-lru movable page, VM supports __PageMovable function. -However, it doesn't guarantee to identify non-lru movable page because -page->mapping field is unified with other variables in struct page. -As well, if driver releases the page after isolation by VM, page->mapping -doesn't have stable value although it has PAGE_MAPPING_MOVABLE -(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether -page is LRU or non-lru movable once the page has been isolated. Because -LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also -good for just peeking to test non-lru movable pages before more expensive -checking with lock_page in pfn scanning to select victim. + For testing of non-lru movable page, VM supports __PageMovable function. + However, it doesn't guarantee to identify non-lru movable page because + page->mapping field is unified with other variables in struct page. + As well, if driver releases the page after isolation by VM, page->mapping + doesn't have stable value although it has PAGE_MAPPING_MOVABLE + (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether + page is LRU or non-lru movable once the page has been isolated. Because + LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also + good for just peeking to test non-lru movable pages before more expensive + checking with lock_page in pfn scanning to select victim. -For guaranteeing non-lru movable page, VM provides PageMovable function. -Unlike __PageMovable, PageMovable functions validates page->mapping and -mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden -destroying of page->mapping. + For guaranteeing non-lru movable page, VM provides PageMovable function. + Unlike __PageMovable, PageMovable functions validates page->mapping and + mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden + destroying of page->mapping. -Driver using __SetPageMovable should clear the flag via __ClearMovablePage -under page_lock before the releasing the page. + Driver using __SetPageMovable should clear the flag via __ClearMovablePage + under page_lock before the releasing the page. -* PG_isolated + * PG_isolated -To prevent concurrent isolation among several CPUs, VM marks isolated page -as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru -movable page, it can skip it. Driver doesn't need to manipulate the flag -because VM will set/clear it automatically. Keep in mind that if driver -sees PG_isolated page, it means the page have been isolated by VM so it -shouldn't touch page.lru field. -PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag -for own purpose. + To prevent concurrent isolation among several CPUs, VM marks isolated page + as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru + movable page, it can skip it. Driver doesn't need to manipulate the flag + because VM will set/clear it automatically. Keep in mind that if driver + sees PG_isolated page, it means the page have been isolated by VM so it + shouldn't touch page.lru field. + PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag + for own purpose. Christoph Lameter, May 8, 2006. Minchan Kim, Mar 28, 2016. From f227e04e90fd4947b5f8442bf6e02ef6a65a6c68 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:35 +0200 Subject: [PATCH 022/103] docs/vm: page_owner: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/page_owner.txt | 38 ++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.txt index ffff1439076a..0ed5ab8c7ab4 100644 --- a/Documentation/vm/page_owner.txt +++ b/Documentation/vm/page_owner.txt @@ -1,7 +1,11 @@ -page owner: Tracking about who allocated each page ------------------------------------------------------------ +.. _page_owner: -* Introduction +================================================== +page owner: Tracking about who allocated each page +================================================== + +Introduction +============ page owner is for the tracking about who allocated each page. It can be used to debug memory leak or to find a memory hogger. @@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump label patching functionality is available. Following is the kernel's code size change due to this facility. -- Without page owner - text data bss dec hex filename - 40662 1493 644 42799 a72f mm/page_alloc.o +- Without page owner:: -- With page owner text data bss dec hex filename - 40892 1493 644 43029 a815 mm/page_alloc.o + 40662 1493 644 42799 a72f mm/page_alloc.o + +- With page owner:: + + text data bss dec hex filename + 40892 1493 644 43029 a815 mm/page_alloc.o 1427 24 8 1459 5b3 mm/page_ext.o 2722 50 0 2772 ad4 mm/page_owner.o @@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct page extension feature. Anyway, after that, no page is left in un-tracking state. -* Usage +Usage +===== + +1) Build user-space helper:: -1) Build user-space helper cd tools/vm make page_owner_sort -2) Enable page owner - Add "page_owner=on" to boot cmdline. +2) Enable page owner: add "page_owner=on" to boot cmdline. 3) Do the job what you want to debug -4) Analyze information from page owner +4) Analyze information from page owner:: + cat /sys/kernel/debug/page_owner > page_owner_full.txt grep -v ^PFN page_owner_full.txt > page_owner.txt ./page_owner_sort page_owner.txt sorted_page_owner.txt - See the result about who allocated each page - in the sorted_page_owner.txt. + See the result about who allocated each page + in the ``sorted_page_owner.txt``. From acc9f3a35cd514392daf62404f888c060dee088b Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:36 +0200 Subject: [PATCH 023/103] docs/vm: remap_file_pages.txt: conert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/remap_file_pages.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt index f609142f406a..7bef6718e3a9 100644 --- a/Documentation/vm/remap_file_pages.txt +++ b/Documentation/vm/remap_file_pages.txt @@ -1,3 +1,9 @@ +.. _remap_file_pages: + +============================== +remap_file_pages() system call +============================== + The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() From 0c14398bf22c30d730fb1f3f4deb9cb6deb1b584 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:37 +0200 Subject: [PATCH 024/103] docs/vm: slub.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/slub.txt | 321 ++++++++++++++++++++------------------ 1 file changed, 170 insertions(+), 151 deletions(-) diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 84652419bff2..3a775fd64e2d 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -1,5 +1,8 @@ +.. _slub: + +========================== Short users guide for SLUB --------------------------- +========================== The basic philosophy of SLUB is very different from SLAB. SLAB requires rebuilding the kernel to activate debug options for all @@ -8,18 +11,19 @@ SLUB can enable debugging only for selected slabs in order to avoid an impact on overall system performance which may make a bug more difficult to find. -In order to switch debugging on one can add an option "slub_debug" +In order to switch debugging on one can add an option ``slub_debug`` to the kernel command line. That will enable full debugging for all slabs. -Typically one would then use the "slabinfo" command to get statistical -data and perform operation on the slabs. By default slabinfo only lists +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists slabs that have data in them. See "slabinfo -h" for more options when -running the command. slabinfo can be compiled with +running the command. ``slabinfo`` can be compiled with +:: -gcc -o slabinfo tools/vm/slabinfo.c + gcc -o slabinfo tools/vm/slabinfo.c -Some of the modes of operation of slabinfo require that slub debugging +Some of the modes of operation of ``slabinfo`` require that slub debugging be enabled on the command line. F.e. no tracking information will be available without debugging on and validation can only partially be performed if debugging was not switched on. @@ -27,14 +31,17 @@ be performed if debugging was not switched on. Some more sophisticated uses of slub_debug: ------------------------------------------- -Parameters may be given to slub_debug. If none is specified then full +Parameters may be given to ``slub_debug``. If none is specified then full debugging is enabled. Format: -slub_debug= Enable options for all slabs +slub_debug= + Enable options for all slabs slub_debug=, - Enable options only for select slabs + Enable options only for select slabs + + +Possible debug options are:: -Possible debug options are F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS Sorry SLAB legacy issues) Z Red zoning @@ -47,18 +54,18 @@ Possible debug options are - Switch all debugging off (useful if the kernel is configured with CONFIG_SLUB_DEBUG_ON) -F.e. in order to boot just with sanity checks and red zoning one would specify: +F.e. in order to boot just with sanity checks and red zoning one would specify:: slub_debug=FZ -Trying to find an issue in the dentry cache? Try +Trying to find an issue in the dentry cache? Try:: slub_debug=,dentry to only enable debugging on the dentry cache. Red zoning and tracking may realign the slab. We can just apply sanity checks -to the dentry cache with +to the dentry cache with:: slub_debug=F,dentry @@ -66,15 +73,15 @@ Debugging options may require the minimum possible slab order to increase as a result of storing the metadata (for example, caches with PAGE_SIZE object sizes). This has a higher liklihood of resulting in slab allocation errors in low memory situations or if there's high fragmentation of memory. To -switch off debugging for such caches by default, use +switch off debugging for such caches by default, use:: slub_debug=O In case you forgot to enable debugging on the kernel command line: It is possible to enable debugging manually when the kernel is up. Look at the -contents of: +contents of:: -/sys/kernel/slab// + /sys/kernel/slab// Look at the writable files. Writing 1 to them will enable the corresponding debug option. All options can be set on a slab that does @@ -86,98 +93,103 @@ Careful with tracing: It may spew out lots of information and never stop if used on the wrong slab. Slab merging ------------- +============ If no debug options are specified then SLUB may merge similar slabs together in order to reduce overhead and increase cache hotness of objects. -slabinfo -a displays which slabs were merged together. +``slabinfo -a`` displays which slabs were merged together. Slab validation ---------------- +=============== SLUB can validate all object if the kernel was booted with slub_debug. In -order to do so you must have the slabinfo tool. Then you can do +order to do so you must have the ``slabinfo`` tool. Then you can do +:: -slabinfo -v + slabinfo -v which will test all objects. Output will be generated to the syslog. This also works in a more limited way if boot was without slab debug. -In that case slabinfo -v simply tests all reachable objects. Usually +In that case ``slabinfo -v`` simply tests all reachable objects. Usually these are in the cpu slabs and the partial slabs. Full slabs are not tracked by SLUB in a non debug situation. Getting more performance ------------------------- +======================== To some degree SLUB's performance is limited by the need to take the list_lock once in a while to deal with partial slabs. That overhead is governed by the order of the allocation for each slab. The allocations can be influenced by kernel parameters: -slub_min_objects=x (default 4) -slub_min_order=x (default 0) -slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) +.. slub_min_objects=x (default 4) +.. slub_min_order=x (default 0) +.. slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) -slub_min_objects allows to specify how many objects must at least fit -into one slab in order for the allocation order to be acceptable. -In general slub will be able to perform this number of allocations -on a slab without consulting centralized resources (list_lock) where -contention may occur. +``slub_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. -slub_min_order specifies a minim order of slabs. A similar effect like -slub_min_objects. +``slub_min_order`` + specifies a minim order of slabs. A similar effect like + ``slub_min_objects``. -slub_max_order specified the order at which slub_min_objects should no -longer be checked. This is useful to avoid SLUB trying to generate -super large order pages to fit slub_min_objects of a slab cache with -large object sizes into one high order page. Setting command line -parameter debug_guardpage_minorder=N (N > 0), forces setting -slub_max_order to 0, what cause minimum possible order of slabs -allocation. +``slub_max_order`` + specified the order at which ``slub_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slub_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slub_max_order`` to 0, what cause minimum possible order of + slabs allocation. SLUB Debug output ------------------ +================= -Here is a sample of slub debug output: +Here is a sample of slub debug output:: -==================================================================== -BUG kmalloc-8: Redzone overwritten --------------------------------------------------------------------- + ==================================================================== + BUG kmalloc-8: Redzone overwritten + -------------------------------------------------------------------- -INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc -INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 -INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 -INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 -Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ - Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005 - Redzone 0xc90f6d28: 00 cc cc cc . - Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005 + Redzone 0xc90f6d28: 00 cc cc cc . + Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ - [] dump_trace+0x63/0x1eb - [] show_trace_log_lvl+0x1a/0x2f - [] show_trace+0x12/0x14 - [] dump_stack+0x16/0x18 - [] object_err+0x143/0x14b - [] check_object+0x66/0x234 - [] __slab_free+0x239/0x384 - [] kfree+0xa6/0xc6 - [] get_modalias+0xb9/0xf5 - [] dmi_dev_uevent+0x27/0x3c - [] dev_uevent+0x1ad/0x1da - [] kobject_uevent_env+0x20a/0x45b - [] kobject_uevent+0xa/0xf - [] store_uevent+0x4f/0x58 - [] dev_attr_store+0x29/0x2f - [] sysfs_write_file+0x16e/0x19c - [] vfs_write+0xd1/0x15a - [] sys_write+0x3d/0x72 - [] sysenter_past_esp+0x5f/0x99 - [] 0xb7f7b410 - ======================= + [] dump_trace+0x63/0x1eb + [] show_trace_log_lvl+0x1a/0x2f + [] show_trace+0x12/0x14 + [] dump_stack+0x16/0x18 + [] object_err+0x143/0x14b + [] check_object+0x66/0x234 + [] __slab_free+0x239/0x384 + [] kfree+0xa6/0xc6 + [] get_modalias+0xb9/0xf5 + [] dmi_dev_uevent+0x27/0x3c + [] dev_uevent+0x1ad/0x1da + [] kobject_uevent_env+0x20a/0x45b + [] kobject_uevent+0xa/0xf + [] store_uevent+0x4f/0x58 + [] dev_attr_store+0x29/0x2f + [] sysfs_write_file+0x16e/0x19c + [] vfs_write+0xd1/0x15a + [] sys_write+0x3d/0x72 + [] sysenter_past_esp+0x5f/0x99 + [] 0xb7f7b410 + ======================= -FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc If SLUB encounters a corrupted object (full detection requires the kernel to be booted with slub_debug) then the following output will be dumped @@ -185,38 +197,38 @@ into the syslog: 1. Description of the problem encountered -This will be a message in the system log starting with + This will be a message in the system log starting with:: -=============================================== -BUG : ------------------------------------------------ + =============================================== + BUG : + ----------------------------------------------- -INFO: - -INFO: Slab
-INFO: Object
-INFO: Allocated in age= cpu=- + INFO: Slab
+ INFO: Object
+ INFO: Allocated in age= cpu= pid= -INFO: Freed in age= cpu= - pid= + INFO: Freed in age= cpu= + pid= -(Object allocation / free information is only available if SLAB_STORE_USER is -set for the slab. slub_debug sets that option) + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slub_debug sets that option) 2. The object contents if an object was involved. -Various types of lines can follow the BUG SLUB line: + Various types of lines can follow the BUG SLUB line: -Bytes b4
: + Bytes b4
: Shows a few bytes before the object where the problem was detected. Can be useful if the corruption does not stop with the start of the object. -Object
: + Object
: The bytes of the object. If the object is inactive then the bytes typically contain poison values. Any non-poison value shows a corruption by a write after free. -Redzone
: + Redzone
: The Redzone following the object. The Redzone is used to detect writes after the object. All bytes should always have the same value. If there is any deviation then it is due to a write after @@ -225,7 +237,7 @@ Redzone
: (Redzone information is only available if SLAB_RED_ZONE is set. slub_debug sets that option) -Padding
: + Padding
: Unused data to fill up the space in order to get the next object properly aligned. In the debug case we make sure that there are at least 4 bytes of padding. This allows the detection of writes @@ -233,29 +245,29 @@ Padding
: 3. A stackdump -The stackdump describes the location where the error was detected. The cause -of the corruption is may be more likely found by looking at the function that -allocated or freed the object. + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. 4. Report on how the problem was dealt with in order to ensure the continued -operation of the system. + operation of the system. -These are messages in the system log beginning with + These are messages in the system log beginning with:: -FIX : + FIX : -In the above sample SLUB found that the Redzone of an active object has -been overwritten. Here a string of 8 characters was written into a slab that -has the length of 8 characters. However, a 8 character string needs a -terminating 0. That zero has overwritten the first byte of the Redzone field. -After reporting the details of the issue encountered the FIX SLUB message -tells us that SLUB has restored the Redzone to its proper value and then -system operations continue. + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. -Emergency operations: ---------------------- +Emergency operations +==================== -Minimal debugging (sanity checks alone) can be enabled by booting with +Minimal debugging (sanity checks alone) can be enabled by booting with:: slub_debug=F @@ -270,73 +282,80 @@ No guarantees. The kernel component still needs to be fixed. Performance may be optimized further by locating the slab that experiences corruption and enabling debugging only for that cache -I.e. +I.e.:: slub_debug=F,dentry If the corruption occurs by writing after the end of the object then it may be advisable to enable a Redzone to avoid corrupting the beginning -of other objects. +of other objects:: slub_debug=FZ,dentry Extended slabinfo mode and plotting ------------------------------------ +=================================== -The slabinfo tool has a special 'extended' ('-X') mode that includes: +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: - Slabcache Totals - Slabs sorted by size (up to -N slabs, default 1) - Slabs sorted by loss (up to -N slabs, default 1) -Additionally, in this mode slabinfo does not dynamically scale sizes (G/M/K) -and reports everything in bytes (this functionality is also available to -other slabinfo modes via '-B' option) which makes reporting more precise and -accurate. Moreover, in some sense the `-X' mode also simplifies the analysis -of slabs' behaviour, because its output can be plotted using the -slabinfo-gnuplot.sh script. So it pushes the analysis from looking through -the numbers (tons of numbers) to something easier -- visual analysis. +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. To generate plots: -a) collect slabinfo extended records, for example: - while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done +a) collect slabinfo extended records, for example:: -b) pass stats file(-s) to slabinfo-gnuplot.sh script: - slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done -The slabinfo-gnuplot.sh script will pre-processes the collected records -and generates 3 png files (and 3 pre-processing cache files) per STATS -file: - - Slabcache Totals: FOO_STATS-totals.png - - Slabs sorted by size: FOO_STATS-slabs-by-size.png - - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: -Another use case, when slabinfo-gnuplot can be useful, is when you need -to compare slabs' behaviour "prior to" and "after" some code modification. -To help you out there, slabinfo-gnuplot.sh script can 'merge' the -`Slabcache Totals` sections from different measurements. To visually -compare N plots: + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] -a) Collect as many STATS1, STATS2, .. STATSN files as you need - while [ 1 ]; do slabinfo -X >> STATS; sleep 1; done + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png -b) Pre-process those STATS files - slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: -c) Execute slabinfo-gnuplot.sh in '-t' mode, passing all of the -generated pre-processed *-totals - slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: -This will produce a single plot (png file). + while [ 1 ]; do slabinfo -X >> STATS; sleep 1; done -Plots, expectedly, can be large so some fluctuations or small spikes -can go unnoticed. To deal with that, `slabinfo-gnuplot.sh' has two -options to 'zoom-in'/'zoom-out': - a) -s %d,%d overwrites the default image width and heigh - b) -r %d,%d specifies a range of samples to use (for example, - in `slabinfo -X >> FOO_STATS; sleep 1;' case, using - a "-r 40,60" range will plot only samples collected - between 40th and 60th seconds). +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and heigh + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). Christoph Lameter, May 30, 2007 Sergey Senozhatsky, October 23, 2015 From 0015190af2d8578bc78cc0b52db24a6c9dddf08a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:38 +0200 Subject: [PATCH 025/103] docs/vm: soft-dirty.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/soft-dirty.txt | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt index 55684d11a1e8..cb0cfd6672fa 100644 --- a/Documentation/vm/soft-dirty.txt +++ b/Documentation/vm/soft-dirty.txt @@ -1,34 +1,38 @@ - SOFT-DIRTY PTEs +.. _soft_dirty: - The soft-dirty is a bit on a PTE which helps to track which pages a task +=============== +Soft-Dirty PTEs +=============== + +The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from the task's PTEs. - This is done by writing "4" into the /proc/PID/clear_refs file of the + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the task in question. 2. Wait some time. 3. Read soft-dirty bits from the PTEs. - This is done by reading from the /proc/PID/pagemap. The bit 55 of the + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the 64-bit qword is the soft-dirty one. If set, the respective PTE was written to since step 1. - Internally, to do this tracking, the writable bit is cleared from PTEs +Internally, to do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is cleared. So, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. - Note, that although all the task's address space is marked as r/o after the +Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts both writable and soft-dirty bits on the PTE. - While in most cases tracking memory changes by #PF-s is more than enough +While in most cases tracking memory changes by #PF-s is more than enough there is still a scenario when we can lose soft dirty bits -- a task unmaps a previously mapped memory region and then maps a new one at exactly the same place. When unmap is called, the kernel internally clears PTE values @@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such memory region renewal the kernel always marks new memory regions (and expanded regions) as soft dirty. - This feature is actively used by the checkpoint-restore project. You +This feature is actively used by the checkpoint-restore project. You can find more details about it on http://criu.org From d18edf52f42b6e7453593738a86c288782c79fe3 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:39 +0200 Subject: [PATCH 026/103] docs/vm: split_page_table_lock: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/split_page_table_lock | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock index 62842a857dab..889b00be469f 100644 --- a/Documentation/vm/split_page_table_lock +++ b/Documentation/vm/split_page_table_lock @@ -1,3 +1,6 @@ +.. _split_page_table_lock: + +===================== Split page table lock ===================== @@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD tables. Access to higher level tables protected by mm->page_table_lock. There are helpers to lock/unlock a table and other accessor functions: + - pte_offset_map_lock() maps pte and takes PTE table lock, returns pointer to the taken lock; @@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE tables and the architecture supports it (see below). Hugetlb and split page table lock ---------------------------------- +================================= Hugetlb can support several page sizes. We use split lock only for PMD level, but not for PUD. Hugetlb-specific helpers: + - huge_pte_lock() takes pmd split lock for PMD_SIZE page, mm->page_table_lock otherwise; @@ -47,7 +52,7 @@ Hugetlb-specific helpers: returns pointer to table lock; Support of split page table lock by an architecture ---------------------------------------------------- +=================================================== There's no need in special enabling of PTE split page table lock: everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), @@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must be handled properly. page->ptl ---------- +========= page->ptl is used to access split page table lock, where 'page' is struct page of page containing the table. It shares storage with page->private @@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private To avoid increasing size of struct page and have best performance, we use a trick: + - if spinlock_t fits into long, we use page->ptr as spinlock, so we can avoid indirect access and save a cache line. - if size of spinlock_t is bigger then size of long, we use page->ptl as From 543199823345a3d8532d41f203477742cb2b06d8 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:40 +0200 Subject: [PATCH 027/103] docs/vm: swap_numa.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/swap_numa.txt | 53 ++++++++++++++++++++-------------- 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt index d5960c9124f5..e0466f2db8fa 100644 --- a/Documentation/vm/swap_numa.txt +++ b/Documentation/vm/swap_numa.txt @@ -1,5 +1,8 @@ +.. _swap_numa: + +=========================================== Automatically bind swap device to numa node -------------------------------------------- +=========================================== If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap @@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance. How to use this feature ------------------------ +======================= Swap device has priority and that decides the order of it to be used. To make use of automatically binding, there is no need to manipulate priority settings for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing: -# swapon /dev/swapA -# swapon /dev/swapB +to be swapped on. Simply swapping them on by doing:: + + # swapon /dev/swapA + # swapon /dev/swapB Then node 0 will use the two swap devices in the order of swapA then swapB and node 1 will use the two swap devices in the order of swapB then swapA. Note @@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter. A more complex example on a 4 node machine. Assume 6 swap devices are going to be swapped on: swapA and swapB are attached to node 0, swapC is attached to node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above: -# swapon /dev/swapA -# swapon /dev/swapB -# swapon /dev/swapC -# swapon /dev/swapD -# swapon /dev/swapE -# swapon /dev/swapF +The way to swap them on is the same as above:: + + # swapon /dev/swapA + # swapon /dev/swapB + # swapon /dev/swapC + # swapon /dev/swapD + # swapon /dev/swapE + # swapon /dev/swapF + +Then node 0 will use them in the order of:: + + swapA/swapB -> swapC -> swapD -> swapE -> swapF -Then node 0 will use them in the order of: -swapA/swapB -> swapC -> swapD -> swapE -> swapF swapA and swapB will be used in a round robin mode before any other swap device. -node 1 will use them in the order of: -swapC -> swapA -> swapB -> swapD -> swapE -> swapF +node 1 will use them in the order of:: + + swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of:: + + swapD/swapE -> swapA -> swapB -> swapC -> swapF -node 2 will use them in the order of: -swapD/swapE -> swapA -> swapB -> swapC -> swapF Similaly, swapD and swapE will be used in a round robin mode before any other swap devices. -node 3 will use them in the order of: -swapF -> swapA -> swapB -> swapC -> swapD -> swapE +node 3 will use them in the order of:: + + swapF -> swapA -> swapB -> swapC -> swapD -> swapE Implementation details ----------------------- +====================== The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same From 44f380fe901c8390df4f7576a3176efe65e2653c Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:41 +0200 Subject: [PATCH 028/103] docs/vm: transhuge.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/transhuge.txt | 276 +++++++++++++++++++-------------- 1 file changed, 161 insertions(+), 115 deletions(-) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 4dde03b44ad1..569d182cc973 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -1,6 +1,11 @@ -= Transparent Hugepage Support = +.. _transhuge: -== Objective == +============================ +Transparent Hugepage Support +============================ + +Objective +========= Performance critical computing applications dealing with large memory working sets are already running on top of libhugetlbfs and in turn @@ -33,7 +38,8 @@ are using hugepages but a significant speedup already happens if only one of the two is using hugepages just because of the fact the TLB miss is going to run faster. -== Design == +Design +====== - "graceful fallback": mm components which don't have transparent hugepage knowledge fall back to breaking huge pmd mapping into table of ptes and, @@ -88,16 +94,17 @@ Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions. -== sysfs == +sysfs +===== Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of: +system wide. This can be achieved with one of:: -echo always >/sys/kernel/mm/transparent_hugepage/enabled -echo madvise >/sys/kernel/mm/transparent_hugepage/enabled -echo never >/sys/kernel/mm/transparent_hugepage/enabled + echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo never >/sys/kernel/mm/transparent_hugepage/enabled It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise @@ -108,44 +115,53 @@ use hugepages later instead of regular pages. This isn't always guaranteed, but it may be more likely in case the allocation is for a MADV_HUGEPAGE region. -echo always >/sys/kernel/mm/transparent_hugepage/defrag -echo defer >/sys/kernel/mm/transparent_hugepage/defrag -echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo never >/sys/kernel/mm/transparent_hugepage/defrag +:: -"always" means that an application requesting THP will stall on allocation -failure and directly reclaim pages and compact memory in an effort to -allocate a THP immediately. This may be desirable for virtual machines -that benefit heavily from THP use and are willing to delay the VM start -to utilise them. + echo always >/sys/kernel/mm/transparent_hugepage/defrag + echo defer >/sys/kernel/mm/transparent_hugepage/defrag + echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo never >/sys/kernel/mm/transparent_hugepage/defrag -"defer" means that an application will wake kswapd in the background -to reclaim pages and wake kcompactd to compact memory so that THP is -available in the near future. It's the responsibility of khugepaged -to then install the THP pages later. +always + means that an application requesting THP will stall on + allocation failure and directly reclaim pages and compact + memory in an effort to allocate a THP immediately. This may be + desirable for virtual machines that benefit heavily from THP + use and are willing to delay the VM start to utilise them. -"defer+madvise" will enter direct reclaim and compaction like "always", but -only for regions that have used madvise(MADV_HUGEPAGE); all other regions -will wake kswapd in the background to reclaim pages and wake kcompactd to -compact memory so that THP is available in the near future. +defer + means that an application will wake kswapd in the background + to reclaim pages and wake kcompactd to compact memory so that + THP is available in the near future. It's the responsibility + of khugepaged to then install the THP pages later. -"madvise" will enter direct reclaim like "always" but only for regions -that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. +defer+madvise + will enter direct reclaim and compaction like ``always``, but + only for regions that have used madvise(MADV_HUGEPAGE); all + other regions will wake kswapd in the background to reclaim + pages and wake kcompactd to compact memory so that THP is + available in the near future. -"never" should be self-explanatory. +madvise + will enter direct reclaim like ``always`` but only for regions + that are have used madvise(MADV_HUGEPAGE). This is the default + behaviour. + +never + should be self-explanatory. By default kernel tries to use huge zero page on read page fault to anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1: +or enable it back by writing 1:: -echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page -echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page + echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page + echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage: +library) may want to know the size (in bytes) of a transparent hugepage:: -cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size + cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll @@ -155,84 +171,86 @@ khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's also possible to disable defrag in khugepaged by writing 0 or enable -defrag in khugepaged by writing 1: +defrag in khugepaged by writing 1:: -echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag -echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag You can also control how many pages khugepaged should scan at each -pass: +pass:: -/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan and how many milliseconds to wait in khugepaged between each pass (you -can set this to 0 to run khugepaged at 100% utilization of one core): +can set this to 0 to run khugepaged at 100% utilization of one core):: -/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs + /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs and how many milliseconds to wait in khugepaged if there's an hugepage -allocation failure to throttle the next allocation attempt. +allocation failure to throttle the next allocation attempt:: -/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs + /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs -The khugepaged progress can be seen in the number of pages collapsed: +The khugepaged progress can be seen in the number of pages collapsed:: -/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed -for each pass: +for each pass:: -/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans + /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans -max_ptes_none specifies how many extra small pages (that are +``max_ptes_none`` specifies how many extra small pages (that are not already mapped) can be allocated when collapsing a group -of small pages into one large page. +of small pages into one large page:: -/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none A higher value leads to use additional memory for programs. A lower value leads to gain less thp performance. Value of max_ptes_none can waste cpu time very little, you can ignore it. -max_ptes_swap specifies how many pages can be brought in from -swap when collapsing a group of pages into a transparent huge page. +``max_ptes_swap`` specifies how many pages can be brought in from +swap when collapsing a group of pages into a transparent huge page:: -/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap A higher value can cause excessive swap IO and waste memory. A lower value can prevent THPs from being collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. -== Boot parameter == +Boot parameter +============== You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter "transparent_hugepage=always" or -"transparent_hugepage=madvise" or "transparent_hugepage=never" -(without "") to the kernel command line. +Support by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` +to the kernel command line. -== Hugepages in tmpfs/shmem == +Hugepages in tmpfs/shmem +======================== You can control hugepage allocation policy in tmpfs with mount option -"huge=". It can have following values: +``huge=``. It can have following values: - - "always": +always Attempt to allocate huge pages every time we need a new page; - - "never": +never Do not allocate huge pages; - - "within_size": +within_size Only allocate huge page if it will be fully within i_size. Also respect fadvise()/madvise() hints; - - "advise: +advise Only allocate huge pages if requested with fadvise()/madvise(); -The default policy is "never". +The default policy is ``never``. -"mount -o remount,huge= /mountpoint" works fine after mount: remounting -huge=never will not attempt to break up huge pages at all, just stop more +``mount -o remount,huge= /mountpoint`` works fine after mount: remounting +``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. There's also sysfs knob to control hugepage allocation policy for internal @@ -243,110 +261,130 @@ MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. In addition to policies listed above, shmem_enabled allows two further values: - - "deny": +deny For use in emergencies, to force the huge option off from all mounts; - - "force": +force Force the huge option on for all - very useful for testing; -== Need of application restart == +Need of application restart +=========================== The transparent_hugepage/enabled values and tmpfs mount option only affect future behavior. So to make them effective you need to restart any application that could have been using hugepages. This also applies to the regions registered in khugepaged. -== Monitoring usage == +Monitoring usage +================ The number of anonymous transparent huge pages currently used by the -system is available by reading the AnonHugePages field in /proc/meminfo. +system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using anonymous transparent huge pages, -it is necessary to read /proc/PID/smaps and count the AnonHugePages fields +it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields for each mapping. The number of file transparent huge pages mapped to userspace is available -by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo. +by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. To identify what applications are mapping file transparent huge pages, it -is necessary to read /proc/PID/smaps and count the FileHugeMapped fields +is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields for each mapping. Note that reading the smaps file is expensive and reading it frequently will incur overhead. -There are a number of counters in /proc/vmstat that may be used to +There are a number of counters in ``/proc/vmstat`` that may be used to monitor how successfully the system is providing huge pages for use. -thp_fault_alloc is incremented every time a huge page is successfully +thp_fault_alloc + is incremented every time a huge page is successfully allocated to handle a page fault. This applies to both the first time a page is faulted and for COW faults. -thp_collapse_alloc is incremented by khugepaged when it has found +thp_collapse_alloc + is incremented by khugepaged when it has found a range of pages to collapse into one huge page and has successfully allocated a new huge page to store the data. -thp_fault_fallback is incremented if a page fault fails to allocate +thp_fault_fallback + is incremented if a page fault fails to allocate a huge page and instead falls back to using small pages. -thp_collapse_alloc_failed is incremented if khugepaged found a range +thp_collapse_alloc_failed + is incremented if khugepaged found a range of pages that should be collapsed into one huge page but failed the allocation. -thp_file_alloc is incremented every time a file huge page is successfully +thp_file_alloc + is incremented every time a file huge page is successfully allocated. -thp_file_mapped is incremented every time a file huge page is mapped into +thp_file_mapped + is incremented every time a file huge page is mapped into user address space. -thp_split_page is incremented every time a huge page is split into base +thp_split_page + is incremented every time a huge page is split into base pages. This can happen for a variety of reasons but a common reason is that a huge page is old and is being reclaimed. This action implies splitting all PMD the page mapped with. -thp_split_page_failed is incremented if kernel fails to split huge +thp_split_page_failed + is incremented if kernel fails to split huge page. This can happen if the page was pinned by somebody. -thp_deferred_split_page is incremented when a huge page is put onto split +thp_deferred_split_page + is incremented when a huge page is put onto split queue. This happens when a huge page is partially unmapped and splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. -thp_split_pmd is incremented every time a PMD split into table of PTEs. +thp_split_pmd + is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or munmap() on part of huge page. It doesn't split huge page, only page table entry. -thp_zero_page_alloc is incremented every time a huge zero page is +thp_zero_page_alloc + is incremented every time a huge zero page is successfully allocated. It includes allocations which where dropped due race with other allocation. Note, it doesn't count every map of the huge zero page, only its allocation. -thp_zero_page_alloc_failed is incremented if kernel fails to allocate +thp_zero_page_alloc_failed + is incremented if kernel fails to allocate huge zero page and falls back to using small pages. As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a -huge page for use. There are some counters in /proc/vmstat to help +huge page for use. There are some counters in ``/proc/vmstat`` to help monitor this overhead. -compact_stall is incremented every time a process stalls to run +compact_stall + is incremented every time a process stalls to run memory compaction so that a huge page is free for use. -compact_success is incremented if the system compacted memory and +compact_success + is incremented if the system compacted memory and freed a huge page for use. -compact_fail is incremented if the system tries to compact memory +compact_fail + is incremented if the system tries to compact memory but failed. -compact_pages_moved is incremented each time a page is moved. If +compact_pages_moved + is incremented each time a page is moved. If this value is increasing rapidly, it implies that the system is copying a lot of data to satisfy the huge page allocation. It is possible that the cost of copying exceeds any savings from reduced TLB misses. -compact_pagemigrate_failed is incremented when the underlying mechanism +compact_pagemigrate_failed + is incremented when the underlying mechanism for moving a page failed. -compact_blocks_moved is incremented each time memory compaction examines +compact_blocks_moved + is incremented each time memory compaction examines a huge page aligned range of pages. It is possible to establish how long the stalls were using the function @@ -354,7 +392,8 @@ tracer to record how long was spent in __alloc_pages_nodemask and using the mm_page_alloc tracepoint to identify which allocations were for huge pages. -== get_user_pages and follow_page == +get_user_pages and follow_page +============================== get_user_pages and follow_page if run on a hugepage, will return the head or tail pages as usual (exactly as they would do on @@ -367,10 +406,11 @@ for the head page and not the tail page), it should be updated to jump to check head page instead. Taking reference on any head/tail page would prevent page from being split by anyone. -NOTE: these aren't new constraints to the GUP API, and they match the -same constrains that applies to hugetlbfs too, so any driver capable -of handling GUP on hugetlbfs will also work fine on transparent -hugepage backed mappings. +.. note:: + these aren't new constraints to the GUP API, and they match the + same constrains that applies to hugetlbfs too, so any driver capable + of handling GUP on hugetlbfs will also work fine on transparent + hugepage backed mappings. In case you can't handle compound pages if they're returned by follow_page, the FOLL_SPLIT bit can be specified as parameter to @@ -383,13 +423,15 @@ hugepages being returned (as it's not only checking the pfn of the page and pinning it during the copy but it pretends to migrate the memory in regular page sizes and with regular pte/pmd mappings). -== Optimizing the applications == +Optimizing the applications +=========================== To be guaranteed that the kernel will map a 2M page immediately in any memory region, the mmap region has to be hugepage naturally aligned. posix_memalign() can provide that guarantee. -== Hugetlbfs == +Hugetlbfs +========= You can use hugetlbfs on a kernel that has transparent hugepage support enabled just fine as always. No difference can be noted in @@ -397,7 +439,8 @@ hugetlbfs other than there will be less overall fragmentation. All usual features belonging to hugetlbfs are preserved and unaffected. libhugetlbfs will also work fine as usual. -== Graceful fallback == +Graceful fallback +================= Code walking pagetables but unaware about huge pmds can simply call split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by @@ -415,20 +458,21 @@ it tries to swapout the hugepage for example. split_huge_page() can fail if the page is pinned and you must handle this correctly. Example to make mremap.c transparent hugepage aware with a one liner -change: +change:: -diff --git a/mm/mremap.c b/mm/mremap.c ---- a/mm/mremap.c -+++ b/mm/mremap.c -@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru - return NULL; + diff --git a/mm/mremap.c b/mm/mremap.c + --- a/mm/mremap.c + +++ b/mm/mremap.c + @@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru + return NULL; - pmd = pmd_offset(pud, addr); -+ split_huge_pmd(vma, pmd, addr); - if (pmd_none_or_clear_bad(pmd)) - return NULL; + pmd = pmd_offset(pud, addr); + + split_huge_pmd(vma, pmd, addr); + if (pmd_none_or_clear_bad(pmd)) + return NULL; -== Locking in hugepage aware code == +Locking in hugepage aware code +============================== We want as much code as possible hugepage aware, as calling split_huge_page() or split_huge_pmd() has a cost. @@ -448,7 +492,8 @@ should just drop the page table lock and fallback to the old code as before. Otherwise you can proceed to process the huge pmd and the hugepage natively. Once finished you can drop the page table lock. -== Refcounts and transparent huge pages == +Refcounts and transparent huge pages +==================================== Refcounting on THP is mostly consistent with refcounting on other compound pages: @@ -510,7 +555,8 @@ clear where reference should go after split: it will stay on head page. Note that split_huge_pmd() doesn't have any limitation on refcounting: pmd can be split at any point and never fails. -== Partial unmap and deferred_split_huge_page() == +Partial unmap and deferred_split_huge_page() +============================================ Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use From a5e4da91e024677cc72d4fd8ea2bbc82217d2443 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:42 +0200 Subject: [PATCH 029/103] docs/vm: unevictable-lru.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/unevictable-lru.txt | 113 +++++++++++---------------- 1 file changed, 47 insertions(+), 66 deletions(-) diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index e14718572476..fdd84cb8d511 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -1,37 +1,13 @@ - ============================== - UNEVICTABLE LRU INFRASTRUCTURE - ============================== +.. _unevictable_lru: -======== -CONTENTS -======== +============================== +Unevictable LRU Infrastructure +============================== - (*) The Unevictable LRU - - - The unevictable page list. - - Memory control group interaction. - - Marking address spaces unevictable. - - Detecting Unevictable Pages. - - vmscan's handling of unevictable pages. - - (*) mlock()'d pages. - - - History. - - Basic management. - - mlock()/mlockall() system call handling. - - Filtering special vmas. - - munlock()/munlockall() system call handling. - - Migrating mlocked pages. - - Compacting mlocked pages. - - mmap(MAP_LOCKED) system call handling. - - munmap()/exit()/exec() system call handling. - - try_to_unmap(). - - try_to_munlock() reverse map scan. - - Page reclaim in shrink_*_list(). +.. contents:: :local: -============ -INTRODUCTION +Introduction ============ This document describes the Linux memory manager's "Unevictable LRU" @@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the descriptions below add value by provide the answer to "why does it do that?". -=================== -THE UNEVICTABLE LRU + +The Unevictable LRU =================== The Unevictable LRU facility adds an additional LRU list to track unevictable @@ -66,17 +42,17 @@ completely unresponsive. The unevictable list addresses the following classes of unevictable pages: - (*) Those owned by ramfs. + * Those owned by ramfs. - (*) Those mapped into SHM_LOCK'd shared memory regions. + * Those mapped into SHM_LOCK'd shared memory regions. - (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. + * Those mapped into VM_LOCKED [mlock()ed] VMAs. The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. -THE UNEVICTABLE PAGE LIST +The Unevictable Page List ------------------------- The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list @@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other tasks are changing the "evictability" state of the page. -MEMORY CONTROL GROUP INTERACTION +Memory Control Group Interaction -------------------------------- The unevictable LRU facility interacts with the memory control group [aka @@ -144,7 +120,9 @@ effects: the control group to thrash or to OOM-kill tasks. -MARKING ADDRESS SPACES UNEVICTABLE +.. _mark_addr_space_unevict: + +Marking Address Spaces Unevictable ---------------------------------- For facilities such as ramfs none of the pages attached to the address space @@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE address space flag is provided, and this can be manipulated by a filesystem using a number of wrapper functions: - (*) void mapping_set_unevictable(struct address_space *mapping); + * ``void mapping_set_unevictable(struct address_space *mapping);`` Mark the address space as being completely unevictable. - (*) void mapping_clear_unevictable(struct address_space *mapping); + * ``void mapping_clear_unevictable(struct address_space *mapping);`` Mark the address space as being evictable. - (*) int mapping_unevictable(struct address_space *mapping); + * ``int mapping_unevictable(struct address_space *mapping);`` Query the address space, and return true if it is completely unevictable. @@ -177,12 +155,13 @@ These are currently used in two places in the kernel: ensure they're in memory. -DETECTING UNEVICTABLE PAGES +Detecting Unevictable Pages --------------------------- The function page_evictable() in vmscan.c determines whether a page is -evictable or not using the query function outlined above [see section "Marking -address spaces unevictable"] to check the AS_UNEVICTABLE flag. +evictable or not using the query function outlined above [see section +:ref:`Marking address spaces unevictable `] +to check the AS_UNEVICTABLE flag. For address spaces that are so marked after being populated (as SHM regions might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate @@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. -VMSCAN'S HANDLING OF UNEVICTABLE PAGES +Vmscan's Handling of Unevictable Pages -------------------------------------- If unevictable pages are culled in the fault path, or moved to the unevictable @@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to putback_lru_page(). -============= -MLOCKED PAGES +MLOCKED Pages ============= The unevictable page list is also useful for mlock(), in addition to ramfs and @@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in NOMMU situations, all mappings are effectively mlocked. -HISTORY +History ------- The "Unevictable mlocked Pages" infrastructure is based on work originally @@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs mapped the page. More on this below. -BASIC MANAGEMENT +Basic Management ---------------- mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable @@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when: (4) before a page is COW'd in a VM_LOCKED VMA. -mlock()/mlockall() SYSTEM CALL HANDLING +mlock()/mlockall() System Call Handling --------------------------------------- -Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() +Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup() for each VMA in the range specified by the call. In the case of mlockall(), this is the entire active address space of the task. Note that mlock_fixup() is used for both mlocking and munlocking a range of memory. A call to mlock() @@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle it later if and when it attempts to reclaim the page. -FILTERING SPECIAL VMAS +Filtering Special VMAs ---------------------- mlock_fixup() filters several classes of "special" VMAs: @@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during munlock(), munmap() or task exit. Neither does mlock_fixup() account these VMAs against the task's "locked_vm". +.. _munlock_munlockall_handling: -munlock()/munlockall() SYSTEM CALL HANDLING +munlock()/munlockall() System Call Handling ------------------------------------------- The munlock() and munlockall() system calls are handled by the same functions - @@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim the page. This should be relatively rare. -MIGRATING MLOCKED PAGES +Migrating MLOCKED Pages ----------------------- A page that is being migrated has been isolated from the LRU lists and is held @@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the putback_lru_page() function to add migrated pages back to the LRU. -COMPACTING MLOCKED PAGES +Compacting MLOCKED Pages ------------------------ The unevictable LRU can be scanned for compactable regions and the default @@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by the page migration code and the same work flow as described in MIGRATING MLOCKED PAGES will apply. -MLOCKING TRANSPARENT HUGE PAGES +MLOCKING Transparent Huge Pages ------------------------------- A transparent huge page is represented by a single entry on an LRU list. @@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed. See also comment in follow_trans_huge_pmd(). -mmap(MAP_LOCKED) SYSTEM CALL HANDLING +mmap(MAP_LOCKED) System Call Handling ------------------------------------- In addition the mlock()/mlockall() system calls, an application can request @@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later and pages allocated into that region. -munmap()/exit()/exec() SYSTEM CALL HANDLING +munmap()/exit()/exec() System Call Handling ------------------------------------------- When unmapping an mlocked region of memory, whether by an explicit call to @@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, holepunching, and truncation of file pages and their anonymous COWed pages. -try_to_munlock() REVERSE MAP SCAN +try_to_munlock() Reverse Map Scan --------------------------------- - [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the - page_referenced() reverse map walker. +.. warning:: + [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the + page_referenced() reverse map walker. -When munlock_vma_page() [see section "munlock()/munlockall() System Call -Handling" above] tries to munlock a page, it needs to determine whether or not -the page is mapped by any VM_LOCKED VMA without actually attempting to unmap -all PTEs from the page. For this purpose, the unevictable/mlock infrastructure +When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call +Handling ` above] tries to munlock a +page, it needs to determine whether or not the page is mapped by any +VM_LOCKED VMA without actually attempting to unmap all PTEs from the +page. For this purpose, the unevictable/mlock infrastructure introduced a variant of try_to_unmap() called try_to_munlock(). try_to_munlock() calls the same functions as try_to_unmap() for anonymous and @@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via mlockall(), overall this is a fairly rare event. -PAGE RECLAIM IN shrink_*_list() +Page Reclaim in shrink_*_list() ------------------------------- shrink_active_list() culls any obviously unevictable pages - i.e. From f9451df2212bfffdde5b99d93292a49a1563a00f Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:43 +0200 Subject: [PATCH 030/103] docs/vm: userfaultfd.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/userfaultfd.txt | 60 +++++++++++++++++++------------- 1 file changed, 36 insertions(+), 24 deletions(-) diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt index bb2f945f87ab..5048cf661a8a 100644 --- a/Documentation/vm/userfaultfd.txt +++ b/Documentation/vm/userfaultfd.txt @@ -1,6 +1,11 @@ -= Userfaultfd = +.. _userfaultfd: -== Objective == +=========== +Userfaultfd +=========== + +Objective +========= Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various @@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do. For example userfaults allows a proper and more optimal implementation of the PROT_NONE+SIGSEGV trick. -== Design == +Design +====== Userfaults are delivered and resolved through the userfaultfd syscall. @@ -41,7 +47,8 @@ different processes without them being aware about what is going on themselves on the same region the manager is already tracking, which is a corner case that would currently return -EBUSY). -== API == +API +=== When first opened the userfaultfd must be enabled invoking the UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or @@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. -== QEMU/KVM == +QEMU/KVM +======== QEMU/KVM is using the userfaultfd syscall to implement postcopy live migration. Postcopy live migration is one form of memory @@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration thread). -== Non-cooperative userfaultfd == +Non-cooperative userfaultfd +=========================== When the userfaultfd is monitored by an external manager, the manager must be able to track changes in the process virtual memory @@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The manager has to explicitly enable these events by setting appropriate bits in uffdio_api.features passed to UFFDIO_API ioctl: -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When -this feature is enabled, the userfaultfd context of the parent process -is duplicated into the newly created process. The manager receives -UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in -the uffd_msg.fork. +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() -calls. When the non-cooperative process moves a virtual memory area to -a different location, the manager will receive UFFD_EVENT_REMAP. The -uffd_msg.remap will contain the old and new addresses of the area and -its original length. +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. -UFFD_FEATURE_EVENT_REMOVE - enable notifications about -madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event -UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The -uffd_msg.remove will contain start and end addresses of the removed -area. +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory -unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove -containing start and end addresses of the unmapped area. +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP are pretty similar, they quite differ in the action expected from the From 44bc09eb3ed8d8a1701914f64c294d089f4b6c86 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:44 +0200 Subject: [PATCH 031/103] docs/vm: z3fold.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/z3fold.txt | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.txt index 38e4dac810b6..224e3c61d686 100644 --- a/Documentation/vm/z3fold.txt +++ b/Documentation/vm/z3fold.txt @@ -1,5 +1,8 @@ +.. _z3fold: + +====== z3fold ------- +====== z3fold is a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. @@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. The main differences between z3fold and zbud are: + * unlike zbud, z3fold allows for up to PAGE_SIZE allocations * z3fold can hold up to 3 compressed pages in its page * z3fold doesn't export any API itself and is thus intended to be used From 2a05c58bf93e2fe34cb48add0a75d0fe93ebe871 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:45 +0200 Subject: [PATCH 032/103] docs/vm: zsmalloc.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/zsmalloc.txt | 60 +++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 24 deletions(-) diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt index 64ed63c4f69d..6e79893d6132 100644 --- a/Documentation/vm/zsmalloc.txt +++ b/Documentation/vm/zsmalloc.txt @@ -1,5 +1,8 @@ +.. _zsmalloc: + +======== zsmalloc --------- +======== This allocator is designed for use with zram. Thus, the allocator is supposed to work well under low memory conditions. In particular, it @@ -31,40 +34,49 @@ be mapped using zs_map_object() to get a usable pointer and subsequently unmapped using zs_unmap_object(). stat ----- +==== With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via -/sys/kernel/debug/zsmalloc/. Here is a sample of stat output: +``/sys/kernel/debug/zsmalloc/``. Here is a sample of stat output:: -# cat /sys/kernel/debug/zsmalloc/zram0/classes + # cat /sys/kernel/debug/zsmalloc/zram0/classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage - .. - .. + ... + ... 9 176 0 1 186 129 8 4 10 192 1 0 2880 2872 135 3 11 208 0 1 819 795 42 2 12 224 0 1 219 159 12 4 - .. - .. + ... + ... -class: index -size: object size zspage stores -almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) -almost_full: the number of ZS_ALMOST_FULL zspages(see below) -obj_allocated: the number of objects allocated -obj_used: the number of objects allocated to the user -pages_used: the number of pages allocated for the class -pages_per_zspage: the number of 0-order pages to make a zspage +class + index +size + object size zspage stores +almost_empty + the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full + the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated + the number of objects allocated +obj_used + the number of objects allocated to the user +pages_used + the number of pages allocated for the class +pages_per_zspage + the number of 0-order pages to make a zspage -We assign a zspage to ZS_ALMOST_EMPTY fullness group when: - n <= N / f, where -n = number of allocated objects -N = total number of objects zspage can store -f = fullness_threshold_frac(ie, 4 at the moment) +We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where + +* n = number of allocated objects +* N = total number of objects zspage can store +* f = fullness_threshold_frac(ie, 4 at the moment) Similarly, we assign zspage to: - ZS_ALMOST_FULL when n > N / f - ZS_EMPTY when n == 0 - ZS_FULL when n == N + +* ZS_ALMOST_FULL when n > N / f +* ZS_EMPTY when n == 0 +* ZS_FULL when n == N From 3406bb5c64a091ad887c3fb339ad88e9e88ef938 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:46 +0200 Subject: [PATCH 033/103] docs/vm: zswap.txt: convert to ReST format Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/zswap.txt | 71 ++++++++++++++++++++++---------------- 1 file changed, 42 insertions(+), 29 deletions(-) diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.txt index 0b3a1148f9f0..1444ecd40911 100644 --- a/Documentation/vm/zswap.txt +++ b/Documentation/vm/zswap.txt @@ -1,4 +1,11 @@ -Overview: +.. _zswap: + +===== +zswap +===== + +Overview +======== Zswap is a lightweight compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a @@ -7,32 +14,34 @@ for potentially reduced swap I/O.  This trade-off can also result in a significant performance improvement if reads from the compressed cache are faster than reads from a swap device. -NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory -reclaim. This interaction has not been fully explored on the large set of -potential configurations and workloads that exist. For this reason, zswap -is a work in progress and should be considered experimental. +.. note:: + Zswap is a new feature as of v3.11 and interacts heavily with memory + reclaim. This interaction has not been fully explored on the large set of + potential configurations and workloads that exist. For this reason, zswap + is a work in progress and should be considered experimental. + + Some potential benefits: -Some potential benefits: * Desktop/laptop users with limited RAM capacities can mitigate the -    performance impact of swapping. + performance impact of swapping. * Overcommitted guests that share a common I/O resource can -    dramatically reduce their swap I/O pressure, avoiding heavy handed I/O - throttling by the hypervisor. This allows more work to get done with less - impact to the guest workload and guests sharing the I/O subsystem + dramatically reduce their swap I/O pressure, avoiding heavy handed I/O + throttling by the hypervisor. This allows more work to get done with less + impact to the guest workload and guests sharing the I/O subsystem * Users with SSDs as swap devices can extend the life of the device by -    drastically reducing life-shortening writes. + drastically reducing life-shortening writes. Zswap evicts pages from compressed cache on an LRU basis to the backing swap device when the compressed pool reaches its size limit. This requirement had been identified in prior community discussions. Zswap is disabled by default but can be enabled at boot time by setting -the "enabled" attribute to 1 at boot time. ie: zswap.enabled=1. Zswap +the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap can also be enabled and disabled at runtime using the sysfs interface. An example command to enable zswap at runtime, assuming sysfs is mounted -at /sys, is: +at ``/sys``, is:: -echo 1 > /sys/module/zswap/parameters/enabled + echo 1 > /sys/module/zswap/parameters/enabled When zswap is disabled at runtime it will stop storing pages that are being swapped out. However, it will _not_ immediately write out or fault @@ -43,7 +52,8 @@ pages out of the compressed pool, a swapoff on the swap device(s) will fault back into memory all swapped out pages, including those in the compressed pool. -Design: +Design +====== Zswap receives pages for compression through the Frontswap API and is able to evict pages from its own compressed pool on an LRU basis and write them back to @@ -53,12 +63,12 @@ Zswap makes use of zpool for the managing the compressed memory pool. Each allocation in zpool is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool of type -zbud is created, but it can be selected at boot time by setting the "zpool" -attribute, e.g. zswap.zpool=zbud. It can also be changed at runtime using the -sysfs "zpool" attribute, e.g. +pages are freed. The pool is not preallocated. By default, a zpool +of type zbud is created, but it can be selected at boot time by +setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can +also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: -echo zbud > /sys/module/zswap/parameters/zpool + echo zbud > /sys/module/zswap/parameters/zpool The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which means the compression ratio will always be 2:1 or worse (because of half-full @@ -83,14 +93,16 @@ via frontswap, to free the compressed entry. Zswap seeks to be simple in its policies. Sysfs attributes allow for one user controlled policy: + * max_pool_percent - The maximum percentage of memory that the compressed - pool can occupy. + pool can occupy. -The default compressor is lzo, but it can be selected at boot time by setting -the “compressor” attribute, e.g. zswap.compressor=lzo. It can also be changed -at runtime using the sysfs "compressor" attribute, e.g. +The default compressor is lzo, but it can be selected at boot time by +setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``. +It can also be changed at runtime using the sysfs "compressor" +attribute, e.g.:: -echo lzo > /sys/module/zswap/parameters/compressor + echo lzo > /sys/module/zswap/parameters/compressor When the zpool and/or compressor parameter is changed at runtime, any existing compressed pages are not modified; they are left in their own zpool. When a @@ -106,11 +118,12 @@ compressed length of the page is set to zero and the pattern or same-filled value is stored. Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the "same_filled_pages_enabled" attribute to 0, -e.g. zswap.same_filled_pages_enabled=0. It can also be enabled and disabled at -runtime using the sysfs "same_filled_pages_enabled" attribute, e.g. +disabled at boot time by setting the ``same_filled_pages_enabled`` attribute +to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and +disabled at runtime using the sysfs ``same_filled_pages_enabled`` +attribute, e.g.:: -echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled + echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled When zswap same-filled page identification is disabled at runtime, it will stop checking for the same-value filled pages during store operation. However, the From ad56b738c5dd223a2f66685830f82194025a6138 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:47 +0200 Subject: [PATCH 034/103] docs/vm: rename documentation files to .rst Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/ABI/stable/sysfs-devices-node | 2 +- .../ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 2 +- Documentation/ABI/testing/sysfs-kernel-slab | 4 +- .../admin-guide/kernel-parameters.txt | 12 ++-- Documentation/dev-tools/kasan.rst | 2 +- Documentation/filesystems/proc.txt | 4 +- Documentation/filesystems/tmpfs.txt | 2 +- Documentation/sysctl/vm.txt | 6 +- Documentation/vm/00-INDEX | 58 +++++++++---------- .../vm/{active_mm.txt => active_mm.rst} | 0 Documentation/vm/{balance => balance.rst} | 0 .../vm/{cleancache.txt => cleancache.rst} | 0 .../vm/{frontswap.txt => frontswap.rst} | 0 Documentation/vm/{highmem.txt => highmem.rst} | 0 Documentation/vm/{hmm.txt => hmm.rst} | 0 ...etlbfs_reserv.txt => hugetlbfs_reserv.rst} | 0 .../vm/{hugetlbpage.txt => hugetlbpage.rst} | 2 +- .../vm/{hwpoison.txt => hwpoison.rst} | 2 +- ...ge_tracking.txt => idle_page_tracking.rst} | 2 +- Documentation/vm/{ksm.txt => ksm.rst} | 0 .../vm/{mmu_notifier.txt => mmu_notifier.rst} | 0 Documentation/vm/{numa => numa.rst} | 2 +- ...mory_policy.txt => numa_memory_policy.rst} | 0 ...t-accounting => overcommit-accounting.rst} | 0 .../vm/{page_frags => page_frags.rst} | 0 .../vm/{page_migration => page_migration.rst} | 0 .../vm/{page_owner.txt => page_owner.rst} | 0 Documentation/vm/{pagemap.txt => pagemap.rst} | 6 +- ...ap_file_pages.txt => remap_file_pages.rst} | 0 Documentation/vm/{slub.txt => slub.rst} | 0 .../vm/{soft-dirty.txt => soft-dirty.rst} | 0 ...e_table_lock => split_page_table_lock.rst} | 0 .../vm/{swap_numa.txt => swap_numa.rst} | 0 .../vm/{transhuge.txt => transhuge.rst} | 0 ...nevictable-lru.txt => unevictable-lru.rst} | 0 .../vm/{userfaultfd.txt => userfaultfd.rst} | 0 Documentation/vm/{z3fold.txt => z3fold.rst} | 0 .../vm/{zsmalloc.txt => zsmalloc.rst} | 0 Documentation/vm/{zswap.txt => zswap.rst} | 0 MAINTAINERS | 2 +- arch/alpha/Kconfig | 2 +- arch/ia64/Kconfig | 2 +- arch/mips/Kconfig | 2 +- arch/powerpc/Kconfig | 2 +- fs/Kconfig | 2 +- fs/dax.c | 2 +- fs/proc/task_mmu.c | 4 +- include/linux/hmm.h | 2 +- include/linux/memremap.h | 4 +- include/linux/mmu_notifier.h | 2 +- include/linux/sched/mm.h | 4 +- include/linux/swap.h | 2 +- mm/Kconfig | 6 +- mm/cleancache.c | 2 +- mm/frontswap.c | 2 +- mm/hmm.c | 2 +- mm/huge_memory.c | 4 +- mm/hugetlb.c | 4 +- mm/ksm.c | 4 +- mm/mmap.c | 2 +- mm/rmap.c | 6 +- mm/util.c | 2 +- 63 files changed, 87 insertions(+), 87 deletions(-) rename Documentation/vm/{active_mm.txt => active_mm.rst} (100%) rename Documentation/vm/{balance => balance.rst} (100%) rename Documentation/vm/{cleancache.txt => cleancache.rst} (100%) rename Documentation/vm/{frontswap.txt => frontswap.rst} (100%) rename Documentation/vm/{highmem.txt => highmem.rst} (100%) rename Documentation/vm/{hmm.txt => hmm.rst} (100%) rename Documentation/vm/{hugetlbfs_reserv.txt => hugetlbfs_reserv.rst} (100%) rename Documentation/vm/{hugetlbpage.txt => hugetlbpage.rst} (99%) rename Documentation/vm/{hwpoison.txt => hwpoison.rst} (99%) rename Documentation/vm/{idle_page_tracking.txt => idle_page_tracking.rst} (98%) rename Documentation/vm/{ksm.txt => ksm.rst} (100%) rename Documentation/vm/{mmu_notifier.txt => mmu_notifier.rst} (100%) rename Documentation/vm/{numa => numa.rst} (99%) rename Documentation/vm/{numa_memory_policy.txt => numa_memory_policy.rst} (100%) rename Documentation/vm/{overcommit-accounting => overcommit-accounting.rst} (100%) rename Documentation/vm/{page_frags => page_frags.rst} (100%) rename Documentation/vm/{page_migration => page_migration.rst} (100%) rename Documentation/vm/{page_owner.txt => page_owner.rst} (100%) rename Documentation/vm/{pagemap.txt => pagemap.rst} (98%) rename Documentation/vm/{remap_file_pages.txt => remap_file_pages.rst} (100%) rename Documentation/vm/{slub.txt => slub.rst} (100%) rename Documentation/vm/{soft-dirty.txt => soft-dirty.rst} (100%) rename Documentation/vm/{split_page_table_lock => split_page_table_lock.rst} (100%) rename Documentation/vm/{swap_numa.txt => swap_numa.rst} (100%) rename Documentation/vm/{transhuge.txt => transhuge.rst} (100%) rename Documentation/vm/{unevictable-lru.txt => unevictable-lru.rst} (100%) rename Documentation/vm/{userfaultfd.txt => userfaultfd.rst} (100%) rename Documentation/vm/{z3fold.txt => z3fold.rst} (100%) rename Documentation/vm/{zsmalloc.txt => zsmalloc.rst} (100%) rename Documentation/vm/{zswap.txt => zswap.rst} (100%) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 5b2d0f08867c..b38f4b734567 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,4 @@ Date: December 2009 Contact: Lee Schermerhorn Description: The node's huge page size control/query attributes. - See Documentation/vm/hugetlbpage.txt \ No newline at end of file + See Documentation/vm/hugetlbpage.rst \ No newline at end of file diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages index e21c00571cf4..5140b233356c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages +++ b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages @@ -12,4 +12,4 @@ Description: free_hugepages surplus_hugepages resv_hugepages - See Documentation/vm/hugetlbpage.txt for details. + See Documentation/vm/hugetlbpage.rst for details. diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm index 73e653ee2481..dfc13244cda3 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-ksm +++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm @@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface sleep_millisecs: how many milliseconds ksm should sleep between scans. - See Documentation/vm/ksm.txt for more information. + See Documentation/vm/ksm.rst for more information. What: /sys/kernel/mm/ksm/merge_across_nodes Date: January 2013 diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index 2cc0a72b64be..29601d93a1c2 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -37,7 +37,7 @@ Description: The alloc_calls file is read-only and lists the kernel code locations from which allocations for this cache were performed. The alloc_calls file only contains information if debugging is - enabled for that cache (see Documentation/vm/slub.txt). + enabled for that cache (see Documentation/vm/slub.rst). What: /sys/kernel/slab/cache/alloc_fastpath Date: February 2008 @@ -219,7 +219,7 @@ Contact: Pekka Enberg , Description: The free_calls file is read-only and lists the locations of object frees if slab debugging is enabled (see - Documentation/vm/slub.txt). + Documentation/vm/slub.rst). What: /sys/kernel/slab/cache/free_fastpath Date: February 2008 diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1d1d53f85ddd..5d6e5509c049 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3887,7 +3887,7 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slab_max_order= [MM, SLAB] Determines the maximum allowed order for slabs. @@ -3901,7 +3901,7 @@ slub_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/vm/slub.txt. + Documentation/vm/slub.rst. slub_memcg_sysfs= [MM, SLUB] Determines whether to enable sysfs directories for @@ -3915,7 +3915,7 @@ Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/vm/slub.txt. + Documentation/vm/slub.rst. slub_min_objects= [MM, SLUB] The minimum number of objects per slab. SLUB will @@ -3924,12 +3924,12 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slub_min_order= [MM, SLUB] Determines the minimum page order for slabs. Must be lower than slub_max_order. - For more information see Documentation/vm/slub.txt. + For more information see Documentation/vm/slub.rst. slub_nomerge [MM, SLUB] Same with slab_nomerge. This is supported for legacy. @@ -4285,7 +4285,7 @@ Format: [always|madvise|never] Can be used to control the default behavior of the system with respect to transparent hugepages. - See Documentation/vm/transhuge.txt for more details. + See Documentation/vm/transhuge.rst for more details. tsc= Disable clocksource stability checks for TSC. Format: diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index f7a18f274357..aabc8738b3d8 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -120,7 +120,7 @@ A typical out of bounds access report looks like this:: The header of the report discribe what kind of bug happened and what kind of access caused it. It's followed by the description of the accessed slub object -(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and +(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and the description of the accessed memory page. In the last section the report shows memory state around the accessed address. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2a84bb334894..2d3984c70feb 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,7 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). +soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +536,7 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. +/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index a85355cf85f4..627389a34f77 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -105,7 +105,7 @@ policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/vm/numa_memory_policy.txt for a list of all available +See Documentation/vm/numa_memory_policy.rst for a list of all available memory allocation policy mode flags and their effect on memory policy. =static is equivalent to MPOL_F_STATIC_NODES diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index ff234d229cbb..ef581a940439 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -516,7 +516,7 @@ nr_hugepages Change the minimum size of the hugepage pool. -See Documentation/vm/hugetlbpage.txt +See Documentation/vm/hugetlbpage.rst ============================================================== @@ -525,7 +525,7 @@ nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. -See Documentation/vm/hugetlbpage.txt +See Documentation/vm/hugetlbpage.rst ============================================================== @@ -668,7 +668,7 @@ and don't use much of it. The default value is 0. -See Documentation/vm/overcommit-accounting and +See Documentation/vm/overcommit-accounting.rst and mm/mmap.c::__vm_enough_memory() for more information. ============================================================== diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 0278f2c85efb..cda564d55b3c 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -1,62 +1,62 @@ 00-INDEX - this file. -active_mm.txt +active_mm.rst - An explanation from Linus about tsk->active_mm vs tsk->mm. -balance +balance.rst - various information on memory balancing. -cleancache.txt +cleancache.rst - Intro to cleancache and page-granularity victim cache. -frontswap.txt +frontswap.rst - Outline frontswap, part of the transcendent memory frontend. -highmem.txt +highmem.rst - Outline of highmem and common issues. -hmm.txt +hmm.rst - Documentation of heterogeneous memory management -hugetlbpage.txt +hugetlbpage.rst - a brief summary of hugetlbpage support in the Linux kernel. -hugetlbfs_reserv.txt +hugetlbfs_reserv.rst - A brief overview of hugetlbfs reservation design/implementation. -hwpoison.txt +hwpoison.rst - explains what hwpoison is -idle_page_tracking.txt +idle_page_tracking.rst - description of the idle page tracking feature. -ksm.txt +ksm.rst - how to use the Kernel Samepage Merging feature. -mmu_notifier.txt +mmu_notifier.rst - a note about clearing pte/pmd and mmu notifications -numa +numa.rst - information about NUMA specific code in the Linux vm. -numa_memory_policy.txt +numa_memory_policy.rst - documentation of concepts and APIs of the 2.6 memory policy support. -overcommit-accounting +overcommit-accounting.rst - description of the Linux kernels overcommit handling modes. -page_frags +page_frags.rst - description of page fragments allocator -page_migration +page_migration.rst - description of page migration in NUMA systems. -pagemap.txt +pagemap.rst - pagemap, from the userspace perspective -page_owner.txt +page_owner.rst - tracking about who allocated each page -remap_file_pages.txt +remap_file_pages.rst - a note about remap_file_pages() system call -slub.txt +slub.rst - a short users guide for SLUB. -soft-dirty.txt +soft-dirty.rst - short explanation for soft-dirty PTEs -split_page_table_lock +split_page_table_lock.rst - Separate per-table lock to improve scalability of the old page_table_lock. -swap_numa.txt +swap_numa.rst - automatic binding of swap device to numa node -transhuge.txt +transhuge.rst - Transparent Hugepage Support, alternative way of using hugepages. -unevictable-lru.txt +unevictable-lru.rst - Unevictable LRU infrastructure -userfaultfd.txt +userfaultfd.rst - description of userfaultfd system call z3fold.txt - outline of z3fold allocator for storing compressed pages -zsmalloc.txt +zsmalloc.rst - outline of zsmalloc allocator for storing compressed pages -zswap.txt +zswap.rst - Intro to compressed cache for swap pages diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.rst similarity index 100% rename from Documentation/vm/active_mm.txt rename to Documentation/vm/active_mm.rst diff --git a/Documentation/vm/balance b/Documentation/vm/balance.rst similarity index 100% rename from Documentation/vm/balance rename to Documentation/vm/balance.rst diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.rst similarity index 100% rename from Documentation/vm/cleancache.txt rename to Documentation/vm/cleancache.rst diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.rst similarity index 100% rename from Documentation/vm/frontswap.txt rename to Documentation/vm/frontswap.rst diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.rst similarity index 100% rename from Documentation/vm/highmem.txt rename to Documentation/vm/highmem.rst diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.rst similarity index 100% rename from Documentation/vm/hmm.txt rename to Documentation/vm/hmm.rst diff --git a/Documentation/vm/hugetlbfs_reserv.txt b/Documentation/vm/hugetlbfs_reserv.rst similarity index 100% rename from Documentation/vm/hugetlbfs_reserv.txt rename to Documentation/vm/hugetlbfs_reserv.rst diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.rst similarity index 99% rename from Documentation/vm/hugetlbpage.txt rename to Documentation/vm/hugetlbpage.rst index 3bb0d991f102..a5da14b05b4b 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.rst @@ -217,7 +217,7 @@ When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: -#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], +#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.rst similarity index 99% rename from Documentation/vm/hwpoison.txt rename to Documentation/vm/hwpoison.rst index b1a8c241d6c2..070aa1e716b7 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.rst @@ -155,7 +155,7 @@ Testing value). This allows stress testing of many kinds of pages. The page_flags are the same as in /proc/kpageflags. The flag bits are defined in include/linux/kernel-page-flags.h and - documented in Documentation/vm/pagemap.txt + documented in Documentation/vm/pagemap.rst * Architecture specific MCE injector diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/vm/idle_page_tracking.rst similarity index 98% rename from Documentation/vm/idle_page_tracking.txt rename to Documentation/vm/idle_page_tracking.rst index 9cbe6f8d7a99..d1c4609a5220 100644 --- a/Documentation/vm/idle_page_tracking.txt +++ b/Documentation/vm/idle_page_tracking.rst @@ -65,7 +65,7 @@ workload one should: are not reclaimable, he or she can filter them out using ``/proc/kpageflags``. -See Documentation/vm/pagemap.txt for more information about +See Documentation/vm/pagemap.rst for more information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. .. _impl_details: diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.rst similarity index 100% rename from Documentation/vm/ksm.txt rename to Documentation/vm/ksm.rst diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.rst similarity index 100% rename from Documentation/vm/mmu_notifier.txt rename to Documentation/vm/mmu_notifier.rst diff --git a/Documentation/vm/numa b/Documentation/vm/numa.rst similarity index 99% rename from Documentation/vm/numa rename to Documentation/vm/numa.rst index c81e7c56f0f9..aada84bc8c46 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa.rst @@ -110,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, such as taskset(1) and numactl(1), and program interfaces such as sched_setaffinity(2). Further, one can modify the kernel's default local allocation behavior using Linux NUMA memory policy. -[see Documentation/vm/numa_memory_policy.txt.] +[see Documentation/vm/numa_memory_policy.rst.] System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.rst similarity index 100% rename from Documentation/vm/numa_memory_policy.txt rename to Documentation/vm/numa_memory_policy.rst diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting.rst similarity index 100% rename from Documentation/vm/overcommit-accounting rename to Documentation/vm/overcommit-accounting.rst diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags.rst similarity index 100% rename from Documentation/vm/page_frags rename to Documentation/vm/page_frags.rst diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration.rst similarity index 100% rename from Documentation/vm/page_migration rename to Documentation/vm/page_migration.rst diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.rst similarity index 100% rename from Documentation/vm/page_owner.txt rename to Documentation/vm/page_owner.rst diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.rst similarity index 98% rename from Documentation/vm/pagemap.txt rename to Documentation/vm/pagemap.rst index bd6d71740c88..d54b4bfd3043 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.rst @@ -18,7 +18,7 @@ There are four components to pagemap: * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) + * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) * Bits 57-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) @@ -97,7 +97,7 @@ Short descriptions to the page flags: A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. + pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. 16 - COMPOUND_TAIL @@ -118,7 +118,7 @@ Short descriptions to the page flags: zero page for pfn_zero or huge_zero page 25 - IDLE page has not been accessed since it was marked idle (see - Documentation/vm/idle_page_tracking.txt). Note that this flag may be + Documentation/vm/idle_page_tracking.rst). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.rst similarity index 100% rename from Documentation/vm/remap_file_pages.txt rename to Documentation/vm/remap_file_pages.rst diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.rst similarity index 100% rename from Documentation/vm/slub.txt rename to Documentation/vm/slub.rst diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.rst similarity index 100% rename from Documentation/vm/soft-dirty.txt rename to Documentation/vm/soft-dirty.rst diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock.rst similarity index 100% rename from Documentation/vm/split_page_table_lock rename to Documentation/vm/split_page_table_lock.rst diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.rst similarity index 100% rename from Documentation/vm/swap_numa.txt rename to Documentation/vm/swap_numa.rst diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.rst similarity index 100% rename from Documentation/vm/transhuge.txt rename to Documentation/vm/transhuge.rst diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.rst similarity index 100% rename from Documentation/vm/unevictable-lru.txt rename to Documentation/vm/unevictable-lru.rst diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.rst similarity index 100% rename from Documentation/vm/userfaultfd.txt rename to Documentation/vm/userfaultfd.rst diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.rst similarity index 100% rename from Documentation/vm/z3fold.txt rename to Documentation/vm/z3fold.rst diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.rst similarity index 100% rename from Documentation/vm/zsmalloc.txt rename to Documentation/vm/zsmalloc.rst diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.rst similarity index 100% rename from Documentation/vm/zswap.txt rename to Documentation/vm/zswap.rst diff --git a/MAINTAINERS b/MAINTAINERS index 3bdc260e36b7..575849a8343e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15406,7 +15406,7 @@ L: linux-mm@kvack.org S: Maintained F: mm/zsmalloc.c F: include/linux/zsmalloc.h -F: Documentation/vm/zsmalloc.txt +F: Documentation/vm/zsmalloc.rst ZSWAP COMPRESSED SWAP CACHING M: Seth Jennings diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index e96adcbcab41..f53e5060afe7 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -584,7 +584,7 @@ config ARCH_DISCONTIGMEM_ENABLE Say Y to support efficient handling of discontiguous physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) or have huge holes in the physical address space for other reasons. - See for more. + See for more. source "mm/Kconfig" diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index bbe12a038d21..3ac9bf4cc2a0 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -397,7 +397,7 @@ config ARCH_DISCONTIGMEM_ENABLE Say Y to support efficient handling of discontiguous physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) or have huge holes in the physical address space for other reasons. - See for more. + See for more. config ARCH_FLATMEM_ENABLE def_bool y diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 8128c3b68d6b..4562810857eb 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2551,7 +2551,7 @@ config ARCH_DISCONTIGMEM_ENABLE Say Y to support efficient handling of discontiguous physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) or have huge holes in the physical address space for other reasons. - See for more. + See for more. config ARCH_SPARSEMEM_ENABLE bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 73ce5dd07642..f8c0f10949ea 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -880,7 +880,7 @@ config PPC_MEM_KEYS page-based protections, but without requiring modification of the page tables when an application changes protection domains. - For details, see Documentation/vm/protection-keys.txt + For details, see Documentation/vm/protection-keys.rst If unsure, say y. diff --git a/fs/Kconfig b/fs/Kconfig index bc821a86d965..ba53dc2a9691 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -196,7 +196,7 @@ config HUGETLBFS help hugetlbfs is a filesystem backing for HugeTLB pages, based on ramfs. For architectures that support it, say Y here and read - for details. + for details. If unsure, say N. diff --git a/fs/dax.c b/fs/dax.c index 0276df90e86c..0eb65c34d5a6 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -618,7 +618,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, * downgrading page table protection not changing it to point * to a new page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ if (pmdp) { #ifdef CONFIG_FS_DAX_PMD diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ec6d2983a5cb..91d14c4ac04a 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -956,7 +956,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the - * Documentation/vm/soft-dirty.txt for full description + * Documentation/vm/soft-dirty.rst for full description * of how soft-dirty works. */ pte_t ptent = *pte; @@ -1436,7 +1436,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) + * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) * Bit 56 page exclusively mapped * Bits 57-60 zero * Bit 61 page is file-page or shared-anon diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 325017ad9311..77be87c095f2 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -16,7 +16,7 @@ /* * Heterogeneous Memory Management (HMM) * - * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it + * See Documentation/vm/hmm.rst for reasons and overview of what HMM is and it * is for. Here we focus on the HMM API description, with some explanation of * the underlying implementation. * diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 7b4899c06f49..74ea5e2310a8 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -45,7 +45,7 @@ struct vmem_altmap { * must be treated as an opaque object, rather than a "normal" struct page. * * A more complete discussion of unaddressable memory may be found in - * include/linux/hmm.h and Documentation/vm/hmm.txt. + * include/linux/hmm.h and Documentation/vm/hmm.rst. * * MEMORY_DEVICE_PUBLIC: * Device memory that is cache coherent from device and CPU point of view. This @@ -67,7 +67,7 @@ enum memory_type { * page_free() * * Additional notes about MEMORY_DEVICE_PRIVATE may be found in - * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief + * include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief * explanation in include/linux/memory_hotplug.h. * * The page_fault() callback must migrate page back, from device memory to diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 2d07a1ed5a31..392e6af82701 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -174,7 +174,7 @@ struct mmu_notifier_ops { * invalidate_range_start()/end() notifiers, as * invalidate_range() alread catches the points in time when an * external TLB range needs to be flushed. For more in depth - * discussion on this see Documentation/vm/mmu_notifier.txt + * discussion on this see Documentation/vm/mmu_notifier.rst * * Note that this function might be called with just a sub-range * of what was passed to invalidate_range_start()/end(), if diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 1149533aa2fa..df2c7d11f496 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -28,7 +28,7 @@ extern struct mm_struct *mm_alloc(void); * * Use mmdrop() to release the reference acquired by mmgrab(). * - * See also for an in-depth explanation + * See also for an in-depth explanation * of &mm_struct.mm_count vs &mm_struct.mm_users. */ static inline void mmgrab(struct mm_struct *mm) @@ -51,7 +51,7 @@ extern void mmdrop(struct mm_struct *mm); * * Use mmput() to release the reference acquired by mmget(). * - * See also for an in-depth explanation + * See also for an in-depth explanation * of &mm_struct.mm_count vs &mm_struct.mm_users. */ static inline void mmget(struct mm_struct *mm) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7b6a59f722a3..4003973deff4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -53,7 +53,7 @@ static inline int current_is_kswapd(void) /* * Unaddressable device memory support. See include/linux/hmm.h and - * Documentation/vm/hmm.txt. Short description is we need struct pages for + * Documentation/vm/hmm.rst. Short description is we need struct pages for * device memory that is unaddressable (inaccessible) by CPU, so that we can * migrate part of a process memory to device memory. * diff --git a/mm/Kconfig b/mm/Kconfig index c782e8fb7235..b9f04213a353 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -312,7 +312,7 @@ config KSM the many instances by a single page with that content, so saving memory until one or another app needs to modify the content. Recommended for use with KVM, or with other duplicative applications. - See Documentation/vm/ksm.txt for more information: KSM is inactive + See Documentation/vm/ksm.rst for more information: KSM is inactive until a program has madvised that an area is MADV_MERGEABLE, and root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). @@ -537,7 +537,7 @@ config MEM_SOFT_DIRTY into a page just as regular dirty bit, but unlike the latter it can be cleared by hands. - See Documentation/vm/soft-dirty.txt for more details. + See Documentation/vm/soft-dirty.rst for more details. config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" @@ -664,7 +664,7 @@ config IDLE_PAGE_TRACKING be useful to tune memory cgroup limits and/or for job placement within a compute cluster. - See Documentation/vm/idle_page_tracking.txt for more details. + See Documentation/vm/idle_page_tracking.rst for more details. # arch_add_memory() comprehends device memory config ARCH_HAS_ZONE_DEVICE diff --git a/mm/cleancache.c b/mm/cleancache.c index f7b9fdc79d97..126548b5a292 100644 --- a/mm/cleancache.c +++ b/mm/cleancache.c @@ -3,7 +3,7 @@ * * This code provides the generic "frontend" layer to call a matching * "backend" driver implementation of cleancache. See - * Documentation/vm/cleancache.txt for more information. + * Documentation/vm/cleancache.rst for more information. * * Copyright (C) 2009-2010 Oracle Corp. All rights reserved. * Author: Dan Magenheimer diff --git a/mm/frontswap.c b/mm/frontswap.c index fec8b5044040..4f5476a0f955 100644 --- a/mm/frontswap.c +++ b/mm/frontswap.c @@ -3,7 +3,7 @@ * * This code provides the generic "frontend" layer to call a matching * "backend" driver implementation of frontswap. See - * Documentation/vm/frontswap.txt for more information. + * Documentation/vm/frontswap.rst for more information. * * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. * Author: Dan Magenheimer diff --git a/mm/hmm.c b/mm/hmm.c index 320545b98ff5..af176c6820cf 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -37,7 +37,7 @@ #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC) /* - * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h + * Device private memory see HMM (Documentation/vm/hmm.rst) or hmm.h */ DEFINE_STATIC_KEY_FALSE(device_private_key); EXPORT_SYMBOL(device_private_key); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 87ab9b8f56b5..6d5911673450 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1185,7 +1185,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, * mmu_notifier_invalidate_range_end() happens which can lead to a * device seeing memory write in different order than CPU. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); @@ -2037,7 +2037,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, * replacing a zero pmd write protected page with a zero pte write * protected page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ pmdp_huge_clear_flush(vma, haddr, pmd); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7c204e3d132b..5af974abae46 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3289,7 +3289,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, * table protection not changing it to point * to a new page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ huge_ptep_set_wrprotect(src, addr, src_pte); } @@ -4355,7 +4355,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * No need to call mmu_notifier_invalidate_range() we are downgrading * page table protection not changing it to point to a new page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ i_mmap_unlock_write(vma->vm_file->f_mapping); mmu_notifier_invalidate_range_end(mm, start, end); diff --git a/mm/ksm.c b/mm/ksm.c index 293721f5da70..0b88698a9014 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1049,7 +1049,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, * No need to notify as we are downgrading page table to read * only not changing it to point to a new page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* @@ -1138,7 +1138,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, * No need to notify as we are replacing a read only page with another * read only page with the same content. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, newpte); diff --git a/mm/mmap.c b/mm/mmap.c index 9efdc021ad22..39fc51d1639c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2769,7 +2769,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, unsigned long ret = -EINVAL; struct file *file; - pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.txt.\n", + pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.rst.\n", current->comm, current->pid); if (prot) diff --git a/mm/rmap.c b/mm/rmap.c index 47db27f8049e..854b703fbe2a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -942,7 +942,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, * downgrading page table protection not changing it to point * to a new page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ if (ret) (*cleaned)++; @@ -1587,7 +1587,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * point at new page while a device still is using this * page. * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ dec_mm_counter(mm, mm_counter_file(page)); } @@ -1597,7 +1597,7 @@ discard: * done above for all cases requiring it to happen under page * table lock before mmu_notifier_invalidate_range_end() * - * See Documentation/vm/mmu_notifier.txt + * See Documentation/vm/mmu_notifier.rst */ page_remove_rmap(subpage, PageHuge(page)); put_page(page); diff --git a/mm/util.c b/mm/util.c index c1250501364f..e857c80c6f4a 100644 --- a/mm/util.c +++ b/mm/util.c @@ -609,7 +609,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); * succeed and -ENOMEM implies there is not. * * We currently support three overcommit policies, which are set via the - * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting + * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting.rst * * Strict overcommit modes added 2002 Feb 26 by Alan Cox. * Additional code 2002 Jul 20 by Robert Love. From 82381918c4712ba107d3e4ff7117751f396018f7 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 21 Mar 2018 21:22:48 +0200 Subject: [PATCH 035/103] docs/vm: add index.rst and link MM documentation to top level index Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/index.rst | 3 +- Documentation/vm/conf.py | 10 +++++++ Documentation/vm/index.rst | 56 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 1 deletion(-) create mode 100644 Documentation/vm/conf.py create mode 100644 Documentation/vm/index.rst diff --git a/Documentation/index.rst b/Documentation/index.rst index 3b99ab931d41..fdc585703498 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -45,7 +45,7 @@ the kernel interface as seen by application developers. .. toctree:: :maxdepth: 2 - userspace-api/index + userspace-api/index Introduction to kernel development @@ -89,6 +89,7 @@ needed). sound/index crypto/index filesystems/index + vm/index Architecture-specific documentation ----------------------------------- diff --git a/Documentation/vm/conf.py b/Documentation/vm/conf.py new file mode 100644 index 000000000000..3b0b601af558 --- /dev/null +++ b/Documentation/vm/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Memory Management Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'memory-management.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst new file mode 100644 index 000000000000..6c451421a01e --- /dev/null +++ b/Documentation/vm/index.rst @@ -0,0 +1,56 @@ +===================================== +Linux Memory Management Documentation +===================================== + +This is a collection of documents about Linux memory management (mm) subsystem. + +User guides for MM features +=========================== + +The following documents provide guides for controlling and tuning +various features of the Linux memory management + +.. toctree:: + :maxdepth: 1 + + hugetlbpage + idle_page_tracking + ksm + numa_memory_policy + pagemap + transhuge + soft-dirty + swap_numa + userfaultfd + zswap + +Kernel developers MM documentation +================================== + +The below documents describe MM internals with different level of +details ranging from notes and mailing list responses to elaborate +descriptions of data structures and algorithms. + +.. toctree:: + :maxdepth: 1 + + active_mm + balance + cleancache + frontswap + highmem + hmm + hwpoison + hugetlbfs_reserv + mmu_notifier + numa + overcommit-accounting + page_migration + page_frags + page_owner + remap_file_pages + slub + split_page_table_lock + unevictable-lru + z3fold + zsmalloc From 6d9094862b70020c684588147e5eff52dec19c8d Mon Sep 17 00:00:00 2001 From: Heikki Krogerus Date: Fri, 6 Apr 2018 15:41:22 +0300 Subject: [PATCH 036/103] Documentation: typec.rst: Use literal-block element with ascii art Using reStructuredText literal-block element with ascii-art. That prevents the ascii art from being processed as reStructuredText. Reported-by: Masanari Iida Reviewed-and-tested-by: Jani Nikula Fixes: bdecb33af34f ("usb: typec: API for controlling USB Type-C Multiplexers") Signed-off-by: Heikki Krogerus Signed-off-by: Jonathan Corbet --- Documentation/driver-api/usb/typec.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/driver-api/usb/typec.rst b/Documentation/driver-api/usb/typec.rst index feb31946490b..48ff58095f11 100644 --- a/Documentation/driver-api/usb/typec.rst +++ b/Documentation/driver-api/usb/typec.rst @@ -210,7 +210,7 @@ If the connector is dual-role capable, there may also be a switch for the data role. USB Type-C Connector Class does not supply separate API for them. The port drivers can use USB Role Class API with those. -Illustration of the muxes behind a connector that supports an alternate mode: +Illustration of the muxes behind a connector that supports an alternate mode:: ------------------------ | Connector | From 3b443955596e8f9965dbf11bd9bc0554c8b63781 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 6 Apr 2018 14:02:35 -0700 Subject: [PATCH 037/103] Docs: tell maintainers to put [GIT PULL] in their subject lines It seems that Linus looks for [GIT PULL] in subject lines to ensure that pull requests don't get buried in the noise during merge windows. Update the docs to reflect that. [jc: From an impromptu post from willy, thus no SOB] Signed-off-by: Jonathan Corbet --- Documentation/process/submitting-patches.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index f7152ed565e5..908bb55be407 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use the pull request as the cover letter for a normal posting of the patch series, giving the maintainer the option of using either. -A pull request should have [GIT] or [PULL] in the subject line. The +A pull request should have [GIT PULL] in the subject line. The request itself should include the repository name and the branch of interest on a single line; it should look something like:: From e37274fa3a2b9b96926b471bf681554ac56d5749 Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Tue, 28 Nov 2017 12:26:13 +0900 Subject: [PATCH 038/103] linux-next: ftrace/docs: Fix spelling typos in ftrace-users.rst This patch corrects some spelling typo in ftrace-users.rst Signed-off-by: Masanari Iida Acked-by: Steven Rostedt (VMware) Signed-off-by: Jonathan Corbet --- Documentation/trace/ftrace-uses.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst index 998a60a93015..00283b6dd101 100644 --- a/Documentation/trace/ftrace-uses.rst +++ b/Documentation/trace/ftrace-uses.rst @@ -12,7 +12,7 @@ Written for: 4.14 Introduction ============ -The ftrace infrastructure was originially created to attach callbacks to the +The ftrace infrastructure was originally created to attach callbacks to the beginning of functions in order to record and trace the flow of the kernel. But callbacks to the start of a function can have other use cases. Either for live kernel patching, or for security monitoring. This document describes @@ -30,7 +30,7 @@ The ftrace context This requires extra care to what can be done inside a callback. A callback can be called outside the protective scope of RCU. -The ftrace infrastructure has some protections agains recursions and RCU +The ftrace infrastructure has some protections against recursions and RCU but one must still be very careful how they use the callbacks. From 9376ff9ba298c983062a12cbbafde506a4eaea71 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:21 +0200 Subject: [PATCH 039/103] LICENSES/GPL2.0: Add GPL-2.0-only/or-later as valid identifiers Quite some files have been flagged with the new GPL-2.0-only and GPL-2.0-or-later identifiers which replace the original GPL-2.0 and GPL-2.0+ identifiers in the SPDX license identifier specification, but the identifiers are not mentioned as valid in the GPL-2.0 license file. Add them to the license file and to the Linux-syscall-note exception to make everything consistent again. Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Mauro Carvalho Chehab Cc: Hans Verkuil Signed-off-by: Jonathan Corbet --- LICENSES/exceptions/Linux-syscall-note | 2 +- LICENSES/preferred/GPL-2.0 | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/LICENSES/exceptions/Linux-syscall-note b/LICENSES/exceptions/Linux-syscall-note index 6b60b61be4e9..9abdad71fafd 100644 --- a/LICENSES/exceptions/Linux-syscall-note +++ b/LICENSES/exceptions/Linux-syscall-note @@ -1,6 +1,6 @@ SPDX-Exception-Identifier: Linux-syscall-note SPDX-URL: https://spdx.org/licenses/Linux-syscall-note.html -SPDX-Licenses: GPL-2.0, GPL-2.0+, GPL-1.0+, LGPL-2.0, LGPL-2.0+, LGPL-2.1, LGPL-2.1+ +SPDX-Licenses: GPL-2.0, GPL-2.0+, GPL-1.0+, LGPL-2.0, LGPL-2.0+, LGPL-2.1, LGPL-2.1+, GPL-2.0-only, GPL-2.0-or-later Usage-Guide: This exception is used together with one of the above SPDX-Licenses to mark user space API (uapi) header files so they can be included diff --git a/LICENSES/preferred/GPL-2.0 b/LICENSES/preferred/GPL-2.0 index b8db91d3a1cb..ff0812fd89cc 100644 --- a/LICENSES/preferred/GPL-2.0 +++ b/LICENSES/preferred/GPL-2.0 @@ -1,5 +1,7 @@ Valid-License-Identifier: GPL-2.0 +Valid-License-Identifier: GPL-2.0-only Valid-License-Identifier: GPL-2.0+ +Valid-License-Identifier: GPL-2.0-or-later SPDX-URL: https://spdx.org/licenses/GPL-2.0.html Usage-Guide: To use this license in source code, put one of the following SPDX @@ -7,8 +9,12 @@ Usage-Guide: guidelines in the licensing rules documentation. For 'GNU General Public License (GPL) version 2 only' use: SPDX-License-Identifier: GPL-2.0 + or + SPDX-License-Identifier: GPL-2.0-only For 'GNU General Public License (GPL) version 2 or any later version' use: SPDX-License-Identifier: GPL-2.0+ + or + SPDX-License-Identifier: GPL-2.0-or-later License-Text: GNU GENERAL PUBLIC LICENSE From 01cf721b16c83d2645348c2c01473a8dec6e0cb4 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:22 +0200 Subject: [PATCH 040/103] LICENSES: Add X11 license Add the full text of the X11 to the kernel tree. It was copied directly from: https://spdx.org/licenses/X11.html#licenseText Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Signed-off-by: Jonathan Corbet --- LICENSES/other/X11 | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 LICENSES/other/X11 diff --git a/LICENSES/other/X11 b/LICENSES/other/X11 new file mode 100644 index 000000000000..fe4353fd0000 --- /dev/null +++ b/LICENSES/other/X11 @@ -0,0 +1,37 @@ +Valid-License-Identifier: X11 +SPDX-URL: https://spdx.org/licenses/X11.html +Usage-Guide: + To use the X11 put the following SPDX tag/value pair into a comment + according to the placement guidelines in the licensing rules + documentation: + SPDX-License-Identifier: X11 +License-Text: + + +X11 License + +Copyright (C) 1996 X Consortium + +Permission is hereby granted, free of charge, to any person obtaining a +copy of this software and associated documentation files (the "Software"), +to deal in the Software without restriction, including without limitation +the rights to use, copy, modify, merge, publish, distribute, sublicense, +and/or sell copies of the Software, and to permit persons to whom the +Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER +IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +Except as contained in this notice, the name of the X Consortium shall not +be used in advertising or otherwise to promote the sale, use or other +dealings in this Software without prior written authorization from the X +Consortium. + +X Window System is a trademark of X Consortium, Inc. From 3e2c812be1c83d6f26a5695208ab5badecfd4af7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:23 +0200 Subject: [PATCH 041/103] LICENSES: Add Apache 2.0 license text Add the full text of the Apache License version 2 to the kernel tree. It was copied directly from: https://spdx.org/licenses/Apache-2.0.html#licenseText Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Kate Stewart Signed-off-by: Jonathan Corbet --- LICENSES/other/Apache-2.0 | 183 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) create mode 100644 LICENSES/other/Apache-2.0 diff --git a/LICENSES/other/Apache-2.0 b/LICENSES/other/Apache-2.0 new file mode 100644 index 000000000000..7cd903f573e5 --- /dev/null +++ b/LICENSES/other/Apache-2.0 @@ -0,0 +1,183 @@ +Valid-License-Identifier: Apache-2.0 +SPDX-URL: https://spdx.org/licenses/Apache-2.0.html +Usage-Guide: + To use the Apache License version 2.0 put the following SPDX tag/value + pair into a comment according to the placement guidelines in the + licensing rules documentation: + SPDX-License-Identifier: Apache-2.0 +License-Text: + +Apache License + +Version 2.0, January 2004 + +http://www.apache.org/licenses/ + +TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + +1. Definitions. + +"License" shall mean the terms and conditions for use, reproduction, and +distribution as defined by Sections 1 through 9 of this document. + +"Licensor" shall mean the copyright owner or entity authorized by the +copyright owner that is granting the License. + +"Legal Entity" shall mean the union of the acting entity and all other +entities that control, are controlled by, or are under common control with +that entity. For the purposes of this definition, "control" means (i) the +power, direct or indirect, to cause the direction or management of such +entity, whether by contract or otherwise, or (ii) ownership of fifty +percent (50%) or more of the outstanding shares, or (iii) beneficial +ownership of such entity. + +"You" (or "Your") shall mean an individual or Legal Entity exercising +permissions granted by this License. + +"Source" form shall mean the preferred form for making modifications, +including but not limited to software source code, documentation source, +and configuration files. + +"Object" form shall mean any form resulting from mechanical transformation +or translation of a Source form, including but not limited to compiled +object code, generated documentation, and conversions to other media types. + +"Work" shall mean the work of authorship, whether in Source or Object form, +made available under the License, as indicated by a copyright notice that +is included in or attached to the work (an example is provided in the +Appendix below). + +"Derivative Works" shall mean any work, whether in Source or Object form, +that is based on (or derived from) the Work and for which the editorial +revisions, annotations, elaborations, or other modifications represent, as +a whole, an original work of authorship. For the purposes of this License, +Derivative Works shall not include works that remain separable from, or +merely link (or bind by name) to the interfaces of, the Work and Derivative +Works thereof. + +"Contribution" shall mean any work of authorship, including the original +version of the Work and any modifications or additions to that Work or +Derivative Works thereof, that is intentionally submitted to Licensor for +inclusion in the Work by the copyright owner or by an individual or Legal +Entity authorized to submit on behalf of the copyright owner. For the +purposes of this definition, "submitted" means any form of electronic, +verbal, or written communication sent to the Licensor or its +representatives, including but not limited to communication on electronic +mailing lists, source code control systems, and issue tracking systems that +are managed by, or on behalf of, the Licensor for the purpose of discussing +and improving the Work, but excluding communication that is conspicuously +marked or otherwise designated in writing by the copyright owner as "Not a +Contribution." + +"Contributor" shall mean Licensor and any individual or Legal Entity on +behalf of whom a Contribution has been received by Licensor and +subsequently incorporated within the Work. + +2. Grant of Copyright License. Subject to the terms and conditions of this + License, each Contributor hereby grants to You a perpetual, worldwide, + non-exclusive, no-charge, royalty-free, irrevocable copyright license to + reproduce, prepare Derivative Works of, publicly display, publicly + perform, sublicense, and distribute the Work and such Derivative Works + in Source or Object form. + +3. Grant of Patent License. Subject to the terms and conditions of this + License, each Contributor hereby grants to You a perpetual, worldwide, + non-exclusive, no-charge, royalty-free, irrevocable (except as stated in + this section) patent license to make, have made, use, offer to sell, + sell, import, and otherwise transfer the Work, where such license + applies only to those patent claims licensable by such Contributor that + are necessarily infringed by their Contribution(s) alone or by + combination of their Contribution(s) with the Work to which such + Contribution(s) was submitted. If You institute patent litigation + against any entity (including a cross-claim or counterclaim in a + lawsuit) alleging that the Work or a Contribution incorporated within + the Work constitutes direct or contributory patent infringement, then + any patent licenses granted to You under this License for that Work + shall terminate as of the date such litigation is filed. + +4. Redistribution. You may reproduce and distribute copies of the Work or + Derivative Works thereof in any medium, with or without modifications, + and in Source or Object form, provided that You meet the following + conditions: + + a. You must give any other recipients of the Work or Derivative Works a + copy of this License; and + + b. You must cause any modified files to carry prominent notices stating + that You changed the files; and + + c. You must retain, in the Source form of any Derivative Works that You + distribute, all copyright, patent, trademark, and attribution notices + from the Source form of the Work, excluding those notices that do not + pertain to any part of the Derivative Works; and + + d. If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained within + such NOTICE file, excluding those notices that do not pertain to any + part of the Derivative Works, in at least one of the following + places: within a NOTICE text file distributed as part of the + Derivative Works; within the Source form or documentation, if + provided along with the Derivative Works; or, within a display + generated by the Derivative Works, if and wherever such third-party + notices normally appear. The contents of the NOTICE file are for + informational purposes only and do not modify the License. You may + add Your own attribution notices within Derivative Works that You + distribute, alongside or as an addendum to the NOTICE text from the + Work, provided that such additional attribution notices cannot be + construed as modifying the License. + + You may add Your own copyright statement to Your modifications and may + provide additional or different license terms and conditions for use, + reproduction, or distribution of Your modifications, or for any such + Derivative Works as a whole, provided Your use, reproduction, and + distribution of the Work otherwise complies with the conditions stated + in this License. + +5. Submission of Contributions. Unless You explicitly state otherwise, any + Contribution intentionally submitted for inclusion in the Work by You to + the Licensor shall be under the terms and conditions of this License, + without any additional terms or conditions. Notwithstanding the above, + nothing herein shall supersede or modify the terms of any separate + license agreement you may have executed with Licensor regarding such + Contributions. + +6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + +7. Disclaimer of Warranty. Unless required by applicable law or agreed to + in writing, Licensor provides the Work (and each Contributor provides + its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied, including, without limitation, + any warranties or conditions of TITLE, NON-INFRINGEMENT, + MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely + responsible for determining the appropriateness of using or + redistributing the Work and assume any risks associated with Your + exercise of permissions under this License. + +8. Limitation of Liability. In no event and under no legal theory, whether + in tort (including negligence), contract, or otherwise, unless required + by applicable law (such as deliberate and grossly negligent acts) or + agreed to in writing, shall any Contributor be liable to You for + damages, including any direct, indirect, special, incidental, or + consequential damages of any character arising as a result of this + License or out of the use or inability to use the Work (including but + not limited to damages for loss of goodwill, work stoppage, computer + failure or malfunction, or any and all other commercial damages or + losses), even if such Contributor has been advised of the possibility of + such damages. + +9. Accepting Warranty or Additional Liability. While redistributing the + Work or Derivative Works thereof, You may choose to offer, and charge a + fee for, acceptance of support, warranty, indemnity, or other liability + obligations and/or rights consistent with this License. However, in + accepting such obligations, You may act only on Your own behalf and on + Your sole responsibility, not on behalf of any other Contributor, and + only if You agree to indemnify, defend, and hold each Contributor + harmless for any liability incurred by, or claims asserted against, such + Contributor by reason of your accepting any such warranty or additional + liability. + +END OF TERMS AND CONDITIONS From f1137e96f8c5ce9b48649b8a344da451145a09a6 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:24 +0200 Subject: [PATCH 042/103] LICENSES: Add CDDL-1.0 license text Add the full text of the CDDL-1.0 to the kernel tree. It was copied directly from: https://spdx.org/licenses/CDDL-1.0.html#licenseText Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Kate Stewart Signed-off-by: Jonathan Corbet --- LICENSES/other/CDDL-1.0 | 364 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 364 insertions(+) create mode 100644 LICENSES/other/CDDL-1.0 diff --git a/LICENSES/other/CDDL-1.0 b/LICENSES/other/CDDL-1.0 new file mode 100644 index 000000000000..195a1687930a --- /dev/null +++ b/LICENSES/other/CDDL-1.0 @@ -0,0 +1,364 @@ +Valid-License-Identifier: CDDL-1.0 +SPDX-URL: https://spdx.org/licenses/CDDL-1.0.html +Usage-Guide: + To use the Common Development and Distribution License 1.0 put the + following SPDX tag/value pair into a comment according to the placement + guidelines in the licensing rules documentation: + SPDX-License-Identifier: CDDL-1.0 + +License-Text: + +COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) +Version 1.0 + + 1. Definitions. + + 1.1. "Contributor" means each individual or entity that creates or + contributes to the creation of Modifications. + + 1.2. "Contributor Version" means the combination of the Original + Software, prior Modifications used by a Contributor (if any), + and the Modifications made by that particular Contributor. + + 1.3. "Covered Software" means (a) the Original Software, or (b) + Modifications, or (c) the combination of files containing + Original Software with files containing Modifications, in each + case including portions thereof. + + 1.4. "Executable" means the Covered Software in any form other than + Source Code. + + 1.5. "Initial Developer" means the individual or entity that first + makes Original Software available under this License. + + 1.6. "Larger Work" means a work which combines Covered Software or + portions thereof with code not governed by the terms of this + License. + + 1.7. "License" means this document. + + 1.8. "Licensable" means having the right to grant, to the maximum + extent possible, whether at the time of the initial grant or + subsequently acquired, any and all of the rights conveyed herein. + + 1.9. "Modifications" means the Source Code and Executable form of + any of the following: + + A. Any file that results from an addition to, deletion from or + modification of the contents of a file containing Original + Software or previous Modifications; + + B. Any new file that contains any part of the Original Software + or previous Modification; or + + C. Any new file that is contributed or otherwise made available + under the terms of this License. + + 1.10. "Original Software" means the Source Code and Executable form + of computer software code that is originally released under + this License. + + 1.11. "Patent Claims" means any patent claim(s), now owned or + hereafter acquired, including without limitation, method, + process, and apparatus claims, in any patent Licensable by + grantor. + + 1.12. "Source Code" means (a) the common form of computer software + code in which modifications are made and (b) associated + documentation included in or with such code. + + 1.13. "You" (or "Your") means an individual or a legal entity + exercising rights under, and complying with all of the terms + of, this License. For legal entities, "You" includes any + entity which controls, is controlled by, or is under common + control with You. For purposes of this definition, "control" + means (a) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract + or otherwise, or (b) ownership of more than fifty percent + (50%) of the outstanding shares or beneficial ownership of + such entity. + + 2. License Grants. + 2.1. The Initial Developer Grant. + + Conditioned upon Your compliance with Section 3.1 below and subject + to third party intellectual property claims, the Initial Developer + hereby grants You a world-wide, royalty-free, non-exclusive + license: + + (a) under intellectual property rights (other than patent or + trademark) Licensable by Initial Developer, to use, + reproduce, modify, display, perform, sublicense and + distribute the Original Software (or portions thereof), + with or without Modifications, and/or as part of a Larger + Work; and + + (b) under Patent Claims infringed by the making, using or + selling of Original Software, to make, have made, use, + practice, sell, and offer for sale, and/or otherwise + dispose of the Original Software (or portions thereof). + + (c) The licenses granted in Sections 2.1(a) and (b) are + effective on the date Initial Developer first distributes + or otherwise makes the Original Software available to a + third party under the terms of this License. + + (d) Notwithstanding Section 2.1(b) above, no patent license is + granted: (1) for code that You delete from the Original + Software, or (2) for infringements caused by: (i) the + modification of the Original Software, or (ii) the + combination of the Original Software with other software or + devices. + + 2.2. Contributor Grant. + + Conditioned upon Your compliance with Section 3.1 below and subject + to third party intellectual property claims, each Contributor + hereby grants You a world-wide, royalty-free, non-exclusive + license: + + (a) under intellectual property rights (other than patent or + trademark) Licensable by Contributor to use, reproduce, + modify, display, perform, sublicense and distribute the + Modifications created by such Contributor (or portions + thereof), either on an unmodified basis, with other + Modifications, as Covered Software and/or as part of a + Larger Work; and + + (b) under Patent Claims infringed by the making, using, or + selling of Modifications made by that Contributor either + alone and/or in combination with its Contributor Version + (or portions of such combination), to make, use, sell, + offer for sale, have made, and/or otherwise dispose of: (1) + Modifications made by that Contributor (or portions + thereof); and (2) the combination of Modifications made by + that Contributor with its Contributor Version (or portions + of such combination). + + (c) The licenses granted in Sections 2.2(a) and 2.2(b) are + effective on the date Contributor first distributes or + otherwise makes the Modifications available to a third + party. + + (d) Notwithstanding Section 2.2(b) above, no patent license is + granted: (1) for any code that Contributor has deleted from + the Contributor Version; (2) for infringements caused by: + (i) third party modifications of Contributor Version, or + (ii) the combination of Modifications made by that + Contributor with other software (except as part of the + Contributor Version) or other devices; or (3) under Patent + Claims infringed by Covered Software in the absence of + Modifications made by that Contributor. + + 3. Distribution Obligations. + 3.1. Availability of Source Code. + + Any Covered Software that You distribute or otherwise make + available in Executable form must also be made available in Source + Code form and that Source Code form must be distributed only under + the terms of this License. You must include a copy of this License + with every copy of the Source Code form of the Covered Software You + distribute or otherwise make available. You must inform recipients + of any such Covered Software in Executable form as to how they can + obtain such Covered Software in Source Code form in a reasonable + manner on or through a medium customarily used for software + exchange. + + 3.2. Modifications. + + The Modifications that You create or to which You contribute are + governed by the terms of this License. You represent that You + believe Your Modifications are Your original creation(s) and/or You + have sufficient rights to grant the rights conveyed by this + License. + + 3.3. Required Notices. + + You must include a notice in each of Your Modifications that + identifies You as the Contributor of the Modification. You may not + remove or alter any copyright, patent or trademark notices + contained within the Covered Software, or any notices of licensing + or any descriptive text giving attribution to any Contributor or + the Initial Developer. + + 3.4. Application of Additional Terms. + + You may not offer or impose any terms on any Covered Software in + Source Code form that alters or restricts the applicable version of + this License or the recipients' rights hereunder. You may choose to + offer, and to charge a fee for, warranty, support, indemnity or + liability obligations to one or more recipients of Covered + Software. However, you may do so only on Your own behalf, and not + on behalf of the Initial Developer or any Contributor. You must + make it absolutely clear that any such warranty, support, indemnity + or liability obligation is offered by You alone, and You hereby + agree to indemnify the Initial Developer and every Contributor for + any liability incurred by the Initial Developer or such Contributor + as a result of warranty, support, indemnity or liability terms You + offer. + + 3.5. Distribution of Executable Versions. + + You may distribute the Executable form of the Covered Software + under the terms of this License or under the terms of a license of + Your choice, which may contain terms different from this License, + provided that You are in compliance with the terms of this License + and that the license for the Executable form does not attempt to + limit or alter the recipient's rights in the Source Code form from + the rights set forth in this License. If You distribute the Covered + Software in Executable form under a different license, You must + make it absolutely clear that any terms which differ from this + License are offered by You alone, not by the Initial Developer or + Contributor. You hereby agree to indemnify the Initial Developer + and every Contributor for any liability incurred by the Initial + Developer or such Contributor as a result of any such terms You + offer. + + 3.6. Larger Works. + + You may create a Larger Work by combining Covered Software with + other code not governed by the terms of this License and distribute + the Larger Work as a single product. In such a case, You must make + sure the requirements of this License are fulfilled for the Covered + Software. + + 4. Versions of the License. + 4.1. New Versions. + + Sun Microsystems, Inc. is the initial license steward and may + publish revised and/or new versions of this License from time to + time. Each version will be given a distinguishing version + number. Except as provided in Section 4.3, no one other than the + license steward has the right to modify this License. + + 4.2. Effect of New Versions. + + You may always continue to use, distribute or otherwise make the + Covered Software available under the terms of the version of the + License under which You originally received the Covered + Software. If the Initial Developer includes a notice in the + Original Software prohibiting it from being distributed or + otherwise made available under any subsequent version of the + License, You must distribute and make the Covered Software + available under the terms of the version of the License under which + You originally received the Covered Software. Otherwise, You may + also choose to use, distribute or otherwise make the Covered + Software available under the terms of any subsequent version of the + License published by the license steward. + + 4.3. Modified Versions. + + When You are an Initial Developer and You want to create a new + license for Your Original Software, You may create and use a + modified version of this License if You: (a) rename the license and + remove any references to the name of the license steward (except to + note that the license differs from this License); and (b) otherwise + make it clear that the license contains terms which differ from + this License. + + 5. DISCLAIMER OF WARRANTY. + + COVERED SOFTWARE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS, + WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, + WITHOUT LIMITATION, WARRANTIES THAT THE COVERED SOFTWARE IS FREE OF + DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR + NON-INFRINGING. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF + THE COVERED SOFTWARE IS WITH YOU. SHOULD ANY COVERED SOFTWARE PROVE + DEFECTIVE IN ANY RESPECT, YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER + CONTRIBUTOR) ASSUME THE COST OF ANY NECESSARY SERVICING, REPAIR OR + CORRECTION. THIS DISCLAIMER OF WARRANTY CONSTITUTES AN ESSENTIAL PART + OF THIS LICENSE. NO USE OF ANY COVERED SOFTWARE IS AUTHORIZED HEREUNDER + EXCEPT UNDER THIS DISCLAIMER. + + 6. TERMINATION. + + 6.1. This License and the rights granted hereunder will terminate + automatically if You fail to comply with terms herein and fail to + cure such breach within 30 days of becoming aware of the + breach. Provisions which, by their nature, must remain in effect + beyond the termination of this License shall survive. + + 6.2. If You assert a patent infringement claim (excluding + declaratory judgment actions) against Initial Developer or a + Contributor (the Initial Developer or Contributor against whom You + assert such claim is referred to as "Participant") alleging that + the Participant Software (meaning the Contributor Version where the + Participant is a Contributor or the Original Software where the + Participant is the Initial Developer) directly or indirectly + infringes any patent, then any and all rights granted directly or + indirectly to You by such Participant, the Initial Developer (if + the Initial Developer is not the Participant) and all Contributors + under Sections 2.1 and/or 2.2 of this License shall, upon 60 days + notice from Participant terminate prospectively and automatically + at the expiration of such 60 day notice period, unless if within + such 60 day period You withdraw Your claim with respect to the + Participant Software against such Participant either unilaterally + or pursuant to a written agreement with Participant. + + 6.3. In the event of termination under Sections 6.1 or 6.2 above, + all end user licenses that have been validly granted by You or any + distributor hereunder prior to termination (excluding licenses + granted to You by any distributor) shall survive termination. + + 7. LIMITATION OF LIABILITY. + + UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, WHETHER TORT + (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL YOU, THE INITIAL + DEVELOPER, ANY OTHER CONTRIBUTOR, OR ANY DISTRIBUTOR OF COVERED + SOFTWARE, OR ANY SUPPLIER OF ANY OF SUCH PARTIES, BE LIABLE TO ANY + PERSON FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES + OF ANY CHARACTER INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOST + PROFITS, LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR + MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES, EVEN IF + SUCH PARTY SHALL HAVE BEEN INFORMED OF THE POSSIBILITY OF SUCH + DAMAGES. THIS LIMITATION OF LIABILITY SHALL NOT APPLY TO LIABILITY FOR + DEATH OR PERSONAL INJURY RESULTING FROM SUCH PARTY'S NEGLIGENCE TO THE + EXTENT APPLICABLE LAW PROHIBITS SUCH LIMITATION. SOME JURISDICTIONS DO + NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL OR CONSEQUENTIAL + DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT APPLY TO YOU. + + 8. U.S. GOVERNMENT END USERS. + + The Covered Software is a "commercial item," as that term is defined in + 48 C.F.R. 2.101 (Oct. 1995), consisting of "commercial computer + software" (as that term is defined at 48 C.F.R. $ 252.227-7014(a)(1)) + and "commercial computer software documentation" as such terms are used + in 48 C.F.R. 12.212 (Sept. 1995). Consistent with 48 C.F.R. 12.212 and + 48 C.F.R. 227.7202-1 through 227.7202-4 (June 1995), all + U.S. Government End Users acquire Covered Software with only those + rights set forth herein. This U.S. Government Rights clause is in lieu + of, and supersedes, any other FAR, DFAR, or other clause or provision + that addresses Government rights in computer software under this + License. + + 9. MISCELLANEOUS. + + This License represents the complete agreement concerning subject + matter hereof. If any provision of this License is held to be + unenforceable, such provision shall be reformed only to the extent + necessary to make it enforceable. This License shall be governed by the + law of the jurisdiction specified in a notice contained within the + Original Software (except to the extent applicable law, if any, + provides otherwise), excluding such jurisdiction's conflict-of-law + provisions. Any litigation relating to this License shall be subject to + the jurisdiction of the courts located in the jurisdiction and venue + specified in a notice contained within the Original Software, with the + losing party responsible for costs, including, without limitation, + court costs and reasonable attorneys' fees and expenses. The + application of the United Nations Convention on Contracts for the + International Sale of Goods is expressly excluded. Any law or + regulation which provides that the language of a contract shall be + construed against the drafter shall not apply to this License. You + agree that You alone are responsible for compliance with the United + States export administration regulations (and the export control laws + and regulation of any other countries) when You use, distribute or + otherwise make available any Covered Software. + + 10. RESPONSIBILITY FOR CLAIMS. + + As between Initial Developer and the Contributors, each party is + responsible for claims and damages arising, directly or indirectly, out + of its utilization of rights under this License and You agree to work + with Initial Developer and Contributors to distribute such + responsibility on an equitable basis. Nothing herein is intended or + shall be deemed to constitute any admission of liability. From b9bf4e4ea797bb002ce11503ad2c810fe4ab93ac Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:25 +0200 Subject: [PATCH 043/103] LICENSES: Add CC-BY-SA-4.0 license text Add the full text of the CC-BY-SA-4.0 license to the kernel tree. It was copied directly from: https://spdx.org/licenses/CC-BY-SA-4.0.html#licenseText Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Kate Stewart Signed-off-by: Jonathan Corbet --- LICENSES/other/CC-BY-SA-4.0 | 397 ++++++++++++++++++++++++++++++++++++ 1 file changed, 397 insertions(+) create mode 100644 LICENSES/other/CC-BY-SA-4.0 diff --git a/LICENSES/other/CC-BY-SA-4.0 b/LICENSES/other/CC-BY-SA-4.0 new file mode 100644 index 000000000000..f9158e831e79 --- /dev/null +++ b/LICENSES/other/CC-BY-SA-4.0 @@ -0,0 +1,397 @@ +Valid-License-Identifier: CC-BY-SA-4.0 +SPDX-URL: https://spdx.org/licenses/CC-BY-SA-4.0 +Usage-Guide: + To use the Creative Commons Attribution Share Alike 4.0 International + license put the following SPDX tag/value pair into a comment according to + the placement guidelines in the licensing rules documentation: + SPDX-License-Identifier: CC-BY-SA-4.0 +License-Text: + +Creative Commons Attribution-ShareAlike 4.0 International + +Creative Commons Corporation ("Creative Commons") is not a law firm and +does not provide legal services or legal advice. Distribution of Creative +Commons public licenses does not create a lawyer-client or other +relationship. Creative Commons makes its licenses and related information +available on an "as-is" basis. Creative Commons gives no warranties +regarding its licenses, any material licensed under their terms and +conditions, or any related information. Creative Commons disclaims all +liability for damages resulting from their use to the fullest extent +possible. + +Using Creative Commons Public Licenses + +Creative Commons public licenses provide a standard set of terms and +conditions that creators and other rights holders may use to share original +works of authorship and other material subject to copyright and certain +other rights specified in the public license below. The following +considerations are for informational purposes only, are not exhaustive, and +do not form part of our licenses. + +Considerations for licensors: Our public licenses are intended for use by +those authorized to give the public permission to use material in ways +otherwise restricted by copyright and certain other rights. Our licenses +are irrevocable. Licensors should read and understand the terms and +conditions of the license they choose before applying it. Licensors should +also secure all rights necessary before applying our licenses so that the +public can reuse the material as expected. Licensors should clearly mark +any material not subject to the license. This includes other CC-licensed +material, or material used under an exception or limitation to +copyright. More considerations for licensors : +wiki.creativecommons.org/Considerations_for_licensors + +Considerations for the public: By using one of our public licenses, a +licensor grants the public permission to use the licensed material under +specified terms and conditions. If the licensor's permission is not +necessary for any reason - for example, because of any applicable exception +or limitation to copyright - then that use is not regulated by the +license. Our licenses grant only permissions under copyright and certain +other rights that a licensor has authority to grant. Use of the licensed +material may still be restricted for other reasons, including because +others have copyright or other rights in the material. A licensor may make +special requests, such as asking that all changes be marked or described. + +Although not required by our licenses, you are encouraged to respect those +requests where reasonable. More considerations for the public : +wiki.creativecommons.org/Considerations_for_licensees + +Creative Commons Attribution-ShareAlike 4.0 International Public License + +By exercising the Licensed Rights (defined below), You accept and agree to +be bound by the terms and conditions of this Creative Commons +Attribution-ShareAlike 4.0 International Public License ("Public +License"). To the extent this Public License may be interpreted as a +contract, You are granted the Licensed Rights in consideration of Your +acceptance of these terms and conditions, and the Licensor grants You such +rights in consideration of benefits the Licensor receives from making the +Licensed Material available under these terms and conditions. + +Section 1 - Definitions. + + a. Adapted Material means material subject to Copyright and Similar + Rights that is derived from or based upon the Licensed Material and + in which the Licensed Material is translated, altered, arranged, + transformed, or otherwise modified in a manner requiring permission + under the Copyright and Similar Rights held by the Licensor. For + purposes of this Public License, where the Licensed Material is a + musical work, performance, or sound recording, Adapted Material is + always produced where the Licensed Material is synched in timed + relation with a moving image. + + b. Adapter's License means the license You apply to Your Copyright and + Similar Rights in Your contributions to Adapted Material in + accordance with the terms and conditions of this Public License. + + c. BY-SA Compatible License means a license listed at + creativecommons.org/compatiblelicenses, approved by Creative Commons + as essentially the equivalent of this Public License. + + d. Copyright and Similar Rights means copyright and/or similar rights + closely related to copyright including, without limitation, + performance, broadcast, sound recording, and Sui Generis Database + Rights, without regard to how the rights are labeled or + categorized. For purposes of this Public License, the rights + specified in Section 2(b)(1)-(2) are not Copyright and Similar + Rights. + + e. Effective Technological Measures means those measures that, in the + absence of proper authority, may not be circumvented under laws + fulfilling obligations under Article 11 of the WIPO Copyright Treaty + adopted on December 20, 1996, and/or similar international + agreements. + + f. Exceptions and Limitations means fair use, fair dealing, and/or any + other exception or limitation to Copyright and Similar Rights that + applies to Your use of the Licensed Material. + + g. License Elements means the license attributes listed in the name of + a Creative Commons Public License. The License Elements of this + Public License are Attribution and ShareAlike. + + h. Licensed Material means the artistic or literary work, database, or + other material to which the Licensor applied this Public License. + + i. Licensed Rights means the rights granted to You subject to the terms + and conditions of this Public License, which are limited to all + Copyright and Similar Rights that apply to Your use of the Licensed + Material and that the Licensor has authority to license. + + j. Licensor means the individual(s) or entity(ies) granting rights + under this Public License. + + k. Share means to provide material to the public by any means or + process that requires permission under the Licensed Rights, such as + reproduction, public display, public performance, distribution, + dissemination, communication, or importation, and to make material + available to the public including in ways that members of the public + may access the material from a place and at a time individually + chosen by them. + + l. Sui Generis Database Rights means rights other than copyright + resulting from Directive 96/9/EC of the European Parliament and of + the Council of 11 March 1996 on the legal protection of databases, + as amended and/or succeeded, as well as other essentially equivalent + rights anywhere in the world. m. You means the individual or entity + exercising the Licensed Rights under this Public License. Your has a + corresponding meaning. + +Section 2 - Scope. + + a. License grant. + + 1. Subject to the terms and conditions of this Public License, the + Licensor hereby grants You a worldwide, royalty-free, + non-sublicensable, non-exclusive, irrevocable license to + exercise the Licensed Rights in the Licensed Material to: + + A. reproduce and Share the Licensed Material, in whole or in part; and + + B. produce, reproduce, and Share Adapted Material. + + 2. Exceptions and Limitations. For the avoidance of doubt, where + Exceptions and Limitations apply to Your use, this Public + License does not apply, and You do not need to comply with its + terms and conditions. + + 3. Term. The term of this Public License is specified in Section 6(a). + + 4. Media and formats; technical modifications allowed. The Licensor + authorizes You to exercise the Licensed Rights in all media and + formats whether now known or hereafter created, and to make + technical modifications necessary to do so. The Licensor waives + and/or agrees not to assert any right or authority to forbid You + from making technical modifications necessary to exercise the + Licensed Rights, including technical modifications necessary to + circumvent Effective Technological Measures. For purposes of + this Public License, simply making modifications authorized by + this Section 2(a)(4) never produces Adapted Material. + + 5. Downstream recipients. + + A. Offer from the Licensor - Licensed Material. Every recipient + of the Licensed Material automatically receives an offer + from the Licensor to exercise the Licensed Rights under the + terms and conditions of this Public License. + + B. Additional offer from the Licensor - Adapted Material. Every + recipient of Adapted Material from You automatically + receives an offer from the Licensor to exercise the Licensed + Rights in the Adapted Material under the conditions of the + Adapter's License You apply. + + C. No downstream restrictions. You may not offer or impose any + additional or different terms or conditions on, or apply any + Effective Technological Measures to, the Licensed Material + if doing so restricts exercise of the Licensed Rights by any + recipient of the Licensed Material. + + 6. No endorsement. Nothing in this Public License constitutes or + may be construed as permission to assert or imply that You are, + or that Your use of the Licensed Material is, connected with, or + sponsored, endorsed, or granted official status by, the Licensor + or others designated to receive attribution as provided in + Section 3(a)(1)(A)(i). + + b. Other rights. + + 1. Moral rights, such as the right of integrity, are not licensed + under this Public License, nor are publicity, privacy, and/or + other similar personality rights; however, to the extent + possible, the Licensor waives and/or agrees not to assert any + such rights held by the Licensor to the limited extent necessary + to allow You to exercise the Licensed Rights, but not otherwise. + + 2. Patent and trademark rights are not licensed under this Public + License. + + 3. To the extent possible, the Licensor waives any right to collect + royalties from You for the exercise of the Licensed Rights, + whether directly or through a collecting society under any + voluntary or waivable statutory or compulsory licensing + scheme. In all other cases the Licensor expressly reserves any + right to collect such royalties. + +Section 3 - License Conditions. + +Your exercise of the Licensed Rights is expressly made subject to the +following conditions. + + a. Attribution. + + 1. If You Share the Licensed Material (including in modified form), + You must: + + A. retain the following if it is supplied by the Licensor with + the Licensed Material: + + i. identification of the creator(s) of the Licensed + Material and any others designated to receive + attribution, in any reasonable manner requested by the + Licensor (including by pseudonym if designated); + + ii. a copyright notice; + + iii. a notice that refers to this Public License; + + iv. a notice that refers to the disclaimer of warranties; + + v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; + + B. indicate if You modified the Licensed Material and retain an + indication of any previous modifications; and + + C. indicate the Licensed Material is licensed under this Public + License, and include the text of, or the URI or hyperlink to, + this Public License. + + 2. You may satisfy the conditions in Section 3(a)(1) in any + reasonable manner based on the medium, means, and context in + which You Share the Licensed Material. For example, it may be + reasonable to satisfy the conditions by providing a URI or + hyperlink to a resource that includes the required information. + + 3. If requested by the Licensor, You must remove any of the + information required by Section 3(a)(1)(A) to the extent + reasonably practicable. b. ShareAlike.In addition to the + conditions in Section 3(a), if You Share Adapted Material You + produce, the following conditions also apply. + + 1. The Adapter's License You apply must be a Creative Commons + license with the same License Elements, this version or + later, or a BY-SA Compatible License. + + 2. You must include the text of, or the URI or hyperlink to, the + Adapter's License You apply. You may satisfy this condition + in any reasonable manner based on the medium, means, and + context in which You Share Adapted Material. + + 3. You may not offer or impose any additional or different terms + or conditions on, or apply any Effective Technological + Measures to, Adapted Material that restrict exercise of the + rights granted under the Adapter's License You apply. + +Section 4 - Sui Generis Database Rights. + +Where the Licensed Rights include Sui Generis Database Rights that apply to +Your use of the Licensed Material: + + a. for the avoidance of doubt, Section 2(a)(1) grants You the right to + extract, reuse, reproduce, and Share all or a substantial portion of + the contents of the database; + + b. if You include all or a substantial portion of the database contents + in a database in which You have Sui Generis Database Rights, then + the database in which You have Sui Generis Database Rights (but not + its individual contents) is Adapted Material, including for purposes + of Section 3(b); and + + c. You must comply with the conditions in Section 3(a) if You Share all + or a substantial portion of the contents of the database. + + For the avoidance of doubt, this Section 4 supplements and does not + replace Your obligations under this Public License where the Licensed + Rights include other Copyright and Similar Rights. + +Section 5 - Disclaimer of Warranties and Limitation of Liability. + + a. Unless otherwise separately undertaken by the Licensor, to the + extent possible, the Licensor offers the Licensed Material as-is and + as-available, and makes no representations or warranties of any kind + concerning the Licensed Material, whether express, implied, + statutory, or other. This includes, without limitation, warranties + of title, merchantability, fitness for a particular purpose, + non-infringement, absence of latent or other defects, accuracy, or + the presence or absence of errors, whether or not known or + discoverable. Where disclaimers of warranties are not allowed in + full or in part, this disclaimer may not apply to You. + + b. To the extent possible, in no event will the Licensor be liable to + You on any legal theory (including, without limitation, negligence) + or otherwise for any direct, special, indirect, incidental, + consequential, punitive, exemplary, or other losses, costs, + expenses, or damages arising out of this Public License or use of + the Licensed Material, even if the Licensor has been advised of the + possibility of such losses, costs, expenses, or damages. Where a + limitation of liability is not allowed in full or in part, this + limitation may not apply to You. + + c. The disclaimer of warranties and limitation of liability provided + above shall be interpreted in a manner that, to the extent possible, + most closely approximates an absolute disclaimer and waiver of all + liability. + +Section 6 - Term and Termination. + + a. This Public License applies for the term of the Copyright and + Similar Rights licensed here. However, if You fail to comply with + this Public License, then Your rights under this Public License + terminate automatically. + + b. Where Your right to use the Licensed Material has terminated under + Section 6(a), it reinstates: + + 1. automatically as of the date the violation is cured, provided it + is cured within 30 days of Your discovery of the violation; or + + 2. upon express reinstatement by the Licensor. + + c. For the avoidance of doubt, this Section 6(b) does not affect any + right the Licensor may have to seek remedies for Your violations of + this Public License. + + d. For the avoidance of doubt, the Licensor may also offer the Licensed + Material under separate terms or conditions or stop distributing the + Licensed Material at any time; however, doing so will not terminate + this Public License. + + e. Sections 1, 5, 6, 7, and 8 survive termination of this Public License. + +Section 7 - Other Terms and Conditions. + + a. The Licensor shall not be bound by any additional or different terms + or conditions communicated by You unless expressly agreed. + + b. Any arrangements, understandings, or agreements regarding the + Licensed Material not stated herein are separate from and + independent of the terms and conditions of this Public License. + +Section 8 - Interpretation. + + a. For the avoidance of doubt, this Public License does not, and shall + not be interpreted to, reduce, limit, restrict, or impose conditions + on any use of the Licensed Material that could lawfully be made + without permission under this Public License. + + b. To the extent possible, if any provision of this Public License is + deemed unenforceable, it shall be automatically reformed to the + minimum extent necessary to make it enforceable. If the provision + cannot be reformed, it shall be severed from this Public License + without affecting the enforceability of the remaining terms and + conditions. + + c. No term or condition of this Public License will be waived and no + failure to comply consented to unless expressly agreed to by the + Licensor. + + d. Nothing in this Public License constitutes or may be interpreted as + a limitation upon, or waiver of, any privileges and immunities that + apply to the Licensor or You, including from the legal processes of + any jurisdiction or authority. + +Creative Commons is not a party to its public licenses. Notwithstanding, +Creative Commons may elect to apply one of its public licenses to material +it publishes and in those instances will be considered the "Licensor." The +text of the Creative Commons public licenses is dedicated to the public +domain under the CC0 Public Domain Dedication. Except for the limited +purpose of indicating that material is shared under a Creative Commons +public license or as otherwise permitted by the Creative Commons policies +published at creativecommons.org/policies, Creative Commons does not +authorize the use of the trademark "Creative Commons" or any other +trademark or logo of Creative Commons without its prior written consent +including, without limitation, in connection with any unauthorized +modifications to any of its public licenses or any other arrangements, +understandings, or agreements concerning use of licensed material. For the +avoidance of doubt, this paragraph does not form part of the public +licenses. + +Creative Commons may be contacted at creativecommons.org. From f91af1c69c57664a508fa054ce1e2cdf74741f00 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 25 Apr 2018 22:30:26 +0200 Subject: [PATCH 044/103] LICENSES: Add Linux-OpenIB license text The infiniband code uses a variant of the OpenIB license. This license is BSD-2-Clause with the MIT disclaimer. The linux kernel uses this license extensively throughout the driver subsystem since 2005. Note that the OpenIB.org license is a true match to BSD-2-Clause. The license text was copied from: https://spdx.org/licenses/Linux-OpenIB.html#licenseText Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Kate Stewart Signed-off-by: Jonathan Corbet --- LICENSES/other/Linux-OpenIB | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 LICENSES/other/Linux-OpenIB diff --git a/LICENSES/other/Linux-OpenIB b/LICENSES/other/Linux-OpenIB new file mode 100644 index 000000000000..1ad85f6b3a89 --- /dev/null +++ b/LICENSES/other/Linux-OpenIB @@ -0,0 +1,26 @@ +Valid-License-Identifier: Linux-OpenIB +SPDX-URL: https://spdx.org/licenses/Linux-OpenIB.html +Usage-Guide: + To use the Linux Kernel Variant of OpenIB.org license put the following + SPDX tag/value pair into a comment according to the placement guidelines + in the licensing rules documentation: + SPDX-License-Identifier: Linux-OpenIB +License-Text: + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER +DEALINGS IN THE SOFTWARE. From 5385a295ec00eb80525ec7ff1d97e13e06ba77b7 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 26 Apr 2018 15:54:27 +0200 Subject: [PATCH 045/103] scripts: Add SPDX checker script The SPDX-License-Identifiers are growing in the kernel and so grow expression failures and license IDs are used which have no corresponding license text file in the LICENSES directory. Add a script which gathers information from the LICENSES directory, i.e. the various tags in the licenses and exception files and then scans either input from stdin, which it treats as a single file or if started without arguments it scans the full kernel tree. It checks whether the license expression syntax is correct and also validates whether the license identifiers used in the expressions are available in the LICENSES files. scripts/spdxcheck.py -h usage: spdxcheck.py [-h] [-m MAXLINES] [-v] [path [path ...]] SPDX expression checker positional arguments: path Check path or file. If not given full git tree scan. For stdin use "-" optional arguments: -h, --help show this help message and exit -m MAXLINES, --maxlines MAXLINES Maximum number of lines to scan in a file. Default 15 -v, --verbose Verbose statistics output include/dt-bindings/reset/amlogic,meson-axg-reset.h: 9:41 Invalid License ID: BSD drivers/pinctrl/sh-pfc/pfc-r8a77965.c: 1:28 Invalid License ID: GPL-2. include/dt-bindings/reset/amlogic,meson-axg-reset.h: 9:41 Invalid License ID: BSD arch/x86/kernel/jailhouse.c: 1:28 Invalid License ID: GPL2.0 include/dt-bindings/reset/amlogic,meson-axg-reset.h: 9:41 Invalid License ID: BSD arch/arm/mach-s3c24xx/h1940-bluetooth.c: 1:28 Invalid License ID: GPL-1.0 arch/x86/kernel/jailhouse.c: 1:28 Invalid License ID: GPL2.0 drivers/pinctrl/sh-pfc/pfc-r8a77965.c: 1:28 Invalid License ID: GPL-2. include/dt-bindings/reset/amlogic,meson-axg-reset.h: 9:41 Invalid License ID: BSD arch/x86/include/asm/jailhouse_para.h: 1:28 Invalid License ID: GPL2.0 arch/arm/mach-s3c24xx/h1940-bluetooth.c: 1:28 Invalid License ID: GPL-1.0 arch/x86/kernel/jailhouse.c: 1:28 Invalid License ID: GPL2.0 drivers/pinctrl/sh-pfc/pfc-r8a77965.c: 1:28 Invalid License ID: GPL-2. include/dt-bindings/reset/amlogic,meson-axg-reset.h: 9:41 Invalid License ID: BSD arch/x86/include/asm/jailhouse_para.h: 1:28 Invalid License ID: GPL2.0 License files: 14 Exception files: 1 License IDs 19 Exception IDs 1 Files checked: 61332 Lines checked: 669181 Files with SPDX: 16169 Files with errors: 5 real 0m2.642s user 0m2.231s sys 0m0.467s That's a full tree sweep on my laptop. Note, this runs single threaded. It scans by default the first 15 lines for a SPDX identifier where the current max inside a top comment is at line 10. But that's going to be faster once the identifiers are all in the first two lines as documented. The python wizards will surely know how to do that smarter and faster, but its at least better than no tool at all. Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman [jc: Fixed ironically erroneous SPDX tag and did chmod +x ] Signed-off-by: Jonathan Corbet --- scripts/spdxcheck.py | 284 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 284 insertions(+) create mode 100755 scripts/spdxcheck.py diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py new file mode 100755 index 000000000000..7deaef297f52 --- /dev/null +++ b/scripts/spdxcheck.py @@ -0,0 +1,284 @@ +#!/usr/bin/env python +# SPDX-License-Identifier: GPL-2.0 +# Copyright Thomas Gleixner + +from argparse import ArgumentParser +from ply import lex, yacc +import traceback +import sys +import git +import re +import os + +class ParserException(Exception): + def __init__(self, tok, txt): + self.tok = tok + self.txt = txt + +class SPDXException(Exception): + def __init__(self, el, txt): + self.el = el + self.txt = txt + +class SPDXdata(object): + def __init__(self): + self.license_files = 0 + self.exception_files = 0 + self.licenses = [ ] + self.exceptions = { } + +# Read the spdx data from the LICENSES directory +def read_spdxdata(repo): + + # The subdirectories of LICENSES in the kernel source + license_dirs = [ "preferred", "other", "exceptions" ] + lictree = repo.heads.master.commit.tree['LICENSES'] + + spdx = SPDXdata() + + for d in license_dirs: + for el in lictree[d].traverse(): + if not os.path.isfile(el.path): + continue + + exception = None + for l in open(el.path).readlines(): + if l.startswith('Valid-License-Identifier:'): + lid = l.split(':')[1].strip().upper() + if lid in spdx.licenses: + raise SPDXException(el, 'Duplicate License Identifier: %s' %lid) + else: + spdx.licenses.append(lid) + + elif l.startswith('SPDX-Exception-Identifier:'): + exception = l.split(':')[1].strip().upper() + spdx.exceptions[exception] = [] + + elif l.startswith('SPDX-Licenses:'): + for lic in l.split(':')[1].upper().strip().replace(' ', '').replace('\t', '').split(','): + if not lic in spdx.licenses: + raise SPDXException(None, 'Exception %s missing license %s' %(ex, lic)) + spdx.exceptions[exception].append(lic) + + elif l.startswith("License-Text:"): + if exception: + if not len(spdx.exceptions[exception]): + raise SPDXException(el, 'Exception %s is missing SPDX-Licenses' %excid) + spdx.exception_files += 1 + else: + spdx.license_files += 1 + break + return spdx + +class id_parser(object): + + reserved = [ 'AND', 'OR', 'WITH' ] + tokens = [ 'LPAR', 'RPAR', 'ID', 'EXC' ] + reserved + + precedence = ( ('nonassoc', 'AND', 'OR'), ) + + t_ignore = ' \t' + + def __init__(self, spdx): + self.spdx = spdx + self.lasttok = None + self.lastid = None + self.lexer = lex.lex(module = self, reflags = re.UNICODE) + # Initialize the parser. No debug file and no parser rules stored on disk + # The rules are small enough to be generated on the fly + self.parser = yacc.yacc(module = self, write_tables = False, debug = False) + self.lines_checked = 0 + self.checked = 0 + self.spdx_valid = 0 + self.spdx_errors = 0 + self.curline = 0 + self.deepest = 0 + + # Validate License and Exception IDs + def validate(self, tok): + id = tok.value.upper() + if tok.type == 'ID': + if not id in self.spdx.licenses: + raise ParserException(tok, 'Invalid License ID') + self.lastid = id + elif tok.type == 'EXC': + if not self.spdx.exceptions.has_key(id): + raise ParserException(tok, 'Invalid Exception ID') + if self.lastid not in self.spdx.exceptions[id]: + raise ParserException(tok, 'Exception not valid for license %s' %self.lastid) + self.lastid = None + elif tok.type != 'WITH': + self.lastid = None + + # Lexer functions + def t_RPAR(self, tok): + r'\)' + self.lasttok = tok.type + return tok + + def t_LPAR(self, tok): + r'\(' + self.lasttok = tok.type + return tok + + def t_ID(self, tok): + r'[A-Za-z.0-9\-+]+' + + if self.lasttok == 'EXC': + print(tok) + raise ParserException(tok, 'Missing parentheses') + + tok.value = tok.value.strip() + val = tok.value.upper() + + if val in self.reserved: + tok.type = val + elif self.lasttok == 'WITH': + tok.type = 'EXC' + + self.lasttok = tok.type + self.validate(tok) + return tok + + def t_error(self, tok): + raise ParserException(tok, 'Invalid token') + + def p_expr(self, p): + '''expr : ID + | ID WITH EXC + | expr AND expr + | expr OR expr + | LPAR expr RPAR''' + pass + + def p_error(self, p): + if not p: + raise ParserException(None, 'Unfinished license expression') + else: + raise ParserException(p, 'Syntax error') + + def parse(self, expr): + self.lasttok = None + self.lastid = None + self.parser.parse(expr, lexer = self.lexer) + + def parse_lines(self, fd, maxlines, fname): + self.checked += 1 + self.curline = 0 + try: + for line in fd: + self.curline += 1 + if self.curline > maxlines: + break + self.lines_checked += 1 + if line.find("SPDX-License-Identifier:") < 0: + continue + expr = line.split(':')[1].replace('*/', '').strip() + self.parse(expr) + self.spdx_valid += 1 + # + # Should we check for more SPDX ids in the same file and + # complain if there are any? + # + break + + except ParserException as pe: + if pe.tok: + col = line.find(expr) + pe.tok.lexpos + tok = pe.tok.value + sys.stdout.write('%s: %d:%d %s: %s\n' %(fname, self.curline, col, pe.txt, tok)) + else: + sys.stdout.write('%s: %d:0 %s\n' %(fname, self.curline, col, pe.txt)) + self.spdx_errors += 1 + +def scan_git_tree(tree): + for el in tree.traverse(): + # Exclude stuff which would make pointless noise + # FIXME: Put this somewhere more sensible + if el.path.startswith("LICENSES"): + continue + if el.path.find("license-rules.rst") >= 0: + continue + if el.path == 'scripts/checkpatch.pl': + continue + if not os.path.isfile(el.path): + continue + parser.parse_lines(open(el.path), args.maxlines, el.path) + +def scan_git_subtree(tree, path): + for p in path.strip('/').split('/'): + tree = tree[p] + scan_git_tree(tree) + +if __name__ == '__main__': + + ap = ArgumentParser(description='SPDX expression checker') + ap.add_argument('path', nargs='*', help='Check path or file. If not given full git tree scan. For stdin use "-"') + ap.add_argument('-m', '--maxlines', type=int, default=15, + help='Maximum number of lines to scan in a file. Default 15') + ap.add_argument('-v', '--verbose', action='store_true', help='Verbose statistics output') + args = ap.parse_args() + + # Sanity check path arguments + if '-' in args.path and len(args.path) > 1: + sys.stderr.write('stdin input "-" must be the only path argument\n') + sys.exit(1) + + try: + # Use git to get the valid license expressions + repo = git.Repo(os.getcwd()) + assert not repo.bare + + # Initialize SPDX data + spdx = read_spdxdata(repo) + + # Initilize the parser + parser = id_parser(spdx) + + except SPDXException as se: + if se.el: + sys.stderr.write('%s: %s\n' %(se.el.path, se.txt)) + else: + sys.stderr.write('%s\n' %se.txt) + sys.exit(1) + + except Exception as ex: + sys.stderr.write('FAIL: %s\n' %ex) + sys.stderr.write('%s\n' %traceback.format_exc()) + sys.exit(1) + + try: + if len(args.path) and args.path[0] == '-': + parser.parse_lines(sys.stdin, args.maxlines, '-') + else: + if args.path: + for p in args.path: + if os.path.isfile(p): + parser.parse_lines(open(p), args.maxlines, p) + elif os.path.isdir(p): + scan_git_subtree(repo.head.reference.commit.tree, p) + else: + sys.stderr.write('path %s does not exist\n' %p) + sys.exit(1) + else: + # Full git tree scan + scan_git_tree(repo.head.commit.tree) + + if args.verbose: + sys.stderr.write('\n') + sys.stderr.write('License files: %12d\n' %spdx.license_files) + sys.stderr.write('Exception files: %12d\n' %spdx.exception_files) + sys.stderr.write('License IDs %12d\n' %len(spdx.licenses)) + sys.stderr.write('Exception IDs %12d\n' %len(spdx.exceptions)) + sys.stderr.write('\n') + sys.stderr.write('Files checked: %12d\n' %parser.checked) + sys.stderr.write('Lines checked: %12d\n' %parser.lines_checked) + sys.stderr.write('Files with SPDX: %12d\n' %parser.spdx_valid) + sys.stderr.write('Files with errors: %12d\n' %parser.spdx_errors) + + sys.exit(0) + + except Exception as ex: + sys.stderr.write('FAIL: %s\n' %ex) + sys.stderr.write('%s\n' %traceback.format_exc()) + sys.exit(1) From 6dddd7a7ec34bd8680ef72de0229cf8a92bd01ab Mon Sep 17 00:00:00 2001 From: Thymo van Beers Date: Wed, 18 Apr 2018 20:51:39 +0200 Subject: [PATCH 046/103] docs: kernel-parameters.txt: Fix whitespace Some lines used spaces instead of tabs at line start. This can cause mangled lines in editors due to inconsistency. Replace spaces for tabs where appropriate. Signed-off-by: Thymo van Beers Reviewed-by: Randy Dunlap Signed-off-by: Jonathan Corbet --- .../admin-guide/kernel-parameters.txt | 136 +++++++++--------- 1 file changed, 68 insertions(+), 68 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3487be79847c..865a24e4d516 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -106,11 +106,11 @@ use by PCI Format: ,... - acpi_mask_gpe= [HW,ACPI] + acpi_mask_gpe= [HW,ACPI] Due to the existence of _Lxx/_Exx, some GPEs triggered by unsupported hardware/firmware features can result in - GPE floodings that cannot be automatically disabled by - the GPE dispatcher. + GPE floodings that cannot be automatically disabled by + the GPE dispatcher. This facility can be used to prevent such uncontrolled GPE floodings. Format: @@ -472,10 +472,10 @@ for platform specific values (SB1, Loongson3 and others). - ccw_timeout_log [S390] + ccw_timeout_log [S390] See Documentation/s390/CommonIO for details. - cgroup_disable= [KNL] Disable a particular controller + cgroup_disable= [KNL] Disable a particular controller Format: {name of the controller(s) to disable} The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in @@ -641,8 +641,8 @@ hvc Use the hypervisor console device . This is for both Xen and PowerPC hypervisors. - If the device connected to the port is not a TTY but a braille - device, prepend "brl," before the device type, for instance + If the device connected to the port is not a TTY but a braille + device, prepend "brl," before the device type, for instance console=brl,ttyS0 For now, only VisioBraille is supported. @@ -662,7 +662,7 @@ consoleblank= [KNL] The console blank (screen saver) timeout in seconds. A value of 0 disables the blank timer. - Defaults to 0. + Defaults to 0. coredump_filter= [KNL] Change the default value for @@ -730,7 +730,7 @@ or memory reserved is below 4G. cryptomgr.notests - [KNL] Disable crypto self-tests + [KNL] Disable crypto self-tests cs89x0_dma= [HW,NET] Format: @@ -746,7 +746,7 @@ Format: , See also Documentation/input/devices/joystick-parport.rst - ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot + ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot time. See Documentation/admin-guide/dynamic-debug-howto.rst for details. Deprecated, see dyndbg. @@ -833,7 +833,7 @@ causing system reset or hang due to sending INIT from AP to BSP. - disable_ddw [PPC/PSERIES] + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. @@ -1188,7 +1188,7 @@ parameter will force ia64_sal_cache_flush to call ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. - forcepae [X86-32] + forcepae [X86-32] Forcefully enable Physical Address Extension (PAE). Many Pentium M systems disable PAE but may have a functionally usable PAE implementation. @@ -1247,7 +1247,7 @@ gamma= [HW,DRM] - gart_fix_e820= [X86_64] disable the fix e820 for K8 GART + gart_fix_e820= [X86_64] disable the fix e820 for K8 GART Format: off | on default: on @@ -1341,11 +1341,11 @@ x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag). - hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) - terminal devices. Valid values: 0..8 - hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. - If specified, z/VM IUCV HVC accepts connections - from listed z/VM user IDs only. + hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) + terminal devices. Valid values: 0..8 + hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. + If specified, z/VM IUCV HVC accepts connections + from listed z/VM user IDs only. keep_bootcon [KNL] Do not unregister boot console at start. This is only @@ -1353,11 +1353,11 @@ between unregistering the boot console and initializing the real console. - i2c_bus= [HW] Override the default board specific I2C bus speed - or register an additional I2C bus that is not - registered from board initialization code. - Format: - , + i2c_bus= [HW] Override the default board specific I2C bus speed + or register an additional I2C bus that is not + registered from board initialization code. + Format: + , i8042.debug [HW] Toggle i8042 debug mode i8042.unmask_kbd_data @@ -1386,7 +1386,7 @@ Default: only on s2r transitions on x86; most other architectures force reset to be always executed i8042.unlock [HW] Unlock (ignore) the keylock - i8042.kbdreset [HW] Reset device connected to KBD port + i8042.kbdreset [HW] Reset device connected to KBD port i810= [HW,DRM] @@ -1548,13 +1548,13 @@ programs exec'd, files mmap'd for exec, and all files opened for read by uid=0. - ima_template= [IMA] + ima_template= [IMA] Select one of defined IMA measurements template formats. Formats: { "ima" | "ima-ng" | "ima-sig" } Default: "ima-ng" ima_template_fmt= - [IMA] Define a custom template format. + [IMA] Define a custom template format. Format: { "field1|...|fieldN" } ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage @@ -1597,7 +1597,7 @@ inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver Format: - int_pln_enable [x86] Enable power limit notification interrupt + int_pln_enable [x86] Enable power limit notification interrupt integrity_audit=[IMA] Format: { "0" | "1" } @@ -1650,39 +1650,39 @@ 0 disables intel_idle and fall back on acpi_idle. 1 to 9 specify maximum depth of C-state. - intel_pstate= [X86] - disable - Do not enable intel_pstate as the default - scaling driver for the supported processors - passive - Use intel_pstate as a scaling driver, but configure it - to work with generic cpufreq governors (instead of - enabling its internal governor). This mode cannot be - used along with the hardware-managed P-states (HWP) - feature. - force - Enable intel_pstate on systems that prohibit it by default - in favor of acpi-cpufreq. Forcing the intel_pstate driver - instead of acpi-cpufreq may disable platform features, such - as thermal controls and power capping, that rely on ACPI - P-States information being indicated to OSPM and therefore - should be used with caution. This option does not work with - processors that aren't supported by the intel_pstate driver - or on platforms that use pcc-cpufreq instead of acpi-cpufreq. - no_hwp - Do not enable hardware P state control (HWP) - if available. - hwp_only - Only load intel_pstate on systems which support - hardware P state control (HWP) if available. - support_acpi_ppc - Enforce ACPI _PPC performance limits. If the Fixed ACPI - Description Table, specifies preferred power management - profile as "Enterprise Server" or "Performance Server", - then this feature is turned on by default. - per_cpu_perf_limits - Allow per-logical-CPU P-State performance control limits using - cpufreq sysfs interface + intel_pstate= [X86] + disable + Do not enable intel_pstate as the default + scaling driver for the supported processors + passive + Use intel_pstate as a scaling driver, but configure it + to work with generic cpufreq governors (instead of + enabling its internal governor). This mode cannot be + used along with the hardware-managed P-states (HWP) + feature. + force + Enable intel_pstate on systems that prohibit it by default + in favor of acpi-cpufreq. Forcing the intel_pstate driver + instead of acpi-cpufreq may disable platform features, such + as thermal controls and power capping, that rely on ACPI + P-States information being indicated to OSPM and therefore + should be used with caution. This option does not work with + processors that aren't supported by the intel_pstate driver + or on platforms that use pcc-cpufreq instead of acpi-cpufreq. + no_hwp + Do not enable hardware P state control (HWP) + if available. + hwp_only + Only load intel_pstate on systems which support + hardware P state control (HWP) if available. + support_acpi_ppc + Enforce ACPI _PPC performance limits. If the Fixed ACPI + Description Table, specifies preferred power management + profile as "Enterprise Server" or "Performance Server", + then this feature is turned on by default. + per_cpu_perf_limits + Allow per-logical-CPU P-State performance control limits using + cpufreq sysfs interface intremap= [X86-64, Intel-IOMMU] on enable Interrupt Remapping (default) @@ -2027,7 +2027,7 @@ * [no]ncqtrim: Turn off queued DSM TRIM. * nohrst, nosrst, norst: suppress hard, soft - and both resets. + and both resets. * rstonce: only attempt one reset during hot-unplug link recovery @@ -2215,7 +2215,7 @@ [KNL,SH] Allow user to override the default size for per-device physically contiguous DMA buffers. - memhp_default_state=online/offline + memhp_default_state=online/offline [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the @@ -2762,7 +2762,7 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. + no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. steal time is computed, but won't influence scheduler behaviour @@ -2823,7 +2823,7 @@ notsc [BUGS=X86-32] Disable Time Stamp Counter nowatchdog [KNL] Disable both lockup detectors, i.e. - soft-lockup and NMI watchdog (hard-lockup). + soft-lockup and NMI watchdog (hard-lockup). nowb [ARM] @@ -2843,7 +2843,7 @@ If the dependencies are under your control, you can turn on cpu0_hotplug. - nps_mtm_hs_ctr= [KNL,ARC] + nps_mtm_hs_ctr= [KNL,ARC] This parameter sets the maximum duration, in cycles, each HW thread of the CTOP can run without interruptions, before HW switches it. @@ -2984,7 +2984,7 @@ pci=option[,option...] [PCI] various PCI subsystem options: earlydump [X86] dump PCI config space before the kernel - changes anything + changes anything off [X86] don't probe for the PCI bus bios [X86-32] force use of PCI BIOS, don't access the hardware directly. Use this if your machine @@ -3072,7 +3072,7 @@ is enabled by default. If you need to use this, please report a bug. nocrs [X86] Ignore PCI host bridge windows from ACPI. - If you need to use this, please report a bug. + If you need to use this, please report a bug. routeirq Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), so this option is a temporary workaround @@ -4391,7 +4391,7 @@ usbcore.initial_descriptor_timeout= [USB] Specifies timeout for the initial 64-byte - USB_REQ_GET_DESCRIPTOR request in milliseconds + USB_REQ_GET_DESCRIPTOR request in milliseconds (default 5000 = 5.0 seconds). usbcore.nousb [USB] Disable the USB subsystem From a997a703d08f805eb8767fbc586fd938f09d13cc Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Thu, 19 Apr 2018 12:28:25 +0200 Subject: [PATCH 047/103] doc: dev-tools: kselftest.rst: update contributing new tests Add a description that the kernel headers should be used as far as it is possible and then the system headers. Signed-off-by: Anders Roxell Reviewed-by: Shuah Khan (Samsung OSG) Signed-off-by: Jonathan Corbet --- Documentation/dev-tools/kselftest.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst index e80850eefe13..3bf371a938d0 100644 --- a/Documentation/dev-tools/kselftest.rst +++ b/Documentation/dev-tools/kselftest.rst @@ -151,6 +151,11 @@ Contributing new tests (details) TEST_FILES, TEST_GEN_FILES mean it is the file which is used by test. + * First use the headers inside the kernel source and/or git repo, and then the + system headers. Headers for the kernel release as opposed to headers + installed by the distro on the system should be the primary focus to be able + to find regressions. + Test Harness ============ From 10c2c55d9fbd9ef9150914f58660bda64e64c98a Mon Sep 17 00:00:00 2001 From: Mathieu Poirier Date: Tue, 17 Apr 2018 10:08:05 -0600 Subject: [PATCH 048/103] coresight: Remove obsolete reference to "owner" in CoreSight descriptor Field "owner" of struct coresight_desc has been removed a while back but the documentation was not updated to reflect the changes. Signed-off-by: Mathieu Poirier Signed-off-by: Jonathan Corbet --- Documentation/trace/coresight.txt | 3 --- 1 file changed, 3 deletions(-) diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index 6f0120c3a4f1..710c75b6c73f 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -187,9 +187,6 @@ that can be performed on them (see "struct coresight_ops"). The specific to that component only. "Implementation defined" customisations are expected to be accessed and controlled using those entries. -Last but not least, "struct module *owner" is expected to be set to reflect -the information carried in "THIS_MODULE". - How to use the tracer modules ----------------------------- From f29816b496f1024ec491203e9f3f6c6c924faf39 Mon Sep 17 00:00:00 2001 From: Mathieu Poirier Date: Tue, 17 Apr 2018 10:08:06 -0600 Subject: [PATCH 049/103] coresight: Add section for integration with the perf tools Adding a section that document how to use the Coresight framework and drivers from the perf tools. Signed-off-by: Mathieu Poirier Acked-by: Randy Dunlap Signed-off-by: Jonathan Corbet --- Documentation/trace/coresight.txt | 52 ++++++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 1 deletion(-) diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index 710c75b6c73f..ab0d0f2d5cec 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -187,10 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The specific to that component only. "Implementation defined" customisations are expected to be accessed and controlled using those entries. + How to use the tracer modules ----------------------------- -Before trace collection can start, a coresight sink needs to be identify. +There are two ways to use the Coresight framework: 1) using the perf cmd line +tools and 2) interacting directly with the Coresight devices using the sysFS +interface. Preference is given to the former as using the sysFS interface +requires a deep understanding of the Coresight HW. The following sections +provide details on using both methods. + +1) Using the sysFS interface: + +Before trace collection can start, a coresight sink needs to be identified. There is no limit on the amount of sinks (nor sources) that can be enabled at any given moment. As a generic operation, all device pertaining to the sink class will have an "active" entry in sysfs: @@ -295,6 +304,47 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc} Timestamp Timestamp: 17107041535 +2) Using perf framework: + +Coresight tracers are represented using the Perf framework's Performance +Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of +controlling when tracing gets enabled based on when the process of interest is +scheduled. When configured in a system, Coresight PMUs will be listed when +queried by the perf command line tool: + + linaro@linaro-nano:~$ ./perf list pmu + + List of pre-defined events (to be used in -e): + + cs_etm// [Kernel PMU event] + + linaro@linaro-nano:~$ + +Regardless of the number of tracers available in a system (usually equal to the +amount of processor cores), the "cs_etm" PMU will be listed only once. + +A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is +listed along with configuration options within forward slashes '/'. Since a +Coresight system will typically have more than one sink, the name of the sink to +work with needs to be specified as an event option. Names for sink to choose +from are listed in sysFS under ($SYSFS)/bus/coresight/devices: + + root@linaro-nano:~# ls /sys/bus/coresight/devices/ + 20010000.etf 20040000.funnel 20100000.stm 22040000.etm + 22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu + 20070000.etr 20120000.replicator 220c0000.funnel + 23040000.etm 23140000.etm 23340000.etm + + root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program + +The syntax within the forward slashes '/' is important. The '@' character +tells the parser that a sink is about to be specified and that this is the sink +to use for the trace session. + +More information on the above and other example on how to use Coresight with +the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub +repository [3]. + How to use the STM module ------------------------- From 87bf4d68a4f2db9fa1664d888ebe4567ca487ea0 Mon Sep 17 00:00:00 2001 From: Mathieu Poirier Date: Tue, 17 Apr 2018 10:08:07 -0600 Subject: [PATCH 050/103] coresight: Grouping all perf tools oriented section together This patch groups together section pertaining to the perf tools. That way everything is at the same place rather than spread out. Signed-off-by: Mathieu Poirier Signed-off-by: Jonathan Corbet --- Documentation/trace/coresight.txt | 72 +++++++++++++++---------------- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index ab0d0f2d5cec..1d74ad0202b6 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -345,42 +345,7 @@ More information on the above and other example on how to use Coresight with the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub repository [3]. -How to use the STM module -------------------------- - -Using the System Trace Macrocell module is the same as the tracers - the only -difference is that clients are driving the trace capture rather -than the program flow through the code. - -As with any other CoreSight component, specifics about the STM tracer can be -found in sysfs with more information on each entry being found in [1]: - -root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm -enable_source hwevent_select port_enable subsystem uevent -hwevent_enable mgmt port_select traceid -root@genericarmv8:~# - -Like any other source a sink needs to be identified and the STM enabled before -being used: - -root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink -root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source - -From there user space applications can request and use channels using the devfs -interface provided for that purpose by the generic STM API: - -root@genericarmv8:~# ls -l /dev/20100000.stm -crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm -root@genericarmv8:~# - -Details on how to use the generic STM API can be found here [2]. - -[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm -[2]. Documentation/trace/stm.txt - - -Using perf tools ----------------- +2.1) AutoFDO analysis using the perf tools: perf can be used to record and analyze trace of programs. @@ -428,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto $ taskset -c 2 ./sort_autofdo Bubble sorting array of 30000 elements 5806 ms + + +How to use the STM module +------------------------- + +Using the System Trace Macrocell module is the same as the tracers - the only +difference is that clients are driving the trace capture rather +than the program flow through the code. + +As with any other CoreSight component, specifics about the STM tracer can be +found in sysfs with more information on each entry being found in [1]: + +root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm +enable_source hwevent_select port_enable subsystem uevent +hwevent_enable mgmt port_select traceid +root@genericarmv8:~# + +Like any other source a sink needs to be identified and the STM enabled before +being used: + +root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink +root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source + +From there user space applications can request and use channels using the devfs +interface provided for that purpose by the generic STM API: + +root@genericarmv8:~# ls -l /dev/20100000.stm +crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm +root@genericarmv8:~# + +Details on how to use the generic STM API can be found here [2]. + +[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm +[2]. Documentation/trace/stm.txt +[3]. https://github.com/Linaro/perf-opencsd From fde7917fbd6fcdc35d5ca216e4d44bdeb87edb76 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:44 +0300 Subject: [PATCH 051/103] docs/vm: hugetlbpage: minor improvements * fixed mistypes * added internal cross-references for sections Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hugetlbpage.rst | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst index a5da14b05b4b..99ad5d95e916 100644 --- a/Documentation/vm/hugetlbpage.rst +++ b/Documentation/vm/hugetlbpage.rst @@ -87,7 +87,7 @@ memory pressure. Once a number of huge pages have been pre-allocated to the kernel huge page pool, a user with appropriate privilege can use either the mmap system call or shared memory system calls to use the huge pages. See the discussion of -Using Huge Pages, below. +:ref:`Using Huge Pages `, below. The administrator can allocate persistent huge pages on the kernel boot command line by specifying the "hugepages=N" parameter, where 'N' = the @@ -115,8 +115,9 @@ over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies ``nr_hugepages``. The default for the allowed nodes--when the task has default memory policy--is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be -silently skipped when allocating persistent huge pages. See the discussion -below of the interaction of task memory policy, cpusets and per node attributes +silently skipped when allocating persistent huge pages. See the +:ref:`discussion below ` +of the interaction of task memory policy, cpusets and per node attributes with the allocation and freeing of persistent huge pages. The success or failure of huge page allocation depends on the amount of @@ -158,7 +159,7 @@ normal page pool. Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. This will occur even if -the number of surplus pages it would exceed the overcommit value. As long as +the number of surplus pages would exceed the overcommit value. As long as this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is increased sufficiently, or the surplus huge pages go out of use and are freed-- no more surplus huge pages will be allowed to be allocated. @@ -187,6 +188,7 @@ Inside each of these directories, the same set of files will exist:: which function as described above for the default huge page-sized case. +.. _mem_policy_and_hp_alloc: Interaction of Task Memory Policy with Huge Page Allocation/Freeing =================================================================== @@ -282,6 +284,7 @@ Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. +.. _using_huge_pages: Using Huge Pages ================ @@ -295,7 +298,7 @@ type hugetlbfs:: min_size=,nr_inodes= none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -``/mnt/huge``. Any files created on ``/mnt/huge`` uses huge pages. +``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. The ``uid`` and ``gid`` options sets the owner and group of the root of the file system. By default the ``uid`` and ``gid`` of the current process @@ -345,8 +348,8 @@ applications are going to use only shmat/shmget system calls or mmap with MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see :ref:`map_hugetlb ` below. -Users who wish to use hugetlb memory via shared memory segment should be a -member of a supplementary group and system admin needs to configure that gid +Users who wish to use hugetlb memory via shared memory segment should be +members of a supplementary group and system admin needs to configure that gid into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different applications to use any combination of mmaps and shm* calls, though the mount of filesystem will be required for using mmap calls without MAP_HUGETLB. From 946280cdfc211b4870d54c07f8a2aa82203d6886 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:45 +0300 Subject: [PATCH 052/103] docs/vm: hugetlbpage: move section about kernel development to hugetlbfs_reserv The hugetlbpage describes hugetlbfs from the user perspective and newer hugetlbfs_reserv document targets kernel developers. Hence the section about hugetlbfs kernel development naturally belongs there. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/hugetlbfs_reserv.rst | 8 ++++++++ Documentation/vm/hugetlbpage.rst | 8 -------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/vm/hugetlbfs_reserv.rst b/Documentation/vm/hugetlbfs_reserv.rst index 36a87a2ea435..9d200762114f 100644 --- a/Documentation/vm/hugetlbfs_reserv.rst +++ b/Documentation/vm/hugetlbfs_reserv.rst @@ -583,5 +583,13 @@ of cpusets or memory policy there is no guarantee that huge pages will be available on the required nodes. This is true even if there are a sufficient number of global reservations. +Hugetlbfs regression testing +============================ +The most complete set of hugetlb tests are in the libhugetlbfs repository. +If you modify any hugetlb related code, use the libhugetlbfs test suite +to check for regressions. In addition, if you add any new hugetlb +functionality, please add appropriate tests to libhugetlbfs. + +-- Mike Kravetz, 7 April 2017 diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst index 99ad5d95e916..2b374d10284d 100644 --- a/Documentation/vm/hugetlbpage.rst +++ b/Documentation/vm/hugetlbpage.rst @@ -379,11 +379,3 @@ The `libhugetlbfs`_ library provides a wide range of userspace tools to help with huge page usability, environment setup, and control. .. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs - -Kernel development regression testing -===================================== - -The most complete set of hugetlb tests are in the libhugetlbfs repository. -If you modify any hugetlb related code, use the libhugetlbfs test suite -to check for regressions. In addition, if you add any new hugetlb -functionality, please add appropriate tests to libhugetlbfs. From 86207d9a90c27acc15e99f1148ce4f873882ea1a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:46 +0300 Subject: [PATCH 053/103] docs/vm: pagemap: formatting and spelling updates Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/pagemap.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst index d54b4bfd3043..9644bc0d6289 100644 --- a/Documentation/vm/pagemap.rst +++ b/Documentation/vm/pagemap.rst @@ -13,7 +13,7 @@ There are four components to pagemap: * ``/proc/pid/pagemap``. This file lets a userspace process find out which physical frame each virtual page is mapped to. It contains one 64-bit value for each virtual page, containing the following data (from - fs/proc/task_mmu.c, above pagemap_read): + ``fs/proc/task_mmu.c``, above pagemap_read): * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped @@ -36,7 +36,7 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. - Efficient users of this interface will use /proc/pid/maps to + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. @@ -79,11 +79,11 @@ There are four components to pagemap: memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. -Short descriptions to the page flags: -===================================== +Short descriptions to the page flags +==================================== 0 - LOCKED - page is being locked for exclusive access, eg. by undergoing read/write IO + page is being locked for exclusive access, e.g. by undergoing read/write IO 7 - SLAB page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When compound page is used, SLUB/SLQB will only set this flag on the head @@ -132,7 +132,7 @@ IO related page flags ie. for file backed page: (in-memory data revision >= on-disk one) 4 - DIRTY page has been written to, hence contains new data - ie. for file backed page: (in-memory data revision > on-disk one) + i.e. for file backed page: (in-memory data revision > on-disk one) 8 - WRITEBACK page is being synced to disk @@ -145,7 +145,7 @@ LRU related page flags page is in the active LRU list 18 - UNEVICTABLE page is in the unevictable (non-)LRU list It is somehow pinned and - not a candidate for LRU page reclaims, eg. ramfs pages, + not a candidate for LRU page reclaims, e.g. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments 2 - REFERENCED page has been referenced since last LRU list enqueue/requeue @@ -156,7 +156,7 @@ LRU related page flags 12 - ANON a memory mapped page that is not part of a file 13 - SWAPCACHE - page is mapped to swap space, ie. has an associated swap entry + page is mapped to swap space, i.e. has an associated swap entry 14 - SWAPBACKED page is backed by swap/RAM From 41ea9dd36b6b968ab83fd972bf15b2e0f8905c80 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:47 +0300 Subject: [PATCH 054/103] docs/vm: pagemap: change document title "pagemap from the Userspace Perspective" is not very descriptive for unaware readers. Since the document describes how to examine a process page tables, let's title it "Examining Process Page Tables" Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/pagemap.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst index 9644bc0d6289..7ba8cbd57ad3 100644 --- a/Documentation/vm/pagemap.rst +++ b/Documentation/vm/pagemap.rst @@ -1,8 +1,8 @@ .. _pagemap: -====================================== -pagemap from the Userspace Perspective -====================================== +============================= +Examining Process Page Tables +============================= pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow userspace programs to examine the page tables and related information by From 3a3f7e26e5544032a687fb05b5221883b97a59ae Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:48 +0300 Subject: [PATCH 055/103] docs/admin-guide: introduce basic index for mm documentation Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/mm/index.rst | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) create mode 100644 Documentation/admin-guide/mm/index.rst diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 5bb9161dbe6a..cac906fb0ed0 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -63,6 +63,7 @@ configure specific aspects of kernel behavior to your liking. pm/index thunderbolt LSM/index + mm/index .. only:: subproject and html diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst new file mode 100644 index 000000000000..c47c16e13a18 --- /dev/null +++ b/Documentation/admin-guide/mm/index.rst @@ -0,0 +1,19 @@ +================= +Memory Management +================= + +Linux memory management subsystem is responsible, as the name implies, +for managing the memory in the system. This includes implemnetation of +virtual memory and demand paging, memory allocation both for kernel +internal structures and user space programms, mapping of files into +processes address space and many other cool things. + +Linux memory management is a complex system with many configurable +settings. Most of these settings are available via ``/proc`` +filesystem and can be quired and adjusted using ``sysctl``. These APIs +are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. + +.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html + +Here we document in detail how to interact with various mechanisms in +the Linux memory management. From 1ad1335dc58646764eda7bb054b350934a1b23ec Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:49 +0300 Subject: [PATCH 056/103] docs/admin-guide/mm: start moving here files from Documentation/vm Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/ABI/stable/sysfs-devices-node | 2 +- Documentation/ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/{vm => admin-guide/mm}/hugetlbpage.rst | 0 .../{vm => admin-guide/mm}/idle_page_tracking.rst | 2 +- Documentation/admin-guide/mm/index.rst | 9 +++++++++ Documentation/{vm => admin-guide/mm}/pagemap.rst | 6 +++--- Documentation/{vm => admin-guide/mm}/soft-dirty.rst | 0 Documentation/{vm => admin-guide/mm}/userfaultfd.rst | 0 Documentation/filesystems/proc.txt | 6 ++++-- Documentation/sysctl/vm.txt | 4 ++-- Documentation/vm/00-INDEX | 10 ---------- Documentation/vm/hwpoison.rst | 2 +- Documentation/vm/index.rst | 5 ----- fs/Kconfig | 2 +- fs/proc/task_mmu.c | 4 ++-- mm/Kconfig | 5 +++-- 16 files changed, 28 insertions(+), 31 deletions(-) rename Documentation/{vm => admin-guide/mm}/hugetlbpage.rst (100%) rename Documentation/{vm => admin-guide/mm}/idle_page_tracking.rst (98%) rename Documentation/{vm => admin-guide/mm}/pagemap.rst (96%) rename Documentation/{vm => admin-guide/mm}/soft-dirty.rst (100%) rename Documentation/{vm => admin-guide/mm}/userfaultfd.rst (100%) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index b38f4b734567..3e90e1f3bf0a 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,4 @@ Date: December 2009 Contact: Lee Schermerhorn Description: The node's huge page size control/query attributes. - See Documentation/vm/hugetlbpage.rst \ No newline at end of file + See Documentation/admin-guide/mm/hugetlbpage.rst \ No newline at end of file diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages index 5140b233356c..fdaa2162fae1 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages +++ b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages @@ -12,4 +12,4 @@ Description: free_hugepages surplus_hugepages resv_hugepages - See Documentation/vm/hugetlbpage.rst for details. + See Documentation/admin-guide/mm/hugetlbpage.rst for details. diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst similarity index 100% rename from Documentation/vm/hugetlbpage.rst rename to Documentation/admin-guide/mm/hugetlbpage.rst diff --git a/Documentation/vm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst similarity index 98% rename from Documentation/vm/idle_page_tracking.rst rename to Documentation/admin-guide/mm/idle_page_tracking.rst index d1c4609a5220..92e3a25d2deb 100644 --- a/Documentation/vm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -65,7 +65,7 @@ workload one should: are not reclaimable, he or she can filter them out using ``/proc/kpageflags``. -See Documentation/vm/pagemap.rst for more information about +See Documentation/admin-guide/mm/pagemap.rst for more information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. .. _impl_details: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c47c16e13a18..6c8b554464bb 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -17,3 +17,12 @@ are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. Here we document in detail how to interact with various mechanisms in the Linux memory management. + +.. toctree:: + :maxdepth: 1 + + hugetlbpage + idle_page_tracking + pagemap + soft-dirty + userfaultfd diff --git a/Documentation/vm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst similarity index 96% rename from Documentation/vm/pagemap.rst rename to Documentation/admin-guide/mm/pagemap.rst index 7ba8cbd57ad3..053ca64fd47a 100644 --- a/Documentation/vm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -18,7 +18,7 @@ There are four components to pagemap: * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) + * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) * Bits 57-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) @@ -97,7 +97,7 @@ Short descriptions to the page flags A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc. + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. 16 - COMPOUND_TAIL @@ -118,7 +118,7 @@ Short descriptions to the page flags zero page for pfn_zero or huge_zero page 25 - IDLE page has not been accessed since it was marked idle (see - Documentation/vm/idle_page_tracking.rst). Note that this flag may be + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. diff --git a/Documentation/vm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst similarity index 100% rename from Documentation/vm/soft-dirty.rst rename to Documentation/admin-guide/mm/soft-dirty.rst diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst similarity index 100% rename from Documentation/vm/userfaultfd.rst rename to Documentation/admin-guide/mm/userfaultfd.rst diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2d3984c70feb..ef53f808288d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,8 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details). +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst. +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index c8e6d5b031e4..697ef8c225df 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -515,7 +515,7 @@ nr_hugepages Change the minimum size of the hugepage pool. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== @@ -524,7 +524,7 @@ nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index cda564d55b3c..f8a96ca16b7a 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -12,14 +12,10 @@ highmem.rst - Outline of highmem and common issues. hmm.rst - Documentation of heterogeneous memory management -hugetlbpage.rst - - a brief summary of hugetlbpage support in the Linux kernel. hugetlbfs_reserv.rst - A brief overview of hugetlbfs reservation design/implementation. hwpoison.rst - explains what hwpoison is -idle_page_tracking.rst - - description of the idle page tracking feature. ksm.rst - how to use the Kernel Samepage Merging feature. mmu_notifier.rst @@ -34,16 +30,12 @@ page_frags.rst - description of page fragments allocator page_migration.rst - description of page migration in NUMA systems. -pagemap.rst - - pagemap, from the userspace perspective page_owner.rst - tracking about who allocated each page remap_file_pages.rst - a note about remap_file_pages() system call slub.rst - a short users guide for SLUB. -soft-dirty.rst - - short explanation for soft-dirty PTEs split_page_table_lock.rst - Separate per-table lock to improve scalability of the old page_table_lock. swap_numa.rst @@ -52,8 +44,6 @@ transhuge.rst - Transparent Hugepage Support, alternative way of using hugepages. unevictable-lru.rst - Unevictable LRU infrastructure -userfaultfd.rst - - description of userfaultfd system call z3fold.txt - outline of z3fold allocator for storing compressed pages zsmalloc.rst diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst index 070aa1e716b7..09bd24a92784 100644 --- a/Documentation/vm/hwpoison.rst +++ b/Documentation/vm/hwpoison.rst @@ -155,7 +155,7 @@ Testing value). This allows stress testing of many kinds of pages. The page_flags are the same as in /proc/kpageflags. The flag bits are defined in include/linux/kernel-page-flags.h and - documented in Documentation/vm/pagemap.rst + documented in Documentation/admin-guide/mm/pagemap.rst * Architecture specific MCE injector diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 6c451421a01e..ed58cb9f9675 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -13,15 +13,10 @@ various features of the Linux memory management .. toctree:: :maxdepth: 1 - hugetlbpage - idle_page_tracking ksm numa_memory_policy - pagemap transhuge - soft-dirty swap_numa - userfaultfd zswap Kernel developers MM documentation diff --git a/fs/Kconfig b/fs/Kconfig index ba53dc2a9691..ac4ac908f001 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -196,7 +196,7 @@ config HUGETLBFS help hugetlbfs is a filesystem backing for HugeTLB pages, based on ramfs. For architectures that support it, say Y here and read - for details. + for details. If unsure, say N. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 333cda80c3dd..ed48b6e36202 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -937,7 +937,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the - * Documentation/vm/soft-dirty.rst for full description + * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ pte_t ptent = *pte; @@ -1417,7 +1417,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) + * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped * Bits 57-60 zero * Bit 61 page is file-page or shared-anon diff --git a/mm/Kconfig b/mm/Kconfig index 9bdb0189caaf..2d7ef6207e1e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -530,7 +530,7 @@ config MEM_SOFT_DIRTY into a page just as regular dirty bit, but unlike the latter it can be cleared by hands. - See Documentation/vm/soft-dirty.rst for more details. + See Documentation/admin-guide/mm/soft-dirty.rst for more details. config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" @@ -656,7 +656,8 @@ config IDLE_PAGE_TRACKING be useful to tune memory cgroup limits and/or for job placement within a compute cluster. - See Documentation/vm/idle_page_tracking.rst for more details. + See Documentation/admin-guide/mm/idle_page_tracking.rst for + more details. # arch_add_memory() comprehends device memory config ARCH_HAS_ZONE_DEVICE From e27a20f104673f8ada70c5d32430a7f4c577fe95 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:50 +0300 Subject: [PATCH 057/103] docs/admin-guide/mm: convert plain text cross references to hyperlinks Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/mm/hugetlbpage.rst | 3 ++- .../admin-guide/mm/idle_page_tracking.rst | 5 +++-- Documentation/admin-guide/mm/pagemap.rst | 18 +++++++++++------- 3 files changed, 16 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 2b374d10284d..a8b0806377bb 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -219,7 +219,8 @@ When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: -#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], +#. Regardless of mempolicy mode [see + :ref:`Documentation/vm/numa_memory_policy.rst `], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst index 92e3a25d2deb..6f7b7ca1add3 100644 --- a/Documentation/admin-guide/mm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -65,8 +65,9 @@ workload one should: are not reclaimable, he or she can filter them out using ``/proc/kpageflags``. -See Documentation/admin-guide/mm/pagemap.rst for more information about -``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. +See :ref:`Documentation/admin-guide/mm/pagemap.rst ` for more +information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and +``/proc/kpagecgroup``. .. _impl_details: diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 053ca64fd47a..577af85beb41 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -18,7 +18,8 @@ There are four components to pagemap: * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) + * Bit 55 pte is soft-dirty (see + :ref:`Documentation/admin-guide/mm/soft-dirty.rst `) * Bit 56 page exclusively mapped (since 4.2) * Bits 57-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) @@ -97,9 +98,11 @@ Short descriptions to the page flags A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. - memory allocators and various device drivers. However in this interface, - only huge/giga pages are made visible to end users. + pages are hugeTLB pages + (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst `), + the SLUB etc. memory allocators and various device drivers. + However in this interface, only huge/giga pages are made visible + to end users. 16 - COMPOUND_TAIL A compound page tail (see description above). 17 - HUGE @@ -118,9 +121,10 @@ Short descriptions to the page flags zero page for pfn_zero or huge_zero page 25 - IDLE page has not been accessed since it was marked idle (see - Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be - stale in case the page was accessed via a PTE. To make sure the flag - is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. + :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst `). + Note that this flag may be stale in case the page was accessed via + a PTE. To make sure the flag is up-to-date one has to read + ``/sys/kernel/mm/page_idle/bitmap`` first. IO related page flags --------------------- From eacc670fa452d8ec70b2eaa6189aefaa1708dd34 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 26 Apr 2018 18:11:02 -0700 Subject: [PATCH 058/103] documentation: core-api: rearrange a few kernel-api chapters and sections Rearrange some kernel-api chapters and sections to group them together better. - move Bit Operations from Basic C Library Functions to Basic Kernel Library Functions (now adjacent to Bitmap Operations since they are not typical C library functions) - move Sorting from Math Functions to Basic Kernel Library Functions since sort functions are more Basic than Math Functions - move Text Searching from Math Functions to Basic Kernel Library Functions (keep Sorting and Searching close to each other) - combine CRC and Math functions together into the (newly named) CRC and Math Functions chapter Signed-off-by: Randy Dunlap Acked-by: Matthew Wilcox Signed-off-by: Jonathan Corbet --- Documentation/core-api/kernel-api.rst | 60 +++++++++++++-------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 92f30006adae..8e44aea366c2 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -39,17 +39,17 @@ String Manipulation .. kernel-doc:: lib/string.c :export: +Basic Kernel Library Functions +============================== + +The Linux kernel provides more basic utility functions. + Bit Operations -------------- .. kernel-doc:: arch/x86/include/asm/bitops.h :internal: -Basic Kernel Library Functions -============================== - -The Linux kernel provides more basic utility functions. - Bitmap Operations ----------------- @@ -80,6 +80,31 @@ Command-line Parsing .. kernel-doc:: lib/cmdline.c :export: +Sorting +------- + +.. kernel-doc:: lib/sort.c + :export: + +.. kernel-doc:: lib/list_sort.c + :export: + +Text Searching +-------------- + +.. kernel-doc:: lib/textsearch.c + :doc: ts_intro + +.. kernel-doc:: lib/textsearch.c + :export: + +.. kernel-doc:: include/linux/textsearch.h + :functions: textsearch_find textsearch_next \ + textsearch_get_pattern textsearch_get_pattern_len + +CRC and Math Functions in Linux +=============================== + CRC Functions ------------- @@ -103,9 +128,6 @@ CRC Functions .. kernel-doc:: lib/crc-itu-t.c :export: -Math Functions in Linux -======================= - Base 2 log and power Functions ------------------------------ @@ -127,28 +149,6 @@ Division Functions .. kernel-doc:: lib/gcd.c :export: -Sorting -------- - -.. kernel-doc:: lib/sort.c - :export: - -.. kernel-doc:: lib/list_sort.c - :export: - -Text Searching --------------- - -.. kernel-doc:: lib/textsearch.c - :doc: ts_intro - -.. kernel-doc:: lib/textsearch.c - :export: - -.. kernel-doc:: include/linux/textsearch.h - :functions: textsearch_find textsearch_next \ - textsearch_get_pattern textsearch_get_pattern_len - UUID/GUID --------- From b976583f881814195c7f0ddbc4c541c915e84ae0 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 26 Apr 2018 18:29:41 -0700 Subject: [PATCH 059/103] Documentation: driver-api: fix device_connection.rst kernel-doc error Using incorrect :functions: syntax (extra space) causes an odd kernel-doc warning, so fix that. Documentation/driver-api/device_connection.rst:42: ERROR: Error in "kernel-doc" directive: Signed-off-by: Randy Dunlap Reviewed-by: Heikki Krogerus Signed-off-by: Jonathan Corbet --- Documentation/driver-api/device_connection.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/driver-api/device_connection.rst b/Documentation/driver-api/device_connection.rst index affbc5566ab0..ba364224c349 100644 --- a/Documentation/driver-api/device_connection.rst +++ b/Documentation/driver-api/device_connection.rst @@ -40,4 +40,4 @@ API --- .. kernel-doc:: drivers/base/devcon.c - : functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove + :functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove From 5a2ca3efe6a07a155674ccbe36ad66d0840ce2c1 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:22 +0300 Subject: [PATCH 060/103] mm/ksm: docs: extend overview comment and make it "DOC:" The existing comment provides a good overview of KSM implementation. Let's update it to reflect recent additions of "chain" and "dup" variants of the stable tree nodes and mark it as "DOC:" for inclusion into the KSM documentation. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- mm/ksm.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/mm/ksm.c b/mm/ksm.c index 16451a2bf712..7d6558f3bac9 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -51,7 +51,9 @@ #define DO_NUMA(x) do { } while (0) #endif -/* +/** + * DOC: Overview + * * A few notes about the KSM scanning process, * to make it easier to understand the data structures below: * @@ -67,6 +69,21 @@ * this tree is fully assured to be working (except when pages are unmapped), * and therefore this tree is called the stable tree. * + * The stable tree node includes information required for reverse + * mapping from a KSM page to virtual addresses that map this page. + * + * In order to avoid large latencies of the rmap walks on KSM pages, + * KSM maintains two types of nodes in the stable tree: + * + * * the regular nodes that keep the reverse mapping structures in a + * linked list + * * the "chains" that link nodes ("dups") that represent the same + * write protected memory content, but each "dup" corresponds to a + * different KSM page copy of that content + * + * Internally, the regular nodes, "dups" and "chains" are represented + * using the same :c:type:`struct stable_node` structure. + * * In addition to the stable tree, KSM uses a second data structure called the * unstable tree: this tree holds pointers to pages which have been found to * be "unchanged for a period of time". The unstable tree sorts these pages From db12c00f13488ecf3eaa1842e845d34f6dc1a5c5 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:23 +0300 Subject: [PATCH 061/103] docs/vm: ksm: (mostly) formatting updates Aside from the formatting: * fixed typos * added section and sub-section headers * moved ksmd overview after the description of KSM origins Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.rst | 110 ++++++++++++++++++++++++--------------- 1 file changed, 69 insertions(+), 41 deletions(-) diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index 87e7eef5ea9c..786d460a0e46 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -4,34 +4,50 @@ Kernel Samepage Merging ======================= +Overview +======== + KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ -The KSM daemon ksmd periodically scans those areas of user memory which -have been registered with it, looking for pages of identical content which -can be replaced by a single write-protected page (which is automatically -copied if a process later wants to update its content). - KSM was originally developed for use with KVM (where it was known as Kernel Shared Memory), to fit more virtual machines into physical memory, by sharing the data common between them. But it can be useful to any application which generates many instances of the same data. +The KSM daemon ksmd periodically scans those areas of user memory +which have been registered with it, looking for pages of identical +content which can be replaced by a single write-protected page (which +is automatically copied if a process later wants to update its +content). The amount of pages that KSM daemon scans in a single pass +and the time between the passes are configured using :ref:`sysfs +intraface ` + KSM only merges anonymous (private) pages, never pagecache (file) pages. KSM's merged pages were originally locked into kernel memory, but can now be swapped out just like other user pages (but sharing is broken when they are swapped back in: ksmd must rediscover their identity and merge again). +Controlling KSM with madvise +============================ + KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) -system call: int madvise(addr, length, MADV_MERGEABLE). +system call:: -The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel -that advice and restore unshared pages: whereupon KSM unmerges whatever -it merged in that range. Note: this unmerging call may suddenly require -more memory than is available - possibly failing with EAGAIN, but more -probably arousing the Out-Of-Memory killer. + int madvise(addr, length, MADV_MERGEABLE) + +The app may call + +:: + + int madvise(addr, length, MADV_UNMERGEABLE) + +to cancel that advice and restore unshared pages: whereupon KSM +unmerges whatever it merged in that range. Note: this unmerging call +may suddenly require more memory than is available - possibly failing +with EAGAIN, but more probably arousing the Out-Of-Memory killer. If KSM is not configured into the running kernel, madvise MADV_MERGEABLE and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was @@ -43,7 +59,7 @@ MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. If a region of memory must be split into at least one new MADV_MERGEABLE or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt). +will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range @@ -54,21 +70,28 @@ Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM's scans may use a lot of processing power: some installations will disable KSM for that reason. +.. _ksm_sysfs: + +KSM daemon sysfs interface +========================== + The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, readable by all but writable only by root: pages_to_scan - how many present pages to scan before ksmd goes to sleep - e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan`` Default: 100 - (chosen for demonstration purposes) + how many pages to scan before ksmd goes to sleep + e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + + Default: 100 (chosen for demonstration purposes) sleep_millisecs how many milliseconds ksmd should sleep before next scan - e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` Default: 20 - (chosen for demonstration purposes) + e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` + + Default: 20 (chosen for demonstration purposes) merge_across_nodes - specifies if pages from different numa nodes can be merged. + specifies if pages from different NUMA nodes can be merged. When set to 0, ksm merges only pages which physically reside in the memory area of same NUMA node. That brings lower latency to access of shared pages. Systems with more nodes, at @@ -77,19 +100,21 @@ merge_across_nodes minimize memory usage, are likely to benefit from the greater sharing of setting 1 (default). You may wish to compare how your system performs under each setting, before deciding on - which to use. merge_across_nodes setting can be changed only - when there are no ksm shared pages in system: set run 2 to + which to use. ``merge_across_nodes`` setting can be changed only + when there are no ksm shared pages in the system: set run 2 to unmerge pages first, then to 1 after changing - merge_across_nodes, to remerge according to the new setting. + ``merge_across_nodes``, to remerge according to the new setting. + Default: 1 (merging across nodes as in earlier releases) run - set 0 to stop ksmd from running but keep merged pages, - set 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, - set 2 to stop ksmd and unmerge all pages currently merged, but - leave mergeable areas registered for next run Default: 0 (must - be changed to 1 to activate KSM, except if CONFIG_SYSFS is - disabled) + * set to 0 to stop ksmd from running but keep merged pages, + * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, + * set to 2 to stop ksmd and unmerge all pages currently merged, but + leave mergeable areas registered for next run. + + Default: 0 (must be changed to 1 to activate KSM, except if + CONFIG_SYSFS is disabled) use_zero_pages specifies whether empty pages (i.e. allocated pages that only @@ -102,8 +127,9 @@ use_zero_pages KSM for some workloads, for example if the checksums of pages candidate for merging match the checksum of an empty page. This setting can be changed at any time, it is only - effective for pages merged after the change. Default: 0 - (normal KSM behaviour as in earlier releases) + effective for pages merged after the change. + + Default: 0 (normal KSM behaviour as in earlier releases) max_page_sharing Maximum sharing allowed for each KSM page. This enforces a @@ -112,7 +138,7 @@ max_page_sharing page will have at least two sharers. The rmap walk has O(N) complexity where N is the number of rmap_items (i.e. virtual mappings) that are sharing the page, which is in turn capped - by max_page_sharing. So this effectively spread the the linear + by ``max_page_sharing``. So this effectively spreads the linear O(N) computational complexity from rmap walk context over different KSM pages. The ksmd walk over the stable_node "chains" is also O(N), but N is the number of stable_node @@ -140,7 +166,7 @@ stable_node_chains_prune_millisecs metadata with lower latency, but they will make ksmd use more CPU during the scan. This only applies to the stable_node chains so it's a noop if not a single KSM page hit the - max_page_sharing yet (there would be no stable_node chains in + ``max_page_sharing`` yet (there would be no stable_node chains in such case). The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: @@ -157,27 +183,29 @@ full_scans how many times all mergeable areas have been scanned stable_node_chains number of stable node chains allocated, this is effectively - the number of KSM pages that hit the max_page_sharing limit + the number of KSM pages that hit the ``max_page_sharing`` limit stable_node_dups number of stable node dups queued into the stable_node chains -A high ratio of pages_sharing to pages_shared indicates good sharing, but -a high ratio of pages_unshared to pages_sharing indicates wasted effort. -pages_volatile embraces several different kinds of activity, but a high -proportion there would also indicate poor use of madvise MADV_MERGEABLE. +A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good +sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` +indicates wasted effort. ``pages_volatile`` embraces several +different kinds of activity, but a high proportion there would also +indicate poor use of madvise MADV_MERGEABLE. -The maximum possible page_sharing/page_shared ratio is limited by the -max_page_sharing tunable. To increase the ratio max_page_sharing must +The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the +``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must be increased accordingly. -The stable_node_dups/stable_node_chains ratio is also affected by the -max_page_sharing tunable, and an high ratio may indicate fragmentation +The ``stable_node_dups/stable_node_chains`` ratio is also affected by the +``max_page_sharing`` tunable, and an high ratio may indicate fragmentation in the stable_node dups, which could be solved by introducing fragmentation algorithms in ksmd which would refile rmap_items from -one stable_node dup to another stable_node dup, in order to freeup +one stable_node dup to another stable_node dup, in order to free up stable_node "dups" with few rmap_items in them, but that may increase the ksmd CPU usage and possibly slowdown the readonly computations on the KSM pages of the applications. +-- Izik Eidus, Hugh Dickins, 17 Nov 2009 From 064fca37bc0545fec0b5abdf9ce09136b73d7083 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:24 +0300 Subject: [PATCH 062/103] docs/vm: ksm: add "Design" section Include the KSM description from the source code comment, add a subsection about reverse mapping and include kernel-doc references for KSM data structures. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.rst | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index 786d460a0e46..0e5a085694e5 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -206,6 +206,45 @@ stable_node "dups" with few rmap_items in them, but that may increase the ksmd CPU usage and possibly slowdown the readonly computations on the KSM pages of the applications. +Design +====== + +Overview +-------- + +.. kernel-doc:: mm/ksm.c + :DOC: Overview + +Reverse mapping +--------------- +KSM maintains reverse mapping information for KSM pages in the stable +tree. + +If a KSM page is shared between less than ``max_page_sharing`` VMAs, +the node of the stable tree that represents such KSM page points to a +list of :c:type:`struct rmap_item` and the ``page->mapping`` of the +KSM page points to the stable tree node. + +When the sharing passes this threshold, KSM adds a second dimension to +the stable tree. The tree node becomes a "chain" that links one or +more "dups". Each "dup" keeps reverse mapping information for a KSM +page with ``page->mapping`` pointing to that "dup". + +Every "chain" and all "dups" linked into a "chain" enforce the +invariant that they represent the same write protected memory content, +even if each "dup" will be pointed by a different KSM page copy of +that content. + +This way the stable tree lookup computational complexity is unaffected +if compared to an unlimited list of reverse mappings. It is still +enforced that there cannot be KSM page content duplicates in the +stable tree itself. + +Reference +--------- +.. kernel-doc:: mm/ksm.c + :functions: mm_slot ksm_scan stable_node rmap_item + -- Izik Eidus, Hugh Dickins, 17 Nov 2009 From 6570c785ea8fdb3c6e8f7591d25d33fd519f928b Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:25 +0300 Subject: [PATCH 063/103] docs/vm: ksm: reshuffle text between "sysfs" and "design" sections The description of "max_page_sharing" sysfs attribute includes lots of implementation details that more naturally belong in the "Design" section. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.rst | 51 +++++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 21 deletions(-) diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index 0e5a085694e5..00961b8ab03e 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -133,31 +133,21 @@ use_zero_pages max_page_sharing Maximum sharing allowed for each KSM page. This enforces a - deduplication limit to avoid the virtual memory rmap lists to - grow too large. The minimum value is 2 as a newly created KSM - page will have at least two sharers. The rmap walk has O(N) - complexity where N is the number of rmap_items (i.e. virtual - mappings) that are sharing the page, which is in turn capped - by ``max_page_sharing``. So this effectively spreads the linear - O(N) computational complexity from rmap walk context over - different KSM pages. The ksmd walk over the stable_node - "chains" is also O(N), but N is the number of stable_node - "dups", not the number of rmap_items, so it has not a - significant impact on ksmd performance. In practice the best - stable_node "dup" candidate will be kept and found at the head - of the "dups" list. The higher this value the faster KSM will - merge the memory (because there will be fewer stable_node dups - queued into the stable_node chain->hlist to check for pruning) - and the higher the deduplication factor will be, but the - slowest the worst case rmap walk could be for any given KSM - page. Slowing down the rmap_walk means there will be higher + deduplication limit to avoid high latency for virtual memory + operations that involve traversal of the virtual mappings that + share the KSM page. The minimum value is 2 as a newly created + KSM page will have at least two sharers. The higher this value + the faster KSM will merge the memory and the higher the + deduplication factor will be, but the slower the worst case + virtual mappings traversal could be for any given KSM + page. Slowing down this traversal means there will be higher latency for certain virtual memory operations happening during swapping, compaction, NUMA balancing and page migration, in turn decreasing responsiveness for the caller of those virtual memory operations. The scheduler latency of other tasks not - involved with the VM operations doing the rmap walk is not - affected by this parameter as the rmap walks are always - schedule friendly themselves. + involved with the VM operations doing the virtual mappings + traversal is not affected by this parameter as these + traversals are always schedule friendly themselves. stable_node_chains_prune_millisecs How frequently to walk the whole list of stable_node "dups" @@ -240,6 +230,25 @@ if compared to an unlimited list of reverse mappings. It is still enforced that there cannot be KSM page content duplicates in the stable tree itself. +The deduplication limit enforced by ``max_page_sharing`` is required +to avoid the virtual memory rmap lists to grow too large. The rmap +walk has O(N) complexity where N is the number of rmap_items +(i.e. virtual mappings) that are sharing the page, which is in turn +capped by ``max_page_sharing``. So this effectively spreads the linear +O(N) computational complexity from rmap walk context over different +KSM pages. The ksmd walk over the stable_node "chains" is also O(N), +but N is the number of stable_node "dups", not the number of +rmap_items, so it has not a significant impact on ksmd performance. In +practice the best stable_node "dup" candidate will be kept and found +at the head of the "dups" list. + +High values of ``max_page_sharing`` result in faster memory merging +(because there will be fewer stable_node dups queued into the +stable_node chain->hlist to check for pruning) and higher +deduplication factor at the expense of slower worst case for rmap +walks for any KSM page which can happen during swapping, compaction, +NUMA balancing and page migration. + Reference --------- .. kernel-doc:: mm/ksm.c From 2a695ca412943b88abde5dda0d5b6876fd154ac8 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:26 +0300 Subject: [PATCH 064/103] docs/vm: ksm: update stable_node_chains_prune_millisecs description Make the description of stable_node_chains_prune_millisecs sysfs parameter less implementation aware and add a few words about this parameter in the "Design" section. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.rst | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index 00961b8ab03e..18d7c710a5f3 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -150,14 +150,12 @@ max_page_sharing traversals are always schedule friendly themselves. stable_node_chains_prune_millisecs - How frequently to walk the whole list of stable_node "dups" - linked in the stable_node "chains" in order to prune stale - stable_nodes. Smaller milllisecs values will free up the KSM - metadata with lower latency, but they will make ksmd use more - CPU during the scan. This only applies to the stable_node - chains so it's a noop if not a single KSM page hit the - ``max_page_sharing`` yet (there would be no stable_node chains in - such case). + specifies how frequently KSM checks the metadata of the pages + that hit the deduplication limit for stale information. + Smaller milllisecs values will free up the KSM metadata with + lower latency, but they will make ksmd use more CPU during the + scan. It's a noop if not a single KSM page hit the + ``max_page_sharing`` yet. The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: @@ -249,6 +247,11 @@ deduplication factor at the expense of slower worst case for rmap walks for any KSM page which can happen during swapping, compaction, NUMA balancing and page migration. +The whole list of stable_node "dups" linked in the stable_node +"chains" is scanned periodically in order to prune stale stable_nodes. +The frequency of such scans is defined by +``stable_node_chains_prune_millisecs`` sysfs tunable. + Reference --------- .. kernel-doc:: mm/ksm.c From 8b898fd11414a365b1e024d027a76f6bb0b12b6e Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:27 +0300 Subject: [PATCH 065/103] docs/vm: ksm: udpate description of stable_node_{dups,chains} Remove implementation details from sysfs parameter descriptions. Also move the paragraph discussing fragmentation issues and their possible solution to the "Design" section. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/ksm.rst | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index 18d7c710a5f3..afcf5a8fc4a5 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -170,10 +170,9 @@ pages_volatile full_scans how many times all mergeable areas have been scanned stable_node_chains - number of stable node chains allocated, this is effectively the number of KSM pages that hit the ``max_page_sharing`` limit stable_node_dups - number of stable node dups queued into the stable_node chains + number of duplicated KSM pages A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` @@ -185,15 +184,6 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the ``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must be increased accordingly. -The ``stable_node_dups/stable_node_chains`` ratio is also affected by the -``max_page_sharing`` tunable, and an high ratio may indicate fragmentation -in the stable_node dups, which could be solved by introducing -fragmentation algorithms in ksmd which would refile rmap_items from -one stable_node dup to another stable_node dup, in order to free up -stable_node "dups" with few rmap_items in them, but that may increase -the ksmd CPU usage and possibly slowdown the readonly computations on -the KSM pages of the applications. - Design ====== @@ -247,6 +237,15 @@ deduplication factor at the expense of slower worst case for rmap walks for any KSM page which can happen during swapping, compaction, NUMA balancing and page migration. +The ``stable_node_dups/stable_node_chains`` ratio is also affected by the +``max_page_sharing`` tunable, and an high ratio may indicate fragmentation +in the stable_node dups, which could be solved by introducing +fragmentation algorithms in ksmd which would refile rmap_items from +one stable_node dup to another stable_node dup, in order to free up +stable_node "dups" with few rmap_items in them, but that may increase +the ksmd CPU usage and possibly slowdown the readonly computations on +the KSM pages of the applications. + The whole list of stable_node "dups" linked in the stable_node "chains" is scanned periodically in order to prune stale stable_nodes. The frequency of such scans is defined by From c9161088e54b56d7ff8c92fd9e18b0fb7a20b2b3 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 24 Apr 2018 09:40:28 +0300 Subject: [PATCH 066/103] docs/vm: ksm: split userspace interface to admin-guide/mm/ksm.rst Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/ksm.rst | 189 +++++++++++++++++++++++++ Documentation/vm/ksm.rst | 176 +---------------------- 3 files changed, 191 insertions(+), 175 deletions(-) create mode 100644 Documentation/admin-guide/mm/ksm.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 6c8b554464bb..ad28644fee35 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -23,6 +23,7 @@ the Linux memory management. hugetlbpage idle_page_tracking + ksm pagemap soft-dirty userfaultfd diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst new file mode 100644 index 000000000000..9303786632d1 --- /dev/null +++ b/Documentation/admin-guide/mm/ksm.rst @@ -0,0 +1,189 @@ +.. _admin_guide_ksm: + +======================= +Kernel Samepage Merging +======================= + +Overview +======== + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +KSM was originally developed for use with KVM (where it was known as +Kernel Shared Memory), to fit more virtual machines into physical memory, +by sharing the data common between them. But it can be useful to any +application which generates many instances of the same data. + +The KSM daemon ksmd periodically scans those areas of user memory +which have been registered with it, looking for pages of identical +content which can be replaced by a single write-protected page (which +is automatically copied if a process later wants to update its +content). The amount of pages that KSM daemon scans in a single pass +and the time between the passes are configured using :ref:`sysfs +intraface ` + +KSM only merges anonymous (private) pages, never pagecache (file) pages. +KSM's merged pages were originally locked into kernel memory, but can now +be swapped out just like other user pages (but sharing is broken when they +are swapped back in: ksmd must rediscover their identity and merge again). + +Controlling KSM with madvise +============================ + +KSM only operates on those areas of address space which an application +has advised to be likely candidates for merging, by using the madvise(2) +system call:: + + int madvise(addr, length, MADV_MERGEABLE) + +The app may call + +:: + + int madvise(addr, length, MADV_UNMERGEABLE) + +to cancel that advice and restore unshared pages: whereupon KSM +unmerges whatever it merged in that range. Note: this unmerging call +may suddenly require more memory than is available - possibly failing +with EAGAIN, but more probably arousing the Out-Of-Memory killer. + +If KSM is not configured into the running kernel, madvise MADV_MERGEABLE +and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was +built with CONFIG_KSM=y, those calls will normally succeed: even if the +the KSM daemon is not currently running, MADV_MERGEABLE still registers +the range for whenever the KSM daemon is started; even if the range +cannot contain any pages which KSM could actually merge; even if +MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. + +If a region of memory must be split into at least one new MADV_MERGEABLE +or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process +will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). + +Like other madvise calls, they are intended for use on mapped areas of +the user address space: they will report ENOMEM if the specified range +includes unmapped gaps (though working on the intervening mapped areas), +and might fail with EAGAIN if not enough memory for internal structures. + +Applications should be considerate in their use of MADV_MERGEABLE, +restricting its use to areas likely to benefit. KSM's scans may use a lot +of processing power: some installations will disable KSM for that reason. + +.. _ksm_sysfs: + +KSM daemon sysfs interface +========================== + +The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, +readable by all but writable only by root: + +pages_to_scan + how many pages to scan before ksmd goes to sleep + e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + + Default: 100 (chosen for demonstration purposes) + +sleep_millisecs + how many milliseconds ksmd should sleep before next scan + e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` + + Default: 20 (chosen for demonstration purposes) + +merge_across_nodes + specifies if pages from different NUMA nodes can be merged. + When set to 0, ksm merges only pages which physically reside + in the memory area of same NUMA node. That brings lower + latency to access of shared pages. Systems with more nodes, at + significant NUMA distances, are likely to benefit from the + lower latency of setting 0. Smaller systems, which need to + minimize memory usage, are likely to benefit from the greater + sharing of setting 1 (default). You may wish to compare how + your system performs under each setting, before deciding on + which to use. ``merge_across_nodes`` setting can be changed only + when there are no ksm shared pages in the system: set run 2 to + unmerge pages first, then to 1 after changing + ``merge_across_nodes``, to remerge according to the new setting. + + Default: 1 (merging across nodes as in earlier releases) + +run + * set to 0 to stop ksmd from running but keep merged pages, + * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, + * set to 2 to stop ksmd and unmerge all pages currently merged, but + leave mergeable areas registered for next run. + + Default: 0 (must be changed to 1 to activate KSM, except if + CONFIG_SYSFS is disabled) + +use_zero_pages + specifies whether empty pages (i.e. allocated pages that only + contain zeroes) should be treated specially. When set to 1, + empty pages are merged with the kernel zero page(s) instead of + with each other as it would happen normally. This can improve + the performance on architectures with coloured zero pages, + depending on the workload. Care should be taken when enabling + this setting, as it can potentially degrade the performance of + KSM for some workloads, for example if the checksums of pages + candidate for merging match the checksum of an empty + page. This setting can be changed at any time, it is only + effective for pages merged after the change. + + Default: 0 (normal KSM behaviour as in earlier releases) + +max_page_sharing + Maximum sharing allowed for each KSM page. This enforces a + deduplication limit to avoid high latency for virtual memory + operations that involve traversal of the virtual mappings that + share the KSM page. The minimum value is 2 as a newly created + KSM page will have at least two sharers. The higher this value + the faster KSM will merge the memory and the higher the + deduplication factor will be, but the slower the worst case + virtual mappings traversal could be for any given KSM + page. Slowing down this traversal means there will be higher + latency for certain virtual memory operations happening during + swapping, compaction, NUMA balancing and page migration, in + turn decreasing responsiveness for the caller of those virtual + memory operations. The scheduler latency of other tasks not + involved with the VM operations doing the virtual mappings + traversal is not affected by this parameter as these + traversals are always schedule friendly themselves. + +stable_node_chains_prune_millisecs + specifies how frequently KSM checks the metadata of the pages + that hit the deduplication limit for stale information. + Smaller milllisecs values will free up the KSM metadata with + lower latency, but they will make ksmd use more CPU during the + scan. It's a noop if not a single KSM page hit the + ``max_page_sharing`` yet. + +The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: + +pages_shared + how many shared pages are being used +pages_sharing + how many more sites are sharing them i.e. how much saved +pages_unshared + how many pages unique but repeatedly checked for merging +pages_volatile + how many pages changing too fast to be placed in a tree +full_scans + how many times all mergeable areas have been scanned +stable_node_chains + the number of KSM pages that hit the ``max_page_sharing`` limit +stable_node_dups + number of duplicated KSM pages + +A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good +sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` +indicates wasted effort. ``pages_volatile`` embraces several +different kinds of activity, but a high proportion there would also +indicate poor use of madvise MADV_MERGEABLE. + +The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the +``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must +be increased accordingly. + +-- +Izik Eidus, +Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/vm/ksm.rst b/Documentation/vm/ksm.rst index afcf5a8fc4a5..d32016d9be2c 100644 --- a/Documentation/vm/ksm.rst +++ b/Documentation/vm/ksm.rst @@ -4,185 +4,11 @@ Kernel Samepage Merging ======================= -Overview -======== - KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ -KSM was originally developed for use with KVM (where it was known as -Kernel Shared Memory), to fit more virtual machines into physical memory, -by sharing the data common between them. But it can be useful to any -application which generates many instances of the same data. - -The KSM daemon ksmd periodically scans those areas of user memory -which have been registered with it, looking for pages of identical -content which can be replaced by a single write-protected page (which -is automatically copied if a process later wants to update its -content). The amount of pages that KSM daemon scans in a single pass -and the time between the passes are configured using :ref:`sysfs -intraface ` - -KSM only merges anonymous (private) pages, never pagecache (file) pages. -KSM's merged pages were originally locked into kernel memory, but can now -be swapped out just like other user pages (but sharing is broken when they -are swapped back in: ksmd must rediscover their identity and merge again). - -Controlling KSM with madvise -============================ - -KSM only operates on those areas of address space which an application -has advised to be likely candidates for merging, by using the madvise(2) -system call:: - - int madvise(addr, length, MADV_MERGEABLE) - -The app may call - -:: - - int madvise(addr, length, MADV_UNMERGEABLE) - -to cancel that advice and restore unshared pages: whereupon KSM -unmerges whatever it merged in that range. Note: this unmerging call -may suddenly require more memory than is available - possibly failing -with EAGAIN, but more probably arousing the Out-Of-Memory killer. - -If KSM is not configured into the running kernel, madvise MADV_MERGEABLE -and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was -built with CONFIG_KSM=y, those calls will normally succeed: even if the -the KSM daemon is not currently running, MADV_MERGEABLE still registers -the range for whenever the KSM daemon is started; even if the range -cannot contain any pages which KSM could actually merge; even if -MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. - -If a region of memory must be split into at least one new MADV_MERGEABLE -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt). - -Like other madvise calls, they are intended for use on mapped areas of -the user address space: they will report ENOMEM if the specified range -includes unmapped gaps (though working on the intervening mapped areas), -and might fail with EAGAIN if not enough memory for internal structures. - -Applications should be considerate in their use of MADV_MERGEABLE, -restricting its use to areas likely to benefit. KSM's scans may use a lot -of processing power: some installations will disable KSM for that reason. - -.. _ksm_sysfs: - -KSM daemon sysfs interface -========================== - -The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, -readable by all but writable only by root: - -pages_to_scan - how many pages to scan before ksmd goes to sleep - e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. - - Default: 100 (chosen for demonstration purposes) - -sleep_millisecs - how many milliseconds ksmd should sleep before next scan - e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` - - Default: 20 (chosen for demonstration purposes) - -merge_across_nodes - specifies if pages from different NUMA nodes can be merged. - When set to 0, ksm merges only pages which physically reside - in the memory area of same NUMA node. That brings lower - latency to access of shared pages. Systems with more nodes, at - significant NUMA distances, are likely to benefit from the - lower latency of setting 0. Smaller systems, which need to - minimize memory usage, are likely to benefit from the greater - sharing of setting 1 (default). You may wish to compare how - your system performs under each setting, before deciding on - which to use. ``merge_across_nodes`` setting can be changed only - when there are no ksm shared pages in the system: set run 2 to - unmerge pages first, then to 1 after changing - ``merge_across_nodes``, to remerge according to the new setting. - - Default: 1 (merging across nodes as in earlier releases) - -run - * set to 0 to stop ksmd from running but keep merged pages, - * set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``, - * set to 2 to stop ksmd and unmerge all pages currently merged, but - leave mergeable areas registered for next run. - - Default: 0 (must be changed to 1 to activate KSM, except if - CONFIG_SYSFS is disabled) - -use_zero_pages - specifies whether empty pages (i.e. allocated pages that only - contain zeroes) should be treated specially. When set to 1, - empty pages are merged with the kernel zero page(s) instead of - with each other as it would happen normally. This can improve - the performance on architectures with coloured zero pages, - depending on the workload. Care should be taken when enabling - this setting, as it can potentially degrade the performance of - KSM for some workloads, for example if the checksums of pages - candidate for merging match the checksum of an empty - page. This setting can be changed at any time, it is only - effective for pages merged after the change. - - Default: 0 (normal KSM behaviour as in earlier releases) - -max_page_sharing - Maximum sharing allowed for each KSM page. This enforces a - deduplication limit to avoid high latency for virtual memory - operations that involve traversal of the virtual mappings that - share the KSM page. The minimum value is 2 as a newly created - KSM page will have at least two sharers. The higher this value - the faster KSM will merge the memory and the higher the - deduplication factor will be, but the slower the worst case - virtual mappings traversal could be for any given KSM - page. Slowing down this traversal means there will be higher - latency for certain virtual memory operations happening during - swapping, compaction, NUMA balancing and page migration, in - turn decreasing responsiveness for the caller of those virtual - memory operations. The scheduler latency of other tasks not - involved with the VM operations doing the virtual mappings - traversal is not affected by this parameter as these - traversals are always schedule friendly themselves. - -stable_node_chains_prune_millisecs - specifies how frequently KSM checks the metadata of the pages - that hit the deduplication limit for stale information. - Smaller milllisecs values will free up the KSM metadata with - lower latency, but they will make ksmd use more CPU during the - scan. It's a noop if not a single KSM page hit the - ``max_page_sharing`` yet. - -The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: - -pages_shared - how many shared pages are being used -pages_sharing - how many more sites are sharing them i.e. how much saved -pages_unshared - how many pages unique but repeatedly checked for merging -pages_volatile - how many pages changing too fast to be placed in a tree -full_scans - how many times all mergeable areas have been scanned -stable_node_chains - the number of KSM pages that hit the ``max_page_sharing`` limit -stable_node_dups - number of duplicated KSM pages - -A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good -sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing`` -indicates wasted effort. ``pages_volatile`` embraces several -different kinds of activity, but a high proportion there would also -indicate poor use of madvise MADV_MERGEABLE. - -The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the -``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must -be increased accordingly. +The userspace interface of KSM is described in :ref:`Documentation/admin-guide/mm/ksm.rst ` Design ====== From 1897e8f394c50124f90d6c1be672f05846438bf8 Mon Sep 17 00:00:00 2001 From: Daniel Vetter Date: Wed, 2 May 2018 09:51:06 +0200 Subject: [PATCH 067/103] doc: botching-up-ioctls: Make it clearer why structs must be padded This came up in discussions when reviewing drm patches. Reviewed-by: Eric Anholt Signed-off-by: Daniel Vetter Signed-off-by: Jonathan Corbet --- Documentation/ioctl/botching-up-ioctls.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/ioctl/botching-up-ioctls.txt b/Documentation/ioctl/botching-up-ioctls.txt index d02cfb48901c..883fb034bd04 100644 --- a/Documentation/ioctl/botching-up-ioctls.txt +++ b/Documentation/ioctl/botching-up-ioctls.txt @@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface. future extensions is going right down the gutters since someone will submit an ioctl struct with random stack garbage in the yet unused parts. Which then bakes in the ABI that those fields can never be used for anything else - but garbage. + but garbage. This is also the reason why you must explicitly pad all + structures, even if you never use them in an array - the padding the compiler + might insert could contain garbage. * Have simple testcases for all of the above. From f318a44e15c16307b3f95751b674cb5d63789eb6 Mon Sep 17 00:00:00 2001 From: Dong Bo Date: Mon, 7 May 2018 11:02:10 +0800 Subject: [PATCH 068/103] vfio: fix documentation Update vfio_add_group_dev description to match the current API. Signed-off-by: Dong Bo Reviewed-by: Cornelia Huck Signed-off-by: Jonathan Corbet --- Documentation/vfio.txt | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index ef6a5111eaa1..f1a4d3c3ba0b 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver, the driver should call vfio_add_group_dev() and vfio_del_group_dev() respectively:: - extern int vfio_add_group_dev(struct iommu_group *iommu_group, - struct device *dev, + extern int vfio_add_group_dev(struct device *dev, const struct vfio_device_ops *ops, void *device_data); extern void *vfio_del_group_dev(struct device *dev); vfio_add_group_dev() indicates to the core to begin tracking the -specified iommu_group and register the specified dev as owned by +iommu_group of the specified dev and register the dev as owned by a VFIO bus driver. The driver provides an ops structure for callbacks similar to a file operations structure:: From f6dbf65b6558e9587570cb5b8345b967175285b5 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sun, 6 May 2018 11:50:29 -0700 Subject: [PATCH 069/103] Documentation: block: cmdline-partition.txt fixes and additions Make the description of the kernel command line option "blkdevparts" a bit more flowing and readable. Fix a few typos. Add the optional and suffixes. Note that size can be "-" to indicate all of the remaining space. Signed-off-by: Randy Dunlap Cc: Cai Zhiyong Signed-off-by: Jonathan Corbet --- Documentation/block/cmdline-partition.txt | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/Documentation/block/cmdline-partition.txt b/Documentation/block/cmdline-partition.txt index 525b9f6d7fb4..760a3f7c3ed4 100644 --- a/Documentation/block/cmdline-partition.txt +++ b/Documentation/block/cmdline-partition.txt @@ -1,7 +1,9 @@ Embedded device command line partition parsing ===================================================================== -Support for reading the block device partition table from the command line. +The "blkdevparts" command line option adds support for reading the +block device partition table from the kernel command line. + It is typically used for fixed block (eMMC) embedded devices. It has no MBR, so saves storage space. Bootloader can be easily accessed by absolute address of data on the block device. @@ -14,22 +16,27 @@ blkdevparts=[;] := [@](part-name) - block device disk name, embedded device used fixed block device, - it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0. + block device disk name. Embedded device uses fixed block device. + Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0. partition size, in bytes, such as: 512, 1m, 1G. + size may contain an optional suffix of (upper or lower case): + K, M, G, T, P, E. + "-" is used to denote all remaining space. partition start address, in bytes. + offset may contain an optional suffix of (upper or lower case): + K, M, G, T, P, E. (part-name) - partition name, kernel send uevent with "PARTNAME". application can create - a link to block device partition with the name "PARTNAME". - user space application can access partition by partition name. + partition name. Kernel sends uevent with "PARTNAME". Application can + create a link to block device partition with the name "PARTNAME". + User space application can access partition by partition name. Example: - eMMC disk name is "mmcblk0" and "mmcblk0boot0" + eMMC disk names are "mmcblk0" and "mmcblk0boot0". bootargs: 'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)' From be99f610a11002e877cee2418466d8505b813937 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:38 +0200 Subject: [PATCH 070/103] Documentation/features: Add script that refreshes the arch support status files in place Provides the script: Documentation/features/scripts/features-refresh.sh which operates on the arch-support.txt files and refreshes them in place. This way [1], "[...] we soft- decouple the refreshing of the entries from the introduction of the features, while still making it all easy to keep sync and to extend." [1] http://lkml.kernel.org/r/20180328122211.GA25420@andrea Suggested-by: Ingo Molnar Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- .../features/scripts/features-refresh.sh | 98 +++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100755 Documentation/features/scripts/features-refresh.sh diff --git a/Documentation/features/scripts/features-refresh.sh b/Documentation/features/scripts/features-refresh.sh new file mode 100755 index 000000000000..9e72d38a0720 --- /dev/null +++ b/Documentation/features/scripts/features-refresh.sh @@ -0,0 +1,98 @@ +# +# Small script that refreshes the kernel feature support status in place. +# + +for F_FILE in Documentation/features/*/*/arch-support.txt; do + F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-) + + # + # Each feature F is identified by a pair (O, K), where 'O' can + # be either the empty string (for 'nop') or "not" (the logical + # negation operator '!'); other operators are not supported. + # + O="" + K=$F + if [[ "$F" == !* ]]; then + O="not" + K=$(echo $F | sed -e 's/^!//g') + fi + + # + # F := (O, K) is 'valid' iff there is a Kconfig file (for some + # arch) which contains K. + # + # Notice that this definition entails an 'asymmetry' between + # the case 'O = ""' and the case 'O = "not"'. E.g., F may be + # _invalid_ if: + # + # [case 'O = ""'] + # 1) no arch provides support for F, + # 2) K does not exist (e.g., it was renamed/mis-typed); + # + # [case 'O = "not"'] + # 3) all archs provide support for F, + # 4) as in (2). + # + # The rationale for adopting this definition (and, thus, for + # keeping the asymmetry) is: + # + # We want to be able to 'detect' (2) (or (4)). + # + # (1) and (3) may further warn the developers about the fact + # that K can be removed. + # + F_VALID="false" + for ARCH_DIR in arch/*/; do + K_FILES=$(find $ARCH_DIR -name "Kconfig*") + K_GREP=$(grep "$K" $K_FILES) + if [ ! -z "$K_GREP" ]; then + F_VALID="true" + break + fi + done + if [ "$F_VALID" = "false" ]; then + printf "WARNING: '%s' is not a valid Kconfig\n" "$F" + fi + + T_FILE="$F_FILE.tmp" + grep "^#" $F_FILE > $T_FILE + echo " -----------------------" >> $T_FILE + echo " | arch |status|" >> $T_FILE + echo " -----------------------" >> $T_FILE + for ARCH_DIR in arch/*/; do + ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g') + K_FILES=$(find $ARCH_DIR -name "Kconfig*") + K_GREP=$(grep "$K" $K_FILES) + # + # Arch support status values for (O, K) are updated according + # to the following rules. + # + # - ("", K) is 'supported by a given arch', if there is a + # Kconfig file for that arch which contains K; + # + # - ("not", K) is 'supported by a given arch', if there is + # no Kconfig file for that arch which contains K; + # + # - otherwise: preserve the previous status value (if any), + # default to 'not yet supported'. + # + # Notice that, according these rules, invalid features may be + # updated/modified. + # + if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then + printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE + elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then + printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE + else + S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:") + if [ ! -z "$S" ]; then + echo "$S" >> $T_FILE + else + printf " |%12s: | TODO |\n" "$ARCH" \ + >> $T_FILE + fi + fi + done + echo " -----------------------" >> $T_FILE + mv $T_FILE $F_FILE +done From 7156fc292850f7841077d5bde487422794b5335a Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:39 +0200 Subject: [PATCH 071/103] Documentation/features: Refresh the arch support status files in place Now that the script 'features-refresh.sh' is available, uses this script to refresh all the arch-support.txt files in place. Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- Documentation/features/core/BPF-JIT/arch-support.txt | 2 ++ .../features/core/generic-idle-thread/arch-support.txt | 4 +++- .../features/core/jump-labels/arch-support.txt | 2 ++ Documentation/features/core/tracehook/arch-support.txt | 2 ++ Documentation/features/debug/KASAN/arch-support.txt | 4 +++- .../features/debug/gcov-profile-all/arch-support.txt | 2 ++ Documentation/features/debug/kgdb/arch-support.txt | 4 +++- .../features/debug/kprobes-on-ftrace/arch-support.txt | 2 ++ Documentation/features/debug/kprobes/arch-support.txt | 4 +++- .../features/debug/kretprobes/arch-support.txt | 4 +++- .../features/debug/optprobes/arch-support.txt | 4 +++- .../features/debug/stackprotector/arch-support.txt | 2 ++ Documentation/features/debug/uprobes/arch-support.txt | 6 ++++-- .../features/debug/user-ret-profiler/arch-support.txt | 2 ++ .../features/io/dma-api-debug/arch-support.txt | 2 ++ .../features/io/dma-contiguous/arch-support.txt | 4 +++- Documentation/features/io/sg-chain/arch-support.txt | 2 ++ .../features/lib/strncasecmp/arch-support.txt | 2 ++ .../features/locking/cmpxchg-local/arch-support.txt | 4 +++- .../features/locking/lockdep/arch-support.txt | 4 +++- .../features/locking/queued-rwlocks/arch-support.txt | 10 ++++++---- .../features/locking/queued-spinlocks/arch-support.txt | 8 +++++--- .../features/locking/rwsem-optimized/arch-support.txt | 2 ++ .../features/perf/kprobes-event/arch-support.txt | 6 ++++-- Documentation/features/perf/perf-regs/arch-support.txt | 4 +++- .../features/perf/perf-stackdump/arch-support.txt | 4 +++- .../sched/membarrier-sync-core/arch-support.txt | 2 ++ .../features/sched/numa-balancing/arch-support.txt | 6 ++++-- .../features/seccomp/seccomp-filter/arch-support.txt | 6 ++++-- .../features/time/arch-tick-broadcast/arch-support.txt | 4 +++- .../features/time/clockevents/arch-support.txt | 4 +++- .../features/time/context-tracking/arch-support.txt | 2 ++ .../features/time/irq-time-acct/arch-support.txt | 4 +++- .../features/time/modern-timekeeping/arch-support.txt | 2 ++ .../features/time/virt-cpuacct/arch-support.txt | 2 ++ Documentation/features/vm/ELF-ASLR/arch-support.txt | 4 +++- Documentation/features/vm/PG_uncached/arch-support.txt | 2 ++ Documentation/features/vm/THP/arch-support.txt | 2 ++ Documentation/features/vm/TLB/arch-support.txt | 2 ++ Documentation/features/vm/huge-vmap/arch-support.txt | 2 ++ .../features/vm/ioremap_prot/arch-support.txt | 2 ++ .../features/vm/numa-memblock/arch-support.txt | 4 +++- Documentation/features/vm/pte_special/arch-support.txt | 2 ++ 43 files changed, 117 insertions(+), 31 deletions(-) diff --git a/Documentation/features/core/BPF-JIT/arch-support.txt b/Documentation/features/core/BPF-JIT/arch-support.txt index 0b96b4e1e7d4..d277f971ccd6 100644 --- a/Documentation/features/core/BPF-JIT/arch-support.txt +++ b/Documentation/features/core/BPF-JIT/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/core/generic-idle-thread/arch-support.txt b/Documentation/features/core/generic-idle-thread/arch-support.txt index 372a2b18a617..0ef6acdb991c 100644 --- a/Documentation/features/core/generic-idle-thread/arch-support.txt +++ b/Documentation/features/core/generic-idle-thread/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/core/jump-labels/arch-support.txt b/Documentation/features/core/jump-labels/arch-support.txt index ad97217b003b..27cbd63abfd2 100644 --- a/Documentation/features/core/jump-labels/arch-support.txt +++ b/Documentation/features/core/jump-labels/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/core/tracehook/arch-support.txt b/Documentation/features/core/tracehook/arch-support.txt index 36ee7bef5d18..f44c274e40ed 100644 --- a/Documentation/features/core/tracehook/arch-support.txt +++ b/Documentation/features/core/tracehook/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/KASAN/arch-support.txt b/Documentation/features/debug/KASAN/arch-support.txt index f5c99fa576d3..282ecc8ea1da 100644 --- a/Documentation/features/debug/KASAN/arch-support.txt +++ b/Documentation/features/debug/KASAN/arch-support.txt @@ -17,15 +17,17 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | | um: | TODO | | unicore32: | TODO | - | x86: | ok | 64-bit only + | x86: | ok | | xtensa: | ok | ----------------------- diff --git a/Documentation/features/debug/gcov-profile-all/arch-support.txt b/Documentation/features/debug/gcov-profile-all/arch-support.txt index 5170a9934843..01b2b3004e0a 100644 --- a/Documentation/features/debug/gcov-profile-all/arch-support.txt +++ b/Documentation/features/debug/gcov-profile-all/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/debug/kgdb/arch-support.txt b/Documentation/features/debug/kgdb/arch-support.txt index 13b6e994ae1f..3b4dff22329f 100644 --- a/Documentation/features/debug/kgdb/arch-support.txt +++ b/Documentation/features/debug/kgdb/arch-support.txt @@ -11,16 +11,18 @@ | arm: | ok | | arm64: | ok | | c6x: | TODO | - | h8300: | TODO | + | h8300: | ok | | hexagon: | ok | | ia64: | TODO | | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | TODO | | nios2: | ok | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt index 419bb38820e7..7e963d0ae646 100644 --- a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt +++ b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/debug/kprobes/arch-support.txt b/Documentation/features/debug/kprobes/arch-support.txt index 52b3ace0a030..4ada027faf16 100644 --- a/Documentation/features/debug/kprobes/arch-support.txt +++ b/Documentation/features/debug/kprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | ok | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/kretprobes/arch-support.txt b/Documentation/features/debug/kretprobes/arch-support.txt index 180d24419518..044e13fcca5d 100644 --- a/Documentation/features/debug/kretprobes/arch-support.txt +++ b/Documentation/features/debug/kretprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | ok | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/debug/optprobes/arch-support.txt b/Documentation/features/debug/optprobes/arch-support.txt index 0a1241f45e41..dce7669c918f 100644 --- a/Documentation/features/debug/optprobes/arch-support.txt +++ b/Documentation/features/debug/optprobes/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | - | powerpc: | TODO | + | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/debug/stackprotector/arch-support.txt b/Documentation/features/debug/stackprotector/arch-support.txt index 570019572383..74b89a9c8b3a 100644 --- a/Documentation/features/debug/stackprotector/arch-support.txt +++ b/Documentation/features/debug/stackprotector/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/debug/uprobes/arch-support.txt b/Documentation/features/debug/uprobes/arch-support.txt index 0b8d922eb799..1a3f9d3229bf 100644 --- a/Documentation/features/debug/uprobes/arch-support.txt +++ b/Documentation/features/debug/uprobes/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,13 +17,15 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/debug/user-ret-profiler/arch-support.txt b/Documentation/features/debug/user-ret-profiler/arch-support.txt index 13852ae62e9e..1d78d1069a5f 100644 --- a/Documentation/features/debug/user-ret-profiler/arch-support.txt +++ b/Documentation/features/debug/user-ret-profiler/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/io/dma-api-debug/arch-support.txt b/Documentation/features/io/dma-api-debug/arch-support.txt index e438ed675623..dd2806de9b8b 100644 --- a/Documentation/features/io/dma-api-debug/arch-support.txt +++ b/Documentation/features/io/dma-api-debug/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/io/dma-contiguous/arch-support.txt b/Documentation/features/io/dma-contiguous/arch-support.txt index 47f64a433df0..30c072d2b67c 100644 --- a/Documentation/features/io/dma-contiguous/arch-support.txt +++ b/Documentation/features/io/dma-contiguous/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | - | s390: | TODO | + | riscv: | ok | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/io/sg-chain/arch-support.txt b/Documentation/features/io/sg-chain/arch-support.txt index 07f357fadbff..6554f0372c3f 100644 --- a/Documentation/features/io/sg-chain/arch-support.txt +++ b/Documentation/features/io/sg-chain/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/lib/strncasecmp/arch-support.txt b/Documentation/features/lib/strncasecmp/arch-support.txt index 4f3a6a0e4e68..6148f42c3d90 100644 --- a/Documentation/features/lib/strncasecmp/arch-support.txt +++ b/Documentation/features/lib/strncasecmp/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/locking/cmpxchg-local/arch-support.txt b/Documentation/features/locking/cmpxchg-local/arch-support.txt index 482a0b09d1f8..51704a2dc8d1 100644 --- a/Documentation/features/locking/cmpxchg-local/arch-support.txt +++ b/Documentation/features/locking/cmpxchg-local/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/locking/lockdep/arch-support.txt b/Documentation/features/locking/lockdep/arch-support.txt index bb35c5ba6286..bd39c5edd460 100644 --- a/Documentation/features/locking/lockdep/arch-support.txt +++ b/Documentation/features/locking/lockdep/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/locking/queued-rwlocks/arch-support.txt b/Documentation/features/locking/queued-rwlocks/arch-support.txt index 627e9a6b2db9..da7aff3bee0b 100644 --- a/Documentation/features/locking/queued-rwlocks/arch-support.txt +++ b/Documentation/features/locking/queued-rwlocks/arch-support.txt @@ -9,21 +9,23 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/locking/queued-spinlocks/arch-support.txt b/Documentation/features/locking/queued-spinlocks/arch-support.txt index 9edda216cdfb..478e9101322c 100644 --- a/Documentation/features/locking/queued-spinlocks/arch-support.txt +++ b/Documentation/features/locking/queued-spinlocks/arch-support.txt @@ -16,14 +16,16 @@ | ia64: | TODO | | m68k: | TODO | | microblaze: | TODO | - | mips: | TODO | + | mips: | ok | + | nds32: | TODO | | nios2: | TODO | - | openrisc: | TODO | + | openrisc: | ok | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/locking/rwsem-optimized/arch-support.txt b/Documentation/features/locking/rwsem-optimized/arch-support.txt index 8d9afb10b16e..8afe24ffa3ab 100644 --- a/Documentation/features/locking/rwsem-optimized/arch-support.txt +++ b/Documentation/features/locking/rwsem-optimized/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/perf/kprobes-event/arch-support.txt b/Documentation/features/perf/kprobes-event/arch-support.txt index d01239ee34b3..7331402d1887 100644 --- a/Documentation/features/perf/kprobes-event/arch-support.txt +++ b/Documentation/features/perf/kprobes-event/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | ok | - | arm64: | TODO | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | ok | @@ -17,13 +17,15 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | ok | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | - | sparc: | TODO | + | sparc: | ok | | um: | TODO | | unicore32: | TODO | | x86: | ok | diff --git a/Documentation/features/perf/perf-regs/arch-support.txt b/Documentation/features/perf/perf-regs/arch-support.txt index 458faba5311a..53feeee6cdad 100644 --- a/Documentation/features/perf/perf-regs/arch-support.txt +++ b/Documentation/features/perf/perf-regs/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | s390: | TODO | + | riscv: | TODO | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/perf/perf-stackdump/arch-support.txt b/Documentation/features/perf/perf-stackdump/arch-support.txt index 545d01c69c88..16164348e0ea 100644 --- a/Documentation/features/perf/perf-stackdump/arch-support.txt +++ b/Documentation/features/perf/perf-stackdump/arch-support.txt @@ -17,11 +17,13 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | - | s390: | TODO | + | riscv: | TODO | + | s390: | ok | | sh: | TODO | | sparc: | TODO | | um: | TODO | diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 85a6c9d4571c..dbdf62907703 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -40,10 +40,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/sched/numa-balancing/arch-support.txt b/Documentation/features/sched/numa-balancing/arch-support.txt index 347508863872..c68bb2c2cb62 100644 --- a/Documentation/features/sched/numa-balancing/arch-support.txt +++ b/Documentation/features/sched/numa-balancing/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | .. | | arm: | .. | - | arm64: | .. | + | arm64: | ok | | c6x: | .. | | h8300: | .. | | hexagon: | .. | @@ -17,11 +17,13 @@ | m68k: | .. | | microblaze: | .. | | mips: | TODO | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | .. | | powerpc: | ok | - | s390: | .. | + | riscv: | TODO | + | s390: | ok | | sh: | .. | | sparc: | TODO | | um: | .. | diff --git a/Documentation/features/seccomp/seccomp-filter/arch-support.txt b/Documentation/features/seccomp/seccomp-filter/arch-support.txt index e4fad58a05e5..d4271b493b41 100644 --- a/Documentation/features/seccomp/seccomp-filter/arch-support.txt +++ b/Documentation/features/seccomp/seccomp-filter/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | TODO | + | parisc: | ok | + | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/time/arch-tick-broadcast/arch-support.txt b/Documentation/features/time/arch-tick-broadcast/arch-support.txt index 8052904b25fc..83d9e68462bb 100644 --- a/Documentation/features/time/arch-tick-broadcast/arch-support.txt +++ b/Documentation/features/time/arch-tick-broadcast/arch-support.txt @@ -17,12 +17,14 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | - | sh: | TODO | + | sh: | ok | | sparc: | TODO | | um: | TODO | | unicore32: | TODO | diff --git a/Documentation/features/time/clockevents/arch-support.txt b/Documentation/features/time/clockevents/arch-support.txt index 7c76b946297e..3d4908fce6da 100644 --- a/Documentation/features/time/clockevents/arch-support.txt +++ b/Documentation/features/time/clockevents/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | ok | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | - | parisc: | TODO | + | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/time/context-tracking/arch-support.txt b/Documentation/features/time/context-tracking/arch-support.txt index 9433b3e523b3..c29974afffaa 100644 --- a/Documentation/features/time/context-tracking/arch-support.txt +++ b/Documentation/features/time/context-tracking/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/time/irq-time-acct/arch-support.txt b/Documentation/features/time/irq-time-acct/arch-support.txt index 212dde0b578c..8d73c463ec27 100644 --- a/Documentation/features/time/irq-time-acct/arch-support.txt +++ b/Documentation/features/time/irq-time-acct/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | .. | - | powerpc: | .. | + | powerpc: | ok | + | riscv: | TODO | | s390: | .. | | sh: | TODO | | sparc: | .. | diff --git a/Documentation/features/time/modern-timekeeping/arch-support.txt b/Documentation/features/time/modern-timekeeping/arch-support.txt index 4074028f72f7..e7c6ea6b8fb3 100644 --- a/Documentation/features/time/modern-timekeeping/arch-support.txt +++ b/Documentation/features/time/modern-timekeeping/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | ok | | mips: | ok | + | nds32: | ok | | nios2: | ok | | openrisc: | ok | | parisc: | ok | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/time/virt-cpuacct/arch-support.txt b/Documentation/features/time/virt-cpuacct/arch-support.txt index a394d8820517..4646457461cf 100644 --- a/Documentation/features/time/virt-cpuacct/arch-support.txt +++ b/Documentation/features/time/virt-cpuacct/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | ok | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | ok | diff --git a/Documentation/features/vm/ELF-ASLR/arch-support.txt b/Documentation/features/vm/ELF-ASLR/arch-support.txt index 082f93d5b40e..1f71d090ff2c 100644 --- a/Documentation/features/vm/ELF-ASLR/arch-support.txt +++ b/Documentation/features/vm/ELF-ASLR/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | ok | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | - | parisc: | TODO | + | parisc: | ok | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/PG_uncached/arch-support.txt b/Documentation/features/vm/PG_uncached/arch-support.txt index 605e0abb756d..fbd5aa463b0a 100644 --- a/Documentation/features/vm/PG_uncached/arch-support.txt +++ b/Documentation/features/vm/PG_uncached/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/THP/arch-support.txt b/Documentation/features/vm/THP/arch-support.txt index 7a8eb0bd5ca8..5d7ecc378f29 100644 --- a/Documentation/features/vm/THP/arch-support.txt +++ b/Documentation/features/vm/THP/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | .. | | mips: | ok | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | .. | | sparc: | ok | diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 35fb99b2b3ea..f7af9678eb66 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | .. | | mips: | TODO | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/huge-vmap/arch-support.txt b/Documentation/features/vm/huge-vmap/arch-support.txt index ed8b943ad8fc..d0713ccc7117 100644 --- a/Documentation/features/vm/huge-vmap/arch-support.txt +++ b/Documentation/features/vm/huge-vmap/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | TODO | + | riscv: | TODO | | s390: | TODO | | sh: | TODO | | sparc: | TODO | diff --git a/Documentation/features/vm/ioremap_prot/arch-support.txt b/Documentation/features/vm/ioremap_prot/arch-support.txt index 589947bdf0a8..8527601a3739 100644 --- a/Documentation/features/vm/ioremap_prot/arch-support.txt +++ b/Documentation/features/vm/ioremap_prot/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | TODO | | sh: | ok | | sparc: | TODO | diff --git a/Documentation/features/vm/numa-memblock/arch-support.txt b/Documentation/features/vm/numa-memblock/arch-support.txt index 8b8bea0318a0..1a988052cd24 100644 --- a/Documentation/features/vm/numa-memblock/arch-support.txt +++ b/Documentation/features/vm/numa-memblock/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | .. | | arm: | .. | - | arm64: | .. | + | arm64: | ok | | c6x: | .. | | h8300: | .. | | hexagon: | .. | @@ -17,10 +17,12 @@ | m68k: | .. | | microblaze: | ok | | mips: | ok | + | nds32: | TODO | | nios2: | .. | | openrisc: | .. | | parisc: | .. | | powerpc: | ok | + | riscv: | ok | | s390: | ok | | sh: | ok | | sparc: | ok | diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 055004f467d2..6a608a6dcf71 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -17,10 +17,12 @@ | m68k: | TODO | | microblaze: | TODO | | mips: | TODO | + | nds32: | TODO | | nios2: | TODO | | openrisc: | TODO | | parisc: | TODO | | powerpc: | ok | + | riscv: | TODO | | s390: | ok | | sh: | ok | | sparc: | ok | From b466865916d554bbdeaa4d440add7e9e959a5433 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:40 +0200 Subject: [PATCH 072/103] Documentation/features/core: Add arch support status files for 'cBPF-JIT' and 'eBPF-JIT' Commit 6077776b5908e split 'HAVE_BPF_JIT' into cBPF and eBPF variant. Adds arch support status files for the new variants, and removes the status file corresponding to 'HAVE_BPT_JIT'. The new status matrices were auto-generated using the script 'features-refresh.sh'. Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- .../features/core/cBPF-JIT/arch-support.txt | 33 +++++++++++++++++++ .../{BPF-JIT => eBPF-JIT}/arch-support.txt | 6 ++-- 2 files changed, 36 insertions(+), 3 deletions(-) create mode 100644 Documentation/features/core/cBPF-JIT/arch-support.txt rename Documentation/features/core/{BPF-JIT => eBPF-JIT}/arch-support.txt (85%) diff --git a/Documentation/features/core/cBPF-JIT/arch-support.txt b/Documentation/features/core/cBPF-JIT/arch-support.txt new file mode 100644 index 000000000000..90459cdde314 --- /dev/null +++ b/Documentation/features/core/cBPF-JIT/arch-support.txt @@ -0,0 +1,33 @@ +# +# Feature name: cBPF-JIT +# Kconfig: HAVE_CBPF_JIT +# description: arch supports cBPF JIT optimizations +# + ----------------------- + | arch |status| + ----------------------- + | alpha: | TODO | + | arc: | TODO | + | arm: | TODO | + | arm64: | TODO | + | c6x: | TODO | + | h8300: | TODO | + | hexagon: | TODO | + | ia64: | TODO | + | m68k: | TODO | + | microblaze: | TODO | + | mips: | ok | + | nds32: | TODO | + | nios2: | TODO | + | openrisc: | TODO | + | parisc: | TODO | + | powerpc: | ok | + | riscv: | TODO | + | s390: | TODO | + | sh: | TODO | + | sparc: | ok | + | um: | TODO | + | unicore32: | TODO | + | x86: | TODO | + | xtensa: | TODO | + ----------------------- diff --git a/Documentation/features/core/BPF-JIT/arch-support.txt b/Documentation/features/core/eBPF-JIT/arch-support.txt similarity index 85% rename from Documentation/features/core/BPF-JIT/arch-support.txt rename to Documentation/features/core/eBPF-JIT/arch-support.txt index d277f971ccd6..c90a0382fe66 100644 --- a/Documentation/features/core/BPF-JIT/arch-support.txt +++ b/Documentation/features/core/eBPF-JIT/arch-support.txt @@ -1,7 +1,7 @@ # -# Feature name: BPF-JIT -# Kconfig: HAVE_BPF_JIT -# description: arch supports BPF JIT optimizations +# Feature name: eBPF-JIT +# Kconfig: HAVE_EBPF_JIT +# description: arch supports eBPF JIT optimizations # ----------------------- | arch |status| From 0ca2840ff52d25dab84cb688f9f020a005e7dc81 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:41 +0200 Subject: [PATCH 073/103] Documentation/features/locking: Use '!RWSEM_GENERIC_SPINLOCK' as Kconfig for 'rwsem-optimized' Uses '!RWSEM_GENERIC_SPINLOCK' in place of 'Optimized asm/rwsem.h' as Kconfig for 'rwsem-optimized': the new Kconfig expresses this feature equivalently, while also enabling the script 'features-refresh.sh' to operate on the corresponding arch support status file. Also refreshes the status matrix by using the script 'features-refresh.sh'. Suggested-by: Ingo Molnar Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- .../features/locking/rwsem-optimized/arch-support.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/features/locking/rwsem-optimized/arch-support.txt b/Documentation/features/locking/rwsem-optimized/arch-support.txt index 8afe24ffa3ab..e54b1f1a8091 100644 --- a/Documentation/features/locking/rwsem-optimized/arch-support.txt +++ b/Documentation/features/locking/rwsem-optimized/arch-support.txt @@ -1,6 +1,6 @@ # # Feature name: rwsem-optimized -# Kconfig: Optimized asm/rwsem.h +# Kconfig: !RWSEM_GENERIC_SPINLOCK # description: arch provides optimized rwsem APIs # ----------------------- @@ -8,8 +8,8 @@ ----------------------- | alpha: | ok | | arc: | TODO | - | arm: | TODO | - | arm64: | TODO | + | arm: | ok | + | arm64: | ok | | c6x: | TODO | | h8300: | TODO | | hexagon: | TODO | @@ -26,7 +26,7 @@ | s390: | ok | | sh: | ok | | sparc: | ok | - | um: | TODO | + | um: | ok | | unicore32: | TODO | | x86: | ok | | xtensa: | ok | From aee17ebe002a187fa97891e6c2e7eb2f9847e3b4 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:42 +0200 Subject: [PATCH 074/103] Documentation/features/lib: Remove arch support status file for 'strncasecmp' Suggested-by: Ingo Molnar Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- .../features/lib/strncasecmp/arch-support.txt | 33 ------------------- 1 file changed, 33 deletions(-) delete mode 100644 Documentation/features/lib/strncasecmp/arch-support.txt diff --git a/Documentation/features/lib/strncasecmp/arch-support.txt b/Documentation/features/lib/strncasecmp/arch-support.txt deleted file mode 100644 index 6148f42c3d90..000000000000 --- a/Documentation/features/lib/strncasecmp/arch-support.txt +++ /dev/null @@ -1,33 +0,0 @@ -# -# Feature name: strncasecmp -# Kconfig: __HAVE_ARCH_STRNCASECMP -# description: arch provides an optimized strncasecmp() function -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | TODO | - | arm: | TODO | - | arm64: | TODO | - | c6x: | TODO | - | h8300: | TODO | - | hexagon: | TODO | - | ia64: | TODO | - | m68k: | TODO | - | microblaze: | TODO | - | mips: | TODO | - | nds32: | TODO | - | nios2: | TODO | - | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | TODO | - | riscv: | TODO | - | s390: | TODO | - | sh: | TODO | - | sparc: | TODO | - | um: | TODO | - | unicore32: | TODO | - | x86: | TODO | - | xtensa: | TODO | - ----------------------- From 2bef69a385b4c1c01d8abae0aa035f0ffa051f07 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 7 May 2018 12:43:43 +0200 Subject: [PATCH 075/103] Documentation/features/vm: Remove arch support status file for 'pte_special' Suggested-by: Ingo Molnar Signed-off-by: Andrea Parri Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Andrew Morton Signed-off-by: Jonathan Corbet --- .../features/vm/pte_special/arch-support.txt | 33 ------------------- 1 file changed, 33 deletions(-) delete mode 100644 Documentation/features/vm/pte_special/arch-support.txt diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt deleted file mode 100644 index 6a608a6dcf71..000000000000 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ /dev/null @@ -1,33 +0,0 @@ -# -# Feature name: pte_special -# Kconfig: __HAVE_ARCH_PTE_SPECIAL -# description: arch supports the pte_special()/pte_mkspecial() VM APIs -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | ok | - | arm: | ok | - | arm64: | ok | - | c6x: | TODO | - | h8300: | TODO | - | hexagon: | TODO | - | ia64: | TODO | - | m68k: | TODO | - | microblaze: | TODO | - | mips: | TODO | - | nds32: | TODO | - | nios2: | TODO | - | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | ok | - | riscv: | TODO | - | s390: | ok | - | sh: | ok | - | sparc: | ok | - | um: | TODO | - | unicore32: | TODO | - | x86: | ok | - | xtensa: | TODO | - ----------------------- From 42f44d124e11af65b67c157ffe132d0ccf07f16b Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 8 May 2018 10:02:08 +0300 Subject: [PATCH 076/103] docs/vm: numa_memory_policy: formatting and spelling updates Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/numa_memory_policy.rst | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/Documentation/vm/numa_memory_policy.rst b/Documentation/vm/numa_memory_policy.rst index 8cd942ca114e..ac0b3967fcba 100644 --- a/Documentation/vm/numa_memory_policy.rst +++ b/Documentation/vm/numa_memory_policy.rst @@ -44,14 +44,20 @@ System Default Policy allocations. Task/Process Policy - this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy. + this is an optional, per-task policy. When defined for a + specific task, this policy controls all page allocations made + by or on behalf of the task that aren't controlled by a more + specific scope. If a task does not define a task policy, then + all page allocations that would have been controlled by the + task policy "fall back" to the System Default Policy. The task policy applies to the entire address space of a task. Thus, it is inheritable, and indeed is inherited, across both fork() [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task to establish the task policy for a child task exec()'d from an executable image that has no awareness of memory policy. See the - MEMORY POLICY APIS section, below, for an overview of the system call + :ref:`Memory Policy APIs ` section, + below, for an overview of the system call that a task may use to set/change its task/process policy. In a multi-threaded task, task policies apply only to the thread @@ -70,12 +76,13 @@ Task/Process Policy VMA Policy A "VMA" or "Virtual Memory Area" refers to a range of a task's virtual address space. A task may define a specific policy for a range - of its virtual address space. See the MEMORY POLICIES APIS section, + of its virtual address space. See the + :ref:`Memory Policy APIs ` section, below, for an overview of the mbind() system call used to set a VMA policy. A VMA policy will govern the allocation of pages that back - this region ofthe address space. Any regions of the task's + this region of the address space. Any regions of the task's address space that don't have an explicit VMA policy will fall back to the task policy, which may itself fall back to the System Default Policy. @@ -117,7 +124,7 @@ VMA Policy Shared Policy Conceptually, shared policies apply to "memory objects" mapped shared into one or more tasks' distinct address spaces. An - application installs a shared policies the same way as VMA + application installs shared policies the same way as VMA policies--using the mbind() system call specifying a range of virtual addresses that map the shared object. However, unlike VMA policies, which can be considered to be an attribute of a @@ -135,7 +142,7 @@ Shared Policy Although hugetlbfs segments now support lazy allocation, their support for shared policy has not been completed. - As mentioned above :ref:`VMA policies `, + As mentioned above in :ref:`VMA policies ` section, allocations of page cache pages for regular files mmap()ed with MAP_SHARED ignore any VMA policy installed on the virtual address range backed by the shared file mapping. Rather, @@ -245,7 +252,7 @@ MPOL_F_STATIC_NODES the user should not be remapped if the task or VMA's set of allowed nodes changes after the memory policy has been defined. - Without this flag, anytime a mempolicy is rebound because of a + Without this flag, any time a mempolicy is rebound because of a change in the set of allowed nodes, the node (Preferred) or nodemask (Bind, Interleave) is remapped to the new set of allowed nodes. This may result in nodes being used that were @@ -389,7 +396,10 @@ follows: or by prefaulting the entire shared memory region into memory and locking it down. However, this might not be appropriate for all applications. +.. _memory_policy_apis: + Memory Policy APIs +================== Linux supports 3 system calls for controlling memory policy. These APIS always affect only the calling task, the calling task's address space, or From 1174bd849c75ee51c89df56f363b33aeae78ffd7 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 8 May 2018 10:02:09 +0300 Subject: [PATCH 077/103] docs/vm: numa_memory_policy: s/Linux memory policy/NUMA memory policy/ The document describes NUMA memory policy and as it is a part of the Linux documentation it's obvious that this is Linux memory policy. Besides, "Linux memory policy" may refer to other policies, e.g. memory hotplug policy, and using term NUMA makes the documentation less ambiguous. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/numa_memory_policy.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Documentation/vm/numa_memory_policy.rst b/Documentation/vm/numa_memory_policy.rst index ac0b3967fcba..d78c5b315f72 100644 --- a/Documentation/vm/numa_memory_policy.rst +++ b/Documentation/vm/numa_memory_policy.rst @@ -1,10 +1,10 @@ .. _numa_memory_policy: -=================== -Linux Memory Policy -=================== +================== +NUMA Memory Policy +================== -What is Linux Memory Policy? +What is NUMA Memory Policy? ============================ In the Linux kernel, "memory policy" determines from which node the kernel will @@ -162,7 +162,7 @@ Shared Policy Components of Memory Policies ----------------------------- -A Linux memory policy consists of a "mode", optional mode flags, and +A NUMA memory policy consists of a "mode", optional mode flags, and an optional set of nodes. The mode determines the behavior of the policy, the optional mode flags determine the behavior of the mode, and the optional set of nodes can be viewed as the arguments to the @@ -172,7 +172,7 @@ Internally, memory policies are implemented by a reference counted structure, struct mempolicy. Details of this structure will be discussed in context, below, as required to explain the behavior. -Linux memory policy supports the following 4 behavioral modes: +NUMA memory policy supports the following 4 behavioral modes: Default Mode--MPOL_DEFAULT This mode is only used in the memory policy APIs. Internally, @@ -245,7 +245,7 @@ MPOL_INTERLEAVED address range or file. During system boot up, the temporary interleaved system default policy works in this mode. -Linux memory policy supports the following optional mode flags: +NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES This flag specifies that the nodemask passed by From 3ecf53e41a642d4172cff1f641b23fa1baaa229a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 8 May 2018 10:02:10 +0300 Subject: [PATCH 078/103] docs/vm: move numa_memory_policy.rst to Documentation/admin-guide/mm The document describes userspace API and as such it belongs to Documentation/admin-guide/mm Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/mm/hugetlbpage.rst | 2 +- Documentation/admin-guide/mm/index.rst | 1 + Documentation/{vm => admin-guide/mm}/numa_memory_policy.rst | 0 Documentation/filesystems/proc.txt | 2 +- Documentation/filesystems/tmpfs.txt | 5 +++-- Documentation/vm/00-INDEX | 2 -- Documentation/vm/index.rst | 1 - Documentation/vm/numa.rst | 2 +- 8 files changed, 7 insertions(+), 8 deletions(-) rename Documentation/{vm => admin-guide/mm}/numa_memory_policy.rst (100%) diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index a8b0806377bb..1cc0bc78d10e 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -220,7 +220,7 @@ memory policy mode--bind, preferred, local or interleave--may be used. The resulting effect on persistent huge page allocation is as follows: #. Regardless of mempolicy mode [see - :ref:`Documentation/vm/numa_memory_policy.rst `], + :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst `], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if "interleave" had been specified. However, if a node in the policy does not contain sufficient contiguous diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index ad28644fee35..a69aa69af255 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -24,6 +24,7 @@ the Linux memory management. hugetlbpage idle_page_tracking ksm + numa_memory_policy pagemap soft-dirty userfaultfd diff --git a/Documentation/vm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst similarity index 100% rename from Documentation/vm/numa_memory_policy.rst rename to Documentation/admin-guide/mm/numa_memory_policy.rst diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ef53f808288d..520f6a84cf50 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -566,7 +566,7 @@ address policy mapping details Where: "address" is the starting address for the mapping; -"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt); +"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst); "mapping details" summarizes mapping data such as mapping type, page usage counters, node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page size, in KB, that is backing the mapping up. diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 627389a34f77..d06e9a59a9f4 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -105,8 +105,9 @@ policy for the file will revert to "default" policy. NUMA memory allocation policies have optional flags that can be used in conjunction with their modes. These optional flags can be specified when tmpfs is mounted by appending them to the mode before the NodeList. -See Documentation/vm/numa_memory_policy.rst for a list of all available -memory allocation policy mode flags and their effect on memory policy. +See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of +all available memory allocation policy mode flags and their effect on +memory policy. =static is equivalent to MPOL_F_STATIC_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index f8a96ca16b7a..f4a4f3e884cf 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -22,8 +22,6 @@ mmu_notifier.rst - a note about clearing pte/pmd and mmu notifications numa.rst - information about NUMA specific code in the Linux vm. -numa_memory_policy.rst - - documentation of concepts and APIs of the 2.6 memory policy support. overcommit-accounting.rst - description of the Linux kernels overcommit handling modes. page_frags.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index ed58cb9f9675..8e1cc667eef1 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -14,7 +14,6 @@ various features of the Linux memory management :maxdepth: 1 ksm - numa_memory_policy transhuge swap_numa zswap diff --git a/Documentation/vm/numa.rst b/Documentation/vm/numa.rst index aada84bc8c46..185d8a568168 100644 --- a/Documentation/vm/numa.rst +++ b/Documentation/vm/numa.rst @@ -110,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, such as taskset(1) and numactl(1), and program interfaces such as sched_setaffinity(2). Further, one can modify the kernel's default local allocation behavior using Linux NUMA memory policy. -[see Documentation/vm/numa_memory_policy.rst.] +[see Documentation/admin-guide/mm/numa_memory_policy.rst.] System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions From 2d93404f358312c8adb0bbf975d0e30662d40c33 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:39 -0300 Subject: [PATCH 079/103] docs: */index.rst: Add newer documents to their respective index.rst A number of new docs were added, but they're currently not on the index.rst from the session they're supposed to be, causing Sphinx warnings. Add them. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/crypto/index.rst | 1 + Documentation/driver-api/index.rst | 1 + Documentation/process/index.rst | 1 + Documentation/security/index.rst | 2 ++ 4 files changed, 5 insertions(+) diff --git a/Documentation/crypto/index.rst b/Documentation/crypto/index.rst index 94c4786f2573..c4ff5d791233 100644 --- a/Documentation/crypto/index.rst +++ b/Documentation/crypto/index.rst @@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples. architecture devel-algos userspace-if + crypto_engine api api-samples diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 6d8352c0f354..3ac51c94f97b 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -18,6 +18,7 @@ available subsections can be seen below. infrastructure pm/index device-io + device_connection dma-buf device_link message-based diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst index 1c9fe657ed01..37bd0628b6ee 100644 --- a/Documentation/process/index.rst +++ b/Documentation/process/index.rst @@ -52,6 +52,7 @@ lack of a better place. adding-syscalls magic-number volatile-considered-harmful + clang-format .. only:: subproject and html diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst index 298a94a33f05..85492bfca530 100644 --- a/Documentation/security/index.rst +++ b/Documentation/security/index.rst @@ -9,5 +9,7 @@ Security Documentation IMA-templates keys/index LSM + LSM-sctp + SELinux-sctp self-protection tpm/index From fe8703cc0de67695e3385ba78b5dfb1091769d50 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:40 -0300 Subject: [PATCH 080/103] docs: admin-guide: add bcache documentation The bcache.txt is already in ReST format. So, move it to the admin guide, where it belongs. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 -- Documentation/{bcache.txt => admin-guide/bcache.rst} | 0 Documentation/admin-guide/index.rst | 1 + 3 files changed, 1 insertion(+), 2 deletions(-) rename Documentation/{bcache.txt => admin-guide/bcache.rst} (100%) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 708dc4c166e4..53699c79ee54 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -64,8 +64,6 @@ auxdisplay/ - misc. LCD driver documentation (cfag12864b, ks0108). backlight/ - directory with info on controlling backlights in flat panel displays -bcache.txt - - Block-layer cache on fast SSDs to improve slow (raid) I/O performance. block/ - info on the Block I/O (BIO) layer. blockdev/ diff --git a/Documentation/bcache.txt b/Documentation/admin-guide/bcache.rst similarity index 100% rename from Documentation/bcache.txt rename to Documentation/admin-guide/bcache.rst diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index cac906fb0ed0..52eb3408f9a0 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -60,6 +60,7 @@ configure specific aspects of kernel behavior to your liking. mono java ras + bcache pm/index thunderbolt LSM/index From de0f51e4b1391145e47d6aa60681dab091bcc777 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:41 -0300 Subject: [PATCH 081/103] docs: core-api: add cachetlb documentation The cachetlb.txt is already in ReST format. So, move it to the core-api guide, where it belongs. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 -- Documentation/{cachetlb.txt => core-api/cachetlb.rst} | 0 Documentation/core-api/index.rst | 1 + Documentation/memory-barriers.txt | 2 +- Documentation/translations/ko_KR/memory-barriers.txt | 2 +- 5 files changed, 3 insertions(+), 4 deletions(-) rename Documentation/{cachetlb.txt => core-api/cachetlb.rst} (100%) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 53699c79ee54..04074059bcdc 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -76,8 +76,6 @@ bus-devices/ - directory with info on TI GPMC (General Purpose Memory Controller) bus-virt-phys-mapping.txt - how to access I/O mapped memory from within device drivers. -cachetlb.txt - - describes the cache/TLB flushing interfaces Linux uses. cdrom/ - directory with information on the CD-ROM drivers that Linux has. cgroup-v1/ diff --git a/Documentation/cachetlb.txt b/Documentation/core-api/cachetlb.rst similarity index 100% rename from Documentation/cachetlb.txt rename to Documentation/core-api/cachetlb.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index c670a8031786..d4d71ee564ae 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -14,6 +14,7 @@ Core utilities kernel-api assoc_array atomic_ops + cachetlb refcount-vs-atomic cpu_hotplug idr diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 6dafc8085acc..983249906fc6 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the appropriate part of the kernel must invalidate the overlapping bits of the cache on each CPU. -See Documentation/cachetlb.txt for more information on cache management. +See Documentation/core-api/cachetlb.rst for more information on cache management. CACHE COHERENCY VS MMIO diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index 0a0930ab4156..081937577c1a 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮 문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는 비트들을 무효화 시켜야 합니다. -캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를 +캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를 참고하세요. From d8a121e3d5a503152206bfa1d16d88074b121b2a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:43 -0300 Subject: [PATCH 082/103] docs: core-api: add circular-buffers documentation The circular-buffers.txt is already in ReST format. So, move it to the core-api guide, where it belongs. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 -- .../{circular-buffers.txt => core-api/circular-buffers.rst} | 0 Documentation/core-api/index.rst | 1 + Documentation/memory-barriers.txt | 2 +- Documentation/translations/ko_KR/memory-barriers.txt | 2 +- 5 files changed, 3 insertions(+), 4 deletions(-) rename Documentation/{circular-buffers.txt => core-api/circular-buffers.rst} (100%) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 04074059bcdc..6e141c05f3d2 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -82,8 +82,6 @@ cgroup-v1/ - cgroups v1 features, including cpusets and memory controller. cgroup-v2.txt - cgroups v2 features, including cpusets and memory controller. -circular-buffers.txt - - how to make use of the existing circular buffer infrastructure clk.txt - info on the common clock framework cma/ diff --git a/Documentation/circular-buffers.txt b/Documentation/core-api/circular-buffers.rst similarity index 100% rename from Documentation/circular-buffers.txt rename to Documentation/core-api/circular-buffers.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index d4d71ee564ae..3864de589126 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -26,6 +26,7 @@ Core utilities genalloc errseq printk-formats + circular-buffers Interfaces for kernel debugging =============================== diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 983249906fc6..33b8bc9573f8 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS Memory barriers can be used to implement circular buffering without the need of a lock to serialise the producer with the consumer. See: - Documentation/circular-buffers.txt + Documentation/core-api/circular-buffers.rst for details. diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index 081937577c1a..2ec5fe0c9cf4 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다. 동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을 위해선 다음을 참고하세요: - Documentation/circular-buffers.txt + Documentation/core-api/circular-buffers.rst ========= From 18bcaa4e617c04043e46e70c54753d42cf6728f4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:44 -0300 Subject: [PATCH 083/103] docs: driver-api: add clk documentation The clk.rst is already in ReST format. So, move it to the driver-api guide, where it belongs. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 -- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/{clk.txt => driver-api/clk.rst} | 0 Documentation/driver-api/index.rst | 1 + 4 files changed, 2 insertions(+), 3 deletions(-) rename Documentation/{clk.txt => driver-api/clk.rst} (100%) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 6e141c05f3d2..a50d2380b6fb 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -82,8 +82,6 @@ cgroup-v1/ - cgroups v1 features, including cpusets and memory controller. cgroup-v2.txt - cgroups v2 features, including cpusets and memory controller. -clk.txt - - info on the common clock framework cma/ - Continuous Memory Area (CMA) debugfs interface. conf.py diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 865a24e4d516..42f3e2884e7c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -518,7 +518,7 @@ those clocks in any way. This parameter is useful for debug and development, but should not be needed on a platform with proper driver support. For more - information, see Documentation/clk.txt. + information, see Documentation/driver-api/clk.rst. clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] diff --git a/Documentation/clk.txt b/Documentation/driver-api/clk.rst similarity index 100% rename from Documentation/clk.txt rename to Documentation/driver-api/clk.rst diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 3ac51c94f97b..5d04296f5ce0 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -17,6 +17,7 @@ available subsections can be seen below. basics infrastructure pm/index + clk device-io device_connection dma-buf From a9251553c2255b2582654c0f239941ef4d830f18 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Fri, 4 May 2018 23:11:49 +0200 Subject: [PATCH 084/103] Documentation: refcount-vs-atomic: Update reference to LKMM doc. The LKMM project has moved to 'tools/memory-model/'. Signed-off-by: Andrea Parri Signed-off-by: Jonathan Corbet --- Documentation/core-api/refcount-vs-atomic.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst index 83351c258cdb..322851bada16 100644 --- a/Documentation/core-api/refcount-vs-atomic.rst +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in these memory ordering guarantees. The terms used through this document try to follow the formal LKMM defined in -github.com/aparri/memory-model/blob/master/Documentation/explanation.txt +tools/memory-model/Documentation/explanation.txt. memory-barriers.txt and atomic_t.txt provide more background to the memory ordering in general and for atomic operations specifically. From c7c527dd6e7687a2479c8508eda6e4b19fc6aebb Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Thu, 10 May 2018 09:59:02 -0600 Subject: [PATCH 085/103] Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'" The removal of this file appears to have been premature; it's not a feature enabled by Kconfig, but it's a arch-level feature regardless. Put it back for now until some happy future time when we decide how we really want to document such features. This reverts commit 2bef69a385b4c1c01d8abae0aa035f0ffa051f07. Signed-off-by: Jonathan Corbet --- .../features/vm/pte_special/arch-support.txt | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 Documentation/features/vm/pte_special/arch-support.txt diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt new file mode 100644 index 000000000000..6a608a6dcf71 --- /dev/null +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -0,0 +1,33 @@ +# +# Feature name: pte_special +# Kconfig: __HAVE_ARCH_PTE_SPECIAL +# description: arch supports the pte_special()/pte_mkspecial() VM APIs +# + ----------------------- + | arch |status| + ----------------------- + | alpha: | TODO | + | arc: | ok | + | arm: | ok | + | arm64: | ok | + | c6x: | TODO | + | h8300: | TODO | + | hexagon: | TODO | + | ia64: | TODO | + | m68k: | TODO | + | microblaze: | TODO | + | mips: | TODO | + | nds32: | TODO | + | nios2: | TODO | + | openrisc: | TODO | + | parisc: | TODO | + | powerpc: | ok | + | riscv: | TODO | + | s390: | ok | + | sh: | ok | + | sparc: | ok | + | um: | TODO | + | unicore32: | TODO | + | x86: | ok | + | xtensa: | TODO | + ----------------------- From b6e9d06789fc37aeac36da41307d1b55a5192778 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 9 May 2018 10:18:45 -0300 Subject: [PATCH 086/103] docs: admin-guide: add cgroup-v2 documentation The cgroup-v2.txt is already in ReST format. So, move it to the admin-guide, where it belongs. Cc: Li Zefan Cc: Johannes Weiner Acked-by: Tejun Heo Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/00-INDEX | 2 -- Documentation/{cgroup-v2.txt => admin-guide/cgroup-v2.rst} | 0 Documentation/admin-guide/index.rst | 1 + 3 files changed, 1 insertion(+), 2 deletions(-) rename Documentation/{cgroup-v2.txt => admin-guide/cgroup-v2.rst} (100%) diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index a50d2380b6fb..2754fe83f0d4 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -80,8 +80,6 @@ cdrom/ - directory with information on the CD-ROM drivers that Linux has. cgroup-v1/ - cgroups v1 features, including cpusets and memory controller. -cgroup-v2.txt - - cgroups v2 features, including cpusets and memory controller. cma/ - Continuous Memory Area (CMA) debugfs interface. conf.py diff --git a/Documentation/cgroup-v2.txt b/Documentation/admin-guide/cgroup-v2.rst similarity index 100% rename from Documentation/cgroup-v2.txt rename to Documentation/admin-guide/cgroup-v2.rst diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 52eb3408f9a0..48d70af11652 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking. :maxdepth: 1 initrd + cgroup-v2 serial-console braille-console parport From f27e1d244b56986e220cd24109b89cb00f87e997 Mon Sep 17 00:00:00 2001 From: Justin Skists Date: Thu, 10 May 2018 20:37:37 +0100 Subject: [PATCH 087/103] Documentation/process/posting: wrap text at 80 cols Trivial patch to adjust the text formatting to wrap at 80 columns. No actual content has changed. Signed-off-by: Justin Skists Signed-off-by: Jonathan Corbet --- Documentation/process/5.Posting.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/process/5.Posting.rst b/Documentation/process/5.Posting.rst index c209d70da66f..c418c5d6cae4 100644 --- a/Documentation/process/5.Posting.rst +++ b/Documentation/process/5.Posting.rst @@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches; following them will make life much easier for everybody involved. This document will attempt to cover these expectations in reasonable detail; more information can also be found in the files process/submitting-patches.rst, -process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation -directory. +process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel +documentation directory. When to post @@ -198,8 +198,8 @@ pass it to diff with the "-X" option. The tags mentioned above are used to describe how various developers have been associated with the development of this patch. They are described in -detail in the process/submitting-patches.rst document; what follows here is a brief -summary. Each of these lines has the format: +detail in the process/submitting-patches.rst document; what follows here is a +brief summary. Each of these lines has the format: :: @@ -210,8 +210,8 @@ The tags in common use are: - Signed-off-by: this is a developer's certification that he or she has the right to submit the patch for inclusion into the kernel. It is an agreement to the Developer's Certificate of Origin, the full text of - which can be found in Documentation/process/submitting-patches.rst. Code without a - proper signoff cannot be merged into the mainline. + which can be found in Documentation/process/submitting-patches.rst. Code + without a proper signoff cannot be merged into the mainline. - Co-developed-by: states that the patch was also created by another developer along with the original author. This is useful at times when multiple @@ -226,8 +226,8 @@ The tags in common use are: it to work. - Reviewed-by: the named developer has reviewed the patch for correctness; - see the reviewer's statement in Documentation/process/submitting-patches.rst for more - detail. + see the reviewer's statement in Documentation/process/submitting-patches.rst + for more detail. - Reported-by: names a user who reported a problem which is fixed by this patch; this tag is used to give credit to the (often underappreciated) From 9167b942d53c96e7ddeb59ee0f84930955d7b682 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 7 May 2018 06:35:54 -0300 Subject: [PATCH 088/103] w1: w1_io.c: fix a kernel-doc warning Add a blank line to avoid this Sphinx warning: ./drivers/w1/w1_io.c:197: WARNING: Definition list ends without a blank line; unexpected unindent. Signed-off-by: Mauro Carvalho Chehab Acked-by: Evgeniy Polyakov Signed-off-by: Jonathan Corbet --- drivers/w1/w1_io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/w1/w1_io.c b/drivers/w1/w1_io.c index 075d120e7b88..0364d3329c52 100644 --- a/drivers/w1/w1_io.c +++ b/drivers/w1/w1_io.c @@ -194,6 +194,7 @@ static u8 w1_read_bit(struct w1_master *dev) * bit 0 = id_bit * bit 1 = comp_bit * bit 2 = dir_taken + * * If both bits 0 & 1 are set, the search should be restarted. * * Return: bit fields - see above From 8d420f6c27c530e25c057a59575e193dd62bb743 Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Wed, 9 May 2018 16:23:41 +0800 Subject: [PATCH 089/103] mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback Add document for newly added thp_swpout, thp_swpout_fallback fields in /proc/vmstat. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Johannes Weiner Signed-off-by: Jonathan Corbet --- Documentation/vm/transhuge.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst index 569d182cc973..2c6867fca6ff 100644 --- a/Documentation/vm/transhuge.rst +++ b/Documentation/vm/transhuge.rst @@ -355,6 +355,15 @@ thp_zero_page_alloc_failed is incremented if kernel fails to allocate huge zero page and falls back to using small pages. +thp_swpout + is incremented every time a huge page is swapout in one + piece without splitting. + +thp_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help From 02a43659e15893a6611cfc10dc7aae1746eb0cdc Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 9 May 2018 10:18:48 -0300 Subject: [PATCH 090/103] docs: uio-howto.rst: use a code block to solve a warning /devel/v4l/docs/Documentation/driver-api/uio-howto.rst:715: WARNING: Unexpected indentation. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/driver-api/uio-howto.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst index 92056c20e070..fb2eb73be4a3 100644 --- a/Documentation/driver-api/uio-howto.rst +++ b/Documentation/driver-api/uio-howto.rst @@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources: If a subchannel is created by a request to host, then the uio_hv_generic device driver will create a sysfs binary file for the per-channel ring buffer. -For example: +For example:: + /sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring Further information From d26560950b6ba6454c11cd978d3e6bb4d38430e8 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 9 May 2018 10:18:49 -0300 Subject: [PATCH 091/103] scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode The original shell script works, but: 1) it is too slow; 2) it is hard to exclude rejex patterns Convert it to perl. Here, the new version is able to check the entire tree in less than a second (after cached): real 0m0,284s user 0m0,668s sys 0m0,778s The old version takes more than a minute to complete (also after cached): real 1m17,905s user 0m25,583s sys 0m55,334s It also produce less false-positives (if any). The new script also contains an auto-fix mode. Usually, file references get lost when they're moved to some other place and/or renamed to .rst. Add an experimental mode to auto-fix those. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- scripts/documentation-file-ref-check | 123 ++++++++++++++++++++++++--- 1 file changed, 112 insertions(+), 11 deletions(-) diff --git a/scripts/documentation-file-ref-check b/scripts/documentation-file-ref-check index bc1659900e89..2520bc14ffac 100755 --- a/scripts/documentation-file-ref-check +++ b/scripts/documentation-file-ref-check @@ -1,15 +1,116 @@ -#!/bin/sh +#!/usr/bin/env perl +# SPDX-License-Identifier: GPL-2.0 +# # Treewide grep for references to files under Documentation, and report # non-existing files in stderr. -for f in $(git ls-files); do - for ref in $(grep -ho "Documentation/[A-Za-z0-9_.,~/*+-]*" "$f"); do - # presume trailing . and , are not part of the name - ref=${ref%%[.,]} +use warnings; +use strict; +use Getopt::Long qw(:config no_auto_abbrev); - # use ls to handle wildcards - if ! ls $ref >/dev/null 2>&1; then - echo "$f: $ref" >&2 - fi - done -done +my $scriptname = $0; +$scriptname =~ s,.*/([^/]+/),$1,; + +# Parse arguments +my $help = 0; +my $fix = 0; + +GetOptions( + 'fix' => \$fix, + 'h|help|usage' => \$help, +); + +if ($help != 0) { + print "$scriptname [--help] [--fix-rst]\n"; + exit -1; +} + +# Step 1: find broken references +print "Finding broken references. This may take a while... " if ($fix); + +my %broken_ref; + +open IN, "git grep 'Documentation/'|" + or die "Failed to run git grep"; +while () { + next if (!m/^([^:]+):(.*)/); + + my $f = $1; + my $ln = $2; + + # Makefiles contain nasty expressions to parse docs + next if ($f =~ m/Makefile/); + # Skip this script + next if ($f eq $scriptname); + + if ($ln =~ m,\b(\S*)(Documentation/[A-Za-z0-9\_\.\,\~/\*+-]*),) { + my $prefix = $1; + my $ref = $2; + my $base = $2; + + $ref =~ s/[\,\.]+$//; + + my $fulref = "$prefix$ref"; + + $fulref =~ s/^(\ 1) { + print STDERR "WARNING: Won't auto-replace, as found multiple files close to $ref:\n"; + foreach my $j (@find) { + $j =~ s,^./,,; + print STDERR " $j\n"; + } + } else { + $f = $find[0]; + $f =~ s,^./,,; + print "INFO: Replacing $ref to $f\n"; + foreach my $j (qx(git grep -l $ref)) { + qx(sed "s\@$ref\@$f\@g" -i $j); + } + } +} From b971a90f7dd3c3acb5967d1cf1a465d82a368d47 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 9 May 2018 10:18:50 -0300 Subject: [PATCH 092/103] docs: ranoops.rst: fix location of ramoops.txt The location of the dt bindings file is wrong: it was probably badly renamed by some script. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/ramoops.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index 4efd7ce77565..6dbcc5481000 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners: mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1 B. Use Device Tree bindings, as described in - ``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``. + ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``. For example:: reserved-memory { From cd4957674570d1497e4974d732fd9d6bf384cc37 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jonathan=20Neusch=C3=A4fer?= Date: Wed, 16 May 2018 14:08:00 +0200 Subject: [PATCH 093/103] Documentation: gpio: driver: Fix a typo and some odd grammar MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Jonathan Neuschäfer Signed-off-by: Jonathan Corbet --- Documentation/driver-api/gpio/driver.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 505ee906d7d9..cbe0242842d1 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -44,7 +44,7 @@ common to each controller of that type: - methods to establish GPIO line direction - methods used to access GPIO line values - - method to set electrical configuration to a a given GPIO line + - method to set electrical configuration for a given GPIO line - method to return the IRQ number associated to a given GPIO line - flag saying whether calls to its methods may sleep - optional line names array to identify lines @@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on the rail actively pulls it down. The level on the line will go as high as the VDD on the pull-up resistor, which -may be higher than the level supported by the transistor, achieveing a +may be higher than the level supported by the transistor, achieving a level-shift to the higher VDD. Integrated electronics often have an output driver stage in the form of a CMOS @@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips Any provider of irqchips needs to be carefully tailored to support Real Time preemption. It is desirable that all irqchips in the GPIO subsystem keep this -in mind and does the proper testing to assure they are real time-enabled. +in mind and do the proper testing to assure they are real time-enabled. So, pay attention on above " RT_FULL:" notes, please. The following is a checklist to follow when preparing a driver for real time-compliance: From f6bf549a0be2298cf64da10f8a2c315f5a3894d0 Mon Sep 17 00:00:00 2001 From: Thomas Hebb Date: Mon, 14 May 2018 17:55:10 -0400 Subject: [PATCH 094/103] Documentation: arm: clean up Marvell Berlin family info Remove dead links, make spacing consistent, and note that the family was acquired by Synaptics in 2017. Signed-off-by: Thomas Hebb Signed-off-by: Jonathan Corbet --- Documentation/arm/Marvell/README | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/Documentation/arm/Marvell/README b/Documentation/arm/Marvell/README index b5bb7f518840..56ada27c53be 100644 --- a/Documentation/arm/Marvell/README +++ b/Documentation/arm/Marvell/README @@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions) 88DE3010, Armada 1000 (no Linux support) Core: Marvell PJ1 (ARMv5TE), Dual-core Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf - 88DE3005, Armada 1500-mini 88DE3005, Armada 1500 Mini Design name: BG2CD Core: ARM Cortex-A9, PL310 L2CC - Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/ - 88DE3006, Armada 1500 Mini Plus - Design name: BG2CDP - Core: Dual Core ARM Cortex-A7 - Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/ + 88DE3006, Armada 1500 Mini Plus + Design name: BG2CDP + Core: Dual Core ARM Cortex-A7 88DE3100, Armada 1500 Design name: BG2 Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC - Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf 88DE3114, Armada 1500 Pro Design name: BG2Q Core: Quad Core ARM Cortex-A9, PL310 L2CC @@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions) 88DE3218, ARMADA 1500 Ultra Core: ARM Cortex-A53 - Homepage: http://www.marvell.com/multimedia-solutions/ + Homepage: https://www.synaptics.com/products/multimedia-solutions Directory: arch/arm/mach-berlin Comments: + * This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...). + * The Berlin family was acquired by Synaptics from Marvell in 2017. + CPU Cores --------- From 07a83038a35164dbc37f604c63b347b1fe46a205 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 14 May 2018 11:13:38 +0300 Subject: [PATCH 095/103] docs/vm: transhuge: change sections order so that userspace interface and implementation description will be grouped together Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/transhuge.rst | 82 +++++++++++++++++----------------- 1 file changed, 41 insertions(+), 41 deletions(-) diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst index 2c6867fca6ff..56d04cbb471f 100644 --- a/Documentation/vm/transhuge.rst +++ b/Documentation/vm/transhuge.rst @@ -38,31 +38,6 @@ are using hugepages but a significant speedup already happens if only one of the two is using hugepages just because of the fact the TLB miss is going to run faster. -Design -====== - -- "graceful fallback": mm components which don't have transparent hugepage - knowledge fall back to breaking huge pmd mapping into table of ptes and, - if necessary, split a transparent hugepage. Therefore these components - can continue working on the regular pages or regular pte mappings. - -- if a hugepage allocation fails because of memory fragmentation, - regular pages should be gracefully allocated instead and mixed in - the same vma without any failure or significant delay and without - userland noticing - -- if some task quits and more hugepages become available (either - immediately in the buddy or through the VM), guest physical memory - backed by regular pages should be relocated on hugepages - automatically (with khugepaged) - -- it doesn't require memory reservation and in turn it uses hugepages - whenever possible (the only possible reservation here is kernelcore= - to avoid unmovable pages to fragment all the memory but such a tweak - is not specific to transparent hugepage support and it's a generic - feature that applies to all dynamic high order allocations in the - kernel) - Transparent Hugepage Support maximizes the usefulness of free memory if compared to the reservation approach of hugetlbfs by allowing all unused memory to be used as cache or other movable (or even unmovable @@ -401,6 +376,47 @@ tracer to record how long was spent in __alloc_pages_nodemask and using the mm_page_alloc tracepoint to identify which allocations were for huge pages. +Optimizing the applications +=========================== + +To be guaranteed that the kernel will map a 2M page immediately in any +memory region, the mmap region has to be hugepage naturally +aligned. posix_memalign() can provide that guarantee. + +Hugetlbfs +========= + +You can use hugetlbfs on a kernel that has transparent hugepage +support enabled just fine as always. No difference can be noted in +hugetlbfs other than there will be less overall fragmentation. All +usual features belonging to hugetlbfs are preserved and +unaffected. libhugetlbfs will also work fine as usual. + +Design principles +================= + +- "graceful fallback": mm components which don't have transparent hugepage + knowledge fall back to breaking huge pmd mapping into table of ptes and, + if necessary, split a transparent hugepage. Therefore these components + can continue working on the regular pages or regular pte mappings. + +- if a hugepage allocation fails because of memory fragmentation, + regular pages should be gracefully allocated instead and mixed in + the same vma without any failure or significant delay and without + userland noticing + +- if some task quits and more hugepages become available (either + immediately in the buddy or through the VM), guest physical memory + backed by regular pages should be relocated on hugepages + automatically (with khugepaged) + +- it doesn't require memory reservation and in turn it uses hugepages + whenever possible (the only possible reservation here is kernelcore= + to avoid unmovable pages to fragment all the memory but such a tweak + is not specific to transparent hugepage support and it's a generic + feature that applies to all dynamic high order allocations in the + kernel) + get_user_pages and follow_page ============================== @@ -432,22 +448,6 @@ hugepages being returned (as it's not only checking the pfn of the page and pinning it during the copy but it pretends to migrate the memory in regular page sizes and with regular pte/pmd mappings). -Optimizing the applications -=========================== - -To be guaranteed that the kernel will map a 2M page immediately in any -memory region, the mmap region has to be hugepage naturally -aligned. posix_memalign() can provide that guarantee. - -Hugetlbfs -========= - -You can use hugetlbfs on a kernel that has transparent hugepage -support enabled just fine as always. No difference can be noted in -hugetlbfs other than there will be less overall fragmentation. All -usual features belonging to hugetlbfs are preserved and -unaffected. libhugetlbfs will also work fine as usual. - Graceful fallback ================= From aa00eaa9afb0cc350590668ba6a9ecd99cfd3ad7 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 14 May 2018 11:13:39 +0300 Subject: [PATCH 096/103] docs/vm: transhuge: minor updates Some formatting changes and addition of a sentence introducing khugepaged Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/transhuge.rst | 47 ++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst index 56d04cbb471f..47c7e4742bc2 100644 --- a/Documentation/vm/transhuge.rst +++ b/Documentation/vm/transhuge.rst @@ -9,14 +9,19 @@ Objective Performance critical computing applications dealing with large memory working sets are already running on top of libhugetlbfs and in turn -hugetlbfs. Transparent Hugepage Support is an alternative means of +hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of using huge pages for the backing of virtual memory with huge pages that supports the automatic promotion and demotion of page sizes and without the shortcomings of hugetlbfs. -Currently it only works for anonymous memory mappings and tmpfs/shmem. +Currently THP only works for anonymous memory mappings and tmpfs/shmem. But in the future it can expand to other filesystems. +.. note:: + in the examples below we presume that the basic page size is 4K and + the huge page size is 2M, although the actual numbers may vary + depending on the CPU architecture. + The reason applications are running faster is because of two factors. The first factor is almost completely irrelevant and it's not of significant interest because it'll also have the downside of @@ -28,15 +33,27 @@ only matters the first time the memory is accessed for the lifetime of a memory mapping. The second long lasting and much more important factor will affect all subsequent accesses to the memory for the whole runtime of the application. The second factor consist of two -components: 1) the TLB miss will run faster (especially with -virtualization using nested pagetables but almost always also on bare -metal without virtualization) and 2) a single TLB entry will be -mapping a much larger amount of virtual memory in turn reducing the -number of TLB misses. With virtualization and nested pagetables the -TLB can be mapped of larger size only if both KVM and the Linux guest -are using hugepages but a significant speedup already happens if only -one of the two is using hugepages just because of the fact the TLB -miss is going to run faster. +components: + +1) the TLB miss will run faster (especially with virtualization using + nested pagetables but almost always also on bare metal without + virtualization) + +2) a single TLB entry will be mapping a much larger amount of virtual + memory in turn reducing the number of TLB misses. With + virtualization and nested pagetables the TLB can be mapped of + larger size only if both KVM and the Linux guest are using + hugepages but a significant speedup already happens if only one of + the two is using hugepages just because of the fact the TLB miss is + going to run faster. + +THP can be enabled system wide or restricted to certain tasks or even +memory ranges inside task's address space. Unless THP is completely +disabled, there is ``khugepaged`` daemon that scans memory and +collapses sequences of basic pages into huge pages. + +The THP behaviour is controlled via :ref:`sysfs ` +interface and using madivse(2) and prctl(2) system calls. Transparent Hugepage Support maximizes the usefulness of free memory if compared to the reservation approach of hugetlbfs by allowing all @@ -69,9 +86,14 @@ Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions. +.. _thp_sysfs: + sysfs ===== +Global THP controls +------------------- + Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled @@ -142,6 +164,9 @@ khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". +Khugepaged controls +------------------- + khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's From 45c9a74f648a76e1118cf8024d11cba54bd64e37 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 14 May 2018 11:13:40 +0300 Subject: [PATCH 097/103] docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge Now that the administrative information for transparent huge pages is nicely separated, move it to its own page under the admin guide. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- .../admin-guide/kernel-parameters.txt | 3 +- Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/transhuge.rst | 418 ++++++++++++++++++ Documentation/vm/transhuge.rst | 414 +---------------- 4 files changed, 423 insertions(+), 413 deletions(-) create mode 100644 Documentation/admin-guide/mm/transhuge.rst diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 42f3e2884e7c..8d24270644a1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4313,7 +4313,8 @@ Format: [always|madvise|never] Can be used to control the default behavior of the system with respect to transparent hugepages. - See Documentation/vm/transhuge.rst for more details. + See Documentation/admin-guide/mm/transhuge.rst + for more details. tsc= Disable clocksource stability checks for TSC. Format: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index a69aa69af255..8454be638108 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -27,4 +27,5 @@ the Linux memory management. numa_memory_policy pagemap soft-dirty + transhuge userfaultfd diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst new file mode 100644 index 000000000000..7ab93a8404b9 --- /dev/null +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -0,0 +1,418 @@ +.. _admin_guide_transhuge: + +============================ +Transparent Hugepage Support +============================ + +Objective +========= + +Performance critical computing applications dealing with large memory +working sets are already running on top of libhugetlbfs and in turn +hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of +using huge pages for the backing of virtual memory with huge pages +that supports the automatic promotion and demotion of page sizes and +without the shortcomings of hugetlbfs. + +Currently THP only works for anonymous memory mappings and tmpfs/shmem. +But in the future it can expand to other filesystems. + +.. note:: + in the examples below we presume that the basic page size is 4K and + the huge page size is 2M, although the actual numbers may vary + depending on the CPU architecture. + +The reason applications are running faster is because of two +factors. The first factor is almost completely irrelevant and it's not +of significant interest because it'll also have the downside of +requiring larger clear-page copy-page in page faults which is a +potentially negative effect. The first factor consists in taking a +single page fault for each 2M virtual region touched by userland (so +reducing the enter/exit kernel frequency by a 512 times factor). This +only matters the first time the memory is accessed for the lifetime of +a memory mapping. The second long lasting and much more important +factor will affect all subsequent accesses to the memory for the whole +runtime of the application. The second factor consist of two +components: + +1) the TLB miss will run faster (especially with virtualization using + nested pagetables but almost always also on bare metal without + virtualization) + +2) a single TLB entry will be mapping a much larger amount of virtual + memory in turn reducing the number of TLB misses. With + virtualization and nested pagetables the TLB can be mapped of + larger size only if both KVM and the Linux guest are using + hugepages but a significant speedup already happens if only one of + the two is using hugepages just because of the fact the TLB miss is + going to run faster. + +THP can be enabled system wide or restricted to certain tasks or even +memory ranges inside task's address space. Unless THP is completely +disabled, there is ``khugepaged`` daemon that scans memory and +collapses sequences of basic pages into huge pages. + +The THP behaviour is controlled via :ref:`sysfs ` +interface and using madivse(2) and prctl(2) system calls. + +Transparent Hugepage Support maximizes the usefulness of free memory +if compared to the reservation approach of hugetlbfs by allowing all +unused memory to be used as cache or other movable (or even unmovable +entities). It doesn't require reservation to prevent hugepage +allocation failures to be noticeable from userland. It allows paging +and all other advanced VM features to be available on the +hugepages. It requires no modifications for applications to take +advantage of it. + +Applications however can be further optimized to take advantage of +this feature, like for example they've been optimized before to avoid +a flood of mmap system calls for every malloc(4k). Optimizing userland +is by far not mandatory and khugepaged already can take care of long +lived page allocations even for hugepage unaware applications that +deals with large amounts of memory. + +In certain cases when hugepages are enabled system wide, application +may end up allocating more memory resources. An application may mmap a +large region but only touch 1 byte of it, in that case a 2M page might +be allocated instead of a 4k page for no good. This is why it's +possible to disable hugepages system-wide and to only have them inside +MADV_HUGEPAGE madvise regions. + +Embedded systems should enable hugepages only inside madvise regions +to eliminate any risk of wasting any precious byte of memory and to +only run faster. + +Applications that gets a lot of benefit from hugepages and that don't +risk to lose memory by using hugepages, should use +madvise(MADV_HUGEPAGE) on their critical mmapped regions. + +.. _thp_sysfs: + +sysfs +===== + +Global THP controls +------------------- + +Transparent Hugepage Support for anonymous memory can be entirely disabled +(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE +regions (to avoid the risk of consuming more memory resources) or enabled +system wide. This can be achieved with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo never >/sys/kernel/mm/transparent_hugepage/enabled + +It's also possible to limit defrag efforts in the VM to generate +anonymous hugepages in case they're not immediately free to madvise +regions or to never try to defrag memory and simply fallback to regular +pages unless hugepages are immediately available. Clearly if we spend CPU +time to defrag memory, we would expect to gain even more by the fact we +use hugepages later instead of regular pages. This isn't always +guaranteed, but it may be more likely in case the allocation is for a +MADV_HUGEPAGE region. + +:: + + echo always >/sys/kernel/mm/transparent_hugepage/defrag + echo defer >/sys/kernel/mm/transparent_hugepage/defrag + echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo never >/sys/kernel/mm/transparent_hugepage/defrag + +always + means that an application requesting THP will stall on + allocation failure and directly reclaim pages and compact + memory in an effort to allocate a THP immediately. This may be + desirable for virtual machines that benefit heavily from THP + use and are willing to delay the VM start to utilise them. + +defer + means that an application will wake kswapd in the background + to reclaim pages and wake kcompactd to compact memory so that + THP is available in the near future. It's the responsibility + of khugepaged to then install the THP pages later. + +defer+madvise + will enter direct reclaim and compaction like ``always``, but + only for regions that have used madvise(MADV_HUGEPAGE); all + other regions will wake kswapd in the background to reclaim + pages and wake kcompactd to compact memory so that THP is + available in the near future. + +madvise + will enter direct reclaim like ``always`` but only for regions + that are have used madvise(MADV_HUGEPAGE). This is the default + behaviour. + +never + should be self-explanatory. + +By default kernel tries to use huge zero page on read page fault to +anonymous mapping. It's possible to disable huge zero page by writing 0 +or enable it back by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page + echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page + +Some userspace (such as a test program, or an optimized memory allocation +library) may want to know the size (in bytes) of a transparent hugepage:: + + cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size + +khugepaged will be automatically started when +transparent_hugepage/enabled is set to "always" or "madvise, and it'll +be automatically shutdown if it's set to "never". + +Khugepaged controls +------------------- + +khugepaged runs usually at low frequency so while one may not want to +invoke defrag algorithms synchronously during the page faults, it +should be worth invoking defrag at least in khugepaged. However it's +also possible to disable defrag in khugepaged by writing 0 or enable +defrag in khugepaged by writing 1:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + +You can also control how many pages khugepaged should scan at each +pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan + +and how many milliseconds to wait in khugepaged between each pass (you +can set this to 0 to run khugepaged at 100% utilization of one core):: + + /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs + +and how many milliseconds to wait in khugepaged if there's an hugepage +allocation failure to throttle the next allocation attempt:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs + +The khugepaged progress can be seen in the number of pages collapsed:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed + +for each pass:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans + +``max_ptes_none`` specifies how many extra small pages (that are +not already mapped) can be allocated when collapsing a group +of small pages into one large page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none + +A higher value leads to use additional memory for programs. +A lower value leads to gain less thp performance. Value of +max_ptes_none can waste cpu time very little, you can +ignore it. + +``max_ptes_swap`` specifies how many pages can be brought in from +swap when collapsing a group of pages into a transparent huge page:: + + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap + +A higher value can cause excessive swap IO and waste +memory. A lower value can prevent THPs from being +collapsed, resulting fewer pages being collapsed into +THPs, and lower memory access performance. + +Boot parameter +============== + +You can change the sysfs boot time defaults of Transparent Hugepage +Support by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` +to the kernel command line. + +Hugepages in tmpfs/shmem +======================== + +You can control hugepage allocation policy in tmpfs with mount option +``huge=``. It can have following values: + +always + Attempt to allocate huge pages every time we need a new page; + +never + Do not allocate huge pages; + +within_size + Only allocate huge page if it will be fully within i_size. + Also respect fadvise()/madvise() hints; + +advise + Only allocate huge pages if requested with fadvise()/madvise(); + +The default policy is ``never``. + +``mount -o remount,huge= /mountpoint`` works fine after mount: remounting +``huge=never`` will not attempt to break up huge pages at all, just stop more +from being allocated. + +There's also sysfs knob to control hugepage allocation policy for internal +shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount +is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or +MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +In addition to policies listed above, shmem_enabled allows two further +values: + +deny + For use in emergencies, to force the huge option off from + all mounts; +force + Force the huge option on for all - very useful for testing; + +Need of application restart +=========================== + +The transparent_hugepage/enabled values and tmpfs mount option only affect +future behavior. So to make them effective you need to restart any +application that could have been using hugepages. This also applies to the +regions registered in khugepaged. + +Monitoring usage +================ + +The number of anonymous transparent huge pages currently used by the +system is available by reading the AnonHugePages field in ``/proc/meminfo``. +To identify what applications are using anonymous transparent huge pages, +it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields +for each mapping. + +The number of file transparent huge pages mapped to userspace is available +by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. +To identify what applications are mapping file transparent huge pages, it +is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields +for each mapping. + +Note that reading the smaps file is expensive and reading it +frequently will incur overhead. + +There are a number of counters in ``/proc/vmstat`` that may be used to +monitor how successfully the system is providing huge pages for use. + +thp_fault_alloc + is incremented every time a huge page is successfully + allocated to handle a page fault. This applies to both the + first time a page is faulted and for COW faults. + +thp_collapse_alloc + is incremented by khugepaged when it has found + a range of pages to collapse into one huge page and has + successfully allocated a new huge page to store the data. + +thp_fault_fallback + is incremented if a page fault fails to allocate + a huge page and instead falls back to using small pages. + +thp_collapse_alloc_failed + is incremented if khugepaged found a range + of pages that should be collapsed into one huge page but failed + the allocation. + +thp_file_alloc + is incremented every time a file huge page is successfully + allocated. + +thp_file_mapped + is incremented every time a file huge page is mapped into + user address space. + +thp_split_page + is incremented every time a huge page is split into base + pages. This can happen for a variety of reasons but a common + reason is that a huge page is old and is being reclaimed. + This action implies splitting all PMD the page mapped with. + +thp_split_page_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +thp_deferred_split_page + is incremented when a huge page is put onto split + queue. This happens when a huge page is partially unmapped and + splitting it would free up some memory. Pages on split queue are + going to be split under memory pressure. + +thp_split_pmd + is incremented every time a PMD split into table of PTEs. + This can happen, for instance, when application calls mprotect() or + munmap() on part of huge page. It doesn't split huge page, only + page table entry. + +thp_zero_page_alloc + is incremented every time a huge zero page is + successfully allocated. It includes allocations which where + dropped due race with other allocation. Note, it doesn't count + every map of the huge zero page, only its allocation. + +thp_zero_page_alloc_failed + is incremented if kernel fails to allocate + huge zero page and falls back to using small pages. + +thp_swpout + is incremented every time a huge page is swapout in one + piece without splitting. + +thp_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + +As the system ages, allocating huge pages may be expensive as the +system uses memory compaction to copy data around memory to free a +huge page for use. There are some counters in ``/proc/vmstat`` to help +monitor this overhead. + +compact_stall + is incremented every time a process stalls to run + memory compaction so that a huge page is free for use. + +compact_success + is incremented if the system compacted memory and + freed a huge page for use. + +compact_fail + is incremented if the system tries to compact memory + but failed. + +compact_pages_moved + is incremented each time a page is moved. If + this value is increasing rapidly, it implies that the system + is copying a lot of data to satisfy the huge page allocation. + It is possible that the cost of copying exceeds any savings + from reduced TLB misses. + +compact_pagemigrate_failed + is incremented when the underlying mechanism + for moving a page failed. + +compact_blocks_moved + is incremented each time memory compaction examines + a huge page aligned range of pages. + +It is possible to establish how long the stalls were using the function +tracer to record how long was spent in __alloc_pages_nodemask and +using the mm_page_alloc tracepoint to identify which allocations were +for huge pages. + +Optimizing the applications +=========================== + +To be guaranteed that the kernel will map a 2M page immediately in any +memory region, the mmap region has to be hugepage naturally +aligned. posix_memalign() can provide that guarantee. + +Hugetlbfs +========= + +You can use hugetlbfs on a kernel that has transparent hugepage +support enabled just fine as always. No difference can be noted in +hugetlbfs other than there will be less overall fragmentation. All +usual features belonging to hugetlbfs are preserved and +unaffected. libhugetlbfs will also work fine as usual. diff --git a/Documentation/vm/transhuge.rst b/Documentation/vm/transhuge.rst index 47c7e4742bc2..a8cf6809e36e 100644 --- a/Documentation/vm/transhuge.rst +++ b/Documentation/vm/transhuge.rst @@ -4,418 +4,8 @@ Transparent Hugepage Support ============================ -Objective -========= - -Performance critical computing applications dealing with large memory -working sets are already running on top of libhugetlbfs and in turn -hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of -using huge pages for the backing of virtual memory with huge pages -that supports the automatic promotion and demotion of page sizes and -without the shortcomings of hugetlbfs. - -Currently THP only works for anonymous memory mappings and tmpfs/shmem. -But in the future it can expand to other filesystems. - -.. note:: - in the examples below we presume that the basic page size is 4K and - the huge page size is 2M, although the actual numbers may vary - depending on the CPU architecture. - -The reason applications are running faster is because of two -factors. The first factor is almost completely irrelevant and it's not -of significant interest because it'll also have the downside of -requiring larger clear-page copy-page in page faults which is a -potentially negative effect. The first factor consists in taking a -single page fault for each 2M virtual region touched by userland (so -reducing the enter/exit kernel frequency by a 512 times factor). This -only matters the first time the memory is accessed for the lifetime of -a memory mapping. The second long lasting and much more important -factor will affect all subsequent accesses to the memory for the whole -runtime of the application. The second factor consist of two -components: - -1) the TLB miss will run faster (especially with virtualization using - nested pagetables but almost always also on bare metal without - virtualization) - -2) a single TLB entry will be mapping a much larger amount of virtual - memory in turn reducing the number of TLB misses. With - virtualization and nested pagetables the TLB can be mapped of - larger size only if both KVM and the Linux guest are using - hugepages but a significant speedup already happens if only one of - the two is using hugepages just because of the fact the TLB miss is - going to run faster. - -THP can be enabled system wide or restricted to certain tasks or even -memory ranges inside task's address space. Unless THP is completely -disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into huge pages. - -The THP behaviour is controlled via :ref:`sysfs ` -interface and using madivse(2) and prctl(2) system calls. - -Transparent Hugepage Support maximizes the usefulness of free memory -if compared to the reservation approach of hugetlbfs by allowing all -unused memory to be used as cache or other movable (or even unmovable -entities). It doesn't require reservation to prevent hugepage -allocation failures to be noticeable from userland. It allows paging -and all other advanced VM features to be available on the -hugepages. It requires no modifications for applications to take -advantage of it. - -Applications however can be further optimized to take advantage of -this feature, like for example they've been optimized before to avoid -a flood of mmap system calls for every malloc(4k). Optimizing userland -is by far not mandatory and khugepaged already can take care of long -lived page allocations even for hugepage unaware applications that -deals with large amounts of memory. - -In certain cases when hugepages are enabled system wide, application -may end up allocating more memory resources. An application may mmap a -large region but only touch 1 byte of it, in that case a 2M page might -be allocated instead of a 4k page for no good. This is why it's -possible to disable hugepages system-wide and to only have them inside -MADV_HUGEPAGE madvise regions. - -Embedded systems should enable hugepages only inside madvise regions -to eliminate any risk of wasting any precious byte of memory and to -only run faster. - -Applications that gets a lot of benefit from hugepages and that don't -risk to lose memory by using hugepages, should use -madvise(MADV_HUGEPAGE) on their critical mmapped regions. - -.. _thp_sysfs: - -sysfs -===== - -Global THP controls -------------------- - -Transparent Hugepage Support for anonymous memory can be entirely disabled -(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE -regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of:: - - echo always >/sys/kernel/mm/transparent_hugepage/enabled - echo madvise >/sys/kernel/mm/transparent_hugepage/enabled - echo never >/sys/kernel/mm/transparent_hugepage/enabled - -It's also possible to limit defrag efforts in the VM to generate -anonymous hugepages in case they're not immediately free to madvise -regions or to never try to defrag memory and simply fallback to regular -pages unless hugepages are immediately available. Clearly if we spend CPU -time to defrag memory, we would expect to gain even more by the fact we -use hugepages later instead of regular pages. This isn't always -guaranteed, but it may be more likely in case the allocation is for a -MADV_HUGEPAGE region. - -:: - - echo always >/sys/kernel/mm/transparent_hugepage/defrag - echo defer >/sys/kernel/mm/transparent_hugepage/defrag - echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag - echo madvise >/sys/kernel/mm/transparent_hugepage/defrag - echo never >/sys/kernel/mm/transparent_hugepage/defrag - -always - means that an application requesting THP will stall on - allocation failure and directly reclaim pages and compact - memory in an effort to allocate a THP immediately. This may be - desirable for virtual machines that benefit heavily from THP - use and are willing to delay the VM start to utilise them. - -defer - means that an application will wake kswapd in the background - to reclaim pages and wake kcompactd to compact memory so that - THP is available in the near future. It's the responsibility - of khugepaged to then install the THP pages later. - -defer+madvise - will enter direct reclaim and compaction like ``always``, but - only for regions that have used madvise(MADV_HUGEPAGE); all - other regions will wake kswapd in the background to reclaim - pages and wake kcompactd to compact memory so that THP is - available in the near future. - -madvise - will enter direct reclaim like ``always`` but only for regions - that are have used madvise(MADV_HUGEPAGE). This is the default - behaviour. - -never - should be self-explanatory. - -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: - - echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page - echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page - -Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage:: - - cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size - -khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". - -Khugepaged controls -------------------- - -khugepaged runs usually at low frequency so while one may not want to -invoke defrag algorithms synchronously during the page faults, it -should be worth invoking defrag at least in khugepaged. However it's -also possible to disable defrag in khugepaged by writing 0 or enable -defrag in khugepaged by writing 1:: - - echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag - echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag - -You can also control how many pages khugepaged should scan at each -pass:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan - -and how many milliseconds to wait in khugepaged between each pass (you -can set this to 0 to run khugepaged at 100% utilization of one core):: - - /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs - -and how many milliseconds to wait in khugepaged if there's an hugepage -allocation failure to throttle the next allocation attempt:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs - -The khugepaged progress can be seen in the number of pages collapsed:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed - -for each pass:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans - -``max_ptes_none`` specifies how many extra small pages (that are -not already mapped) can be allocated when collapsing a group -of small pages into one large page:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none - -A higher value leads to use additional memory for programs. -A lower value leads to gain less thp performance. Value of -max_ptes_none can waste cpu time very little, you can -ignore it. - -``max_ptes_swap`` specifies how many pages can be brought in from -swap when collapsing a group of pages into a transparent huge page:: - - /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap - -A higher value can cause excessive swap IO and waste -memory. A lower value can prevent THPs from being -collapsed, resulting fewer pages being collapsed into -THPs, and lower memory access performance. - -Boot parameter -============== - -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` -to the kernel command line. - -Hugepages in tmpfs/shmem -======================== - -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: - -always - Attempt to allocate huge pages every time we need a new page; - -never - Do not allocate huge pages; - -within_size - Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; - -advise - Only allocate huge pages if requested with fadvise()/madvise(); - -The default policy is ``never``. - -``mount -o remount,huge= /mountpoint`` works fine after mount: remounting -``huge=never`` will not attempt to break up huge pages at all, just stop more -from being allocated. - -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: - -deny - For use in emergencies, to force the huge option off from - all mounts; -force - Force the huge option on for all - very useful for testing; - -Need of application restart -=========================== - -The transparent_hugepage/enabled values and tmpfs mount option only affect -future behavior. So to make them effective you need to restart any -application that could have been using hugepages. This also applies to the -regions registered in khugepaged. - -Monitoring usage -================ - -The number of anonymous transparent huge pages currently used by the -system is available by reading the AnonHugePages field in ``/proc/meminfo``. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields -for each mapping. - -The number of file transparent huge pages mapped to userspace is available -by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. -To identify what applications are mapping file transparent huge pages, it -is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields -for each mapping. - -Note that reading the smaps file is expensive and reading it -frequently will incur overhead. - -There are a number of counters in ``/proc/vmstat`` that may be used to -monitor how successfully the system is providing huge pages for use. - -thp_fault_alloc - is incremented every time a huge page is successfully - allocated to handle a page fault. This applies to both the - first time a page is faulted and for COW faults. - -thp_collapse_alloc - is incremented by khugepaged when it has found - a range of pages to collapse into one huge page and has - successfully allocated a new huge page to store the data. - -thp_fault_fallback - is incremented if a page fault fails to allocate - a huge page and instead falls back to using small pages. - -thp_collapse_alloc_failed - is incremented if khugepaged found a range - of pages that should be collapsed into one huge page but failed - the allocation. - -thp_file_alloc - is incremented every time a file huge page is successfully - allocated. - -thp_file_mapped - is incremented every time a file huge page is mapped into - user address space. - -thp_split_page - is incremented every time a huge page is split into base - pages. This can happen for a variety of reasons but a common - reason is that a huge page is old and is being reclaimed. - This action implies splitting all PMD the page mapped with. - -thp_split_page_failed - is incremented if kernel fails to split huge - page. This can happen if the page was pinned by somebody. - -thp_deferred_split_page - is incremented when a huge page is put onto split - queue. This happens when a huge page is partially unmapped and - splitting it would free up some memory. Pages on split queue are - going to be split under memory pressure. - -thp_split_pmd - is incremented every time a PMD split into table of PTEs. - This can happen, for instance, when application calls mprotect() or - munmap() on part of huge page. It doesn't split huge page, only - page table entry. - -thp_zero_page_alloc - is incremented every time a huge zero page is - successfully allocated. It includes allocations which where - dropped due race with other allocation. Note, it doesn't count - every map of the huge zero page, only its allocation. - -thp_zero_page_alloc_failed - is incremented if kernel fails to allocate - huge zero page and falls back to using small pages. - -thp_swpout - is incremented every time a huge page is swapout in one - piece without splitting. - -thp_swpout_fallback - is incremented if a huge page has to be split before swapout. - Usually because failed to allocate some continuous swap space - for the huge page. - -As the system ages, allocating huge pages may be expensive as the -system uses memory compaction to copy data around memory to free a -huge page for use. There are some counters in ``/proc/vmstat`` to help -monitor this overhead. - -compact_stall - is incremented every time a process stalls to run - memory compaction so that a huge page is free for use. - -compact_success - is incremented if the system compacted memory and - freed a huge page for use. - -compact_fail - is incremented if the system tries to compact memory - but failed. - -compact_pages_moved - is incremented each time a page is moved. If - this value is increasing rapidly, it implies that the system - is copying a lot of data to satisfy the huge page allocation. - It is possible that the cost of copying exceeds any savings - from reduced TLB misses. - -compact_pagemigrate_failed - is incremented when the underlying mechanism - for moving a page failed. - -compact_blocks_moved - is incremented each time memory compaction examines - a huge page aligned range of pages. - -It is possible to establish how long the stalls were using the function -tracer to record how long was spent in __alloc_pages_nodemask and -using the mm_page_alloc tracepoint to identify which allocations were -for huge pages. - -Optimizing the applications -=========================== - -To be guaranteed that the kernel will map a 2M page immediately in any -memory region, the mmap region has to be hugepage naturally -aligned. posix_memalign() can provide that guarantee. - -Hugetlbfs -========= - -You can use hugetlbfs on a kernel that has transparent hugepage -support enabled just fine as always. No difference can be noted in -hugetlbfs other than there will be less overall fragmentation. All -usual features belonging to hugetlbfs are preserved and -unaffected. libhugetlbfs will also work fine as usual. +This document describes design principles Transparent Hugepage (THP) +Support and its interaction with other parts of the memory management. Design principles ================= From 8962e40c19933a11bb5c46216e36ca4d63751c3e Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Wed, 23 May 2018 15:20:14 -0700 Subject: [PATCH 098/103] docs: update kernel versions and dates in tables Every once in a while, we should update the examples to reflect more recent kernel versions. Update the tables describing kernel releases, the merge window, and current longterm maintained kernel, from 2.6-era kernels to 4.x. Signed-off-by: Tim Bird Signed-off-by: Jonathan Corbet --- Documentation/process/2.Process.rst | 72 +++++++++++++++-------------- 1 file changed, 38 insertions(+), 34 deletions(-) diff --git a/Documentation/process/2.Process.rst b/Documentation/process/2.Process.rst index ce5561bb3f8e..a9c46dd0706b 100644 --- a/Documentation/process/2.Process.rst +++ b/Documentation/process/2.Process.rst @@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent release history looks like this: ====== ================= - 2.6.38 March 14, 2011 - 2.6.37 January 4, 2011 - 2.6.36 October 20, 2010 - 2.6.35 August 1, 2010 - 2.6.34 May 15, 2010 - 2.6.33 February 24, 2010 + 4.11 April 30, 2017 + 4.12 July 2, 2017 + 4.13 September 3, 2017 + 4.14 November 12, 2017 + 4.15 January 28, 2018 + 4.16 April 1, 2018 ====== ================= -Every 2.6.x release is a major kernel release with new features, internal -API changes, and more. A typical 2.6 release can contain nearly 10,000 -changesets with changes to several hundred thousand lines of code. 2.6 is +Every 4.x release is a major kernel release with new features, internal +API changes, and more. A typical 4.x release contain about 13,000 +changesets with changes to several hundred thousand lines of code. 4.x is thus the leading edge of Linux kernel development; the kernel uses a rolling development model which is continually integrating major changes. @@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is considered to be sufficiently stable and the final 2.6.x release is made. At that point the whole process starts over again. -As an example, here is how the 2.6.38 development cycle went (all dates in -2011): +As an example, here is how the 4.16 development cycle went (all dates in +2018): ============== =============================== - January 4 2.6.37 stable release - January 18 2.6.38-rc1, merge window closes - January 21 2.6.38-rc2 - February 1 2.6.38-rc3 - February 7 2.6.38-rc4 - February 15 2.6.38-rc5 - February 21 2.6.38-rc6 - March 1 2.6.38-rc7 - March 7 2.6.38-rc8 - March 14 2.6.38 stable release + January 28 4.15 stable release + February 11 4.16-rc1, merge window closes + February 18 4.16-rc2 + February 25 4.16-rc3 + March 4 4.16-rc4 + March 11 4.16-rc5 + March 18 4.16-rc6 + March 25 4.16-rc7 + April 1 4.17 stable release ============== =============================== How do the developers decide when to close the development cycle and create @@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to achieve; there are just too many variables in a project of this size. There comes a point where delaying the final release just makes the problem worse; the pile of changes waiting for the next merge window will grow -larger, creating even more regressions the next time around. So most 2.6.x +larger, creating even more regressions the next time around. So most 4.x kernels go out with a handful of known regressions though, hopefully, none of them are serious. Once a stable release is made, its ongoing maintenance is passed off to the "stable team," currently consisting of Greg Kroah-Hartman. The stable team -will release occasional updates to the stable release using the 2.6.x.y +will release occasional updates to the stable release using the 4.x.y numbering scheme. To be considered for an update release, a patch must (1) fix a significant bug, and (2) already be merged into the mainline for the next development kernel. Kernels will typically receive stable updates for a little more than one development cycle past their initial release. So, -for example, the 2.6.36 kernel's history looked like: +for example, the 4.13 kernel's history looked like: ============== =============================== - October 10 2.6.36 stable release - November 22 2.6.36.1 - December 9 2.6.36.2 - January 7 2.6.36.3 - February 17 2.6.36.4 + September 3 4.13 stable release + September 13 4.13.1 + September 20 4.13.2 + September 27 4.13.3 + October 5 4.13.4 + October 12 4.13.5 + ... ... + November 24 4.13.16 ============== =============================== -2.6.36.4 was the final stable update for the 2.6.36 release. +4.13.16 was the final stable update of the 4.13 release. Some kernels are designated "long term" kernels; they will receive support for a longer period. As of this writing, the current long term kernels and their maintainers are: - ====== ====================== =========================== - 2.6.27 Willy Tarreau (Deep-frozen stable kernel) - 2.6.32 Greg Kroah-Hartman - 2.6.35 Andi Kleen (Embedded flag kernel) + ====== ====================== ============================== + 3.16 Ben Hutchings (very long-term stable kernel) + 4.1 Sasha Levin + 4.4 Greg Kroah-Hartman (very long-term stable kernel) + 4.9 Greg Kroah-Hartman + 4.14 Greg Kroah-Hartman ====== ====================== =========================== The selection of a kernel for long-term support is purely a matter of a From 46ca359955fee63486dc1cfc528ae5692bb16dcd Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 29 May 2018 10:26:44 +0200 Subject: [PATCH 099/103] doc: document scope NOFS, NOIO APIs Although the api is documented in the source code Ted has pointed out that there is no mention in the core-api Documentation and there are people looking there to find answers how to use a specific API. Requested-by: "Theodore Y. Ts'o" Reviewed-by: Dave Chinner Signed-off-by: Michal Hocko Signed-off-by: Jonathan Corbet --- .../core-api/gfp_mask-from-fs-io.rst | 61 +++++++++++++++++++ Documentation/core-api/index.rst | 1 + include/linux/sched/mm.h | 38 ++++++++++++ 3 files changed, 100 insertions(+) create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst new file mode 100644 index 000000000000..2dc442b04a77 --- /dev/null +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -0,0 +1,61 @@ +================================= +GFP masks used from FS/IO context +================================= + +:Date: May, 2018 +:Author: Michal Hocko + +Introduction +============ + +Code paths in the filesystem and IO stacks must be careful when +allocating memory to prevent recursion deadlocks caused by direct +memory reclaim calling back into the FS or IO paths and blocking on +already held resources (e.g. locks - most commonly those used for the +transaction context). + +The traditional way to avoid this deadlock problem is to clear __GFP_FS +respectively __GFP_IO (note the latter implies clearing the first as well) in +the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be +used as shortcut. It turned out though that above approach has led to +abuses when the restricted gfp mask is used "just in case" without a +deeper consideration which leads to problems because an excessive use +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory +reclaim issues. + +New API +======== + +Since 4.12 we do have a generic scope API for both NOFS and NOIO context +``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``, +``memalloc_noio_restore`` which allow to mark a scope to be a critical +section from a filesystem or I/O point of view. Any allocation from that +scope will inherently drop __GFP_FS respectively __GFP_IO from the given +mask so no memory allocation can recurse back in the FS/IO. + +FS/IO code then simply calls the appropriate save function before +any critical section with respect to the reclaim is started - e.g. +lock shared with the reclaim context or when a transaction context +nesting would be possible via reclaim. The restore function should be +called when the critical section ends. All that ideally along with an +explanation what is the reclaim context for easier maintenance. + +Please note that the proper pairing of save/restore functions +allows nesting so it is safe to call ``memalloc_noio_save`` or +``memalloc_noio_restore`` respectively from an existing NOIO or NOFS +scope. + +What about __vmalloc(GFP_NOFS) +============================== + +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is +almost always a bug. The good news is that the NOFS/NOIO semantic can be +achieved by the scope API. + +In the ideal world, upper layers should already mark dangerous contexts +and so no special care is required and vmalloc should be called without +any problems. Sometimes if the context is not really clear or there are +layering violations then the recommended way around that is to wrap ``vmalloc`` +by the scope API with a comment explaining the problem. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 3864de589126..f5a66b72f984 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -27,6 +27,7 @@ Core utilities errseq printk-formats circular-buffers + gfp_mask-from-fs-io Interfaces for kernel debugging =============================== diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 4e1411bbbcfc..76a8cb4ef178 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -170,6 +170,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { } static inline void fs_reclaim_release(gfp_t gfp_mask) { } #endif +/** + * memalloc_noio_save - Marks implicit GFP_NOIO allocation scope. + * + * This functions marks the beginning of the GFP_NOIO allocation scope. + * All further allocations will implicitly drop __GFP_IO flag and so + * they are safe for the IO critical section from the allocation recursion + * point of view. Use memalloc_noio_restore to end the scope with flags + * returned by this function. + * + * This function is safe to be used from any context. + */ static inline unsigned int memalloc_noio_save(void) { unsigned int flags = current->flags & PF_MEMALLOC_NOIO; @@ -177,11 +188,30 @@ static inline unsigned int memalloc_noio_save(void) return flags; } +/** + * memalloc_noio_restore - Ends the implicit GFP_NOIO scope. + * @flags: Flags to restore. + * + * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function. + * Always make sure that that the given flags is the return value from the + * pairing memalloc_noio_save call. + */ static inline void memalloc_noio_restore(unsigned int flags) { current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; } +/** + * memalloc_nofs_save - Marks implicit GFP_NOFS allocation scope. + * + * This functions marks the beginning of the GFP_NOFS allocation scope. + * All further allocations will implicitly drop __GFP_FS flag and so + * they are safe for the FS critical section from the allocation recursion + * point of view. Use memalloc_nofs_restore to end the scope with flags + * returned by this function. + * + * This function is safe to be used from any context. + */ static inline unsigned int memalloc_nofs_save(void) { unsigned int flags = current->flags & PF_MEMALLOC_NOFS; @@ -189,6 +219,14 @@ static inline unsigned int memalloc_nofs_save(void) return flags; } +/** + * memalloc_nofs_restore - Ends the implicit GFP_NOFS scope. + * @flags: Flags to restore. + * + * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function. + * Always make sure that that the given flags is the return value from the + * pairing memalloc_nofs_save call. + */ static inline void memalloc_nofs_restore(unsigned int flags) { current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags; From d43f2c98f63a961172172ce4d1b3aeea7fbc0628 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Tue, 29 May 2018 05:44:58 -0600 Subject: [PATCH 100/103] docs: Use the kerneldoc comments for memalloc_no*() Now that we have kerneldoc comments for memalloc_no{fs,io}_{save_restore}(), go ahead and pull them into the docs. Signed-off-by: Jonathan Corbet --- Documentation/core-api/gfp_mask-from-fs-io.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst index 2dc442b04a77..e0df8f416582 100644 --- a/Documentation/core-api/gfp_mask-from-fs-io.rst +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst @@ -33,6 +33,11 @@ section from a filesystem or I/O point of view. Any allocation from that scope will inherently drop __GFP_FS respectively __GFP_IO from the given mask so no memory allocation can recurse back in the FS/IO. +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_nofs_save memalloc_nofs_restore +.. kernel-doc:: include/linux/sched/mm.h + :functions: memalloc_noio_save memalloc_noio_restore + FS/IO code then simply calls the appropriate save function before any critical section with respect to the reclaim is started - e.g. lock shared with the reclaim context or when a transaction context From ba22931235949cc4d552c86e29bd7cd794412032 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 29 May 2018 13:13:38 +0300 Subject: [PATCH 101/103] docs/vm: move ksm and transhuge from "user" to "internals" section. After the userspace interface description for KSM and THP was split to Documentation/admin-guide/mm, the remaining parts belong to the section describing MM internals. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/vm/index.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 8e1cc667eef1..c4ded22197ca 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -13,8 +13,6 @@ various features of the Linux memory management .. toctree:: :maxdepth: 1 - ksm - transhuge swap_numa zswap @@ -36,6 +34,7 @@ descriptions of data structures and algorithms. hmm hwpoison hugetlbfs_reserv + ksm mmu_notifier numa overcommit-accounting @@ -45,6 +44,7 @@ descriptions of data structures and algorithms. remap_file_pages slub split_page_table_lock + transhuge unevictable-lru z3fold zsmalloc From f462951e87bb1b9e5ffa95a406af38fc399d7b5a Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Tue, 29 May 2018 14:37:25 +0300 Subject: [PATCH 102/103] docs/admin-guide/mm: add high level concepts overview The are terms that seem obvious to the mm developers, but may be somewhat obscure for, say, less involved readers. The concepts overview can be seen as an "extended glossary" that introduces such terms to the readers of the kernel documentation. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/mm/concepts.rst | 222 ++++++++++++++++++++++ Documentation/admin-guide/mm/index.rst | 5 + 2 files changed, 227 insertions(+) create mode 100644 Documentation/admin-guide/mm/concepts.rst diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst new file mode 100644 index 000000000000..291699c810d4 --- /dev/null +++ b/Documentation/admin-guide/mm/concepts.rst @@ -0,0 +1,222 @@ +.. _mm_concepts: + +================= +Concepts overview +================= + +The memory management in Linux is complex system that evolved over the +years and included more and more functionality to support variety of +systems from MMU-less microcontrollers to supercomputers. The memory +management for systems without MMU is called ``nommu`` and it +definitely deserves a dedicated document, which hopefully will be +eventually written. Yet, although some of the concepts are the same, +here we assume that MMU is available and CPU can translate a virtual +address to a physical address. + +.. contents:: :local: + +Virtual Memory Primer +===================== + +The physical memory in a computer system is a limited resource and +even for systems that support memory hotplug there is a hard limit on +the amount of memory that can be installed. The physical memory is not +necessary contiguous, it might be accessible as a set of distinct +address ranges. Besides, different CPU architectures, and even +different implementations of the same architecture have different view +how these address ranges defined. + +All this makes dealing directly with physical memory quite complex and +to avoid this complexity a concept of virtual memory was developed. + +The virtual memory abstracts the details of physical memory from the +application software, allows to keep only needed information in the +physical memory (demand paging) and provides a mechanism for the +protection and controlled sharing of data between processes. + +With virtual memory, each and every memory access uses a virtual +address. When the CPU decodes the an instruction that reads (or +writes) from (or to) the system memory, it translates the `virtual` +address encoded in that instruction to a `physical` address that the +memory controller can understand. + +The physical system memory is divided into page frames, or pages. The +size of each page is architecture specific. Some architectures allow +selection of the page size from several supported values; this +selection is performed at the kernel build time by setting an +appropriate kernel configuration option. + +Each physical memory page can be mapped as one or more virtual +pages. These mappings are described by page tables that allow +translation from virtual address used by programs to real address in +the physical memory. The page tables organized hierarchically. + +The tables at the lowest level of the hierarchy contain physical +addresses of actual pages used by the software. The tables at higher +levels contain physical addresses of the pages belonging to the lower +levels. The pointer to the top level page table resides in a +register. When the CPU performs the address translation, it uses this +register to access the top level page table. The high bits of the +virtual address are used to index an entry in the top level page +table. That entry is then used to access the next level in the +hierarchy with the next bits of the virtual address as the index to +that level page table. The lowest bits in the virtual address define +the offset inside the actual page. + +Huge Pages +========== + +The address translation requires several memory accesses and memory +accesses are slow relatively to CPU speed. To avoid spending precious +processor cycles on the address translation, CPUs maintain a cache of +such translations called Translation Lookaside Buffer (or +TLB). Usually TLB is pretty scarce resource and applications with +large memory working set will experience performance hit because of +TLB misses. + +Many modern CPU architectures allow mapping of the memory pages +directly by the higher levels in the page table. For instance, on x86, +it is possible to map 2M and even 1G pages using entries in the second +and the third level page tables. In Linux such pages are called +`huge`. Usage of huge pages significantly reduces pressure on TLB, +improves TLB hit-rate and thus improves overall system performance. + +There are two mechanisms in Linux that enable mapping of the physical +memory with the huge pages. The first one is `HugeTLB filesystem`, or +hugetlbfs. It is a pseudo filesystem that uses RAM as its backing +store. For the files created in this filesystem the data resides in +the memory and mapped using huge pages. The hugetlbfs is described at +:ref:`Documentation/admin-guide/mm/hugetlbpage.rst `. + +Another, more recent, mechanism that enables use of the huge pages is +called `Transparent HugePages`, or THP. Unlike the hugetlbfs that +requires users and/or system administrators to configure what parts of +the system memory should and can be mapped by the huge pages, THP +manages such mappings transparently to the user and hence the +name. See +:ref:`Documentation/admin-guide/mm/transhuge.rst ` +for more details about THP. + +Zones +===== + +Often hardware poses restrictions on how different physical memory +ranges can be accessed. In some cases, devices cannot perform DMA to +all the addressable memory. In other cases, the size of the physical +memory exceeds the maximal addressable size of virtual memory and +special actions are required to access portions of the memory. Linux +groups memory pages into `zones` according to their possible +usage. For example, ZONE_DMA will contain memory that can be used by +devices for DMA, ZONE_HIGHMEM will contain memory that is not +permanently mapped into kernel's address space and ZONE_NORMAL will +contain normally addressed pages. + +The actual layout of the memory zones is hardware dependent as not all +architectures define all zones, and requirements for DMA are different +for different platforms. + +Nodes +===== + +Many multi-processor machines are NUMA - Non-Uniform Memory Access - +systems. In such systems the memory is arranged into banks that have +different access latency depending on the "distance" from the +processor. Each bank is referred as `node` and for each node Linux +constructs an independent memory management subsystem. A node has it's +own set of zones, lists of free and used pages and various statistics +counters. You can find more details about NUMA in +:ref:`Documentation/vm/numa.rst ` and in +:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst `. + +Page cache +========== + +The physical memory is volatile and the common case for getting data +into the memory is to read it from files. Whenever a file is read, the +data is put into the `page cache` to avoid expensive disk access on +the subsequent reads. Similarly, when one writes to a file, the data +is placed in the page cache and eventually gets into the backing +storage device. The written pages are marked as `dirty` and when Linux +decides to reuse them for other purposes, it makes sure to synchronize +the file contents on the device with the updated data. + +Anonymous Memory +================ + +The `anonymous memory` or `anonymous mappings` represent memory that +is not backed by a filesystem. Such mappings are implicitly created +for program's stack and heap or by explicit calls to mmap(2) system +call. Usually, the anonymous mappings only define virtual memory areas +that the program is allowed to access. The read accesses will result +in creation of a page table entry that references a special physical +page filled with zeroes. When the program performs a write, regular +physical page will be allocated to hold the written data. The page +will be marked dirty and if the kernel will decide to repurpose it, +the dirty page will be swapped out. + +Reclaim +======= + +Throughout the system lifetime, a physical page can be used for storing +different types of data. It can be kernel internal data structures, +DMA'able buffers for device drivers use, data read from a filesystem, +memory allocated by user space processes etc. + +Depending on the page usage it is treated differently by the Linux +memory management. The pages that can be freed at any time, either +because they cache the data available elsewhere, for instance, on a +hard disk, or because they can be swapped out, again, to the hard +disk, are called `reclaimable`. The most notable categories of the +reclaimable pages are page cache and anonymous memory. + +In most cases, the pages holding internal kernel data and used as DMA +buffers cannot be repurposed, and they remain pinned until freed by +their user. Such pages are called `unreclaimable`. However, in certain +circumstances, even pages occupied with kernel data structures can be +reclaimed. For instance, in-memory caches of filesystem metadata can +be re-read from the storage device and therefore it is possible to +discard them from the main memory when system is under memory +pressure. + +The process of freeing the reclaimable physical memory pages and +repurposing them is called (surprise!) `reclaim`. Linux can reclaim +pages either asynchronously or synchronously, depending on the state +of the system. When system is not loaded, most of the memory is free +and allocation request will be satisfied immediately from the free +pages supply. As the load increases, the amount of the free pages goes +down and when it reaches a certain threshold (high watermark), an +allocation request will awaken the ``kswapd`` daemon. It will +asynchronously scan memory pages and either just free them if the data +they contain is available elsewhere, or evict to the backing storage +device (remember those dirty pages?). As memory usage increases even +more and reaches another threshold - min watermark - an allocation +will trigger the `direct reclaim`. In this case allocation is stalled +until enough memory pages are reclaimed to satisfy the request. + +Compaction +========== + +As the system runs, tasks allocate and free the memory and it becomes +fragmented. Although with virtual memory it is possible to present +scattered physical pages as virtually contiguous range, sometimes it is +necessary to allocate large physically contiguous memory areas. Such +need may arise, for instance, when a device driver requires large +buffer for DMA, or when THP allocates a huge page. Memory `compaction` +addresses the fragmentation issue. This mechanism moves occupied pages +from the lower part of a memory zone to free pages in the upper part +of the zone. When a compaction scan is finished free pages are grouped +together at the beginning of the zone and allocations of large +physically contiguous areas become possible. + +Like reclaim, the compaction may happen asynchronously in ``kcompactd`` +daemon or synchronously as a result of memory allocation request. + +OOM killer +========== + +It may happen, that on a loaded machine memory will be exhausted. When +the kernel detects that the system runs out of memory (OOM) it invokes +`OOM killer`. Its mission is simple: all it has to do is to select a +task to sacrifice for the sake of the overall system health. The +selected task is killed in a hope that after it exits enough memory +will be freed to continue normal operation. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8454be638108..ceead68c2df7 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -15,12 +15,17 @@ are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html +Linux memory management has its own jargon and if you are not yet +familiar with it, consider reading +:ref:`Documentation/admin-guide/mm/concepts.rst `. + Here we document in detail how to interact with various mechanisms in the Linux memory management. .. toctree:: :maxdepth: 1 + concepts hugetlbpage idle_page_tracking ksm From a49d9c0ae46e149a22aefa8251d07dddd5611851 Mon Sep 17 00:00:00 2001 From: Omar Sandoval Date: Mon, 21 May 2018 11:18:17 -0700 Subject: [PATCH 103/103] Documentation: document hung_task_panic kernel parameter This parameter has been around since commit e162b39a368f ("softlockup: decouple hung tasks check from softlockup detection") in 2009 but was never documented. Signed-off-by: Omar Sandoval Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8d24270644a1..5385af53a8ca 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1341,12 +1341,21 @@ x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag). + hung_task_panic= + [KNL] Should the hung task detector generate panics. + Format: + + A nonzero value instructs the kernel to panic when a + hung task is detected. The default value is controlled + by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time + option. The value selected by this boot parameter can + be changed later by the kernel.hung_task_panic sysctl. + hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. If specified, z/VM IUCV HVC accepts connections from listed z/VM user IDs only. - keep_bootcon [KNL] Do not unregister boot console at start. This is only useful for debugging when something happens in the window