From 641fe2e9387a36f9ee01d7c69382d1fe147a5e98 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 18 Oct 2019 20:19:16 -0700 Subject: [PATCH 01/26] drivers/base/memory.c: don't access uninitialized memmaps in soft_offline_page_store() Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. Right now, when trying to soft-offline a PFN that resides on a memory block that was never onlined, one gets a misleading error with CONFIG_PAGE_POISONING: :/# echo 5637144576 > /sys/devices/system/memory/soft_offline_page [ 23.097167] soft offline: 0x150000 page already poisoned But the actual result depends on the garbage in the memmap. soft_offline_page() can only work with online pages, it returns -EIO in case of ZONE_DEVICE. Make sure to only forward pages that are online (iow, managed by the buddy) and, therefore, have an initialized memmap. Add a check against pfn_to_online_page() and similarly return -EIO. Link: http://lkml.kernel.org/r/20191010141200.8985-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand Acked-by: Naoya Horiguchi Acked-by: Michal Hocko Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 6bea4f3f8040..55907c27075b 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -540,6 +540,9 @@ static ssize_t soft_offline_page_store(struct device *dev, pfn >>= PAGE_SHIFT; if (!pfn_valid(pfn)) return -ENXIO; + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */ + if (!pfn_to_online_page(pfn)) + return -EIO; ret = soft_offline_page(pfn_to_page(pfn), 0); return ret == 0 ? count : ret; } From aad5f69bc161af489dbb5934868bd347282f0764 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 18 Oct 2019 20:19:20 -0700 Subject: [PATCH 02/26] fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c There are three places where we access uninitialized memmaps, namely: - /proc/kpagecount - /proc/kpageflags - /proc/kpagecgroup We have initialized memmaps either when the section is online or when the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. For example, not onlining a DIMM during boot and calling /proc/kpagecount with CONFIG_PAGE_POISONING: :/# cat /proc/kpagecount > tmp.test BUG: unable to handle page fault for address: fffffffffffffffe #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 RIP: 0010:kpagecount_read+0xce/0x1e0 Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 Call Trace: proc_reg_read+0x3c/0x60 vfs_read+0xc5/0x180 ksys_read+0x68/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe For now, let's drop support for ZONE_DEVICE from the three pseudo files in order to fix this. To distinguish offline memory (with garbage memmap) from ZONE_DEVICE memory with properly initialized memmaps, we would have to check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage of both (especially, special casing devmem) is frowned upon and needs to be reworked. The fundamental issue we have is: if (pfn_to_online_page(pfn)) { /* memmap initialized */ } else if (pfn_valid(pfn)) { /* * ??? * a) offline memory. memmap garbage. * b) devmem: memmap initialized to ZONE_DEVICE. * c) devmem: reserved for driver. memmap garbage. * (d) devmem: memmap currently initializing - garbage) */ } We'll leave the pfn_zone_device_reserved() check in stable_page_flags() in place as that function is also used from memory failure. We now no longer dump information about pages that are not in use anymore - offline. Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand Reported-by: Qian Cai Acked-by: Michal Hocko Cc: Dan Williams Cc: Alexey Dobriyan Cc: Stephen Rothwell Cc: Toshiki Fukasawa Cc: Pankaj gupta Cc: Mike Rapoport Cc: Anthony Yznaga Cc: "Aneesh Kumar K.V" Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/page.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 544d1ee15aee..7c952ee732e6 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -42,10 +42,12 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) - ppage = pfn_to_page(pfn); - else - ppage = NULL; + /* + * TODO: ZONE_DEVICE support requires to identify + * memmaps that were actually initialized. + */ + ppage = pfn_to_online_page(pfn); + if (!ppage || PageSlab(ppage) || page_has_type(ppage)) pcount = 0; else @@ -216,10 +218,11 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) - ppage = pfn_to_page(pfn); - else - ppage = NULL; + /* + * TODO: ZONE_DEVICE support requires to identify + * memmaps that were actually initialized. + */ + ppage = pfn_to_online_page(pfn); if (put_user(stable_page_flags(ppage), out)) { ret = -EFAULT; @@ -261,10 +264,11 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) - ppage = pfn_to_page(pfn); - else - ppage = NULL; + /* + * TODO: ZONE_DEVICE support requires to identify + * memmaps that were actually initialized. + */ + ppage = pfn_to_online_page(pfn); if (ppage) ino = page_cgroup_ino(ppage); From 96c804a6ae8c59a9092b3d5dd581198472063184 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 18 Oct 2019 20:19:23 -0700 Subject: [PATCH 03/26] mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() We should check for pfn_to_online_page() to not access uninitialized memmaps. Reshuffle the code so we don't have to duplicate the error message. Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com Signed-off-by: David Hildenbrand Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Acked-by: Naoya Horiguchi Cc: Michal Hocko Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0ae72b6acee7..3151c87dff73 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1257,17 +1257,19 @@ int memory_failure(unsigned long pfn, int flags) if (!sysctl_memory_failure_recovery) panic("Memory failure on page %lx", pfn); - if (!pfn_valid(pfn)) { + p = pfn_to_online_page(pfn); + if (!p) { + if (pfn_valid(pfn)) { + pgmap = get_dev_pagemap(pfn, NULL); + if (pgmap) + return memory_failure_dev_pagemap(pfn, flags, + pgmap); + } pr_err("Memory failure: %#lx: memory outside kernel control\n", pfn); return -ENXIO; } - pgmap = get_dev_pagemap(pfn, NULL); - if (pgmap) - return memory_failure_dev_pagemap(pfn, flags, pgmap); - - p = pfn_to_page(pfn); if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { From ca210ba32ef7537b02731bfe255ed8eb1e4e2b59 Mon Sep 17 00:00:00 2001 From: Joel Colledge Date: Fri, 18 Oct 2019 20:19:26 -0700 Subject: [PATCH 04/26] scripts/gdb: fix lx-dmesg when CONFIG_PRINTK_CALLER is set When CONFIG_PRINTK_CALLER is set, struct printk_log contains an additional member caller_id. This affects the offset of the log text. Account for this by using the type information from gdb to determine all the offsets instead of using hardcoded values. This fixes following error: (gdb) lx-dmesg Python Exception embedded null character: Error occurred in Python command: embedded null character The read_u* utility functions now take an offset argument to make them easier to use. Link: http://lkml.kernel.org/r/20191011142500.2339-1-joel.colledge@linbit.com Signed-off-by: Joel Colledge Reviewed-by: Jan Kiszka Cc: Kieran Bingham Cc: Leonard Crestez Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/gdb/linux/dmesg.py | 16 ++++++++++++---- scripts/gdb/linux/utils.py | 25 +++++++++++++------------ 2 files changed, 25 insertions(+), 16 deletions(-) diff --git a/scripts/gdb/linux/dmesg.py b/scripts/gdb/linux/dmesg.py index 6d2e09a2ad2f..2fa7bb83885f 100644 --- a/scripts/gdb/linux/dmesg.py +++ b/scripts/gdb/linux/dmesg.py @@ -16,6 +16,8 @@ import sys from linux import utils +printk_log_type = utils.CachedType("struct printk_log") + class LxDmesg(gdb.Command): """Print Linux kernel log buffer.""" @@ -42,9 +44,14 @@ class LxDmesg(gdb.Command): b = utils.read_memoryview(inf, log_buf_addr, log_next_idx) log_buf = a.tobytes() + b.tobytes() + length_offset = printk_log_type.get_type()['len'].bitpos // 8 + text_len_offset = printk_log_type.get_type()['text_len'].bitpos // 8 + time_stamp_offset = printk_log_type.get_type()['ts_nsec'].bitpos // 8 + text_offset = printk_log_type.get_type().sizeof + pos = 0 while pos < log_buf.__len__(): - length = utils.read_u16(log_buf[pos + 8:pos + 10]) + length = utils.read_u16(log_buf, pos + length_offset) if length == 0: if log_buf_2nd_half == -1: gdb.write("Corrupted log buffer!\n") @@ -52,10 +59,11 @@ class LxDmesg(gdb.Command): pos = log_buf_2nd_half continue - text_len = utils.read_u16(log_buf[pos + 10:pos + 12]) - text = log_buf[pos + 16:pos + 16 + text_len].decode( + text_len = utils.read_u16(log_buf, pos + text_len_offset) + text_start = pos + text_offset + text = log_buf[text_start:text_start + text_len].decode( encoding='utf8', errors='replace') - time_stamp = utils.read_u64(log_buf[pos:pos + 8]) + time_stamp = utils.read_u64(log_buf, pos + time_stamp_offset) for line in text.splitlines(): msg = u"[{time:12.6f}] {line}\n".format( diff --git a/scripts/gdb/linux/utils.py b/scripts/gdb/linux/utils.py index bc67126118c4..ea94221dbd39 100644 --- a/scripts/gdb/linux/utils.py +++ b/scripts/gdb/linux/utils.py @@ -92,15 +92,16 @@ def read_memoryview(inf, start, length): return memoryview(inf.read_memory(start, length)) -def read_u16(buffer): +def read_u16(buffer, offset): + buffer_val = buffer[offset:offset + 2] value = [0, 0] - if type(buffer[0]) is str: - value[0] = ord(buffer[0]) - value[1] = ord(buffer[1]) + if type(buffer_val[0]) is str: + value[0] = ord(buffer_val[0]) + value[1] = ord(buffer_val[1]) else: - value[0] = buffer[0] - value[1] = buffer[1] + value[0] = buffer_val[0] + value[1] = buffer_val[1] if get_target_endianness() == LITTLE_ENDIAN: return value[0] + (value[1] << 8) @@ -108,18 +109,18 @@ def read_u16(buffer): return value[1] + (value[0] << 8) -def read_u32(buffer): +def read_u32(buffer, offset): if get_target_endianness() == LITTLE_ENDIAN: - return read_u16(buffer[0:2]) + (read_u16(buffer[2:4]) << 16) + return read_u16(buffer, offset) + (read_u16(buffer, offset + 2) << 16) else: - return read_u16(buffer[2:4]) + (read_u16(buffer[0:2]) << 16) + return read_u16(buffer, offset + 2) + (read_u16(buffer, offset) << 16) -def read_u64(buffer): +def read_u64(buffer, offset): if get_target_endianness() == LITTLE_ENDIAN: - return read_u32(buffer[0:4]) + (read_u32(buffer[4:8]) << 32) + return read_u32(buffer, offset) + (read_u32(buffer, offset + 4) << 32) else: - return read_u32(buffer[4:8]) + (read_u32(buffer[0:4]) << 32) + return read_u32(buffer, offset + 4) + (read_u32(buffer, offset) << 32) target_arch = None From a26ee565b6cd8dc2bf15ff6aa70bbb28f928b773 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Fri, 18 Oct 2019 20:19:29 -0700 Subject: [PATCH 05/26] mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. For example, when not onlining a memory block that is spanned by a zone and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and CONFIG_PAGE_POISONING, we can trigger a kernel BUG: :/# echo 1 > /sys/devices/system/memory/memory40/online :/# echo 1 > /sys/devices/system/memory/memory42/online :/# cat /proc/pagetypeinfo > test.file page:fffff2c585200000 is uninitialized and poisoned raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) There is not page extension available. ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1107! invalid opcode: 0000 [#1] SMP NOPTI Please note that this change does not affect ZONE_DEVICE, because pagetypeinfo_showmixedcount_print() is called from mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and ZONE_DEVICE is never populated (zone->present_pages always 0). [david@redhat.com: move check to outer loop, add comment, rephrase description] Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319 Signed-off-by: Qian Cai Signed-off-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Thomas Gleixner Cc: "Peter Zijlstra (Intel)" Cc: Miles Chen Cc: Mike Rapoport Cc: Qian Cai Cc: Greg Kroah-Hartman Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_owner.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index e327bcd0380e..18ecde9f45b2 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -271,7 +271,8 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, * not matter as the mixed block count will still be correct */ for (; pfn < end_pfn; ) { - if (!pfn_valid(pfn)) { + page = pfn_to_online_page(pfn); + if (!page) { pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES); continue; } @@ -279,13 +280,13 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); block_end_pfn = min(block_end_pfn, end_pfn); - page = pfn_to_page(pfn); pageblock_mt = get_pageblock_migratetype(page); for (; pfn < block_end_pfn; pfn++) { if (!pfn_valid_within(pfn)) continue; + /* The pageblock is online, no need to recheck. */ page = pfn_to_page(pfn); if (page_zone(page) != zone) From 00d6c019b5bc175cee3770e0e659f2b5f4804ea5 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 18 Oct 2019 20:19:33 -0700 Subject: [PATCH 06/26] mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() We might use the nid of memmaps that were never initialized. For example, if the memmap was poisoned, we will crash the kernel in pfn_to_nid() right now. Let's use the calculated boundaries of the separate zones instead. This now also avoids having to iterate over a whole bunch of subsections again, after shrinking one zone. Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"), the memmap was initialized to 0 and the node was set to the right value. After that commit, the node might be garbage. We'll have to fix shrink_zone_span() next. Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b319] Signed-off-by: David Hildenbrand Reported-by: Aneesh Kumar K.V Cc: Oscar Salvador Cc: David Hildenbrand Cc: Michal Hocko Cc: Pavel Tatashin Cc: Dan Williams Cc: Wei Yang Cc: Alexander Duyck Cc: Alexander Potapenko Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Christophe Leroy Cc: Damian Tometzki Cc: Dave Hansen Cc: Fenghua Yu Cc: Gerald Schaefer Cc: Greg Kroah-Hartman Cc: Halil Pasic Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Ira Weiny Cc: Jason Gunthorpe Cc: Jun Yao Cc: Logan Gunthorpe Cc: Mark Rutland Cc: Masahiro Yamada Cc: "Matthew Wilcox (Oracle)" Cc: Mel Gorman Cc: Michael Ellerman Cc: Mike Rapoport Cc: Pankaj Gupta Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Qian Cai Cc: Rich Felker Cc: Robin Murphy Cc: Steve Capper Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Tony Luck Cc: Vasily Gorbik Cc: Vlastimil Babka Cc: Wei Yang Cc: Will Deacon Cc: Yoshinori Sato Cc: Yu Zhao Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 74 ++++++++++----------------------------------- 1 file changed, 16 insertions(+), 58 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b1be791f772d..df570e5c71cc 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, zone_span_writeunlock(zone); } -static void shrink_pgdat_span(struct pglist_data *pgdat, - unsigned long start_pfn, unsigned long end_pfn) +static void update_pgdat_span(struct pglist_data *pgdat) { - unsigned long pgdat_start_pfn = pgdat->node_start_pfn; - unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */ - unsigned long pgdat_end_pfn = p; - unsigned long pfn; - int nid = pgdat->node_id; + unsigned long node_start_pfn = 0, node_end_pfn = 0; + struct zone *zone; - if (pgdat_start_pfn == start_pfn) { - /* - * If the section is smallest section in the pgdat, it need - * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages. - * In this case, we find second smallest valid mem_section - * for shrinking zone. - */ - pfn = find_smallest_section_pfn(nid, NULL, end_pfn, - pgdat_end_pfn); - if (pfn) { - pgdat->node_start_pfn = pfn; - pgdat->node_spanned_pages = pgdat_end_pfn - pfn; - } - } else if (pgdat_end_pfn == end_pfn) { - /* - * If the section is biggest section in the pgdat, it need - * shrink pgdat->node_spanned_pages. - * In this case, we find second biggest valid mem_section for - * shrinking zone. - */ - pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn, - start_pfn); - if (pfn) - pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1; + for (zone = pgdat->node_zones; + zone < pgdat->node_zones + MAX_NR_ZONES; zone++) { + unsigned long zone_end_pfn = zone->zone_start_pfn + + zone->spanned_pages; + + /* No need to lock the zones, they can't change. */ + if (zone_end_pfn > node_end_pfn) + node_end_pfn = zone_end_pfn; + if (zone->zone_start_pfn < node_start_pfn) + node_start_pfn = zone->zone_start_pfn; } - /* - * If the section is not biggest or smallest mem_section in the pgdat, - * it only creates a hole in the pgdat. So in this case, we need not - * change the pgdat. - * But perhaps, the pgdat has only hole data. Thus it check the pgdat - * has only hole or not. - */ - pfn = pgdat_start_pfn; - for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) { - if (unlikely(!pfn_valid(pfn))) - continue; - - if (pfn_to_nid(pfn) != nid) - continue; - - /* Skip range to be removed */ - if (pfn >= start_pfn && pfn < end_pfn) - continue; - - /* If we find valid section, we have nothing to do */ - return; - } - - /* The pgdat has no valid section */ - pgdat->node_start_pfn = 0; - pgdat->node_spanned_pages = 0; + pgdat->node_start_pfn = node_start_pfn; + pgdat->node_spanned_pages = node_end_pfn - node_start_pfn; } static void __remove_zone(struct zone *zone, unsigned long start_pfn, @@ -507,7 +465,7 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn, pgdat_resize_lock(zone->zone_pgdat, &flags); shrink_zone_span(zone, start_pfn, start_pfn + nr_pages); - shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages); + update_pgdat_span(pgdat); pgdat_resize_unlock(zone->zone_pgdat, &flags); } From 77e080e7680e1e615587352f70c87b9e98126d03 Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Fri, 18 Oct 2019 20:19:39 -0700 Subject: [PATCH 07/26] mm/memunmap: don't access uninitialized memmap in memunmap_pages() Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6. This series fixes the access of uninitialized memmaps when shrinking zones/nodes and when removing memory. Also, it contains all fixes for crashes that can be triggered when removing certain namespace using memunmap_pages() - ZONE_DEVICE, reported by Aneesh. We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be more involved (we don't have SECTION_IS_ONLINE as an indicator), and shrinking is only of limited use (set_zone_contiguous() cannot detect the ZONE_DEVICE as contiguous). We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of code to a minimum. Shrinking is especially necessary to keep zone->contiguous set where possible, especially, on memory unplug of DIMMs at zone boundaries. -------------------------------------------------------------------------- Zones are now properly shrunk when offlining memory blocks or when onlining failed. This allows to properly shrink zones on memory unplug even if the separate memory blocks of a DIMM were onlined to different zones or re-onlined to a different zone after offlining. Example: :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 :/# echo "online_movable" > /sys/devices/system/memory/memory41/state :/# echo "online_movable" > /sys/devices/system/memory/memory43/state :/# cat /proc/zoneinfo Node 1, zone Movable spanned 98304 present 65536 managed 65536 :/# echo 0 > /sys/devices/system/memory/memory43/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 32768 present 32768 managed 32768 :/# echo 0 > /sys/devices/system/memory/memory41/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 This patch (of 10): With an altmap, the memmap falling into the reserved altmap space are not initialized and, therefore, contain a garbage NID and a garbage zone. Make sure to read the NID/zone from a memmap that was initialized. This fixes a kernel crash that is observed when destroying a namespace: kernel BUG at include/linux/mm.h:1107! cpu 0x1: Vector: 700 (Program Check) at [c000000274087890] pc: c0000000004b9728: memunmap_pages+0x238/0x340 lr: c0000000004b9724: memunmap_pages+0x234/0x340 ... pid = 3669, comm = ndctl kernel BUG at include/linux/mm.h:1107! devm_action_release+0x30/0x50 release_nodes+0x268/0x2d0 device_release_driver_internal+0x174/0x240 unbind_store+0x13c/0x190 drv_attr_store+0x44/0x60 sysfs_kf_write+0x70/0xa0 kernfs_fop_write+0x1ac/0x290 __vfs_write+0x3c/0x70 vfs_write+0xe4/0x200 ksys_write+0x7c/0x140 system_call+0x5c/0x68 The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I think we will never have driver reserved memory with MEMORY_DEVICE_PRIVATE (no altmap AFAIKS). [david@redhat.com: minimze code changes, rephrase description] Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory") Signed-off-by: Aneesh Kumar K.V Signed-off-by: David Hildenbrand Cc: Dan Williams Cc: Jason Gunthorpe Cc: Logan Gunthorpe Cc: Ira Weiny Cc: Damian Tometzki Cc: Alexander Duyck Cc: Alexander Potapenko Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Christophe Leroy Cc: Dave Hansen Cc: Fenghua Yu Cc: Gerald Schaefer Cc: Greg Kroah-Hartman Cc: Halil Pasic Cc: Heiko Carstens Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jun Yao Cc: Mark Rutland Cc: Masahiro Yamada Cc: "Matthew Wilcox (Oracle)" Cc: Mel Gorman Cc: Michael Ellerman Cc: Michal Hocko Cc: Mike Rapoport Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paul Mackerras Cc: Pavel Tatashin Cc: Pavel Tatashin Cc: Peter Zijlstra Cc: Qian Cai Cc: Rich Felker Cc: Robin Murphy Cc: Steve Capper Cc: Thomas Gleixner Cc: Tom Lendacky Cc: Tony Luck Cc: Vasily Gorbik Cc: Vlastimil Babka Cc: Wei Yang Cc: Wei Yang Cc: Will Deacon Cc: Yoshinori Sato Cc: Yu Zhao Cc: [5.0+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memremap.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 68204912cc0a..03ccbdfeb697 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -103,6 +103,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap) void memunmap_pages(struct dev_pagemap *pgmap) { struct resource *res = &pgmap->res; + struct page *first_page; unsigned long pfn; int nid; @@ -111,14 +112,16 @@ void memunmap_pages(struct dev_pagemap *pgmap) put_page(pfn_to_page(pfn)); dev_pagemap_cleanup(pgmap); + /* make sure to access a memmap that was actually initialized */ + first_page = pfn_to_page(pfn_first(pgmap)); + /* pages are dead and unused, undo the arch mapping */ - nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start))); + nid = page_to_nid(first_page); mem_hotplug_begin(); if (pgmap->type == MEMORY_DEVICE_PRIVATE) { - pfn = PHYS_PFN(res->start); - __remove_pages(page_zone(pfn_to_page(pfn)), pfn, - PHYS_PFN(resource_size(res)), NULL); + __remove_pages(page_zone(first_page), PHYS_PFN(res->start), + PHYS_PFN(resource_size(res)), NULL); } else { arch_remove_memory(nid, res->start, resource_size(res), pgmap_altmap(pgmap)); From b749ecfaf6c53ce79d6ab66afd2fc34189a073b1 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Fri, 18 Oct 2019 20:19:44 -0700 Subject: [PATCH 08/26] mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Karsten reported the following panic in __free_slab() happening on a s390x machine: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000483 Fault in home space mode while using kernel ASCE. AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14 Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78 Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10) 00000000003cadac: c0e50046c286 brasl %r14,ca32b8 #00000000003cadb2: a7f4fe36 brc 15,3caa1e >00000000003cadb6: e32060800024 stg %r2,128(%r6) 00000000003cadbc: a7f4fd9e brc 15,3ca8f8 00000000003cadc0: c0e50046790c brasl %r14,c99fd8 00000000003cadc6: a7f4fe2c brc 15,3caa 00000000003cadc6: a7f4fe2c brc 15,3caa1e 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1 Call Trace: (<00000000003cabcc> __free_slab+0x49c/0x6b0) <00000000001f5886> rcu_core+0x5a6/0x7e0 <0000000000ca2dea> __do_softirq+0xf2/0x5c0 <0000000000152644> irq_exit+0x104/0x130 <000000000010d222> do_IRQ+0x9a/0xf0 <0000000000ca2344> ext_int_handler+0x130/0x134 <0000000000103648> enabled_wait+0x58/0x128 (<0000000000103634> enabled_wait+0x44/0x128) <0000000000103b00> arch_cpu_idle+0x40/0x58 <0000000000ca0544> default_idle_call+0x3c/0x68 <000000000018eaa4> do_idle+0xec/0x1c0 <000000000018ee0e> cpu_startup_entry+0x36/0x40 <000000000122df34> arch_call_rest_init+0x5c/0x88 <0000000000000000> 0x0 INFO: lockdep is turned off. Last Breaking-Event-Address: <00000000003ca8f4> __free_slab+0x1c4/0x6b0 Kernel panic - not syncing: Fatal exception in interrupt The kernel panics on an attempt to dereference the NULL memcg pointer. When shutdown_cache() is called from the kmem_cache_destroy() context, a memcg kmem_cache might have empty slab pages in a partial list, which are still charged to the memory cgroup. These pages are released by free_partial() at the beginning of shutdown_cache(): either directly or by scheduling a RCU-delayed work (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag). The latter case is when the reported panic can happen: memcg_unlink_cache() is called immediately after shrinking partial lists, without waiting for scheduled RCU works. It sets the kmem_cache->memcg_params.memcg pointer to NULL, and the following attempt to dereference it by __free_slab() from the RCU work context causes the panic. To fix the issue, let's postpone the release of the memcg pointer to destroy_memcg_params(). It's called from a separate work context by slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier. This guarantees that all scheduled page release RCU works will complete before the memcg pointer will be zeroed. Big thanks for Karsten for the perfect report containing all necessary information, his help with the analysis of the problem and testing of the fix. Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com Fixes: fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") Signed-off-by: Roman Gushchin Reported-by: Karsten Graul Tested-by: Karsten Graul Acked-by: Vlastimil Babka Reviewed-by: Shakeel Butt Cc: Karsten Graul Cc: Vladimir Davydov Cc: David Rientjes Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab_common.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index c29f03adca91..f9fb27b4c843 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -178,10 +178,13 @@ static int init_memcg_params(struct kmem_cache *s, static void destroy_memcg_params(struct kmem_cache *s) { - if (is_root_cache(s)) + if (is_root_cache(s)) { kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)); - else + } else { + mem_cgroup_put(s->memcg_params.memcg); + WRITE_ONCE(s->memcg_params.memcg, NULL); percpu_ref_exit(&s->memcg_params.refcnt); + } } static void free_memcg_params(struct rcu_head *rcu) @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s) } else { list_del(&s->memcg_params.children_node); list_del(&s->memcg_params.kmem_caches_node); - mem_cgroup_put(s->memcg_params.memcg); - WRITE_ONCE(s->memcg_params.memcg, NULL); } } #else From ce750f43f5790de74c1644c39d78f684071658d1 Mon Sep 17 00:00:00 2001 From: Chengguang Xu Date: Fri, 18 Oct 2019 20:19:47 -0700 Subject: [PATCH 09/26] ocfs2: fix error handling in ocfs2_setattr() Should set transfer_to[USRQUOTA/GRPQUOTA] to NULL on error case before jumping to do dqput(). Link: http://lkml.kernel.org/r/20191010082349.1134-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/file.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 2e982db3e1ae..53939bf9d7d2 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1230,6 +1230,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) transfer_to[USRQUOTA] = dqget(sb, make_kqid_uid(attr->ia_uid)); if (IS_ERR(transfer_to[USRQUOTA])) { status = PTR_ERR(transfer_to[USRQUOTA]); + transfer_to[USRQUOTA] = NULL; goto bail_unlock; } } @@ -1239,6 +1240,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) transfer_to[GRPQUOTA] = dqget(sb, make_kqid_gid(attr->ia_gid)); if (IS_ERR(transfer_to[GRPQUOTA])) { status = PTR_ERR(transfer_to[GRPQUOTA]); + transfer_to[GRPQUOTA] = NULL; goto bail_unlock; } } From 6f24c8d30d08f270b54f4c2cb9b08dfccbe59c57 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Fri, 18 Oct 2019 20:19:50 -0700 Subject: [PATCH 10/26] mm/gup_benchmark: add a missing "w" to getopt string Even though gup_benchmark.c has code to handle the -w command-line option, the "w" is not part of the getopt string. It looks as if it has been missing the whole time. On my machine, this leads naturally to the following predictable result: $ sudo ./gup_benchmark -w ./gup_benchmark: invalid option -- 'w' ...which is fixed with this commit. Link: http://lkml.kernel.org/r/20191014184639.1512873-2-jhubbard@nvidia.com Signed-off-by: John Hubbard Acked-by: Kirill A. Shutemov Cc: Keith Busch Cc: Shuah Khan Cc: Christoph Hellwig Cc: "Aneesh Kumar K . V" Cc: Ira Weiny Cc: Christoph Hellwig Cc: kbuild test robot Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/gup_benchmark.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index c0534e298b51..cb3fc09645c4 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -37,7 +37,7 @@ int main(int argc, char **argv) char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) { switch (opt) { case 'm': size = atoi(optarg) * MB; From 0cd22afdcea21fa16bb5b0d3e0508ca42072d0bd Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Fri, 18 Oct 2019 20:19:53 -0700 Subject: [PATCH 11/26] mm/gup: fix a misnamed "write" argument, and a related bug In several routines, the "flags" argument is incorrectly named "write". Change it to "flags". Also, in one place, the misnaming led to an actual bug: "flags & FOLL_WRITE" is required, rather than just "flags". (That problem was flagged by krobot, in v1 of this patch.) Also, change the flags argument from int, to unsigned int. You can see that this was a simple oversight, because the calling code passes "flags" to the fifth argument: gup_pgd_range(): ... if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) ...which, until this patch, the callees referred to as "write". Also, change two lines to avoid checkpatch line length complaints, and another line to fix another oversight that checkpatch called out: missing "int" on pdshift. Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com Fixes: b798bec4741b ("mm/gup: change write parameter to flags in fast walk") Signed-off-by: John Hubbard Reported-by: kbuild test robot Suggested-by: Kirill A. Shutemov Suggested-by: Ira Weiny Acked-by: Kirill A. Shutemov Reviewed-by: Ira Weiny Cc: Christoph Hellwig Cc: Aneesh Kumar K.V Cc: Keith Busch Cc: Shuah Khan Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 23a9f9c9d377..8f236a335ae9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1973,7 +1973,8 @@ static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, } static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, - unsigned long end, int write, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { unsigned long pte_end; struct page *head, *page; @@ -1986,7 +1987,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, pte = READ_ONCE(*ptep); - if (!pte_access_permitted(pte, write)) + if (!pte_access_permitted(pte, flags & FOLL_WRITE)) return 0; /* hugepages are never "special" */ @@ -2023,7 +2024,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, } static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned int pdshift, unsigned long end, int write, + unsigned int pdshift, unsigned long end, unsigned int flags, struct page **pages, int *nr) { pte_t *ptep; @@ -2033,7 +2034,7 @@ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, ptep = hugepte_offset(hugepd, addr, pdshift); do { next = hugepte_addr_end(addr, end, sz); - if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr)) + if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr)) return 0; } while (ptep++, addr = next, addr != end); @@ -2041,7 +2042,7 @@ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, } #else static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, - unsigned pdshift, unsigned long end, int write, + unsigned int pdshift, unsigned long end, unsigned int flags, struct page **pages, int *nr) { return 0; @@ -2049,7 +2050,8 @@ static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, #endif /* CONFIG_ARCH_HAS_HUGEPD */ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct page *head, *page; int refs; From b11edebbc967ebf5c55b8f9e1d5bb6d68ec3a7fd Mon Sep 17 00:00:00 2001 From: Honglei Wang Date: Fri, 18 Oct 2019 20:19:58 -0700 Subject: [PATCH 12/26] mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") has made lruvec_page_state to use per-cpu counters instead of calculating it directly from lru_zone_size with an idea that this would be more effective. Tim has reported that this is not really the case for their database benchmark which is showing an opposite results where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% of the system time which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload is running on a larger machine (96cpus), it has many cgroups (500) and it is heavily direct reclaim bound. Tim Chen said: : The problem can also be reproduced by running simple multi-threaded : pmbench benchmark with a fast Optane SSD swap (see profile below). : : : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size : | : |--3.07%--lruvec_lru_size : | | : | |--2.11%--cpumask_next : | | | : | | --1.66%--find_next_bit : | | : | --0.57%--call_function_interrupt : | | : | --0.55%--smp_call_function_interrupt : | : |--1.59%--0x441f0fc3d009 : | _ops_rdtsc_init_base_freq : | access_histogram : | page_fault : | __do_page_fault : | handle_mm_fault : | __handle_mm_fault : | | : | --1.54%--do_swap_page : | swapin_readahead : | swap_cluster_readahead : | | : | --1.53%--read_swap_cache_async : | __read_swap_cache_async : | alloc_pages_vma : | __alloc_pages_nodemask : | __alloc_pages_slowpath : | try_to_free_pages : | do_try_to_free_pages : | shrink_node : | shrink_node_memcg : | | : | |--0.77%--lruvec_lru_size : | | : | --0.76%--inactive_list_is_low : | | : | --0.76%--lruvec_lru_size : | : --1.50%--measure_read : page_fault : __do_page_fault : handle_mm_fault : __handle_mm_fault : do_swap_page : swapin_readahead : swap_cluster_readahead : | : --1.48%--read_swap_cache_async : __read_swap_cache_async : alloc_pages_vma : __alloc_pages_nodemask : __alloc_pages_slowpath : try_to_free_pages : do_try_to_free_pages : shrink_node : shrink_node_memcg : | : |--0.75%--inactive_list_is_low : | | : | --0.75%--lruvec_lru_size : | : --0.73%--lruvec_lru_size The likely culprit is the cache traffic the lruvec_page_state_local generates. Dave Hansen says: : I was thinking purely of the cache footprint. If it's reading : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would : be 18k of data for the whole system and the caching would be pretty : efficient and all 18k would probably survive a tight page fault loop in : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't : fit in the L1 and probably wouldn't survive a tight page fault loop if : both logical threads were banging on different cgroups. : : It's just a theory, but it's why I noted the number of cgroups when I : initially saw this show up in profiles Fix the regression by partially reverting the said commit and calculate the lru size explicitly. Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang Reported-by: Tim Chen Acked-by: Tim Chen Tested-by: Tim Chen Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Roman Gushchin Cc: Tejun Heo Cc: Dave Hansen Cc: [5.2+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c6659bb758a4..024b7e929752 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(struct zone *zone) */ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) { - unsigned long lru_size; + unsigned long lru_size = 0; int zid; - if (!mem_cgroup_disabled()) - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); - else + if (!mem_cgroup_disabled()) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) + lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); + } else lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru); for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) { From f3057ad767542be7bbac44e548cb44017178a163 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Fri, 18 Oct 2019 20:20:01 -0700 Subject: [PATCH 13/26] mm: memblock: do not enforce current limit for memblock_phys* family Until commit 92d12f9544b7 ("memblock: refactor internal allocation functions") the maximal address for memblock allocations was forced to memblock.current_limit only for the allocation functions returning virtual address. The changes introduced by that commit moved the limit enforcement into the allocation core and as a result the allocation functions returning physical address also started to limit allocations to memblock.current_limit. This caused breakage of etnaviv GPU driver: etnaviv etnaviv: bound 130000.gpu (ops gpu_ops) etnaviv etnaviv: bound 134000.gpu (ops gpu_ops) etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops) etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108 etnaviv-gpu 130000.gpu: command buffer outside valid memory window etnaviv-gpu 134000.gpu: model: GC320, revision: 5007 etnaviv-gpu 134000.gpu: command buffer outside valid memory window etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215 etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0 Restore the behaviour of memblock_phys* family so that these functions will not enforce memblock.current_limit. Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org Fixes: 92d12f9544b7 ("memblock: refactor internal allocation functions") Signed-off-by: Mike Rapoport Reported-by: Adam Ford Tested-by: Adam Ford [imx6q-logicpd] Cc: Catalin Marinas Cc: Christoph Hellwig Cc: Fabio Estevam Cc: Lucas Stach Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memblock.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memblock.c b/mm/memblock.c index 7d4f61ae666a..c4b16cae2bc9 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1356,9 +1356,6 @@ static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, align = SMP_CACHE_BYTES; } - if (end > memblock.current_limit) - end = memblock.current_limit; - again: found = memblock_find_in_range_node(size, align, start, end, nid, flags); @@ -1469,6 +1466,9 @@ static void * __init memblock_alloc_internal( if (WARN_ON_ONCE(slab_is_available())) return kzalloc_node(size, GFP_NOWAIT, nid); + if (max_addr > memblock.current_limit) + max_addr = memblock.current_limit; + alloc = memblock_alloc_range_nid(size, align, min_addr, max_addr, nid); /* retry allocation without lower limit */ From f231fe4235e22e18d847e05cbe705deaca56580a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 18 Oct 2019 20:20:05 -0700 Subject: [PATCH 14/26] hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. Let's make sure that we only consider online memory (managed by the buddy) that has initialized memmaps. ZONE_DEVICE is not applicable. page_zone() will call page_to_nid(), which will trigger VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be the case when an offline memory block (e.g., never onlined) is spanned by a zone. Note: As explained by Michal in [1], alloc_contig_range() will verify the range. So it boils down to the wrong access in this function. [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand Reported-by: Michal Hocko Acked-by: Michal Hocko Reviewed-by: Mike Kravetz Cc: Anshuman Khandual Cc: [4.13+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ef37c85423a5..b45a95363a84 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1084,11 +1084,10 @@ static bool pfn_range_valid_gigantic(struct zone *z, struct page *page; for (i = start_pfn; i < end_pfn; i++) { - if (!pfn_valid(i)) + page = pfn_to_online_page(i); + if (!page) return false; - page = pfn_to_page(i); - if (page_zone(page) != z) return false; From b918c43021baaa3648de09e19a4a3dd555a45f40 Mon Sep 17 00:00:00 2001 From: Yi Li Date: Fri, 18 Oct 2019 20:20:08 -0700 Subject: [PATCH 15/26] ocfs2: fix panic due to ocfs2_wq is null mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters an error. ocfs2_initialize_super() returns before allocating ocfs2_wq. ocfs2_dismount_volume() triggers the following panic. Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted. Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30 ------------[ cut here ]------------ Oops: 0002 [#1] SMP NOPTI CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E 4.14.148-200.ckv.x86_64 #1 Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018 task: ffff967af0520000 task.stack: ffffa5f05484000 RIP: 0010:mutex_lock+0x19/0x20 Call Trace: flush_workqueue+0x81/0x460 ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2] ocfs2_dismount_volume+0x84/0x400 [ocfs2] ocfs2_fill_super+0xa4/0x1270 [ocfs2] ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2] mount_bdev+0x17f/0x1c0 mount_fs+0x3a/0x160 Link: http://lkml.kernel.org/r/1571139611-24107-1-git-send-email-yili@winhong.com Signed-off-by: Yi Li Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/journal.c | 3 ++- fs/ocfs2/localalloc.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 930e3d388579..699a560efbb0 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -217,7 +217,8 @@ void ocfs2_recovery_exit(struct ocfs2_super *osb) /* At this point, we know that no more recovery threads can be * launched, so wait for any recovery completion work to * complete. */ - flush_workqueue(osb->ocfs2_wq); + if (osb->ocfs2_wq) + flush_workqueue(osb->ocfs2_wq); /* * Now that recovery is shut down, and the osb is about to be diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 158e5af767fd..720e9f94957e 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -377,7 +377,8 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) struct ocfs2_dinode *alloc = NULL; cancel_delayed_work(&osb->la_enable_wq); - flush_workqueue(osb->ocfs2_wq); + if (osb->ocfs2_wq) + flush_workqueue(osb->ocfs2_wq); if (osb->local_alloc_state == OCFS2_LA_UNUSED) goto out; From ae8af4388db002bbd1df78ecee7ca31cee78e964 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Fri, 18 Oct 2019 20:20:11 -0700 Subject: [PATCH 16/26] mm/memcontrol: update lruvec counters in mem_cgroup_move_account Mapped, dirty and writeback pages are also counted in per-lruvec stats. These counters needs update when page is moved between cgroups. Currently is nobody *consuming* the lruvec versions of these counters and that there is no user-visible effect. Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure") Signed-off-by: Konstantin Khlebnikov Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bdac56009a38..363106578876 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5420,6 +5420,8 @@ static int mem_cgroup_move_account(struct page *page, struct mem_cgroup *from, struct mem_cgroup *to) { + struct lruvec *from_vec, *to_vec; + struct pglist_data *pgdat; unsigned long flags; unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1; int ret; @@ -5443,11 +5445,15 @@ static int mem_cgroup_move_account(struct page *page, anon = PageAnon(page); + pgdat = page_pgdat(page); + from_vec = mem_cgroup_lruvec(pgdat, from); + to_vec = mem_cgroup_lruvec(pgdat, to); + spin_lock_irqsave(&from->move_lock, flags); if (!anon && page_mapped(page)) { - __mod_memcg_state(from, NR_FILE_MAPPED, -nr_pages); - __mod_memcg_state(to, NR_FILE_MAPPED, nr_pages); + __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); } /* @@ -5459,14 +5465,14 @@ static int mem_cgroup_move_account(struct page *page, struct address_space *mapping = page_mapping(page); if (mapping_cap_account_dirty(mapping)) { - __mod_memcg_state(from, NR_FILE_DIRTY, -nr_pages); - __mod_memcg_state(to, NR_FILE_DIRTY, nr_pages); + __mod_lruvec_state(from_vec, NR_FILE_DIRTY, -nr_pages); + __mod_lruvec_state(to_vec, NR_FILE_DIRTY, nr_pages); } } if (PageWriteback(page)) { - __mod_memcg_state(from, NR_WRITEBACK, -nr_pages); - __mod_memcg_state(to, NR_WRITEBACK, nr_pages); + __mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages); + __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE From f7daefe4231e57381d92c2e2ad905a899c28e402 Mon Sep 17 00:00:00 2001 From: Chenwandun Date: Fri, 18 Oct 2019 20:20:14 -0700 Subject: [PATCH 17/26] zram: fix race between backing_dev_show and backing_dev_store CPU0: CPU1: backing_dev_show backing_dev_store ...... ...... file = zram->backing_dev; down_read(&zram->init_lock); down_read(&zram->init_init_lock) file_path(file, ...); zram->backing_dev = backing_dev; up_read(&zram->init_lock); up_read(&zram->init_lock); gets the value of zram->backing_dev too early in backing_dev_show, which resultin the value being NULL at the beginning, and not NULL later. backtrace: d_path+0xcc/0x174 file_path+0x10/0x18 backing_dev_show+0x40/0xb4 dev_attr_show+0x20/0x54 sysfs_kf_seq_show+0x9c/0x10c kernfs_seq_show+0x28/0x30 seq_read+0x184/0x488 kernfs_fop_read+0x5c/0x1a4 __vfs_read+0x44/0x128 vfs_read+0xa0/0x138 SyS_read+0x54/0xb4 Link: http://lkml.kernel.org/r/1571046839-16814-1-git-send-email-chenwandun@huawei.com Signed-off-by: Chenwandun Acked-by: Minchan Kim Cc: Sergey Senozhatsky Cc: Jens Axboe Cc: [4.14+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d58a359a6622..4285e75e52c3 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -413,13 +413,14 @@ static void reset_bdev(struct zram *zram) static ssize_t backing_dev_show(struct device *dev, struct device_attribute *attr, char *buf) { + struct file *file; struct zram *zram = dev_to_zram(dev); - struct file *file = zram->backing_dev; char *p; ssize_t ret; down_read(&zram->init_lock); - if (!zram->backing_dev) { + file = zram->backing_dev; + if (!file) { memcpy(buf, "none\n", 5); up_read(&zram->init_lock); return 5; From 444f84fd2ac7bae36f3dd3ce1d39d11211c2c72a Mon Sep 17 00:00:00 2001 From: Ben Dooks Date: Fri, 18 Oct 2019 20:20:17 -0700 Subject: [PATCH 18/26] mm: include for is_vma_temporary_stack Include for the definition of is_vma_temporary_stack to fix the following sparse warning: mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks Reviewed-by: Qian Cai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/rmap.c b/mm/rmap.c index d9a23bb773bf..0c7b2a9400d4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -61,6 +61,7 @@ #include #include #include +#include #include #include #include From d0e6a5821cdf08eea91d8597dce1ad695ebeb8bc Mon Sep 17 00:00:00 2001 From: Ben Dooks Date: Fri, 18 Oct 2019 20:20:20 -0700 Subject: [PATCH 19/26] mm/filemap.c: include for generic_file_vm_ops definition The generic_file_vm_ops is defined in so include it to fix the following warning: mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/filemap.c b/mm/filemap.c index 1146fcfa3215..85b7d087eb45 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -40,6 +40,7 @@ #include #include #include +#include #include "internal.h" #define CREATE_TRACE_POINTS From a2ae8c0551050cbf3232fbb3d963a9bbe7b57268 Mon Sep 17 00:00:00 2001 From: "Ben Dooks (Codethink)" Date: Fri, 18 Oct 2019 20:20:24 -0700 Subject: [PATCH 20/26] mm/init-mm.c: include for vm_committed_as_batch mm_init.c needs to include for the definition of vm_committed_as_batch. Fixes the following sparse warning: mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/init-mm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/init-mm.c b/mm/init-mm.c index fb1e15028ef0..19603302a77f 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -5,6 +5,7 @@ #include #include #include +#include #include #include From 2be5fbf9a91c417cd178a481538baddda3e2f38c Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 18 Oct 2019 20:20:27 -0700 Subject: [PATCH 21/26] proc/meminfo: fix output alignment Patch series "Fixes for THP in page cache", v2. This patch (of 5): Add extra space for FileHugePages and FilePmdMapped, so the output is aligned with other rows. Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov Signed-off-by: Song Liu Tested-by: Song Liu Acked-by: Yang Shi Cc: Matthew Wilcox Cc: Oleg Nesterov Cc: Srikar Dronamraju Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/meminfo.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index ac9247371871..8c1f1bb1a5ce 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -132,9 +132,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR); show_val_kb(m, "ShmemPmdMapped: ", global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR); - show_val_kb(m, "FileHugePages: ", + show_val_kb(m, "FileHugePages: ", global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR); - show_val_kb(m, "FilePmdMapped: ", + show_val_kb(m, "FilePmdMapped: ", global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR); #endif From 06d3eff62d9dc6f21e7ebeb14399f2542a36cdf5 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 18 Oct 2019 20:20:30 -0700 Subject: [PATCH 22/26] mm/thp: fix node page state in split_huge_page_to_list() Make sure split_huge_page_to_list() handles the state of shmem THP and file THP properly. Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov Signed-off-by: Song Liu Tested-by: Song Liu Acked-by: Yang Shi Cc: Matthew Wilcox (Oracle) Cc: Oleg Nesterov Cc: Srikar Dronamraju Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c5cb6dcd6c69..13cc93785006 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2789,8 +2789,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) ds_queue->split_queue_len--; list_del(page_deferred_list(head)); } - if (mapping) - __dec_node_page_state(page, NR_SHMEM_THPS); + if (mapping) { + if (PageSwapBacked(page)) + __dec_node_page_state(page, NR_SHMEM_THPS); + else + __dec_node_page_state(page, NR_FILE_THPS); + } + spin_unlock(&ds_queue->split_queue_lock); __split_huge_page(page, list, end, flags); if (PageSwapCache(head)) { From 906d278d75e364f2bb85dc1e1ff6265ea46e7e43 Mon Sep 17 00:00:00 2001 From: William Kucharski Date: Fri, 18 Oct 2019 20:20:33 -0700 Subject: [PATCH 23/26] mm/vmscan.c: support removing arbitrary sized pages from mapping __remove_mapping() assumes that pages can only be either base pages or HPAGE_PMD_SIZE. Ask the page what size it is. Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: William Kucharski Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Song Liu Acked-by: Yang Shi Cc: "Kirill A. Shutemov" Cc: Oleg Nesterov Cc: Srikar Dronamraju Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 024b7e929752..ee4eecc7e1c2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -933,10 +933,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * Note that if SetPageDirty is always performed via set_page_dirty, * and thus under the i_pages lock, then this ordering is not required. */ - if (unlikely(PageTransHuge(page)) && PageSwapCache(page)) - refcount = 1 + HPAGE_PMD_NR; - else - refcount = 2; + refcount = 1 + compound_nr(page); if (!page_ref_freeze(page, refcount)) goto cannot_free; /* note: atomic_cmpxchg in page_ref_freeze provides the smp_rmb */ From ef18a1ca847b01cb7296f11a728cb2f5ef671760 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 18 Oct 2019 20:20:36 -0700 Subject: [PATCH 24/26] mm/thp: allow dropping THP from page cache Once a THP is added to the page cache, it cannot be dropped via /proc/sys/vm/drop_caches. Fix this issue with proper handling in invalidate_mapping_pages(). Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Kirill A. Shutemov Signed-off-by: Song Liu Tested-by: Song Liu Acked-by: Yang Shi Cc: Matthew Wilcox (Oracle) Cc: Oleg Nesterov Cc: Srikar Dronamraju Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/truncate.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/truncate.c b/mm/truncate.c index 8563339041f6..dd9ebc1da356 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -592,6 +592,16 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, unlock_page(page); continue; } + + /* Take a pin outside pagevec */ + get_page(page); + + /* + * Drop extra pins before trying to invalidate + * the huge page. + */ + pagevec_remove_exceptionals(&pvec); + pagevec_release(&pvec); } ret = invalidate_inode_page(page); @@ -602,6 +612,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, */ if (!ret) deactivate_file_page(page); + if (PageTransHuge(page)) + put_page(page); count += ret; } pagevec_remove_exceptionals(&pvec); From aa5de305c90c995321df452f274fe75b52bd8fcc Mon Sep 17 00:00:00 2001 From: Song Liu Date: Fri, 18 Oct 2019 20:20:40 -0700 Subject: [PATCH 25/26] kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register Attaching uprobe to text section in THP splits the PMD mapped page table into PTE mapped entries. On uprobe detach, we would like to regroup PMD mapped page table entry to regain performance benefit of THP. However, the regroup is broken For perf_event based trace_uprobe. This is because perf_event based trace_uprobe calls uprobe_unregister twice on close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped page table entry, which is not the desired behavior. Fix this by only use FOLL_SPLIT_PMD for uprobe register case. Add a WARN() to confirm uprobe unregister never work on huge pages, and abort the operation when this WARN() triggers. Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Signed-off-by: Song Liu Reviewed-by: Srikar Dronamraju Cc: Kirill A. Shutemov Cc: Oleg Nesterov Cc: Matthew Wilcox (Oracle) Cc: William Kucharski Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/events/uprobes.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 94d38a39d72e..c74761004ee5 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -474,14 +474,17 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, struct vm_area_struct *vma; int ret, is_register, ref_ctr_updated = 0; bool orig_page_huge = false; + unsigned int gup_flags = FOLL_FORCE; is_register = is_swbp_insn(&opcode); uprobe = container_of(auprobe, struct uprobe, arch); retry: + if (is_register) + gup_flags |= FOLL_SPLIT_PMD; /* Read the page with vaddr into memory */ - ret = get_user_pages_remote(NULL, mm, vaddr, 1, - FOLL_FORCE | FOLL_SPLIT_PMD, &old_page, &vma, NULL); + ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags, + &old_page, &vma, NULL); if (ret <= 0) return ret; @@ -489,6 +492,12 @@ retry: if (ret <= 0) goto put_old; + if (WARN(!is_register && PageCompound(old_page), + "uprobe unregister should never work on compound page\n")) { + ret = -EINVAL; + goto put_old; + } + /* We are going to replace instruction, update ref_ctr. */ if (!ref_ctr_updated && uprobe->ref_ctr_offset) { ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1); From 585d730d41120926e3f79a601edad3930fa28366 Mon Sep 17 00:00:00 2001 From: Ilya Leoshkevich Date: Fri, 18 Oct 2019 20:20:43 -0700 Subject: [PATCH 26/26] scripts/gdb: fix debugging modules on s390 Currently lx-symbols assumes that module text is always located at module->core_layout->base, but s390 uses the following layout: +------+ <- module->core_layout->base | GOT | +------+ <- module->core_layout->base + module->arch->plt_offset | PLT | +------+ <- module->core_layout->base + module->arch->plt_offset + | TEXT | module->arch->plt_size +------+ Therefore, when trying to debug modules on s390, all the symbol addresses are skewed by plt_offset + plt_size. Fix by adding plt_offset + plt_size to module_addr in load_module_symbols(). Link: http://lkml.kernel.org/r/20191017085917.81791-1-iii@linux.ibm.com Signed-off-by: Ilya Leoshkevich Reviewed-by: Jan Kiszka Cc: Kieran Bingham Cc: Heiko Carstens Cc: Vasily Gorbik Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/gdb/linux/symbols.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py index 34e40e96dee2..7b7c2fafbc68 100644 --- a/scripts/gdb/linux/symbols.py +++ b/scripts/gdb/linux/symbols.py @@ -15,7 +15,7 @@ import gdb import os import re -from linux import modules +from linux import modules, utils if hasattr(gdb, 'Breakpoint'): @@ -116,6 +116,12 @@ lx-symbols command.""" module_file = self._get_module_file(module_name) if module_file: + if utils.is_target_arch('s390'): + # Module text is preceded by PLT stubs on s390. + module_arch = module['arch'] + plt_offset = int(module_arch['plt_offset']) + plt_size = int(module_arch['plt_size']) + module_addr = hex(int(module_addr, 0) + plt_offset + plt_size) gdb.write("loading @{addr}: {filename}\n".format( addr=module_addr, filename=module_file)) cmdline = "add-symbol-file {filename} {addr}{sections}".format(