1
0
Fork 0
alistair23-linux/mm
David Hildenbrand 2c91f8fc6c mm/memory_hotplug: fix try_offline_node()
try_offline_node() is pretty much broken right now:

 - The node span is updated when onlining memory, not when adding it. We
   ignore memory that was mever onlined. Bad.

 - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
   trigger a kernel panic. Bad for memory that is offline but also bad
   for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
   first PFN of a section might contain garbage.

 - Sections belonging to mixed nodes are not properly considered.

As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE).  This makes things
more complicated.

Luckily, we can piggy pack on the node span and the nid stored in memory
blocks.  Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node().  Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.

If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline.  As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).

Introduce for_each_memory_block() to efficiently walk all memory blocks.

Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized.  This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.

Since commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized).  The introducing
commit 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.

I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node.  The node is properly onlined when adding the DIMMs.  When
removing the DIMMs, the node is properly offlined.

Masayoshi Mizuma reported:

: Without this patch, memory hotplug fails as panic:
:
:  BUG: kernel NULL pointer dereference, address: 0000000000000000
:  ...
:  Call Trace:
:   remove_memory_block_devices+0x81/0xc0
:   try_remove_memory+0xb4/0x130
:   __remove_memory+0xa/0x20
:   acpi_memory_device_remove+0x84/0x100
:   acpi_bus_trim+0x57/0x90
:   acpi_bus_trim+0x2e/0x90
:   acpi_device_hotplug+0x2b2/0x4d0
:   acpi_hotplug_work_fn+0x1a/0x30
:   process_one_work+0x171/0x380
:   worker_thread+0x49/0x3f0
:   kthread+0xf8/0x130
:   ret_from_fork+0x35/0x40

[david@redhat.com: v3]
  Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 18:34:00 -08:00
..
kasan mm: introduce compound_nr() 2019-09-24 15:54:08 -07:00
Kconfig mm,thp: add read-only THP support for (non-shmem) FS 2019-09-24 15:54:11 -07:00
Kconfig.debug mm, page_owner, debug_pagealloc: save and dump freeing stack trace 2019-09-24 15:54:08 -07:00
Makefile mm: silence -Woverride-init/initializer-overrides 2019-09-24 15:54:10 -07:00
backing-dev.c bdi: Do not use freezable workqueue 2019-10-06 09:11:35 -06:00
balloon_compaction.c mm/balloon_compaction: suppress allocation warnings 2019-09-04 07:42:01 -04:00
cleancache.c Driver Core and debugfs changes for 5.3-rc1 2019-07-12 12:24:03 -07:00
cma.c mm/cma.c: fail if fixed declaration can't be honored 2019-07-16 19:23:21 -07:00
cma.h
cma_debug.c mm/cma_debug.c: fix the break condition in cma_maxchunk_get() 2019-05-14 09:47:45 -07:00
compaction.c mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() 2019-10-14 15:04:01 -07:00
debug.c mm: update references to page _refcount 2019-05-14 19:52:47 -07:00
debug_page_ref.c
dmapool.c mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options 2019-07-12 11:05:46 -07:00
early_ioremap.c
fadvise.c fs: Export generic_fadvise() 2019-08-30 22:43:58 -07:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition 2019-10-19 06:32:32 -04:00
frame_vector.c mm: untag user pointers in get_vaddr_frames 2019-09-25 17:51:41 -07:00
frontswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
gup.c mm/gup: fix a misnamed "write" argument, and a related bug 2019-10-19 06:32:32 -04:00
gup_benchmark.c mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM 2019-05-14 09:47:45 -07:00
highmem.c
hmm.c pagewalk: separate function pointers from iterator data 2019-09-07 04:28:04 -03:00
huge_memory.c mm/thp: fix node page state in split_huge_page_to_list() 2019-10-19 06:32:32 -04:00
hugetlb.c hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() 2019-10-19 06:32:32 -04:00
hugetlb_cgroup.c mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() 2019-11-15 18:34:00 -08:00
hwpoison-inject.c hwpoison-inject: no need to check return value of debugfs_create functions 2019-06-03 15:39:40 +02:00
init-mm.c mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch 2019-10-19 06:32:32 -04:00
internal.h mm: introduce MADV_COLD 2019-09-25 17:51:41 -07:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
khugepaged.c mm,thp: recheck each page before collapsing file THP 2019-11-15 18:34:00 -08:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c kmemleak: Do not corrupt the object_list during clean-up 2019-10-14 08:56:16 -07:00
ksm.c mm: move memcmp_pages() and pages_identical() 2019-09-24 15:54:11 -07:00
list_lru.c mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages 2019-07-12 11:05:44 -07:00
maccess.c The main changes in this release include: 2019-07-18 11:51:00 -07:00
madvise.c mm: fix trying to reclaim unevictable lru page when calling madvise_pageout 2019-11-15 18:33:59 -08:00
memblock.c mm: memblock: do not enforce current limit for memblock_phys* family 2019-10-19 06:32:32 -04:00
memcontrol.c mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() 2019-11-15 18:34:00 -08:00
memfd.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
memory-failure.c mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() 2019-10-19 06:32:31 -04:00
memory.c mm: do not hash address in print_bad_pte() 2019-09-24 15:54:09 -07:00
memory_hotplug.c mm/memory_hotplug: fix try_offline_node() 2019-11-15 18:34:00 -08:00
mempolicy.c mm: mempolicy: fix the wrong return value and potential pages leak of mbind 2019-11-15 18:33:59 -08:00
mempool.c
memremap.c mm/memunmap: don't access uninitialized memmap in memunmap_pages() 2019-10-19 06:32:32 -04:00
memtest.c
migrate.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mincore.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mlock.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mm_init.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
mmap.c mm: untag user pointers in mmap/munmap/mremap/brk 2019-09-25 17:51:41 -07:00
mmu_context.c
mmu_gather.c mm: remove quicklist page table caches 2019-09-24 15:54:09 -07:00
mmu_notifier.c mm/mmu_notifiers: use the right return code for WARN_ON 2019-11-06 08:47:50 -08:00
mmzone.c
mprotect.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mremap.c mm: untag user pointers in mmap/munmap/mremap/brk 2019-09-25 17:51:41 -07:00
msync.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
nommu.c mm: introduce page_size() 2019-09-24 15:54:08 -07:00
oom_kill.c mm: introduce MADV_COLD 2019-09-25 17:51:41 -07:00
page-writeback.c writeback, memcg: Implement foreign dirty flushing 2019-08-27 09:22:38 -06:00
page_alloc.c mm/page_alloc.c: ratelimit allocation failure warnings more aggressively 2019-11-06 08:47:50 -08:00
page_counter.c
page_ext.c mm, page_owner: fix off-by-one error in __set_page_owner_handle() 2019-10-14 15:04:00 -07:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-06-29 16:43:45 +08:00
page_io.c mm, swap: use rbtree for swap_extent 2019-07-12 11:05:43 -07:00
page_isolation.c mm/page_isolation.c: change the prototype of undo_isolate_page_range() 2019-07-12 11:05:43 -07:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-19 06:32:31 -04:00
page_poison.c mm/page_poison.c: fix a typo in a comment 2019-09-24 15:54:08 -07:00
page_vma_mapped.c mm: introduce page_size() 2019-09-24 15:54:08 -07:00
pagewalk.c pagewalk: use lockdep_assert_held for locking validation 2019-09-07 04:28:04 -03:00
percpu-internal.h
percpu-km.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-stats.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-vm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu.c percpu: Use struct_size() helper 2019-09-04 13:40:49 -07:00
pgtable-generic.c
process_vm_access.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
readahead.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
rmap.c mm: include <linux/huge_mm.h> for is_vma_temporary_stack 2019-10-19 06:32:32 -04:00
rodata_test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 2019-06-05 17:37:17 +02:00
shmem.c Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-10-10 08:16:44 -07:00
shuffle.c mm: fix -Wmissing-prototypes warnings 2019-10-07 15:47:19 -07:00
shuffle.h mm: maintain randomization of page free lists 2019-05-14 19:52:48 -07:00
slab.c mm/slab.c: fix kernel-doc warning for __ksize() 2019-10-14 15:04:01 -07:00
slab.h mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly 2019-11-06 08:47:50 -08:00
slab_common.c mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release 2019-10-19 06:32:32 -04:00
slob.c mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) 2019-10-07 15:47:20 -07:00
slub.c mm: slub: really fix slab walking for init_on_free 2019-11-15 18:34:00 -08:00
sparse-vmemmap.c mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() 2019-07-18 17:08:07 -07:00
sparse.c mm: fix -Wmissing-prototypes warnings 2019-10-07 15:47:19 -07:00
swap.c mm: introduce MADV_COLD 2019-09-25 17:51:41 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
swapfile.c vfs: don't allow writes to swap files 2019-08-20 07:55:16 -07:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-09-17 15:20:17 -07:00
userfaultfd.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 2019-06-19 17:09:53 +02:00
util.c arm64, mm: make randomization selected by generic topdown mmap layout 2019-09-24 15:54:11 -07:00
vmacache.c
vmalloc.c augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro 2019-09-25 17:51:39 -07:00
vmpressure.c mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() 2019-10-07 15:47:19 -07:00
vmscan.c mm/vmscan.c: support removing arbitrary sized pages from mapping 2019-10-19 06:32:32 -04:00
vmstat.c mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo 2019-11-06 08:47:50 -08:00
workingset.c mm: workingset: fix vmstat counters for shadow nodes 2019-08-13 16:06:52 -07:00
z3fold.c mm/z3fold.c: claim page in the beginning of free 2019-10-07 15:47:19 -07:00
zbud.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zpool.c zpool: add malloc_support_movable to zpool_driver 2019-09-24 15:54:12 -07:00
zsmalloc.c mm/zsmalloc.c: fix a -Wunused-function warning 2019-09-24 15:54:12 -07:00
zswap.c zswap: do not map same object twice 2019-09-24 15:54:12 -07:00