1
0
Fork 0
Commit Graph

3123 Commits (redonkable)

Author SHA1 Message Date
Dan Williams 10dcb79ec7 x86/mm: Fix leak of pmd ptlock
commit d1c5246e08 upstream.

Commit

  28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")

introduced a new location where a pmd was released, but neglected to
run the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:

  c283610e44 ("x86, mm: do not leak page->ptl for pmd page tables").

This issue was hidden until recently because the failure mode is silent,
but commit:

  b2b29d6d01 ("mm: account PMD tables like PTE tables")

turns the failure mode into this signature:

 BUG: Bad page state in process lt-pmem-ns  pfn:15943d
 page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
 flags: 0xaffff800000000()
 raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
 raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
 page dumped because: nonzero mapcount
 [..]
  dump_stack+0x8b/0xb0
  bad_page.cold+0x63/0x94
  free_pcp_prepare+0x224/0x270
  free_unref_page+0x18/0xd0
  pud_free_pmd_page+0x146/0x160
  ioremap_pud_range+0xe3/0x350
  ioremap_page_range+0x108/0x160
  __ioremap_caller.constprop.0+0x174/0x2b0
  ? memremap+0x7a/0x110
  memremap+0x7a/0x110
  devm_memremap+0x53/0xa0
  pmem_attach_disk+0x4ed/0x530 [nd_pmem]
  ? __devm_release_region+0x52/0x80
  nvdimm_bus_probe+0x85/0x210 [libnvdimm]

Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure.

As for a better helper pmd_free() is close, but it is a messy fit
due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
paravirt_tlb_remove_table() instead of free_page(), so open-coded
pgtable_pmd_page_dtor() seems the best way forward for now.

Debugged together with Matthew Wilcox <willy@infradead.org>.

Fixes: 28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:16:22 +01:00
Arvind Sankar 0a72e7286c x86/mm/ident_map: Check for errors from ident_pud_init()
[ Upstream commit 1fcd009102 ]

Commit

  ea3b5e60ce ("x86/mm/ident_map: Add 5-level paging support")

added ident_p4d_init() to support 5-level paging, but this function
doesn't check and return errors from ident_pud_init().

For example, the decompressor stub uses this code to create an identity
mapping. If it runs out of pages while trying to allocate a PMD
pagetable, the error will be currently ignored.

Fix this to propagate errors.

 [ bp: Space out statements for better readability. ]

Fixes: ea3b5e60ce ("x86/mm/ident_map: Add 5-level paging support")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20201027230648.1885111-1-nivedita@alum.mit.edu
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:51:02 +01:00
Andy Lutomirski e3c1d51868 x86/membarrier: Get rid of a dubious optimization
commit a493d1ca1a upstream.

sync_core_before_usermode() had an incorrect optimization.  If the kernel
returns from an interrupt, it can get to usermode without IRET. It just has
to schedule to a different task in the same mm and do SYSRET.  Fortunately,
there were no callers of sync_core_before_usermode() that could have had
in_irq() or in_nmi() equal to true, because it's only ever called from the
scheduler.

While at it, clarify a related comment.

Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-16 10:56:59 +01:00
Arvind Sankar 6346ed69bc x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
commit 29ac40cbed upstream.

The PAT bit is in different locations for 4k and 2M/1G page table
entries.

Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.

Fixes: 6ebcb06071 ("x86/mm: Add support to encrypt the kernel in-place")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-16 10:56:58 +01:00
Huang Ying 920d9ffcd4 x86, fakenuma: Fix invalid starting node ID
[ Upstream commit ccae0f36d5 ]

Commit:

  cc9aec03e5 ("x86/numa_emulation: Introduce uniform split capability")

uses "-1" as the starting node ID, which causes the strange kernel log as
follows, when "numa=fake=32G" is added to the kernel command line:

    Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB)
    Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB)
    Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB)
    Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB)
    Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB)

And finally the kernel crashes:

    BUG: Bad page state in process swapper  pfn:00011
    page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11
    failed to read mapping contents, not a valid kernel address?
    flags: 0x5(locked|uptodate)
    raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006
    raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510
    page dumped because: page still charged to cgroup
    page->mem_cgroup:000055cd7e44b510
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
    Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
    Call Trace:
     dump_stack+0x57/0x80
     bad_page.cold+0x63/0x94
     __free_pages_ok+0x33f/0x360
     memblock_free_all+0x127/0x195
     mem_init+0x23/0x1f5
     start_kernel+0x219/0x4f5
     secondary_startup_64+0xb6/0xc0

Fix this bug via using 0 as the starting node ID.  This restores the
original behavior before cc9aec03e5.

[ mingo: Massaged the changelog. ]

Fixes: cc9aec03e5 ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200904061047.612950-1-ying.huang@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-09 19:12:28 +02:00
Arvind Sankar 87ef5086a3 x86/mm: Stop printing BRK addresses
[ Upstream commit 67d631b7c0 ]

This currently leaks kernel physical addresses into userspace.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/20200229231120.1147527-1-nivedita@alum.mit.edu
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-22 09:31:08 +02:00
Nathan Chancellor 92c09e88c1 x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
[ Upstream commit d7110a26e5 ]

When building with Clang + -Wtautological-compare and
CONFIG_CPUMASK_OFFSTACK unset:

  arch/x86/mm/mmio-mod.c:375:6: warning: comparison of array 'downed_cpus'
  equal to a null pointer is always false [-Wtautological-pointer-compare]
          if (downed_cpus == NULL &&
              ^~~~~~~~~~~    ~~~~
  arch/x86/mm/mmio-mod.c:405:6: warning: comparison of array 'downed_cpus'
  equal to a null pointer is always false [-Wtautological-pointer-compare]
          if (downed_cpus == NULL || cpumask_weight(downed_cpus) == 0)
              ^~~~~~~~~~~    ~~~~
  2 warnings generated.

Commit

  f7e30f01a9 ("cpumask: Add helper cpumask_available()")

added cpumask_available() to fix warnings of this nature. Use that here
so that clang does not warn regardless of CONFIG_CPUMASK_OFFSTACK's
value.

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/982
Link: https://lkml.kernel.org/r/20200408205323.44490-1-natechancellor@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-07 13:18:51 +02:00
Rick Edgecombe f82a301322 x86/mm/cpa: Flush direct map alias during cpa
[ Upstream commit ab5130186d ]

As an optimization, cpa_flush() was changed to optionally only flush
the range in @cpa if it was small enough.  However, this range does
not include any direct map aliases changed in cpa_process_alias(). So
small set_memory_() calls that touch that alias don't get the direct
map changes flushed. This situation can happen when the virtual
address taking variants are passed an address in vmalloc or modules
space.

In these cases, force a full TLB flush.

Note this issue does not extend to cases where the set_memory_() calls are
passed a direct map address, or page array, etc, as the primary target. In
those cases the direct map would be flushed.

Fixes: 935f583982 ("x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200424105343.GA20730@hirez.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27 17:46:32 +02:00
Borislav Petkov 94a4104bf1 x86/ioremap: Fix CONFIG_EFI=n build
commit 870b4333a6 upstream.

In order to use efi_mem_type(), one needs CONFIG_EFI enabled. Otherwise
that function is undefined. Use IS_ENABLED() to check and avoid the
ifdeffery as the compiler optimizes away the following unreachable code
then.

Fixes: 985e537a40 ("x86/ioremap: Map EFI runtime services data as encrypted for SEV")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/7561e981-0d9b-d62c-0ef2-ce6007aff1ab@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-01 11:01:59 +02:00
Joerg Roedel 66f28e1105 x86/mm: split vmalloc_sync_all()
commit 763802b53a upstream.

Commit 3f8fd02b1b ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25 08:25:58 +01:00
Tom Lendacky 680613f441 x86/ioremap: Map EFI runtime services data as encrypted for SEV
commit 985e537a40 upstream.

The dmidecode program fails to properly decode the SMBIOS data supplied
by OVMF/UEFI when running in an SEV guest. The SMBIOS area, under SEV, is
encrypted and resides in reserved memory that is marked as EFI runtime
services data.

As a result, when memremap() is attempted for the SMBIOS data, it
can't be mapped as regular RAM (through try_ram_remap()) and, since
the address isn't part of the iomem resources list, it isn't mapped
encrypted through the fallback ioremap().

Add a new __ioremap_check_other() to deal with memory types like
EFI_RUNTIME_SERVICES_DATA which are not covered by the resource ranges.

This allows any runtime services data which has been created encrypted,
to be mapped encrypted too.

 [ bp: Move functionality to a separate function. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: <stable@vger.kernel.org> # 5.3
Link: https://lkml.kernel.org/r/2d9e16eb5b53dc82665c95c6764b7407719df7a0.1582645327.git.thomas.lendacky@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18 07:17:53 +01:00
Ard Biesheuvel e7e6b53fea x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
[ Upstream commit 75fbef0a8b ]

The following commit:

  15f003d207 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()")

modified kernel_map_pages_in_pgd() to manage writable permissions
of memory mappings in the EFI page table in a different way, but
in the process, it removed the ability to clear NX attributes from
read-only mappings, by clobbering the clear mask if _PAGE_RW is not
being requested.

Failure to remove the NX attribute from read-only mappings is
unlikely to be a security issue, but it does prevent us from
tightening the permissions in the EFI page tables going forward,
so let's fix it now.

Fixes: 15f003d207 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24 08:36:53 +01:00
David Hildenbrand e84c5b7617 mm/memory_hotplug: shrink zones when offlining memory
commit feee6b2989 upstream.

We currently try to shrink a single zone when removing memory.  We use
the zone of the first page of the memory we are removing.  If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     __remove_pages+0x4b/0x640
     arch_remove_memory+0x63/0x8d
     try_remove_memory+0xdb/0x130
     __remove_memory+0xa/0x11
     acpi_memory_device_remove+0x70/0x100
     acpi_bus_trim+0x55/0x90
     acpi_device_hotplug+0x227/0x3a0
     acpi_hotplug_work_fn+0x1a/0x30
     process_one_work+0x221/0x550
     worker_thread+0x50/0x3b0
     kthread+0x105/0x140
     ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that.  We now
properly shrink the zones, even if we have DIMMs whereby

 - Some memory blocks fall into no zone (never onlined)

 - Some memory blocks fall into multiple zones (offlined+re-onlined)

 - Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09 10:19:56 +01:00
Sami Tolvanen a823d762a5 x86/mm: Use the correct function type for native_set_fixmap()
[ Upstream commit f53e2cd0b8 ]

We call native_set_fixmap indirectly through the function pointer
struct pv_mmu_ops::set_fixmap, which expects the first parameter to be
'unsigned' instead of 'enum fixed_addresses'. This patch changes the
function type for native_set_fixmap to match the pointer, which fixes
indirect call mismatches with Control-Flow Integrity (CFI) checking.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190913211402.193018-1-samitolvanen@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31 16:44:04 +01:00
Joerg Roedel 89398a7fc4 x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all()
commit 9a62d20027 upstream.

The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc()
ranges: before such vmap ranges are reused we make sure that they are
unmapped from every task's page tables.

This is really easy on pagetable setups where the kernel page tables
are shared between all tasks - this is the case on 32-bit kernels
with SHARED_KERNEL_PMD = 1.

But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating
over the pgd_list and clearing all pmd entries in the pgds that
are cleared in the init_mm.pgd, which is the reference pagetable
that the vmalloc() code uses.

In that context the current practice of vmalloc_sync_all() iterating
until FIX_ADDR_TOP is buggy:

        for (address = VMALLOC_START & PMD_MASK;
             address >= TASK_SIZE_MAX && address < FIXADDR_TOP;
             address += PMD_SIZE) {
                struct page *page;

Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc
address ranges:

	VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR

This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges
that don't clear their pmds, but it's lethal for the LDT range,
which relies on having different mappings in different processes,
and 'synchronizing' them in the vmalloc sense corrupts those
pagetable entries (clearing them).

This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD
off and makes this the dominant mapping mode on 32-bit.

To make LDT working again vmalloc_sync_all() must only iterate over
the volatile parts of the kernel address range that are identical
between all processes.

So the correct check in vmalloc_sync_all() is "address < VMALLOC_END"
to make sure the VMALLOC areas are synchronized and the LDT
mapping is not falsely overwritten.

The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either,
but this is not really a proplem since their PMDs get established
during bootup and never change.

This change fixes the ldt_gdt selftest in my setup.

[ mingo: Fixed up the changelog to explain the logic and modified the
         copying to only happen up until VMALLOC_END. ]

Reported-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 7757d607c6b3: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-13 08:42:46 +01:00
Ingo Molnar cd923d2b57 x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
commit 05b042a194 upstream.

When two recent commits that increased the size of the 'struct cpu_entry_area'
were merged in -tip, the 32-bit defconfig build started failing on the following
build time assert:

  ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
  arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’
  In function ‘setup_cpu_entry_area_ptes’,

Which corresponds to the following build time assert:

	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);

The purpose of this assert is to sanity check the fixed-value definition of
CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h:

	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 41)

The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value
we didn't want to define in such a low level header, because it would cause
dependency hell.

Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES
accordingly - and this assert is checking that constraint.

But the assert is both imprecise and buggy, primarily because it doesn't
include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE
(which begins at a PMD boundary).

This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined
too large upstream (v5.4-rc8):

	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 40)

While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra
pages, which hid the bug.

The following commit (not yet upstream) increased the size to 40 pages:

  x86/iopl: ("Restrict iopl() permission scope")

... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap
to just 1 extra page.

Then another not-yet-upstream commit changed the size again:

  880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")

Which increased the cpu_entry_area size from 38 to 39 pages, but
didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked
fine, because we still had a page left from the accidental 'reserve'.

But when these two commits were merged into the same tree, the
combined size of cpu_entry_area grew from 38 to 40 pages, while
CPU_ENTRY_AREA_PAGES finally caught up to 40 as well.

Which is fine in terms of functionality, but the assert broke:

	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);

because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area,
which is 1 page larger due to the IDT page.

To fix all this, change the assert to two precise asserts:

	BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
	BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE);

This takes the IDT page into account, and also connects the size-based
define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based
define of CPU_ENTRY_AREA_MAP_SIZE.

Also clean up some of the names which made it rather confusing:

 - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of
   the cpu-entry-area, but the per-cpu array size, so rename this
   to CPU_ENTRY_AREA_ARRAY_SIZE.

 - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping
   size, with the IDT included.

 - Add comments where '+1' denotes the IDT mapping - it wasn't
   obvious and took me about 3 hours to decode...

Finally, because this particular commit is actually applied after
this patch:

  880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")

Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages.

All future commits that change cpu_entry_area will have to adjust
this value precisely.

As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES
and derive its value directly from the structure, without causing
header hell - but that is an adventure for another day! :-)

Fixes: 880a98c33996: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-29 10:10:00 +01:00
Linus Torvalds aefcf2f4b5 Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
 "This is the latest iteration of the kernel lockdown patchset, from
  Matthew Garrett, David Howells and others.

  From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

  There are two major changes since this was last proposed for mainline:

   - Separating lockdown from EFI secure boot. Background discussion is
     covered here: https://lwn.net/Articles/751061/

   -  Implementation as an LSM, with a default stackable lockdown LSM
      module. This allows the lockdown feature to be policy-driven,
      rather than encoding an implicit policy within the mechanism.

  The new locked_down LSM hook is provided to allow LSMs to make a
  policy decision around whether kernel functionality that would allow
  tampering with or examining the runtime state of the kernel should be
  permitted.

  The included lockdown LSM provides an implementation with a simple
  policy intended for general purpose use. This policy provides a coarse
  level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

  Enable the kernel lockdown feature. If set to integrity, kernel features
  that allow userland to modify the running kernel are disabled. If set to
  confidentiality, kernel features that allow userland to extract
  confidential information from the kernel are also disabled.

  This may also be controlled via /sys/kernel/security/lockdown and
  overriden by kernel configuration.

  New or existing LSMs may implement finer-grained controls of the
  lockdown features. Refer to the lockdown_reason documentation in
  include/linux/security.h for details.

  The lockdown feature has had signficant design feedback and review
  across many subsystems. This code has been in linux-next for some
  weeks, with a few fixes applied along the way.

  Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
  when kernel lockdown is in confidentiality mode") is missing a
  Signed-off-by from its author. Matthew responded that he is providing
  this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
  kexec: Fix file verification on S390
  security: constify some arrays in lockdown LSM
  lockdown: Print current->comm in restriction messages
  efi: Restrict efivar_ssdt_load when the kernel is locked down
  tracefs: Restrict tracefs when the kernel is locked down
  debugfs: Restrict debugfs when the kernel is locked down
  kexec: Allow kexec_file() with appropriate IMA policy when locked down
  lockdown: Lock down perf when in confidentiality mode
  bpf: Restrict bpf when kernel lockdown is in confidentiality mode
  lockdown: Lock down tracing and perf kprobes when in confidentiality mode
  lockdown: Lock down /proc/kcore
  x86/mmiotrace: Lock down the testmmiotrace module
  lockdown: Lock down module params that specify hardware parameters (eg. ioport)
  lockdown: Lock down TIOCSSERIAL
  lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
  acpi: Disable ACPI table override if the kernel is locked down
  acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
  ACPI: Limit access to custom_method when the kernel is locked down
  x86/msr: Restrict MSR access when the kernel is locked down
  x86: Lock down IO port access when the kernel is locked down
  ...
2019-09-28 08:14:15 -07:00
Mark Rutland b4ed71f557 mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
people, and until recently arm64 used these erroneously/pointlessly for
other levels of page table.

To make it incredibly clear that these only apply to the PTE level, and to
align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
to pgtable_pte_page_{ctor,dtor}().

These changes were generated with the following shell script:

----
git grep -lw 'pgtable_page_.tor' | while read FILE; do
    sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
    sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
done
----

... with the documentation re-flowed to remain under 80 columns, and
whitespace fixed up in macros to keep backslashes aligned.

There should be no functional change as a result of this patch.

Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:44 -07:00
Michel Lespinasse 315cc066b8 augmented rbtree: add new RB_DECLARE_CALLBACKS_MAX macro
Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern.  This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.

[walken@google.com: fix mm/vmalloc.c]
  Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
  Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:39 -07:00
Mike Rapoport 782de70c42 mm: consolidate pgtable_cache_init() and pgd_cache_init()
Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
cache for page table allocations on several architectures that do not use
PAGE_SIZE tables for one or more levels of the page table hierarchy.

Most architectures do not implement these functions and use __weak default
NOP implementation of pgd_cache_init().  Since there is no such default
for pgtable_cache_init(), its empty stub is duplicated among most
architectures.

Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
drop empty stubs of pgtable_cache_init().

Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org>		[arm64]
Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Linus Torvalds 45824fc0da powerpc updates for 5.4
- Initial support for running on a system with an Ultravisor, which is software
    that runs below the hypervisor and protects guests against some attacks by
    the hypervisor.
 
  - Support for building the kernel to run as a "Secure Virtual Machine", ie. as
    a guest capable of running on a system with an Ultravisor.
 
  - Some changes to our DMA code on bare metal, to allow devices with medium
    sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
 
  - Support for firmware assisted crash dumps on bare metal (powernv).
 
  - Two series fixing bugs in and refactoring our PCI EEH code.
 
  - A large series refactoring our exception entry code to use gas macros, both
    to make it more readable and also enable some future optimisations.
 
 As well as many cleanups and other minor features & fixups.
 
 Thanks to:
   Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
   Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
   Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
   David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
   Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
   Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
   Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
   Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
   Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
   Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
   Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
   Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
 D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
 bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
 DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
 aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
 5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
 j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
 kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
 sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
 j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
 LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
 oP0rHfgmVYvF1g==
 =WlW+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This is a bit late, partly due to me travelling, and partly due to a
  power outage knocking out some of my test systems *while* I was
  travelling.

   - Initial support for running on a system with an Ultravisor, which
     is software that runs below the hypervisor and protects guests
     against some attacks by the hypervisor.

   - Support for building the kernel to run as a "Secure Virtual
     Machine", ie. as a guest capable of running on a system with an
     Ultravisor.

   - Some changes to our DMA code on bare metal, to allow devices with
     medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
     DMA space.

   - Support for firmware assisted crash dumps on bare metal (powernv).

   - Two series fixing bugs in and refactoring our PCI EEH code.

   - A large series refactoring our exception entry code to use gas
     macros, both to make it more readable and also enable some future
     optimisations.

  As well as many cleanups and other minor features & fixups.

  Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
  JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
  Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
  Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
  Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
  Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
  Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
  Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
  Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
  Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
  Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
  Lendacky, Vasant Hegde"

* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
  powerpc/mm/mce: Keep irqs disabled during lockless page table walk
  powerpc: Use ftrace_graph_ret_addr() when unwinding
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc: dump kernel log before carrying out fadump or kdump
  docs: powerpc: Add missing documentation reference
  powerpc/xmon: Fix output of XIVE IPI
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/mm/radix: remove useless kernel messages
  powerpc/fadump: support holes in kernel boot memory area
  powerpc/fadump: remove RMA_START and RMA_END macros
  powerpc/fadump: update documentation about option to release opalcore
  powerpc/fadump: consider f/w load area
  powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
  powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
  powerpc/fadump: improve how crashed kernel's memory is reserved
  powerpc/fadump: consider reserved ranges while releasing memory
  powerpc/fadump: make crash memory ranges array allocation generic
  ...
2019-09-20 11:48:06 -07:00
Linus Torvalds 3fb7f3a6ed Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
 "Two small PTI updates:

   - Handle unaligned addresses gracefully in pti_clone_pagetable(). Not
     an issue with current callers, but a correctness problem. Adds a
     warning so any caller which hands in an unaligned address gets
     pointed out clearly.

   - Prevent PTI functions from being invoked when PTI is disabled at
     boottime. While this does not cause any harm today, it's pointless
     code executed and prone to cause subtle issues if the PTI
     implementation changes internally over time"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
2019-09-17 11:02:21 -07:00
Linus Torvalds ac51667b5b Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:

 - Make cpumask_of_node() more robust against invalid node IDs

 - Simplify and speed up load_mm_cr4()

 - Unexport and remove various unused set_memory_*() APIs

 - Misc cleanups

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix cpumask_of_node() error condition
  x86/mm: Remove the unused set_memory_wt() function
  x86/mm: Remove set_pages_x() and set_pages_nx()
  x86/mm: Remove the unused set_memory_array_*() functions
  x86/mm: Unexport set_memory_x() and set_memory_nx()
  x86/fixmap: Cleanup outdated comments
  x86/kconfig: Remove X86_DIRECT_GBPAGES dependency on !DEBUG_PAGEALLOC
  x86/mm: Avoid redundant interrupt disable in load_mm_cr4()
2019-09-16 19:21:34 -07:00
Linus Torvalds cc9b499a1f Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:

 - refactor the EFI config table handling across architectures

 - add support for the Dell EMC OEM config table

 - include AER diagnostic output to CPER handling of fatal PCIe errors

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: cper: print AER info of PCIe fatal error
  efi: Export Runtime Configuration Interface table to sysfs
  efi: ia64: move SAL systab handling out of generic EFI code
  efi/x86: move UV_SYSTAB handling into arch/x86
  efi: x86: move efi_is_table_address() into arch/x86
2019-09-16 16:47:38 -07:00
Peter Zijlstra bc04a049f0 x86/mm: Fix cpumask_of_node() error condition
When CONFIG_DEBUG_PER_CPU_MAPS=y we validate that the @node argument of
cpumask_of_node() is a valid node_id. It however forgets to check for
negative numbers. Fix this by explicitly casting to unsigned int.

  (unsigned)node >= nr_node_ids

verifies: 0 <= node < nr_node_ids

Also ammend the error message to match the condition.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lkml.kernel.org/r/20190903075352.GY2369@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-05 13:03:04 +02:00
Christoph Hellwig aeb415fbe9 x86/mm: Remove the unused set_memory_wt() function
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190826075558.8125-5-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:26:37 +02:00
Christoph Hellwig 185be15143 x86/mm: Remove set_pages_x() and set_pages_nx()
These wrappers don't provide a real benefit over just using
set_memory_x() and set_memory_nx().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190826075558.8125-4-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:26:37 +02:00
Christoph Hellwig a919198b97 x86/mm: Remove the unused set_memory_array_*() functions
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190826075558.8125-3-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:26:37 +02:00
Christoph Hellwig ec46133d3b x86/mm: Unexport set_memory_x() and set_memory_nx()
No module currently messed with clearing or setting the execute
permission of kernel memory, and none really should.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190826075558.8125-2-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:26:37 +02:00
Ingo Molnar ae1ad26388 Linux 5.3-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl1tSg4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG018IAJGV7SbXggW/iC+e
 cSMlo8kPnuU7dKCUW+ngXnZY1xuDYWPhXMX9+yDYf2NfMYGdDGYZ+GRjSFim816w
 HsNsovnYiyxhkh+wA/DmZPWKdTgYrIxbPRO+MlO5ZfbxWNaLgSjqirz0iBITSv3S
 r2XLmFw8GVACv/GkNGrWBM53wpkJLHzvwaV9hg6dr8HFDipaEn7vEY9/LAN3S3fw
 reVwW6Q4N4+RSofM1eIGgAZsTYbYBDfri94mRQZ3y+Q8EkRGkJ270WKA0OAVFYS7
 KA6nrjvGSYVtmDK3HORjbINQn3bXwIKeMZHl15c+LGM9ePwoHbsN3+smBswRX+R3
 JDQjkhY=
 =DV37
 -----END PGP SIGNATURE-----

Merge tag 'v5.3-rc7' into x86/mm, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:23:41 +02:00
Thomas Gleixner 990784b577 x86/mm/pti: Do not invoke PTI functions when PTI is disabled
When PTI is disabled at boot time either because the CPU is not affected or
PTI has been disabled on the command line, the boot code still calls into
pti_finalize() which then unconditionally invokes:

     pti_clone_entry_text()
     pti_clone_kernel_text()

pti_clone_kernel_text() was called unconditionally before the 32bit support
was added and 32bit added the call to pti_clone_entry_text().

The call has no side effects as cloning the page tables into the available
second one, which was allocated for PTI does not create damage. But it does
not make sense either and in case that this functionality would be extended
later this might actually lead to hard to diagnose issues.

Neither function should be called when PTI is runtime disabled. Make the
invocation conditional.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190828143124.063353972@linutronix.de
2019-08-29 20:52:53 +02:00
Song Liu 825d0b73cd x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
pti_clone_pmds() assumes that the supplied address is either:

 - properly PUD/PMD aligned
or
 - the address is actually mapped which means that independently
   of the mapping level (PUD/PMD/PTE) the next higher mapping
   exists.

If that's not the case the unaligned address can be incremented by PUD or
PMD size incorrectly. All callers supply mapped and/or aligned addresses,
but for the sake of robustness it's better to handle that case properly and
to emit a warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282352470.1938@nanos.tec.linutronix.de
2019-08-29 20:52:52 +02:00
Thomas Gleixner 7af0145067 x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
ftrace does not use text_poke() for enabling trace functionality. It uses
its own mechanism and flips the whole kernel text to RW and back to RO.

The CPA rework removed a loop based check of 4k pages which tried to
preserve a large page by checking each 4k page whether the change would
actually cover all pages in the large page.

This resulted in endless loops for nothing as in testing it turned out that
it actually never preserved anything. Of course testing missed to include
ftrace, which is the one and only case which benefitted from the 4k loop.

As a consequence enabling function tracing or ftrace based kprobes results
in a full 4k split of the kernel text, which affects iTLB performance.

The kernel RO protection is the only valid case where this can actually
preserve large pages.

All other static protections (RO data, data NX, PCI, BIOS) are truly
static.  So a conflict with those protections which results in a split
should only ever happen when a change of memory next to a protected region
is attempted. But these conflicts are rightfully splitting the large page
to preserve the protected regions. In fact a change to the protected
regions itself is a bug and is warned about.

Add an exception for the static protection check for kernel text RO when
the to be changed region spawns a full large page which allows to preserve
the large mappings. This also prevents the syslog to be spammed about CPA
violations when ftrace is used.

The exception needs to be removed once ftrace switched over to text_poke()
which avoids the whole issue.

Fixes: 585948f4f6 ("x86/mm/cpa: Avoid the 4k pages check completely")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282355340.1938@nanos.tec.linutronix.de
2019-08-29 20:48:44 +02:00
David Howells 906357f77a x86/mmiotrace: Lock down the testmmiotrace module
The testmmiotrace module shouldn't be permitted when the kernel is locked
down as it can be used to arbitrarily read and write MMIO space. This is
a runtime check rather than buildtime in order to allow configurations
where the same kernel may be run in both locked down or permissive modes
depending on local policy.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Ingo Molnar <mingo@kernel.org>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: x86@kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
Thiago Jung Bauermann ae7eb82a92 fs/core/vmcore: Move sev_active() reference to x86 arch code
Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
appear in generic kernel code because it forces non-x86 architectures to
define the sev_active() function, which doesn't make a lot of sense.

To solve this problem, add an x86 elfcorehdr_read() function to override
the generic weak implementation. To do that, it's necessary to make
read_from_oldmem() public so that it can be used outside of vmcore.c.

Also, remove the export for sev_active() since it's only used in files that
won't be built as modules.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-6-bauerman@linux.ibm.com
2019-08-09 22:52:10 +10:00
Thiago Jung Bauermann 284e21fab2 x86, s390/mm: Move sme_active() and sme_me_mask to x86-specific header
Now that generic code doesn't reference them, move sme_active() and
sme_me_mask to x86's <asm/mem_encrypt.h>.

Also remove the export for sme_active() since it's only used in files that
won't be built as modules. sme_me_mask on the other hand is used in
arch/x86/kvm/svm.c (via __sme_set() and __psp_pa()) which can be built as a
module so its export needs to stay.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-5-bauerman@linux.ibm.com
2019-08-09 22:52:08 +10:00
Ard Biesheuvel e55f31a599 efi: x86: move efi_is_table_address() into arch/x86
The function efi_is_table_address() and the associated array of table
pointers is specific to x86. Since we will be adding some more x86
specific tables, let's move this code out of the generic code first.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
2019-08-08 11:01:48 +03:00
Jan Kiszka 21e450d21c x86/mm: Avoid redundant interrupt disable in load_mm_cr4()
load_mm_cr4() is always called with interrupts disabled from:

 - switch_mm_irqs_off()
 - refresh_pce(), which is a on_each_cpu() callback

Thus, disabling interrupts in cr4_set/clear_bits() is redundant.

Implement cr4_set/clear_bits_irqsoff() helpers, rename load_mm_cr4() to
load_mm_cr4_irqsoff() and use the new helpers. The new helpers do not need
a lockdep assert as __cr4_set() has one already.

The renaming in combination with the checks in __cr4_set() ensure that any
changes in the boundary conditions at the call sites will be detected.

[ tglx: Massaged change log ]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/0fbbcb64-5f26-4ffb-1bb9-4f5f48426893@siemens.com
2019-07-24 14:43:37 +02:00
Joerg Roedel 8e998fc24d x86/mm: Sync also unmappings in vmalloc_sync_all()
With huge-page ioremap areas the unmappings also need to be synced between
all page-tables. Otherwise it can cause data corruption when a region is
unmapped and later re-used.

Make the vmalloc_sync_one() function ready to sync unmappings and make sure
vmalloc_sync_all() iterates over all page-tables even when an unmapped PMD
is found.

Fixes: 5d72b4fba4 ('x86, mm: support huge I/O mapping capability I/F')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20190719184652.11391-3-joro@8bytes.org
2019-07-22 10:18:30 +02:00
Joerg Roedel 51b75b5b56 x86/mm: Check for pfn instead of page in vmalloc_sync_one()
Do not require a struct page for the mapped memory location because it
might not exist. This can happen when an ioremapped region is mapped with
2MB pages.

Fixes: 5d72b4fba4 ('x86, mm: support huge I/O mapping capability I/F')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20190719184652.11391-2-joro@8bytes.org
2019-07-22 10:18:30 +02:00
Linus Torvalds ac60602a6d dma-mapping fixes for 5.3-rc1
Fix various regressions:
 
  - force unencrypted dma-coherent buffers if encryption bit can't fit
    into the dma coherent mask (Tom Lendacky)
  - avoid limiting request size if swiotlb is not used (me)
  - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device
    (Fugang Duan)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl0zTvELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMAsQ/6AleklMsMbc1xPsYYMukmjAOUNf+nvsFG4PRs/KVn
 1/Yohkxx/FN3oXZ+zZEnyd8a5u0ghwkN1WDivEhpclzbDuQP+Z+jEDmb37Oea4aJ
 L6XRLQJYiFwwEA6oJ87FNVMZXK/QUo+/lnDvJg0xNW6+HiR4GAUmnqy+/KyEIRSf
 SX+aiUOX4tUkwHPWyMaWvTlZ4hZgSovXwkUnR08jCwyJFezUwJBr/Yf5G6M1C10B
 hPFTrREhaekXgFd5E1dwKNk5omvfihxGyVUujFZhtMvs//LP8GcFLcVtYRWM/SUZ
 XpKkXxnaRC0gEm2P4/tSEGL3xl1CST/oYde74KNBQDIe0svGFS0QrP68+4zu/1ih
 vaf2gHoCoJciFY2DHglw1OG/gMWW06OtdseOKe9LZXtsGA6HCVBZW4c01V5YHVQT
 TMQMr0UyxJzmrxCo+LafAf9DoQxIii8WapewomwceL0TUtIDIujirzC/ieLhNPKL
 L2Fk+zPtFL24IpVe52S1PngatlW4MioiyiJji1QM0RK1V68+r/nSKPBxeq9s+jR3
 CfGvfhfRDd/NbZ9m66YFUaRzHL6Fpi2hMvJc9O6dgcVEYEBrL0d8J9nH42cqOlfe
 OBGeCxnFNQMuBp4Tw1OZO9PjzR3+pQOb32pOWLDUUs9ed3gtdMrJYTKhw9/cLpyp
 838=
 =Bv+Q
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix various regressions:

   - force unencrypted dma-coherent buffers if encryption bit can't fit
     into the dma coherent mask (Tom Lendacky)

   - avoid limiting request size if swiotlb is not used (me)

   - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang
     Duan)"

* tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
  dma-direct: only limit the mapping size if swiotlb could be used
  dma-mapping: add a dma_addressing_limited helper
  dma-direct: Force unencrypted DMA under SME for certain DMA masks
2019-07-20 12:09:52 -07:00
Linus Torvalds c6dd78fcb8 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A set of x86 specific fixes and updates:

   - The CR2 corruption fixes which store CR2 early in the entry code
     and hand the stored address to the fault handlers.

   - Revert a forgotten leftover of the dropped FSGSBASE series.

   - Plug a memory leak in the boot code.

   - Make the Hyper-V assist functionality robust by zeroing the shadow
     page.

   - Remove a useless check for dead processes with LDT

   - Update paravirt and VMware maintainers entries.

   - A few cleanup patches addressing various compiler warnings"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/64: Prevent clobbering of saved CR2 value
  x86/hyper-v: Zero out the VP ASSIST PAGE on allocation
  x86, boot: Remove multiple copy of static function sanitize_boot_params()
  x86/boot/compressed/64: Remove unused variable
  x86/boot/efi: Remove unused variables
  x86/mm, tracing: Fix CR2 corruption
  x86/entry/64: Update comments and sanity tests for create_gap
  x86/entry/64: Simplify idtentry a little
  x86/entry/32: Simplify common_exception
  x86/paravirt: Make read_cr2() CALLEE_SAVE
  MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
  x86/process: Delete useless check for dead process with LDT
  x86: math-emu: Hide clang warnings for 16-bit overflow
  x86/e820: Use proper booleans instead of 0/1
  x86/apic: Silence -Wtype-limits compiler warnings
  x86/mm: Free sme_early_buffer after init
  x86/boot: Fix memory leak in default_get_smp_config()
  Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix the test
2019-07-20 11:24:49 -07:00
Dan Williams e9c0a3f054 mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
Allow sub-section sized ranges to be added to the memmap.

populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate().  There should be no sub-section usage in
current deployments.  New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
David Hildenbrand 80ec922dbd mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE
We want to improve error handling while adding memory by allowing to use
arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:

	arch_add_memory()
	rc = do_something();
	if (rc) {
		arch_remove_memory();
	}

We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
quite some dependencies for memory offlining.

Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Peter Zijlstra a0d14b8909 x86/mm, tracing: Fix CR2 corruption
Despite the current efforts to read CR2 before tracing happens there still
exist a number of possible holes:

  idtentry page_fault             do_page_fault           has_error_code=1
    call error_entry
      TRACE_IRQS_OFF
        call trace_hardirqs_off*
          #PF // modifies CR2

      CALL_enter_from_user_mode
        __context_tracking_exit()
          trace_user_exit(0)
            #PF // modifies CR2

    call do_page_fault
      address = read_cr2(); /* whoopsie */

And similar for i386.

Fix it by pulling the CR2 read into the entry code, before any of that
stuff gets a chance to run and ruin things.

Reported-by: He Zhe <zhe.he@windriver.com>
Reported-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: bp@alien8.de
Cc: rostedt@goodmis.org
Cc: torvalds@linux-foundation.org
Cc: hpa@zytor.com
Cc: dave.hansen@linux.intel.com
Cc: jgross@suse.com
Cc: joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20190711114336.116812491@infradead.org

Debugged-by: Steven Rostedt <rostedt@goodmis.org>
2019-07-17 23:17:38 +02:00
Anshuman Khandual b98cca444d mm, kprobes: generalize and rename notify_page_fault() as kprobe_page_fault()
Architectures which support kprobes have very similar boilerplate around
calling kprobe_fault_handler().  Use a helper function in kprobes.h to
unify them, based on the x86 code.

This changes the behaviour for other architectures when preemption is
enabled.  Previously, they would have disabled preemption while calling
the kprobe handler.  However, preemption would be disabled if this fault
was due to a kprobe, so we know the fault was not due to a kprobe
handler and can simply return failure.

This behaviour was introduced in commit a980c0ef9f ("x86/kprobes:
Refactor kprobes_fault() like kprobe_exceptions_notify()")

[anshuman.khandual@arm.com: export kprobe_fault_handler()]
  Link: http://lkml.kernel.org/r/1561133358-8876-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1560420444-25737-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:22 -07:00
Anshuman Khandual 0f472d04f5 mm/ioremap: probe platform for p4d huge map support
Finish up what commit c2febafc67 ("mm: convert generic code to 5-level
paging") started while levelling up P4D huge mapping support at par with
PUD and PMD.  A new arch call back arch_ioremap_p4d_supported() is added
which just maintains status quo (P4D huge map not supported) on x86,
arm64 and powerpc.

When HAVE_ARCH_HUGE_VMAP is enabled its just a simple check from the
arch about the support, hence runtime effects are minimal.

Link: http://lkml.kernel.org/r/1561699231-20991-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:22 -07:00
David Rientjes ffdb07f312 x86/mm: Free sme_early_buffer after init
The contents of sme_early_buffer should be cleared after
__sme_early_enc_dec() because it is used to move encrypted and decrypted
data, but since __sme_early_enc_dec() is __init this buffer simply can be
freed after init.

This saves a page that is otherwise unreferenced after init.

Reported-by: Cfir Cohen <cfir@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907101318170.197432@chino.kir.corp.google.com
2019-07-16 23:13:48 +02:00
Tom Lendacky 9087c37584 dma-direct: Force unencrypted DMA under SME for certain DMA masks
If a device doesn't support DMA to a physical address that includes the
encryption bit (currently bit 47, so 48-bit DMA), then the DMA must
occur to unencrypted memory. SWIOTLB is used to satisfy that requirement
if an IOMMU is not active (enabled or configured in passthrough mode).

However, commit fafadcd165 ("swiotlb: don't dip into swiotlb pool for
coherent allocations") modified the coherent allocation support in
SWIOTLB to use the DMA direct coherent allocation support. When an IOMMU
is not active, this resulted in dma_alloc_coherent() failing for devices
that didn't support DMA addresses that included the encryption bit.

Addressing this requires changes to the force_dma_unencrypted() function
in kernel/dma/direct.c. Since the function is now non-trivial and
SME/SEV specific, update the DMA direct support to add an arch override
for the force_dma_unencrypted() function. The arch override is selected
when CONFIG_AMD_MEM_ENCRYPT is set. The arch override function resides in
the arch/x86/mm/mem_encrypt.c file and forces unencrypted DMA when either
SEV is active or SME is active and the device does not support DMA to
physical addresses that include the encryption bit.

Fixes: fafadcd165 ("swiotlb: don't dip into swiotlb pool for coherent allocations")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
[hch: moved the force_dma_unencrypted declaration to dma-mapping.h,
      fold the s390 fix from Halil Pasic]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-16 22:15:46 +02:00
Linus Torvalds fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00