From 1f35c9c0ce3888405fc813afedaff79de433cf27 Mon Sep 17 00:00:00 2001 From: Chris Down Date: Fri, 21 Aug 2020 13:10:24 +0100 Subject: [PATCH 1/4] x86/msr: Prevent userspace MSR access from dominating the console Applications which manipulate MSRs from userspace often do so infrequently, and all at once. As such, the default printk ratelimit architecture supplied by pr_err_ratelimited() doesn't do enough to prevent kmsg becoming completely overwhelmed with their messages and pushing other salient information out of the circular buffer. In one case, I saw over 80% of kmsg being filled with these messages, and the default kmsg buffer being completely filled less than 5 minutes after boot(!). Make things much less aggressive, while still achieving the original goal of fiter_write(). Operators will still get warnings that MSRs are being manipulated from userspace, but they won't have other also potentially useful messages pushed out of the kmsg buffer. Of course, one can boot with `allow_writes=1` to avoid these messages at all, but that then has the downfall that one doesn't get _any_ notification at all about these problems in the first place, and so is much less likely to forget to fix it. One might rather it was less binary: it was still logged, just less often, so that application developers _do_ have the incentive to improve their current methods, without the kernel having to push other useful stuff out of the kmsg buffer. This one example isn't the point, of course: I'm sure there are plenty of other non-ideal-but-pragmatic cases where people are writing to MSRs from userspace right now, and it will take time for those people to find other solutions. Overall, keep the intent of the original patch, while mitigating its sometimes heavy effects on kmsg composition. [ bp: Massage a bit. ] Signed-off-by: Chris Down Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/563994ef132ce6cffd28fc659254ca37d032b5ef.1598011595.git.chris@chrisdown.name --- arch/x86/kernel/msr.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c index 49dcfb85e773..b03001dfc613 100644 --- a/arch/x86/kernel/msr.c +++ b/arch/x86/kernel/msr.c @@ -80,18 +80,30 @@ static ssize_t msr_read(struct file *file, char __user *buf, static int filter_write(u32 reg) { + /* + * MSRs writes usually happen all at once, and can easily saturate kmsg. + * Only allow one message every 30 seconds. + * + * It's possible to be smarter here and do it (for example) per-MSR, but + * it would certainly be more complex, and this is enough at least to + * avoid saturating the ring buffer. + */ + static DEFINE_RATELIMIT_STATE(fw_rs, 30 * HZ, 1); + switch (allow_writes) { case MSR_WRITES_ON: return 0; case MSR_WRITES_OFF: return -EPERM; default: break; } + if (!__ratelimit(&fw_rs)) + return 0; + if (reg == MSR_IA32_ENERGY_PERF_BIAS) return 0; - pr_err_ratelimited("Write to unrecognized MSR 0x%x by %s\n" - "Please report to x86@kernel.org\n", - reg, current->comm); + pr_err("Write to unrecognized MSR 0x%x by %s\n" + "Please report to x86@kernel.org\n", reg, current->comm); return 0; } From c31feed8461fb8648075ba9b53d9e527d530972f Mon Sep 17 00:00:00 2001 From: Chris Down Date: Fri, 21 Aug 2020 13:10:35 +0100 Subject: [PATCH 2/4] x86/msr: Make source of unrecognised MSR writes unambiguous In many cases, task_struct.comm isn't enough to distinguish the offender, since for interpreted languages it's likely just going to be "python3" or whatever. Add the pid to make it unambiguous. [ bp: Make the printk string a single line for easier grepping. ] Signed-off-by: Chris Down Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/6f6fbd0ee6c99bc5e47910db700a6642159db01b.1598011595.git.chris@chrisdown.name --- arch/x86/kernel/msr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c index b03001dfc613..c0d409810658 100644 --- a/arch/x86/kernel/msr.c +++ b/arch/x86/kernel/msr.c @@ -102,8 +102,8 @@ static int filter_write(u32 reg) if (reg == MSR_IA32_ENERGY_PERF_BIAS) return 0; - pr_err("Write to unrecognized MSR 0x%x by %s\n" - "Please report to x86@kernel.org\n", reg, current->comm); + pr_err("Write to unrecognized MSR 0x%x by %s (pid: %d). Please report to x86@kernel.org.\n", + reg, current->comm, current->pid); return 0; } From ea4e3bef4c94d5b5d8e790f37b48b7641172bcf5 Mon Sep 17 00:00:00 2001 From: Kyung Min Park Date: Mon, 31 Aug 2020 11:35:00 -0700 Subject: [PATCH 3/4] Documentation/x86: Add documentation for /proc/cpuinfo feature flags /proc/cpuinfo shows features which the kernel supports. Some of these flags are derived from CPUID, and others are derived from other sources, including some that are entirely software-based. Currently, there is not any documentation in the kernel about how /proc/cpuinfo flags are generated and what it means when they are missing. Add documentation for /proc/cpuinfo feature flags enumeration. Document how and when x86 feature flags are used. Also discuss what their presence or absence mean for the kernel and users. Suggested-by: Dave Hansen Co-developed-by: Ricardo Neri Signed-off-by: Ricardo Neri Signed-off-by: Kyung Min Park Signed-off-by: Borislav Petkov Reviewed-by: Tony Luck Link: https://lkml.kernel.org/r/20200831183500.15481-1-kyung.min.park@intel.com --- Documentation/x86/cpuinfo.rst | 155 ++++++++++++++++++++++++++++++++++ Documentation/x86/index.rst | 1 + 2 files changed, 156 insertions(+) create mode 100644 Documentation/x86/cpuinfo.rst diff --git a/Documentation/x86/cpuinfo.rst b/Documentation/x86/cpuinfo.rst new file mode 100644 index 000000000000..5d54c39a063f --- /dev/null +++ b/Documentation/x86/cpuinfo.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +x86 Feature Flags +================= + +Introduction +============ + +On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition +in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature +or KVM want to expose the feature to a KVM guest, it can and should have +an X86_FEATURE_* defined. These flags represent hardware features as +well as software features. + +If users want to know if a feature is available on a given system, they +try to find the flag in /proc/cpuinfo. If a given flag is present, it +means that the kernel supports it and is currently making it available. +If such flag represents a hardware feature, it also means that the +hardware supports it. + +If the expected flag does not appear in /proc/cpuinfo, things are murkier. +Users need to find out the reason why the flag is missing and find the way +how to enable it, which is not always easy. There are several factors that +can explain missing flags: the expected feature failed to enable, the feature +is missing in hardware, platform firmware did not enable it, the feature is +disabled at build or run time, an old kernel is in use, or the kernel does +not support the feature and thus has not enabled it. In general, /proc/cpuinfo +shows features which the kernel supports. For a full list of CPUID flags +which the CPU supports, use tools/arch/x86/kcpuid. + +How are feature flags created? +============================== + +a: Feature flags can be derived from the contents of CPUID leaves. +------------------------------------------------------------------ +These feature definitions are organized mirroring the layout of CPUID +leaves and grouped in words with offsets as mapped in enum cpuid_leafs +in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details). +If a feature is defined with a X86_FEATURE_ definition in +cpufeatures.h, and if it is detected at run time, the flags will be +displayed accordingly in /proc/cpuinfo. For example, the flag "avx2" +comes from X86_FEATURE_AVX2 in cpufeatures.h. + +b: Flags can be from scattered CPUID-based features. +---------------------------------------------------- +Hardware features enumerated in sparsely populated CPUID leaves get +software-defined values. Still, CPUID needs to be queried to determine +if a given feature is present. This is done in init_scattered_cpuid_features(). +For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is +checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1]. + +The intent of scattering CPUID leaves is to not bloat struct +cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf +[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1] +has only one feature and would waste 31 bits of space in the x86_capability[] +array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted +memory is not trivial. + +c: Flags can be created synthetically under certain conditions for hardware features. +------------------------------------------------------------------------------------- +Examples of conditions include whether certain features are present in +MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed +conditions are met, the features are enabled by the set_cpu_cap or +setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, +the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and +"split_lock_detect" will be displayed. The flag "ring3mwait" will be +displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. + +d: Flags can represent purely software features. +------------------------------------------------ +These flags do not represent hardware features. Instead, they represent a +software feature implemented in the kernel. For example, Kernel Page Table +Isolation is purely software feature and its feature flag X86_FEATURE_PTI is +also defined in cpufeatures.h. + +Naming of Flags +=============== + +The script arch/x86/kernel/cpu/mkcapflags.sh processes the +#define X86_FEATURE_ from cpufeatures.h and generates the +x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the +resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming +of flags in the x86_cap/bug_flags[] are as follows: + +a: The name of the flag is from the string in X86_FEATURE_ by default. +---------------------------------------------------------------------------- +By default, the flag in /proc/cpuinfo is extracted from the respective +X86_FEATURE_ in cpufeatures.h. For example, the flag "avx2" is from +X86_FEATURE_AVX2. + +b: The naming can be overridden. +-------------------------------- +If the comment on the line for the #define X86_FEATURE_* starts with a +double-quote character (""), the string inside the double-quote characters +will be the name of the flags. For example, the flag "sse4_1" comes from +the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition. + +There are situations in which overriding the displayed name of the flag is +needed. For instance, /proc/cpuinfo is a userspace interface and must remain +constant. If, for some reason, the naming of X86_FEATURE_ changes, one +shall override the new naming with the name already used in /proc/cpuinfo. + +c: The naming override can be "", which means it will not appear in /proc/cpuinfo. +---------------------------------------------------------------------------------- +The feature shall be omitted from /proc/cpuinfo if it does not make sense for +the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is +defined in cpufeatures.h but that flag is an internal kernel feature used +in the alternative runtime patching functionality. So, its name is overridden +with "". Its flag will not appear in /proc/cpuinfo. + +Flags are missing when one or more of these happen +================================================== + +a: The hardware does not enumerate support for it. +-------------------------------------------------- +For example, when a new kernel is running on old hardware or the feature is +not enabled by boot firmware. Even if the hardware is new, there might be a +problem enabling the feature at run time, the flag will not be displayed. + +b: The kernel does not know about the flag. +------------------------------------------- +For example, when an old kernel is running on new hardware. + +c: The kernel disabled support for it at compile-time. +------------------------------------------------------ +For example, if 5-level-paging is not enabled when building (i.e., +CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_. +Even though the feature will still be detected via CPUID, the kernel disables +it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57). + +d: The feature is disabled at boot-time. +---------------------------------------- +A feature can be disabled either using a command-line parameter or because +it failed to be enabled. The command-line parameter clearcpuid= can be used +to disable features using the feature number as defined in +/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction +Protection can be disabled using clearcpuid=514. The number 514 is calculated +from #define X86_FEATURE_UMIP (16*32 + 2). + +In addition, there exists a variety of custom command-line parameters that +disable specific features. The list of parameters includes, but is not limited +to, nofsgsbase, nosmap, and nosmep. 5-level paging can also be disabled using +"no5lvl". SMAP and SMEP are disabled with the aforementioned parameters, +respectively. + +e: The feature was known to be non-functional. +---------------------------------------------- +The feature was known to be non-functional because a dependency was +missing at runtime. For example, AVX flags will not show up if XSAVE feature +is disabled since they depend on XSAVE feature. Another example would be broken +CPUs and them missing microcode patches. Due to that, the kernel decides not to +enable a feature. + +.. [#f1] 5-level paging uses linear address of 57 bits. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 265d9e9a093b..d5adb0ab8668 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -9,6 +9,7 @@ x86-specific Documentation :numbered: boot + cpuinfo topology exception-tables kernel-stacks From f94c91f7ba3ba7de2bc8aa31be28e1abb22f849e Mon Sep 17 00:00:00 2001 From: Libing Zhou Date: Thu, 20 Aug 2020 10:56:41 +0800 Subject: [PATCH 4/4] x86/nmi: Fix nmi_handle() duration miscalculation When nmi_check_duration() is checking the time an NMI handler took to execute, the whole_msecs value used should be read from the @duration argument, not from the ->max_duration, the latter being used to store the current maximal duration. [ bp: Rewrite commit message. ] Fixes: 248ed51048c4 ("x86/nmi: Remove irq_work from the long duration NMI handler") Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Libing Zhou Signed-off-by: Borislav Petkov Cc: Changbin Du Link: https://lkml.kernel.org/r/20200820025641.44075-1-libing.zhou@nokia-sbell.com --- arch/x86/kernel/nmi.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 4fc9954a9560..47381666d6a5 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -102,7 +102,6 @@ fs_initcall(nmi_warning_debugfs); static void nmi_check_duration(struct nmiaction *action, u64 duration) { - u64 whole_msecs = READ_ONCE(action->max_duration); int remainder_ns, decimal_msecs; if (duration < nmi_longest_ns || duration < action->max_duration) @@ -110,12 +109,12 @@ static void nmi_check_duration(struct nmiaction *action, u64 duration) action->max_duration = duration; - remainder_ns = do_div(whole_msecs, (1000 * 1000)); + remainder_ns = do_div(duration, (1000 * 1000)); decimal_msecs = remainder_ns / 1000; printk_ratelimited(KERN_INFO "INFO: NMI handler (%ps) took too long to run: %lld.%03d msecs\n", - action->handler, whole_msecs, decimal_msecs); + action->handler, duration, decimal_msecs); } static int nmi_handle(unsigned int type, struct pt_regs *regs)