alistair23-linux

redonkable

Author	SHA1	Message	Date
Borislav Petkov	066d115fdd	EDAC/amd64: Fix PCI component registration commit `706657b1fe` upstream. In order to setup its PCI component, the driver needs any node private instance in order to get a reference to the PCI device and hand that into edac_pci_create_generic_ctl(). For convenience, it uses the 0th memory controller descriptor under the assumption that if any, the 0th will be always present. However, this assumption goes wrong when the 0th node doesn't have memory and the driver doesn't initialize an instance for it: EDAC amd64: F17h detected (node 0). ... EDAC amd64: Node 0: No DIMMs detected. But looking up node instances is not really needed - all one needs is the pointer to the proper device which gets discovered during instance init. So stash that pointer into a variable and use it when setting up the EDAC PCI component. Clear that variable when the driver needs to unwind due to some instances failing init to avoid any registration imbalance. Cc: <stable@vger.kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201122150815.13808-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-12-30 11:51:36 +01:00
Qiuxu Zhuo	f4ce4a53c4	EDAC/i10nm: Use readl() to access MMIO registers commit `83ff51c4e3` upstream. Instead of raw access, use readl() to access MMIO registers of memory controller to avoid possible compiler re-ordering. Fixes: `d4dc89d069` ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-12-30 11:51:36 +01:00
Yazen Ghannam	3e08a61b2f	EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId [ Upstream commit `8de0c9917c` ] The edac_mce_amd module calls decode_dram_ecc() on AMD Family17h and later systems. This function is used in amd64_edac_mod to do system-specific decoding for DRAM ECC errors. The function takes a "NodeId" as a parameter. In AMD documentation, NodeId is used to identify a physical die in a system. This can be used to identify a node in the AMD_NB code and also it is used with umc_normaddr_to_sysaddr(). However, the input used for decode_dram_ecc() is currently the NUMA node of a logical CPU. In the default configuration, the NUMA node and physical die will be equivalent, so this doesn't have an impact. But the NUMA node configuration can be adjusted with optional memory interleaving modes. This will cause the NUMA node enumeration to not match the physical die enumeration. The mismatch will cause the address translation function to fail or report incorrect results. Use struct cpuinfo_x86.cpu_die_id for the node_id parameter to ensure the physical ID is used. Fixes: `fbe63acf62` ("EDAC, mce_amd: Use cpu_to_node() to find the node ID") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201109210659.754018-4-Yazen.Ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-12-30 11:51:10 +01:00
Krzysztof Kozlowski	9aee821655	EDAC/ti: Fix handling of platform_get_irq() error [ Upstream commit `66077adb70` ] platform_get_irq() returns a negative error number on error. In such a case, comparison to 0 would pass the check therefore check the return value properly, whether it is negative. [ bp: Massage commit message. ] Fixes: `86a18ee21e` ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tero Kristo <t-kristo@ti.com> Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-29 09:57:29 +01:00
Krzysztof Kozlowski	64a9f5a30f	EDAC/aspeed: Fix handling of platform_get_irq() error [ Upstream commit `afce699694` ] platform_get_irq() returns a negative error number on error. In such a case, comparison to 0 would pass the check therefore check the return value properly, whether it is negative. [ bp: Massage commit message. ] Fixes: `9b7e6242ee` ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Stefan Schaeckeler <schaecsn@gmx.net> Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-29 09:57:29 +01:00
Dinghao Liu	4d86328e42	EDAC/i5100: Fix error handling order in i5100_init_one() [ Upstream commit `857a3139bd` ] When pci_get_device_func() fails, the driver doesn't need to execute pci_dev_put(). mci should still be freed, though, to prevent a memory leak. When pci_enable_device() fails, the error injection PCI device "einj" doesn't need to be disabled either. [ bp: Massage commit message, rename label to "bail_mc_free". ] Fixes: `52608ba205` ("i5100_edac: probe for device 19 function 0") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-29 09:57:29 +01:00
Borislav Petkov	b11f2d6b80	EDAC/ghes: Check whether the driver is on the safe list correctly [ Upstream commit `251c54ea26` ] With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe, unregister and probe again a driver. When ghes_edac is attempted to be loaded on a system which is not on the safe platforms list, ghes_edac_register() would return early. The unregister counterpart ghes_edac_unregister() would still attempt to unregister and exit early at the refcount test, leading to the refcount underflow below. In order to not do anything on the unregister path too, reuse the force_load parameter and check it on that path too, before fumbling with the refcount. ghes_edac: ghes_edac_register: entry ghes_edac: ghes_edac_register: return -ENODEV ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100 Modules linked in: CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:refcount_warn_saturate+0xb9/0x100 Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c RSP: 0018:ffffc90000037d58 EFLAGS: 00010292 RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0 Call Trace: ghes_edac_unregister ghes_remove platform_drv_remove really_probe driver_probe_device device_driver_attach __driver_attach ? device_driver_attach ? device_driver_attach bus_for_each_dev bus_add_driver driver_register ? bert_init ghes_init do_one_initcall ? rcu_read_lock_sched_held kernel_init_freeable ? rest_init kernel_init ret_from_fork ... ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824 Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-10-01 13:18:14 +02:00
Tony Luck	6d11320bed	EDAC/{i7core,sb,pnd2,skx}: Fix error event severity [ Upstream commit `45bc6098a3` ] IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto the stack as part of machine check delivery is valid or not. Various drivers copied a code fragment that uses the RIPV bit to determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where RIPV is set as "FATAL"). Reverse the tests so that the error is marked fatal when RIPV is not set. Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-09-03 11:26:53 +02:00
Mauro Carvalho Chehab	87cc96bb11	EDAC: skx_common: get rid of unused type var [ Upstream commit `f05390d30e` ] drivers/edac/skx_common.c: In function ‘skx_mce_output_error’: drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 478 \| char type, optype; \| ^~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-09-03 11:26:52 +02:00
Mauro Carvalho Chehab	3bf42b2e8d	EDAC: sb_edac: get rid of unused vars [ Upstream commit `323014d85d` ] There are several vars unused on this driver, probably because it was a modified copy of another driver. Get rid of them. drivers/edac/sb_edac.c: In function ‘knl_get_dimm_capacity’: drivers/edac/sb_edac.c:1343:16: warning: variable ‘sad_size’ set but not used [-Wunused-but-set-variable] 1343 \| u64 sad_base, sad_size, sad_limit = 0; \| ^~~~~~~~ drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’: drivers/edac/sb_edac.c:2955:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 2955 \| char type, optype, msg[256]; \| ^~~~ drivers/edac/sb_edac.c: In function ‘sbridge_unregister_mci’: drivers/edac/sb_edac.c:3203:22: warning: variable ‘pvt’ set but not used [-Wunused-but-set-variable] 3203 \| struct sbridge_pvt *pvt; \| ^~~ At top level: drivers/edac/sb_edac.c:266:18: warning: ‘correrrthrsld’ defined but not used [-Wunused-const-variable=] 266 \| static const u32 correrrthrsld[] = { \| ^~~~~~~~~~~~~ drivers/edac/sb_edac.c:257:18: warning: ‘correrrcnt’ defined but not used [-Wunused-const-variable=] 257 \| static const u32 correrrcnt[] = { \| ^~~~~~~~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-09-03 11:26:52 +02:00
Jason Baron	c67c6e1f54	EDAC/ie31200: Fallback if host bridge device is already initialized [ Upstream commit `709ed1bcef` ] The Intel uncore driver may claim some of the pci ids from ie31200 which means that the ie31200 edac driver will not initialize them as part of pci_register_driver(). Let's add a fallback for this case to 'pci_get_device()' to get a reference on the device such that it can still be configured. This is similar in approach to other edac drivers. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Borislav Petkov <bp@suse.de> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-09-03 11:26:48 +02:00
Qiushi Wu	c73eec04e6	EDAC: Fix reference count leaks [ Upstream commit `17ed808ad2` ] When kobject_init_and_add() returns an error, it should be handled because kobject_init_and_add() takes a reference even when it fails. If this function returns an error, kobject_put() must be called to properly clean up the memory associated with the object. Therefore, replace calling kfree() and call kobject_put() and add a missing kobject_put() in the edac_device_register_sysfs_main_kobj() error path. [ bp: Massage and merge into a single patch. ] Fixes: `b2ed215a33` ("Kobject: change drivers/edac to use kobject_init_and_add") Signed-off-by: Qiushi Wu <wu000273@umn.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-08-19 08:15:54 +02:00
Borislav Petkov	58ab86e58b	EDAC/amd64: Read back the scrub rate PCI register on F15h [ Upstream commit `ee470bb25d` ] Commit: `da92110dfd` ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") added support for F15h, model 0x60 CPUs but in doing so, missed to read back SCRCTRL PCI config register on F15h CPUs which are not model 0x60. Add that read so that doing $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate can show the previously set DRAM scrub rate. Fixes: `da92110dfd` ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h") Reported-by: Anders Andersson <pipatron@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> #v4.4.. Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-07-09 09:37:49 +02:00
Alexander Monakov	7c71b9aa18	EDAC/amd64: Add AMD family 17h model 60h PCI IDs commit `b6bea24d41` upstream. Add support for AMD Renoir (4000-series Ryzen CPUs). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-22 09:31:20 +02:00
Qiuxu Zhuo	78e6964dce	EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enable commit `1032095053` upstream. The skx_edac driver wrongly uses the mtr register to retrieve two fields close_pg and bank_xor_enable. Fix it by using the correct mcmtr register to get the two fields. Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Matthew Riley <mattdr@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-06-17 16:40:33 +02:00
Sherry Sun	90335d6681	EDAC/synopsys: Do not print an error with back-to-back snprintf() calls commit `dfc6014e3b` upstream. handle_error() currently calls snprintf() a couple of times in succession to output the message for a CE/UE, therefore overwriting each part of the message which was formatted with the previous snprintf() call. As a result, only the part of the message from the last snprintf() call will be printed. The simplest and most effective way to fix this problem is to combine the whole string into one which to supply to a single snprintf() call. [ bp: Massage. ] Fixes: `b500b4a029` ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller") Signed-off-by: Sherry Sun <sherry.sun@nxp.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Cc: Manish Narani <manish.narani@xilinx.com> Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-03-12 13:00:31 +01:00
Aristeu Rozanski	728afb955b	EDAC: skx_common: downgrade message importance on missing PCI device [ Upstream commit `854bb48018` ] Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being available which leads the module to be automatically loaded in virtual machines as well. That will fail due the missing PCI devices. In both drivers the first function to make use of the PCI devices is skx_get_hi_lo() will simply print EDAC skx: Can't get tolm/tohm for each CPU core, which is noisy. This patch makes it a debug message. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-03-05 16:43:31 +01:00
Wei Yongjun	d4870a4343	EDAC/sifive: Fix return value check in ecc_register() [ Upstream commit `6cd18453b6` ] In case of error, the function edac_device_alloc_ctl_info() returns a NULL pointer, not ERR_PTR(). Replace the IS_ERR() test in the return value check with a NULL test. Fixes: `91abaeaaff` ("EDAC/sifive: Add EDAC platform driver for SiFive SoCs") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200115150303.112627-1-weiyongjun1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-02-24 08:36:51 +01:00
Robert Richter	ce8b9b8032	EDAC/mc: Fix use-after-free and memleaks during device removal commit `216aa145aa` upstream. A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and DEBUG_KMEMLEAK set, revealed several issues when removing an mci device: 1) Use-after-free: On 27.11.19 17:07:33, John Garry wrote: > [ 22.104498] BUG: KASAN: use-after-free in > edac_remove_sysfs_mci_device+0x148/0x180 The use-after-free is caused by the mci_for_each_dimm() macro called in edac_remove_sysfs_mci_device(). The iterator was introduced with `c498afaf7d` ("EDAC: Introduce an mci_for_each_dimm() iterator"). The iterator loop calls device_unregister(&dimm->dev), which removes the sysfs entry of the device, but also frees the dimm struct in dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(), the dimm struct is accessed again, after having been freed already. The fix is to free all the mci device's subsequent dimm and csrow objects at a later point, in _edac_mc_free(), when the mci device itself is being freed. This keeps the data structures intact and the mci device can be fully used until its removal. The change allows the safe usage of mci_for_each_dimm() to release dimm devices from sysfs. 2) Memory leaks: Following memory leaks have been detected: # grep edac /sys/kernel/debug/kmemleak \| sort \| uniq -c 1 [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0 # mci->csrows 16 [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0 # csr->channels 16 [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0 # csr->channels[chn] 1 [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0 # mci->dimms 34 [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc() All leaks are from memory allocated by edac_mc_alloc(). Note: The test above shows that edac_mc_alloc() was called here from ghes_edac_register(), thus both functions show up in the stack trace but the module causing the leaks is edac_mc. The comments with the data structures involved were made manually by analyzing the objdump. The data structures listed above and created by edac_mc_alloc() are not properly removed during device removal, which is done in edac_mc_free(). There are two paths implemented to remove the device depending on device registration, _edac_mc_free() is called if the device is not registered and edac_unregister_sysfs() otherwise. The implemenations differ. For the sysfs case, the mci device removal lacks the removal of subsequent data structures (csrows, channels, dimms). This causes the memory leaks (see mci_attr_release()). [ bp: Massage commit message. ] Fixes: `c498afaf7d` ("EDAC: Introduce an mci_for_each_dimm() iterator") Fixes: `faa2ad09c0` ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.") Fixes: `7a623c0390` ("edac: rewrite the sysfs code to use struct device") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200212120340.4764-3-rrichter@marvell.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-02-19 19:53:03 +01:00
Robert Richter	b2e977a973	EDAC/sysfs: Remove csrow objects on errors commit `4d59588c09` upstream. All created csrow objects must be removed in the error path of edac_create_csrow_objects(). The objects have been added as devices. They need to be removed by doing a device_del() and put_device() call to also free their memory. The missing put_device() leaves a memory leak. Use device_unregister() instead of device_del() which properly unregisters the device doing both. Fixes: `7adc05d2dc` ("EDAC/sysfs: Drop device references properly") Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200212120340.4764-4-rrichter@marvell.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-02-19 19:53:03 +01:00
Robert Richter	f90edcff1e	EDAC/ghes: Fix grain calculation [ Upstream commit `7088e29e04` ] The current code to convert a physical address mask to a grain (defined as granularity in bytes) is: e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); This is broken in several ways: 1) It calculates to wrong grain values. E.g., a physical address mask of ~0xfff should give a grain of 0x1000. Without considering PAGE_MASK, there is an off-by-one. Things are worse when also filtering it with ~PAGE_MASK. This will calculate to a grain with the upper bits set. In the example it even calculates to ~0. 2) The grain does not depend on and is unrelated to the kernel's page-size. The page-size only matters when unmapping memory in memory_failure(). Smaller grains are wrongly rounded up to the page-size, on architectures with a configurable page-size (e.g. arm64) this could round up to the even bigger page-size of the hypervisor. Fix this with: e->grain = ~mem_err->physical_addr_mask + 1; The grain_bits are defined as: grain = 1 << grain_bits; Change also the grain_bits calculation accordingly, it is the same formula as in edac_mc.c now and the code can be unified. The value in ->physical_addr_mask coming from firmware is assumed to be contiguous, but this is not sanity-checked. However, in case the mask is non-contiguous, a conversion to grain_bits effectively converts the grain bit mask to a power of 2 by rounding it up. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-11-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-31 16:45:16 +01:00
Yazen Ghannam	36b4080a3f	EDAC/amd64: Set grain per DIMM [ Upstream commit `466503d6b1` ] The following commit introduced a warning on error reports without a non-zero grain value. `3724ace582` ("EDAC/mc: Fix grain_bits calculation") The amd64_edac_mod module does not provide a value, so the warning will be given on the first reported memory error. Set the grain per DIMM to cacheline size (64 bytes). This is the current recommendation. Fixes: `3724ace582` ("EDAC/mc: Fix grain_bits calculation") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191022203448.13962-7-Yazen.Ghannam@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-31 16:44:20 +01:00
Robert Richter	e240c7d1f1	EDAC/ghes: Do not warn when incrementing refcount on 0 [ Upstream commit `16214bd9e4` ] The following warning from the refcount framework is seen during ghes initialization: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) ------------[ cut here ]------------ refcount_t: increment on 0; use-after-free. WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked [...] Call trace: refcount_inc_checked ghes_edac_register ghes_probe ... It warns if the refcount is incremented from zero. This warning is reasonable as a kernel object is typically created with a refcount of one and freed once the refcount is zero. Afterwards the object would be "used-after-free". For GHES, the refcount is initialized with zero, and that is why this message is seen when initializing the first instance. However, whenever the refcount is zero, the device will be allocated and registered. Since the ghes_reg_mutex protects the refcount and serializes allocation and freeing of ghes devices, a use-after-free cannot happen here. Instead of using refcount_inc() for the first instance, use refcount_set(). This can be used here because the refcount is zero at this point and can not change due to its protection by the mutex. Fixes: `23f61b9fc5` ("EDAC/ghes: Fix locking and memory barrier issues") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <huangming23@huawei.com> Cc: James Morse <james.morse@arm.com> Cc: <linuxarm@huawei.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: <tanxiaofei@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <wanghuiqiang@huawei.com> Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-17 19:56:54 +01:00
Meng Li	dc69bd2393	EDAC/altera: Use fast register IO for S10 IRQs commit `56d9e7bd3f` upstream. When an IRQ occurs, regmap_{read,write,...}() is invoked in atomic context. Regmap must indicate register IO is fast so that a spinlock is used instead of a mutex to avoid sleeping in atomic context: lock_acquire __mutex_lock mutex_lock_nested regmap_lock_mutex regmap_write a10_eccmgr_irq_unmask unmask_irq.part.0 irq_enable __irq_startup irq_startup __setup_irq request_threaded_irq devm_request_threaded_irq altr_sdram_probe Mark it so. [ bp: Massage. ] Fixes: `3dab6bd526` ("EDAC, altera: Add support for Stratix10 SDRAM EDAC") Reported-by: Meng Li <Meng.Li@windriver.com> Signed-off-by: Meng Li <Meng.Li@windriver.com> Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: stable <stable@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/1574361048-17572-2-git-send-email-thor.thayer@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-12-17 19:55:51 +01:00
Robert Richter	d2136afb51	EDAC/ghes: Fix locking and memory barrier issues [ Upstream commit `23f61b9fc5` ] The ghes registration and refcount is broken in several ways: * ghes_edac_register() returns with success for a 2nd instance even if a first instance's registration is still running. This is not correct as the first instance may fail later. A subsequent registration may not finish before the first. Parallel registrations must be avoided. * The refcount was increased even if a registration failed. This leads to stale counters preventing the device from being released. * The ghes refcount may not be decremented properly on unregistration. Always decrement the refcount once ghes_edac_unregister() is called to keep the refcount sane. * The ghes_pvt pointer is handed to the irq handler before registration finished. * The mci structure could be freed while the irq handler is running. Fix this by adding a mutex to ghes_edac_register(). This mutex serializes instances to register and unregister. The refcount is only increased if the registration succeeded. This makes sure the refcount is in a consistent state after registering or unregistering a device. Note: A spinlock cannot be used here as the code section may sleep. The ghes_pvt is protected by ghes_lock now. This ensures the pointer is not updated before registration was finished or while the irq handler is running. It is unset before unregistering the device including necessary (implicit) memory barriers making the changes visible to other CPUs. Thus, the device can not be used anymore by an interrupt. Also, rename ghes_init to ghes_refcount for better readability and switch to refcount API. A refcount is needed because there can be multiple GHES structures being defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error Source, "Some platforms may describe multiple Generic Hardware Error Source structures with different notification types, ..."). Another approach to use the mci's device refcount (get_device()) and have a release function does not work here. A release function will be called only for device_release() with the last put_device() call. The device must be deleted before that with device_del(). This is only possible by maintaining an own refcount. [ bp: touchups. ] Fixes: `0fe5f281f7` ("EDAC, ghes: Model a single, logical memory controller") Fixes: `1e72e673b9` ("EDAC/ghes: Fix Use after free in ghes_edac remove path") Co-developed-by: James Morse <james.morse@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-12-13 08:43:30 +01:00
James Morse	1e72e673b9	EDAC/ghes: Fix Use after free in ghes_edac remove path ghes_edac models a single logical memory controller, and uses a global ghes_init variable to ensure only the first ghes_edac_register() will do anything. ghes_edac is registered the first time a GHES entry in the HEST is probed. There may be multiple entries, so subsequent attempts to register ghes_edac are silently ignored as the work has already been done. When a GHES entry is unregistered, it calls ghes_edac_unregister(), which free()s the memory behind the global variables in ghes_edac. But there may be multiple GHES entries, the next call to ghes_edac_unregister() will dereference the free()d memory, and attempt to free it a second time. This may also be triggered on a platform with one GHES entry, if the driver is unbound/re-bound and unbound. The re-bind step will do nothing because of ghes_init, the second unbind will then do the same work as the first. Doing the unregister work on the first call is unsafe, as another CPU may be processing a notification in ghes_edac_report_mem_error(), using the memory we are about to free. ghes_init is already half of the reference counting. We only need to do the register work for the first call, and the unregister work for the last. Add the unregister check. This means we no longer free ghes_edac's memory while there are GHES entries that may receive a notification. This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE. [ bp: merge into a single patch. ] Fixes: `0fe5f281f7` ("EDAC, ghes: Model a single, logical memory controller") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com	2019-10-17 11:27:05 +02:00
Linus Torvalds	8808cf8cbc	ARM updates for 5.4-rc1: - fix various clang build and cppcheck issues - switch ARM to use new common outgoing-CPU-notification code - add some additional explanation about the boot code - kbuild "make clean" fixes - get rid of another "(____ptrval____)", this time for the VDSO code - avoid treating cache maintenance faults as a write - add a frame pointer unwinder implementation for clang - add EDAC support for Aurora L2 cache - improve robustness of adjust_lowmem_bounds() finding the bounds of lowmem. - add reset control for AMBA primecell devices -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAXYZCdvTnkBvkraxkAQK7vQ//UO0XJ1InSLnWzPYuNwJGcCmzHIg6p40A VxnvDTVxZH6UKDhBg8xx+gpPOhwZElGyc0H563p5jgmjzbIesESS5Xy3hUUMkQ9y A6Ta9Nk+NhL+j9O9VtcOk90oQJsLuVyYtHTfk6Wl9xaVLjM1OALWNzCSDqXIPTjF qEhTRahlv9Nc9aisFJAPduf/zQx9ULaZVvDzTo6clXSD7ieSy0MZRiRbcH3MJwiY Q5AbImF49NGcNtlknPh8Gnz/4P3q+bxQDmrzki9d4Fcy2brko845q9Ca5PC+iXro fZHvs8q2+8xz4PuOddBrYebqPIIv+3W6uPlJAPjO0MQrxPTUxRBxqAkYXxwTZBx/ A79AQsbnmUSyOV4EI2lk9USmN/GF2QwGOusRoiA/XMbSVfqnVZWH5mE98dr+2vn+ rUnTq9yvSz2y6QH7+UI+7Q7T8jg4QFBBmPDfCP+yTOWqPb8u070h+VgLBr28g1JL t6VtzOeI4wyl7E/WInmoM/d8SXnjv/1yNzLBcCdvgBV94fUQAV5EP+cDGJ0hv1SJ TGywm8adf3zAa7ZUAOhBoAK3gkNqjJB28ynsH4QmBUmsKkozxoKwwb4jjbGgcoUY rYII4VyoQB/0eX5/i8u69krA+3QNRhehLWC/zM4ZK5lKfFRCnNDvLgiIEM5b59JW nBywRtpyw2I= =Evmc -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm Pull ARM updates from Russell King: - fix various clang build and cppcheck issues - switch ARM to use new common outgoing-CPU-notification code - add some additional explanation about the boot code - kbuild "make clean" fixes - get rid of another "(____ptrval____)", this time for the VDSO code - avoid treating cache maintenance faults as a write - add a frame pointer unwinder implementation for clang - add EDAC support for Aurora L2 cache - improve robustness of adjust_lowmem_bounds() finding the bounds of lowmem. - add reset control for AMBA primecell devices * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (24 commits) ARM: 8906/1: drivers/amba: add reset control to amba bus probe ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer ARM: 8904/1: skip nomap memblocks while finding the lowmem/highmem boundary ARM: 8903/1: ensure that usable memory in bank 0 starts from a PMD-aligned address ARM: 8891/1: EDAC: armada_xp: Add support for more SoCs ARM: 8888/1: EDAC: Add driver for the Marvell Armada XP SDRAM and L2 cache ECC ARM: 8892/1: EDAC: Add missing debugfs_create_x32 wrapper ARM: 8890/1: l2x0: add marvell,ecc-enable property for aurora ARM: 8889/1: dt-bindings: document marvell,ecc-enable binding ARM: 8886/1: l2x0: support parity-enable/disable on aurora ARM: 8885/1: aurora-l2: add defines for parity and ECC registers ARM: 8887/1: aurora-l2: add prefix to MAX_RANGE_SIZE ARM: 8902/1: l2c: move cache-aurora-l2.h to asm/hardware ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang ARM: 8898/1: mm: Don't treat faults reported from cache maintenance as writes ARM: 8896/1: VDSO: Don't leak kernel addresses ARM: 8895/1: visit mach-* and plat-* directories when cleaning ARM: 8894/1: boot: Replace open-coded nop with macro ARM: 8893/1: boot: Explain the 8 nops ARM: 8876/1: fix O= building with CONFIG_FPE_FASTFPE ...	2019-09-22 09:39:09 -07:00
Linus Torvalds	22331f8952	Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu-feature updates from Ingo Molnar: - Rework the Intel model names symbols/macros, which were decades of ad-hoc extensions and added random noise. It's now a coherent, easy to follow nomenclature. - Add new Intel CPU model IDs: - "Tiger Lake" desktop and mobile models - "Elkhart Lake" model ID - and the "Lightning Mountain" variant of Airmont, plus support code - Add the new AVX512_VP2INTERSECT instruction to cpufeatures - Remove Intel MPX user-visible APIs and the self-tests, because the toolchain (gcc) is not supporting it going forward. This is the first, lowest-risk phase of MPX removal. - Remove X86_FEATURE_MFENCE_RDTSC - Various smaller cleanups and fixes * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) x86/cpu: Update init data for new Airmont CPU model x86/cpu: Add new Airmont variant to Intel family x86/cpu: Add Elkhart Lake to Intel family x86/cpu: Add Tiger Lake to Intel family x86: Correct misc typos x86/intel: Add common OPTDIFFs x86/intel: Aggregate microserver naming x86/intel: Aggregate big core graphics naming x86/intel: Aggregate big core mobile naming x86/intel: Aggregate big core client naming x86/cpufeature: Explain the macro duplication x86/ftrace: Remove mcount() declaration x86/PCI: Remove superfluous returns from void functions x86/msr-index: Move AMD MSRs where they belong x86/cpu: Use constant definitions for CPU models lib: Remove redundant ftrace flag removal x86/crash: Remove unnecessary comparison x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE() x86: Remove X86_FEATURE_MFENCE_RDTSC x86/mpx: Remove MPX APIs ...	2019-09-16 18:47:53 -07:00
Isaac Vaughn	3e443eb353	EDAC/amd64: Add PCI device IDs for family 17h, model 70h Add the new Family 17h Model 70h PCI IDs (device 18h functions 0 and 6) to the AMD64 EDAC module. [ bp: s/f17_base_addr_to_cs_size/f17_addr_mask_to_cs_size/g ] Signed-off-by: Isaac Vaughn <isaac.vaughn@knights.ucf.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: linux-edac@vger.kernel.org Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190906192131.8ced0ca112146f32d82b6cae@knights.ucf.edu	2019-09-07 07:29:27 +02:00
Robert Richter	e701f41203	EDAC/mc_sysfs: Make debug messages consistent Debug messages are inconsistently used in the error handlers. Some lack an error message, some are called regardless of the return status, messages for the same error are at different locations in the code depending on the error code. This happens esp. near put_device() calls. Make those debug messages more consistent. Additionally, unify the error messages to have the same terms for the same operations of the device. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190902123216.9809-5-rrichter@marvell.com	2019-09-04 11:39:19 +02:00
Robert Richter	644110e17d	EDAC/mc_sysfs: Remove pointless gotos Use direct returns instead of gotos. Error handling code becomes smaller and better readable. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190902123216.9809-4-rrichter@marvell.com	2019-09-03 19:24:10 +02:00
Robert Richter	d55c79ac86	EDAC: Prefer 'unsigned int' to bare use of 'unsigned' Use of 'unsigned int' instead of bare use of 'unsigned'. Fix this for edac_mc*, ghes and the i5100 driver as reported by checkpatch.pl. While at it, struct member dev_ch_attribute->channel is always used as unsigned int. Change type to unsigned int to avoid type casts. [ bp: Massage. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190902123216.9809-2-rrichter@marvell.com	2019-09-03 19:21:19 +02:00
Chris Packham	23d103ae3e	ARM: 8891/1: EDAC: armada_xp: Add support for more SoCs The Armada 38x and other integrated SoCs use a reduced pin count so the width of the SDRAM interface is smaller than the Armada XP SoCs. This means that the definition of "full" and "half" width is reduced from 64/32 to 32/16. Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>	2019-08-29 07:58:01 +01:00
Jan Luebbe	7f6998a412	ARM: 8888/1: EDAC: Add driver for the Marvell Armada XP SDRAM and L2 cache ECC Add support for the ECC functionality as found in the DDR RAM and L2 cache controllers on the MV78230/MV78x60 SoCs. This driver has been tested on the MV78460 (on a custom board with a DDR3 ECC DIMM). [cp use SPDX license] Signed-off-by: Jan Luebbe <jlu@pengutronix.de> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>	2019-08-29 07:58:01 +01:00
Jan Luebbe	0ecace04a3	ARM: 8892/1: EDAC: Add missing debugfs_create_x32 wrapper We already have wrappers for x8 and x16, so add the missing x32 one. Signed-off-by: Jan Luebbe <jlu@pengutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>	2019-08-29 07:58:01 +01:00
Peter Zijlstra	5ebb34edbe	x86/intel: Aggregate microserver naming Currently big microservers have _XEON_D while small microservers have _X, Make it uniformly: _D. for i in `git grep -l "$INTEL_FAM6_\\|VULNWL_INTEL\\|INTEL_CPU_FAM6$._$X\\|XEON_D$"` do sed -i -e 's/$\(INTEL_FAM6_\\|VULNWL_INTEL\\|INTEL_CPU_FAM6$.ATOM.\)_X/\1_D/g' \ -e 's/$\(INTEL_FAM6_\\|VULNWL_INTEL\\|INTEL_CPU_FAM6$.\)_XEON_D/\1_D/g' ${i} done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org	2019-08-28 11:29:32 +02:00
Yazen Ghannam	81f5090db8	EDAC/amd64: Support asymmetric dual-rank DIMMs Future AMD systems will support asymmetric dual-rank DIMMs. These are DIMMs where the ranks are of different sizes. The even rank will use the Primary Even Chip Select registers and the odd rank will use the Secondary Odd Chip Select registers. Recognize if a Secondary Odd Chip Select is being used. Use the Secondary Odd Address Mask when calculating the chip select size. [ bp: move csrow_sec_enabled() to the header, fix CS_ODD define and tone-down the capitalized words spelling. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-8-Yazen.Ghannam@amd.com	2019-08-23 16:09:52 +02:00
Yazen Ghannam	7574729e91	EDAC/amd64: Cache secondary Chip Select registers AMD Family 17h systems have a set of secondary Chip Select Base Addresses and Address Masks. These do not represent unique Chip Selects, rather they are used in conjunction with the primary Chip Select registers in certain cases. Cache these secondary Chip Select registers for future use. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-7-Yazen.Ghannam@amd.com	2019-08-23 12:55:05 +02:00
Yazen Ghannam	8a2eaab7da	EDAC/amd64: Decode syndrome before translating address AMD Family 17h systems currently require address translation in order to report the system address of a DRAM ECC error. This is currently done before decoding the syndrome information. The syndrome information does not depend on the address translation, so the proper EDAC csrow/channel reporting can function without the address. However, the syndrome information will not be decoded if the address translation fails. Decode the syndrome information before doing the address translation. The syndrome information is architecturally defined in MCA_SYND and can be considered robust. The address translation is system-specific and may fail on newer systems without proper updates to the translation algorithm. Fixes: `713ad54675` ("EDAC, amd64: Define and register UMC error decode function") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-6-Yazen.Ghannam@amd.com	2019-08-23 07:50:21 +02:00
Yazen Ghannam	e53a3b267f	EDAC/amd64: Find Chip Select memory size using Address Mask Chip Select memory size reporting on AMD Family 17h was recently fixed in order to account for interleaving. However, the current method is not robust. The Chip Select Address Mask can be used to find the memory size. There are a couple of cases. 1) For single-rank and dual-rank non-interleaved, use the address mask plus 1 as the size. 2) For dual-rank interleaved, do #1 but "de-interleave" the address mask first. Always "de-interleave" the address mask in order to simplify the code flow. Bit mask manipulation is necessary to check for interleaving, so just go ahead and do the de-interleaving. In the non-interleaved case, the original and de-interleaved address masks will be the same. To de-interleave the mask, count the number of zero bits in the middle of the mask and swap them with the most significant bits. For example, Original=0xFFFF9FE, De-interleaved=0x3FFFFFE Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-5-Yazen.Ghannam@amd.com	2019-08-23 07:23:53 +02:00
Yazen Ghannam	353a1fcb8f	EDAC/amd64: Initialize DIMM info for systems with more than two channels Currently, the DIMM info for AMD Family 17h systems is initialized in init_csrows(). This function is shared with legacy systems, and it has a limit of two channel support. This prevents initialization of the DIMM info for a number of ranks, so there will be missing ranks in the EDAC sysfs. Create a new init_csrows_df() for Family17h+ and revert init_csrows() back to pre-Family17h support. Loop over all channels in the new function in order to support systems with more than two channels. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-4-Yazen.Ghannam@amd.com	2019-08-23 07:06:29 +02:00
Yazen Ghannam	f8be8e5680	EDAC/amd64: Recognize DRAM device type ECC capability AMD Family 17h systems support x4 and x16 DRAM devices. However, the device type is not checked when setting mci.edac_ctl_cap. Set the appropriate capability flag based on the device type. Default to x8 DRAM device when neither the x4 or x16 bits are set. [ bp: reverse cpk_en check to save an indentation level. ] Fixes: `2d09d8f301` ("EDAC, amd64: Determine EDAC MC capabilities on Fam17h") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-3-Yazen.Ghannam@amd.com	2019-08-23 06:58:31 +02:00
Yazen Ghannam	d971e28e2c	EDAC/amd64: Support more than two controllers for chip selects handling The struct chip_select array that's used for saving chip select bases and masks is fixed at length of two. There should be one struct chip_select for each controller, so this array should be increased to support systems that may have more than two controllers. Increase the size of the struct chip_select array to eight, which is the largest number of controllers per die currently supported on AMD systems. Fix number of DIMMs and Chip Select bases/masks on Family17h, because AMD Family 17h systems support 2 DIMMs, 4 CS bases, and 2 CS masks per channel. Also, carve out the Family 17h+ reading of the bases/masks into a separate function. This effectively reverts the original bases/masks reading code to before Family 17h support was added. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-2-Yazen.Ghannam@amd.com	2019-08-22 19:08:49 +02:00
Robert Richter	718d58514e	EDAC/mc: Cleanup _edac_mc_free() code Remove needless and boilerplate variable declarations. No functional changes. [ bp: Add newlines for better readability. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190624150758.6695-10-rrichter@marvell.com	2019-08-14 18:27:00 +02:00
Stephen Douthit	29a3388bfc	EDAC, pnd2: Fix ioremap() size in dnv_rd_reg() Depending on how BIOS has marked the reserved region containing the 32KB MCHBAR you can get warnings like: resource sanity check: requesting [mem 0xfed10000-0xfed1ffff], which spans more than reserved [mem 0xfed10000-0xfed17fff] caller dnv_rd_reg+0xc8/0x240 [pnd2_edac] mapping multiple BARs Not all of the mmio regions used in dnv_rd_reg() are the same size. The MCHBAR window is 32KB and the sideband ports are 64KB. Pass the correct size to ioremap() depending on which resource we're reading from. Signed-off-by: Stephen Douthit <stephend@silicom-usa.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2019-08-09 10:19:07 -07:00
Shravan Kumar Ramani	82413e562e	EDAC, mellanox: Add ECC support for BlueField DDR4 Add ECC support for Mellanox BlueField SoC DDR controller. This requires SMC to the running Arm Trusted Firmware to report what is the current memory configuration. Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Shravan Kumar Ramani <sramani@mellanox.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	2019-08-08 12:57:01 -03:00
Dan Carpenter	8faa1cf6ed	EDAC/altera: Use the proper type for the IRQ status bits Smatch complains about the cast of a u32 pointer to unsigned long: drivers/edac/altera_edac.c:1878 altr_edac_a10_irq_handler() warn: passing casted pointer '&irq_status' to 'find_first_bit()' This code wouldn't work on a 64 bit big endian system because it would read past the end of &irq_status. [ bp: massage. ] Fixes: `13ab8448d2` ("EDAC, altera: Add ECC Manager IRQ controller support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thor Thayer <thor.thayer@linux.intel.com> Cc: James Morse <james.morse@arm.com> Cc: kernel-janitors@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190624134717.GA1754@mwanda	2019-08-07 10:37:34 +02:00
Robert Richter	3724ace582	EDAC/mc: Fix grain_bits calculation The grain in EDAC is defined as "minimum granularity for an error report, in bytes". The following calculation of the grain_bits in edac_mc is wrong: grain_bits = fls_long(e->grain) + 1; Where grain_bits is defined as: grain = 1 << grain_bits Example: grain = 8 # 64 bit (8 bytes) grain_bits = fls_long(8) + 1 grain_bits = 4 + 1 = 5 grain = 1 << grain_bits grain = 1 << 5 = 32 Replace it with the correct calculation: grain_bits = fls_long(e->grain - 1); The example gives now: grain_bits = fls_long(8 - 1) grain_bits = fls_long(7) grain_bits = 3 grain = 1 << 3 = 8 Also, check if the hardware reports a reasonable grain != 0 and fallback with a warning to 1 byte granularity otherwise. [ bp: massage a bit. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190624150758.6695-2-rrichter@marvell.com	2019-08-03 12:05:51 +02:00
Thor Thayer	3123c5c4ca	edac: altera: Move Stratix10 SDRAM ECC to peripheral ARM32 SoCFPGAs had separate IRQs for SDRAM. ARM64 SoCFPGAs send all DBEs to SError so filtering by source is necessary. The Stratix10 SDRAM ECC is a better match with the generic Altera peripheral ECC framework because the linked list can be searched to find the ECC block offset and printout the DBE Address. Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com> Acked-by: James Morse <james.morse@arm.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>	2019-07-25 14:28:42 -04:00
Eiichi Tsukata	d8655e7630	EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec Commit `9da21b1509` ("EDAC: Poll timeout cannot be zero, p2") assumes edac_mc_poll_msec to be unsigned long, but the type of the variable still remained as int. Setting edac_mc_poll_msec can trigger out-of-bounds write. Reproducer: # echo 1001 > /sys/module/edac_core/parameters/edac_mc_poll_msec KASAN report: BUG: KASAN: global-out-of-bounds in edac_set_poll_msec+0x140/0x150 Write of size 8 at addr ffffffffb91b2d00 by task bash/1996 CPU: 1 PID: 1996 Comm: bash Not tainted 5.2.0-rc6+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 Call Trace: dump_stack+0xca/0x13e print_address_description.cold+0x5/0x246 __kasan_report.cold+0x75/0x9a ? edac_set_poll_msec+0x140/0x150 kasan_report+0xe/0x20 edac_set_poll_msec+0x140/0x150 ? dimmdev_location_show+0x30/0x30 ? vfs_lock_file+0xe0/0xe0 ? _raw_spin_lock+0x87/0xe0 param_attr_store+0x1b5/0x310 ? param_array_set+0x4f0/0x4f0 module_attr_store+0x58/0x80 ? module_attr_show+0x80/0x80 sysfs_kf_write+0x13d/0x1a0 kernfs_fop_write+0x2bc/0x460 ? sysfs_kf_bin_read+0x270/0x270 ? kernfs_notify+0x1f0/0x1f0 __vfs_write+0x81/0x100 vfs_write+0x1e1/0x560 ksys_write+0x126/0x250 ? __ia32_sys_read+0xb0/0xb0 ? do_syscall_64+0x1f/0x390 do_syscall_64+0xc1/0x390 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa7caa5e970 Code: 73 01 c3 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 99 2d 2c 00 00 75 10 b8 01 00 00 00 04 RSP: 002b:00007fff6acfdfe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa7caa5e970 RDX: 0000000000000005 RSI: 0000000000e95c08 RDI: 0000000000000001 RBP: 0000000000e95c08 R08: 00007fa7cad1e760 R09: 00007fa7cb36a700 R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000005 R13: 0000000000000001 R14: 00007fa7cad1d600 R15: 0000000000000005 The buggy address belongs to the variable: edac_mc_poll_msec+0x0/0x40 Memory state around the buggy address: ffffffffb91b2c00: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa ffffffffb91b2c80: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa >ffffffffb91b2d00: 04 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa ^ ffffffffb91b2d80: 04 fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 ffffffffb91b2e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Fix it by changing the type of edac_mc_poll_msec to unsigned int. The reason why this patch adopts unsigned int rather than unsigned long is msecs_to_jiffies() assumes arg to be unsigned int. We can avoid integer conversion bugs and unsigned int will be large enough for edac_mc_poll_msec. Reviewed-by: James Morse <james.morse@arm.com> Fixes: `9da21b1509` ("EDAC: Poll timeout cannot be zero, p2") Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2019-06-27 10:24:47 -07:00

1 2 3 4 5 ...

1580 Commits (c27a2a1ecf699ed8d77eafa59ae28d81347eac20)