1
0
Fork 0
alistair23-linux/drivers
Zhao Heming 2fb550de75 md/cluster: fix deadlock when node is doing resync job
commit bca5b06580 upstream.

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:45 +01:00
..
accessibility
acpi ACPI: PNP: compare the string length in the matching_id() 2020-12-30 11:51:32 +01:00
amba
android binder: add flag to clear buffer on txn complete 2020-12-30 11:51:35 +01:00
ata ata: sata_nv: Fix retrieving of active qcs 2020-11-05 11:43:12 +01:00
atm atm: nicstar: Unmap DMA on send error 2020-11-24 13:28:55 +01:00
auxdisplay
base PM: runtime: Resume the device earlier in __device_release_driver() 2020-11-10 12:37:34 +01:00
bcma
block nbd: fix a block_device refcount leak in nbd_release 2020-11-18 19:20:27 +01:00
bluetooth Bluetooth: btmtksdio: Add the missed release_firmware() in mtk_setup_firmware() 2020-12-30 11:51:20 +01:00
bus bus: fsl-mc: fix error return code in fsl_mc_object_allocate() 2020-12-30 11:51:23 +01:00
cdrom
char virtio: virtio_console: fix DMA memory allocation for rproc serial 2020-11-18 19:20:29 +01:00
clk clk: sunxi-ng: Make sure divider tables have sentinel 2020-12-30 11:51:29 +01:00
clocksource clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI 2020-12-30 11:51:19 +01:00
connector
counter
cpufreq cpufreq: scpi: Add missing MODULE_ALIAS 2020-12-30 11:51:20 +01:00
cpuidle cpuidle: Fixup IRQ state 2020-09-09 19:12:21 +02:00
crypto crypto: atmel-i2c - select CONFIG_BITREVERSE 2020-12-30 11:51:24 +01:00
dax dax: Fix alloc_dax_region() compile warning 2020-10-01 13:17:15 +02:00
dca
devfreq PM / devfreq: tegra30: Fix integer overflow on CPU's freq max out 2020-10-01 13:17:14 +02:00
dio
dma dmaengine: mv_xor_v2: Fix error return code in mv_xor_v2_probe() 2020-12-30 11:51:12 +01:00
dma-buf dmabuf: fix NULL pointer dereference in dma_buf_release() 2020-10-01 13:18:24 +02:00
edac EDAC/amd64: Fix PCI component registration 2020-12-30 11:51:36 +01:00
eisa
extcon extcon: max77693: Fix modalias string 2020-12-30 11:51:24 +01:00
firewire
firmware efi: EFI_EARLYCON should depend on EFI 2020-12-02 08:49:53 +01:00
fpga fpga: dfl: fix bug in port reset handshake 2020-07-29 10:18:31 +02:00
fsi
gnss gnss: sirf: fix error return code in sirf_probe() 2020-06-22 09:31:20 +02:00
gpio gpio: eic-sprd: break loop when getting NULL device resource 2020-12-30 11:50:55 +01:00
gpu drm/i915: Fix mismatch between misplaced vma check and vma insert 2020-12-30 11:51:41 +01:00
greybus
hid HID: i2c-hid: add Vero K147 to descriptor override 2020-12-30 11:51:00 +01:00
hsi HSI: omap_ssi: Don't jump to free ID in ssi_add_controller() 2020-12-30 11:51:13 +01:00
hv Drivers: hv: vmbus: Allow cleanup of VMBUS_CONNECT_CPU if disconnected 2020-11-24 13:29:23 +01:00
hwmon hwmon: (ina3221) Fix PM usage counter unbalance in ina3221_write_enable 2020-12-30 11:51:17 +01:00
hwspinlock
hwtracing coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf() 2020-12-30 11:50:59 +01:00
i2c Revert "i2c: i2c-qcom-geni: Fix DMA transfer race" 2020-12-30 11:51:01 +01:00
i3c i3c: master: Fix error return in cdns_i3c_master_probe() 2020-10-29 09:57:51 +01:00
ide
idle
iio iio:adc:ti-ads124s08: Fix alignment and data leak issues. 2020-12-30 11:51:45 +01:00
infiniband RDMA/cma: Don't overwrite sgid_attr after device is released 2020-12-30 11:51:26 +01:00
input Input: cyapa_gen6 - fix out-of-bounds stack access 2020-12-30 11:51:32 +01:00
interconnect interconnect: qcom: qcs404: Remove GPU and display RPM IDs 2020-12-16 10:56:56 +01:00
iommu iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs 2020-12-11 13:23:31 +01:00
ipack
irqchip irqchip/alpine-msi: Fix freeing of interrupts on allocation error path 2020-12-30 11:51:25 +01:00
isdn
leds leds: bcm6328, bcm6358: use devres LED registering function 2020-11-05 11:43:24 +01:00
lightnvm lightnvm: fix out-of-bounds write to array devices->info[] 2020-10-29 09:58:00 +01:00
macintosh macintosh/via-macii: Access autopoll_devs when inside lock 2020-08-19 08:16:15 +02:00
mailbox mailbox: avoid timer start from callback 2020-10-29 09:57:53 +01:00
mcb
md md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:51:45 +01:00
media media: ipu3-cio2: Make the field on subdev format V4L2_FIELD_NONE 2020-12-30 11:51:31 +01:00
memory memory: emif: Remove bogus debugfs error handling 2020-11-05 11:43:21 +01:00
memstick memstick: r592: Fix error return in r592_probe() 2020-12-30 11:51:18 +01:00
message scsi: mptfusion: Fix null pointer dereferences in mptscsih_remove() 2020-11-05 11:43:25 +01:00
mfd mfd: sprd: Add wakeup capability for PMIC IRQ 2020-11-18 19:20:26 +01:00
misc habanalabs: put devices before driver removal 2020-12-30 11:50:56 +01:00
mmc mmc: pxamci: Fix error return code in pxamci_probe 2020-12-30 11:51:11 +01:00
mtd mtd: rawnand: meson: fix meson_nfc_dma_buffer_release() arguments 2020-12-30 11:51:43 +01:00
mux
net virtio_net: Fix error code in probe() 2020-12-30 11:51:29 +01:00
nfc nfc: s3fwrn5: Release the nfc firmware 2020-12-30 11:51:26 +01:00
ntb NTB: hw: amd: fix an issue about leak system resources 2020-10-29 09:58:00 +01:00
nubus
nvdimm libnvdimm/label: Return -ENXIO for no slot in __blk_label_update 2020-12-30 11:51:27 +01:00
nvme nvme: free sq/cq dbbuf pointers when dbbuf set fails 2020-12-02 08:49:48 +01:00
nvmem nvmem: core: fix possibly memleak when use nvmem_cell_info_to_nvmem_cell() 2020-10-29 09:57:42 +01:00
of of/address: Fix of_node memory leak in of_dma_is_coherent 2020-11-18 19:20:28 +01:00
opp opp: Reduce the size of critical section in _opp_table_kref_release() 2020-11-18 19:20:21 +01:00
oprofile
parisc parisc: mask out enable and reserved bits from sba imask 2020-08-19 08:16:26 +02:00
parport
pci PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup() 2020-12-30 11:51:32 +01:00
pcmcia
perf drivers/perf: thunderx2_pmu: Fix memory resource error handling 2020-10-29 09:57:30 +01:00
phy phy: renesas: rcar-gen3-usb2: disable runtime pm in case of failure 2020-12-30 11:51:19 +01:00
pinctrl pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe() 2020-12-30 11:51:18 +01:00
platform platform/x86: mlx-platform: remove an unused variable 2020-12-30 11:51:40 +01:00
pnp
power power: supply: bq24190_charger: fix reference leak 2020-12-30 11:51:14 +01:00
powercap powercap: restrict energy meter to root access 2020-11-10 21:13:20 +01:00
pps
ps3 powerpc/ps3: use dma_mapping_error() 2020-12-30 11:51:26 +01:00
ptp
pwm pwm: lp3943: Dynamically allocate PWM chip base 2020-12-30 11:51:28 +01:00
rapidio rapidio: fix the missed put_device() for rio_mport_add_riodev 2020-10-29 09:57:53 +01:00
ras
regulator regulator: workaround self-referent regulators 2020-11-24 13:29:22 +01:00
remoteproc remoteproc: qcom: Fix potential NULL dereference in adsp_init_mmio() 2020-12-30 11:51:24 +01:00
reset
rpmsg rpmsg: glink: Use complete_all for open states 2020-11-05 11:43:20 +01:00
rtc rtc: pcf2127: fix pcf2127_nvmem_read/write() returns 2020-12-30 11:51:02 +01:00
s390 s390/dasd: fix list corruption of lcu list 2020-12-30 11:51:34 +01:00
sbus
scsi scsi: lpfc: Re-fix use after free in lpfc_rq_buf_free() 2020-12-30 11:51:44 +01:00
sfi
sh
siox
slimbus slimbus: qcom-ngd-ctrl: Avoid sending power requests without QMI 2020-12-30 11:51:13 +01:00
soc soc: qcom: smp2p: Safely acquire spinlock without IRQs 2020-12-30 11:51:43 +01:00
soundwire soundwire: bus: disable pm_runtime in sdw_slave_delete 2020-10-01 13:17:36 +02:00
spi spi: atmel-quadspi: Fix AHB memory accesses 2020-12-30 11:51:43 +01:00
spmi
ssb
staging staging: comedi: mf6x4: Fix AI end-of-conversion detection 2020-12-30 11:51:35 +01:00
target scsi: target: iscsi: Fix cmd abort fabric stop race 2020-12-02 08:49:49 +01:00
tc
tee optee: add writeback to valid memory type 2020-12-02 08:49:53 +01:00
thermal thermal: rcar_thermal: Handle probe error gracefully 2020-10-01 13:17:44 +02:00
thunderbolt thunderbolt: Fix use-after-free in remove_unplugged_switch() 2020-12-11 13:23:29 +01:00
tty serial_core: Check for port state when tty is in error state 2020-12-30 11:51:00 +01:00
uio uio: Fix use-after-free in uio_unregister_device() 2020-11-18 19:20:29 +01:00
usb USB: serial: keyspan_pda: fix write unthrottling 2020-12-30 11:51:37 +01:00
vfio vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU 2020-12-30 11:51:30 +01:00
vhost vhost scsi: fix cmd completion race 2020-12-02 08:49:48 +01:00
video video: fbdev: atmel_lcdfb: fix return error code in atmel_lcdfb_of_init() 2020-12-30 11:51:08 +01:00
virt drivers/virt/fsl_hypervisor: Fix error handling path 2020-10-29 09:57:38 +01:00
virtio virtio_ring: Fix two use after free bugs 2020-12-30 11:51:29 +01:00
visorbus
vlynq
vme
w1 w1: mxc_w1: Fix timeout resolution problem leading to bus error 2020-11-05 11:43:25 +01:00
watchdog watchdog: coh901327: add COMMON_CLK dependency 2020-12-30 11:51:28 +01:00
xen xen/events: block rogue events for some time 2020-11-05 11:43:12 +01:00
zorro
Kconfig
Makefile