1
0
Fork 0
alistair23-linux/drivers/md
Zhao Heming 2fb550de75 md/cluster: fix deadlock when node is doing resync job
commit bca5b06580 upstream.

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba959774e
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:45 +01:00
..
bcache bcache: fix a lost wake-up problem caused by mca_cannibalize_lock 2020-10-01 13:17:18 +02:00
persistent-data dm thin metadata: Fix use-after-free in dm_bm_set_read_only 2020-09-09 19:12:36 +02:00
Kconfig dm: add clone target 2019-09-12 09:32:31 -04:00
Makefile dm: add clone target 2019-09-12 09:32:31 -04:00
dm-bio-prison-v1.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v1.h block: switch bios to blk_status_t 2017-06-09 09:27:32 -06:00
dm-bio-prison-v2.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v2.h
dm-bio-record.h dm bio record: save/restore bi_end_io and bi_integrity 2020-03-25 08:25:48 +01:00
dm-bufio.c dm bufio: introduce a global cache replacement 2019-09-13 17:00:21 -04:00
dm-builtin.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-cache-background-tracker.c dm cache background tracker: fix sparse warning 2018-04-30 15:40:40 -04:00
dm-cache-background-tracker.h
dm-cache-block-types.h
dm-cache-metadata.c dm cache metadata: Avoid returning cmd->bm wild pointer on error 2020-09-09 19:12:35 +02:00
dm-cache-metadata.h
dm-cache-policy-internal.h
dm-cache-policy-smq.c dm: remove unnecessary unlikely() around WARN_ON_ONCE() 2018-10-16 14:34:59 -04:00
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c dm cache: fix a crash due to incorrect work item cancelling 2020-03-12 13:00:23 +01:00
dm-clone-metadata.c dm clone: Fix handling of partial region discards 2020-04-17 10:50:24 +02:00
dm-clone-metadata.h dm clone: replace spin_lock_irqsave with spin_lock_irq 2020-04-17 10:50:23 +02:00
dm-clone-target.c dm clone: Add missing casts to prevent overflows and data corruption 2020-04-17 10:50:24 +02:00
dm-core.h dm: disable DISCARD if the underlying storage no longer supports it 2019-04-04 15:33:59 -04:00
dm-crypt.c dm crypt: Initialize crypto wait structures 2020-09-09 19:12:35 +02:00
dm-delay.c dm delay: fix a crash when invalid device is specified 2019-04-26 11:29:32 -04:00
dm-dust.c dm dust: use dust block size for badblocklist index 2019-08-21 11:27:17 -04:00
dm-era-target.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
dm-exception-store.c
dm-exception-store.h - Improve DM snapshot target's scalability by using finer grained 2019-05-16 15:55:48 -07:00
dm-flakey.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-init.c docs: device-mapper: move it to the admin-guide 2019-07-15 11:03:01 -03:00
dm-integrity.c dm integrity: fix error reporting in bitmap mode after creation 2020-09-09 19:12:35 +02:00
dm-io.c dm: Use kzalloc for all structs with embedded biosets/mempools 2018-06-05 08:47:43 -06:00
dm-ioctl.c dm ioctl: fix error return code in target_message 2020-12-30 11:51:18 +01:00
dm-kcopyd.c dm kcopyd: always complete failed jobs 2019-08-15 15:57:39 -04:00
dm-linear.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-log-userspace-base.c dm: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c dm log writes: fix incorrect comment about the logged sequence example 2019-07-09 14:13:33 -04:00
dm-log.c
dm-mpath.c dm mpath: fix racey management of PG initialization 2020-09-09 19:12:35 +02:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-raid.c dm raid: fix updating of max_discard_sectors limit 2019-09-11 16:18:23 -04:00
dm-raid1.c dm raid1: use struct_size() with kzalloc() 2019-08-26 11:05:32 -04:00
dm-region-hash.c - Error path bug fix for overflow tests (Dan) 2018-06-12 18:28:00 -07:00
dm-round-robin.c
dm-rq.c dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue() 2020-08-21 13:05:33 +02:00
dm-rq.h dm: remove unused _rq_tio_cache and _rq_cache 2019-03-05 14:48:50 -05:00
dm-service-time.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-snap-persistent.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
dm-snap-transient.c
dm-snap.c dm snapshot: rework COW throttling to fix deadlock 2019-10-10 09:46:05 -04:00
dm-stats.c dm stats: use struct_size() helper 2019-09-04 09:39:22 -04:00
dm-stats.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-stripe.c dax: Introduce a ->copy_to_iter dax operation 2018-05-22 23:18:31 -07:00
dm-switch.c dm switch: use struct_size() in kzalloc() 2019-03-05 14:48:51 -05:00
dm-sysfs.c dm: remove legacy request-based IO path 2018-10-11 11:36:09 -04:00
dm-table.c dm table: Remove BUG_ON(in_interrupt()) 2020-12-30 11:50:57 +01:00
dm-target.c dm mpath: fix missing call of path selector type->end_io 2019-04-25 15:38:52 -04:00
dm-thin-metadata.c dm thin metadata: Fix use-after-free in dm_bm_set_read_only 2020-09-09 19:12:36 +02:00
dm-thin-metadata.h dm thin metadata: Add support for a pre-commit callback 2019-12-21 11:05:01 +01:00
dm-thin.c dm thin: don't allow changing data device during thin-pool reload 2020-02-24 08:36:49 +01:00
dm-uevent.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-uevent.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-unstripe.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2018-12-18 09:02:26 -05:00
dm-verity-fec.c dm verity fec: fix hash block number in verity_fec_decode 2020-05-06 08:15:10 +02:00
dm-verity-fec.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
dm-verity-target.c dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity-verify-sig.c dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity-verify-sig.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-writecache.c dm writecache: remove BUG() and fail gracefully instead 2020-12-11 13:23:33 +01:00
dm-zero.c dm: don't return errnos from ->map 2017-06-09 09:27:32 -06:00
dm-zoned-metadata.c dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone 2020-06-24 17:50:31 +02:00
dm-zoned-reclaim.c dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone 2020-06-24 17:50:31 +02:00
dm-zoned-target.c dm zoned: assign max_io_len correctly 2020-07-09 09:37:57 +02:00
dm-zoned.h dm zoned: reduce overhead of backing device checks 2019-12-17 19:56:12 +01:00
dm.c dm: remove invalid sparse __acquires and __releases annotations 2020-12-11 13:23:30 +01:00
dm.h dm: make dm_table_find_target return NULL 2019-08-23 10:13:12 -04:00
md-bitmap.c md/bitmap: md_bitmap_get_counter returns wrong blocks 2020-11-05 11:43:20 +01:00
md-bitmap.h md: Avoid namespace collision with bitmap API 2018-08-01 15:49:39 -07:00
md-cluster.c md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:51:45 +01:00
md-cluster.h md-cluster: introduce resync_info_get interface for sanity check 2018-10-18 09:36:35 -07:00
md-faulty.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 2019-05-24 17:27:13 +02:00
md-linear.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-linear.h Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-11-14 16:07:26 -08:00
md-multipath.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-multipath.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md.c md/cluster: fix deadlock when node is doing resync job 2020-12-30 11:51:45 +01:00
md.h md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
raid0.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
raid0.h md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-13 13:10:05 -07:00
raid1-10.c md: raid1-10: Unify r{1,10}bio_pool_free 2019-06-15 01:37:35 -06:00
raid1.c md: raid1: check rdev before reference in raid1_sync_request func 2020-01-09 10:19:48 +01:00
raid1.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid5-cache.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5-log.h raid5: set write hint for PPL 2019-03-12 10:15:18 -07:00
raid5-ppl.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5.c md/raid5: fix oops during stripe resizing 2020-11-05 11:43:22 +01:00
raid5.h raid5: use bio_end_sector in r5_next_bio 2019-09-13 13:14:43 -07:00
raid10.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
raid10.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00