1
0
Fork 0
alistair23-linux/block
Ming Lei cecf5d87ff block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.

However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.

So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.

sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.

[1]  lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __mutex_lock+0x14a/0xa9b
           blk_mq_hw_sysfs_show+0x63/0xb6
           sysfs_kf_seq_show+0x11f/0x196
           seq_read+0x2cd/0x5f2
           vfs_read+0xc7/0x18c
           ksys_read+0xc4/0x13e
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
           check_prev_add+0x5d2/0xc45
           validate_chain+0xed3/0xf94
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           __kernfs_remove+0x237/0x40b
           kernfs_remove_by_name_ns+0x59/0x72
           remove_files+0x61/0x96
           sysfs_remove_group+0x81/0xa4
           sysfs_remove_groups+0x3b/0x44
           kobject_del+0x44/0x94
           blk_mq_unregister_dev+0x83/0xdd
           blk_unregister_queue+0xa0/0x10b
           del_gendisk+0x259/0x3fa
           null_del_dev+0x8b/0x1c3 [null_blk]
           null_exit+0x5c/0x95 [null_blk]
           __se_sys_delete_module+0x204/0x337
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&q->sysfs_lock);
                                   lock(kn->count#202);
                                   lock(&q->sysfs_lock);
      lock(kn->count#202);

     *** DEADLOCK ***

    2 locks held by rmmod/777:
     #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
     #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
     dump_stack+0x9a/0xe6
     check_noncircular+0x207/0x251
     ? print_circular_bug+0x32a/0x32a
     ? find_usage_backwards+0x84/0xb0
     check_prev_add+0x5d2/0xc45
     validate_chain+0xed3/0xf94
     ? check_prev_add+0xc45/0xc45
     ? mark_lock+0x11b/0x804
     ? check_usage_forwards+0x1ca/0x1ca
     __lock_acquire+0x95f/0xa2f
     lock_acquire+0x1b4/0x1e8
     ? kernfs_remove_by_name_ns+0x59/0x72
     __kernfs_remove+0x237/0x40b
     ? kernfs_remove_by_name_ns+0x59/0x72
     ? kernfs_next_descendant_post+0x7d/0x7d
     ? strlen+0x10/0x23
     ? strcmp+0x22/0x44
     kernfs_remove_by_name_ns+0x59/0x72
     remove_files+0x61/0x96
     sysfs_remove_group+0x81/0xa4
     sysfs_remove_groups+0x3b/0x44
     kobject_del+0x44/0x94
     blk_mq_unregister_dev+0x83/0xdd
     blk_unregister_queue+0xa0/0x10b
     del_gendisk+0x259/0x3fa
     ? disk_events_poll_msecs_store+0x12b/0x12b
     ? check_flags+0x1ea/0x204
     ? mark_held_locks+0x1f/0x7a
     null_del_dev+0x8b/0x1c3 [null_blk]
     null_exit+0x5c/0x95 [null_blk]
     __se_sys_delete_module+0x204/0x337
     ? free_module+0x39f/0x39f
     ? blkcg_maybe_throttle_current+0x8a/0x718
     ? rwlock_bug+0x62/0x62
     ? __blkcg_punt_bio_submit+0xd0/0xd0
     ? trace_hardirqs_on_thunk+0x1a/0x20
     ? mark_held_locks+0x1f/0x7a
     ? do_syscall_64+0x4c/0x295
     do_syscall_64+0xa7/0x295
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 10:40:20 -06:00
..
partitions docs: admin-guide: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
Kconfig docs: cgroup-v1: add it to the admin-guide book 2019-07-15 11:03:02 -03:00
Kconfig.iosched docs: block: convert to ReST 2019-07-15 09:20:27 -03:00
Makefile block: remove legacy IO schedulers 2018-11-07 13:42:32 -07:00
badblocks.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00
bfq-cgroup.c block: rename CONFIG_DEBUG_BLK_CGROUP to CONFIG_BFQ_CGROUP_DEBUG 2019-06-20 10:32:35 -06:00
bfq-iosched.c for-linus-20190726 2019-07-26 10:32:12 -07:00
bfq-iosched.h block, bfq: detect wakers and unconditionally inject their I/O 2019-06-25 09:07:34 -06:00
bfq-wf2q.c block: switch all files cleared marked as GPLv2 or later to SPDX tags 2019-04-30 16:11:59 -06:00
bio-integrity.c block/bio-integrity: fix a memory leak bug 2019-07-11 20:01:21 -06:00
bio.c block: move same page handling from __bio_add_pc_page to the callers 2019-08-22 07:14:39 -06:00
blk-cgroup.c blkcg: allow blkcg_policy->pd_stat() to print non-debug info too 2019-07-16 10:06:39 -06:00
blk-core.c block: split .sysfs_lock into two locks 2019-08-27 10:40:20 -06:00
blk-exec.c block: add SPDX tags to block layer files missing licensing information 2019-04-30 16:12:03 -06:00
blk-flush.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00
blk-integrity.c docs: block: convert to ReST 2019-07-15 09:20:27 -03:00
blk-ioc.c block: remove the queue_lock indirection 2018-11-15 12:17:28 -07:00
blk-iolatency.c blkcg: allow blkcg_policy->pd_stat() to print non-debug info too 2019-07-16 10:06:39 -06:00
blk-lib.c block: fix 32 bit overflow in __blkdev_issue_discard() 2018-11-14 08:17:18 -07:00
blk-map.c block: remove the bi_phys_segments field in struct bio 2019-06-20 10:29:22 -06:00
blk-merge.c block: Improve physical block alignment of split bios 2019-08-04 21:41:29 -06:00
blk-mq-cpumap.c blk-mq: balance mapping between present CPUs and queues 2019-08-04 21:43:12 -06:00
blk-mq-debugfs-zoned.c block: Cleanup license notice 2019-01-17 21:21:40 -07:00
blk-mq-debugfs.c for-5.3/block-20190708 2019-07-09 10:45:06 -07:00
blk-mq-debugfs.h blk-mq: no need to check return value of debugfs_create functions 2019-06-13 03:00:30 -06:00
blk-mq-pci.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-rdma.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq-sched.c blk-mq: remove blk_mq_put_ctx() 2019-07-02 21:03:27 -06:00
blk-mq-sched.h block: blk-mq: Remove blk_mq_sched_started_request and started_request 2019-07-23 07:25:09 -06:00
blk-mq-sysfs.c block: split .sysfs_lock into two locks 2019-08-27 10:40:20 -06:00
blk-mq-tag.c blk-mq: introduce blk_mq_tagset_wait_completed_request() 2019-08-04 21:41:29 -06:00
blk-mq-tag.h Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block 2017-11-14 15:32:19 -08:00
blk-mq-virtio.c block: Fix blk_mq_*_map_queues() kernel-doc headers 2019-05-31 15:12:34 -06:00
blk-mq.c blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue 2019-08-27 10:40:20 -06:00
blk-mq.h block: Disable write plugging for zoned block devices 2019-07-10 14:18:01 -06:00
blk-pm.c block: remove the queue_lock indirection 2018-11-15 12:17:28 -07:00
blk-pm.h block: remove the queue_lock indirection 2018-11-15 12:17:28 -07:00
blk-rq-qos.c rq-qos: use a mb for got_token 2019-07-18 10:20:14 -06:00
blk-rq-qos.h block: add SPDX tags to block layer files missing licensing information 2019-04-30 16:12:03 -06:00
blk-settings.c block: fix max segment size handling in blk_queue_virt_boundary 2019-07-26 12:50:57 -06:00
blk-softirq.c block: remove a few unused exports 2018-11-15 12:13:25 -07:00
blk-stat.c block: add SPDX tags to block layer files missing licensing information 2019-04-30 16:12:03 -06:00
blk-stat.h block: deactivate blk_stat timer in wbt_disable_default() 2018-12-12 06:47:51 -07:00
blk-sysfs.c block: split .sysfs_lock into two locks 2019-08-27 10:40:20 -06:00
blk-throttle.c blk-throttle: fix zero wait time for iops throttled group 2019-07-10 09:00:57 -06:00
blk-timeout.c block: add SPDX tags to block layer files missing licensing information 2019-04-30 16:12:03 -06:00
blk-wbt.c block: add helper for checking if queue is registered 2019-08-27 10:40:20 -06:00
blk-wbt.h block: remove external dependency on wbt_flags 2018-07-09 09:07:54 -06:00
blk-zoned.c blk-zoned: implement REQ_OP_ZONE_RESET_ALL 2019-08-04 21:41:29 -06:00
blk.h block: split .sysfs_lock into two locks 2019-08-27 10:40:20 -06:00
bounce.c block: remove the i argument to bio_for_each_segment_all 2019-04-30 09:26:13 -06:00
bsg-lib.c block: Fix bsg_setup_queue() kernel-doc header 2019-05-31 15:12:34 -06:00
bsg.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00
cmdline-parser.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compat_ioctl.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
elevator.c block: split .sysfs_lock into two locks 2019-08-27 10:40:20 -06:00
genhd.c block: fix sysfs module parameters directory path in comment 2019-07-16 10:16:33 -06:00
ioctl.c block: add SPDX tags to block layer files missing licensing information 2019-04-30 16:12:03 -06:00
ioprio.c docs: block: convert to ReST 2019-07-15 09:20:27 -03:00
kyber-iosched.c blk-mq: remove blk_mq_put_ctx() 2019-07-02 21:03:27 -06:00
mq-deadline.c docs: block: convert to ReST 2019-07-15 09:20:27 -03:00
opal_proto.h block: sed-opal: Removed duplicate OPAL_METHOD_LENGTH definition 2019-08-20 09:34:49 -06:00
partition-generic.c block: fix use-after-free on gendisk 2019-04-22 09:48:12 -06:00
scsi_ioctl.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00
sed-opal.c block: sed-opal: Remove always false conditional statement 2019-08-20 09:33:21 -06:00
t10-pi.c block: switch all files cleared marked as GPLv2 to SPDX tags 2019-04-30 16:11:57 -06:00