1
0
Fork 0
alistair23-linux/drivers/md
Coly Li 837c36e045 bcache: avoid unnecessary btree nodes flushing in btree_flush_write()
commit 2aa8c52938 upstream.

the commit 91be66e131 ("bcache: performance improvement for
btree_flush_write()") was an effort to flushing btree node with oldest
btree node faster in following methods,
- Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot
  of clean btree nodes.
- Take c->btree_cache as a LRU-like list, aggressively flushing all
  dirty nodes from tail of c->btree_cache util the btree node with
  oldest journal entry is flushed. This is to reduce the time of holding
  c->bucket_lock.

Guoju Fang and Shuang Li reported that they observe unexptected extra
write I/Os on cache device after applying the above patch. Guoju Fang
provideed more detailed diagnose information that the aggressive
btree nodes flushing may cause 10x more btree nodes to flush in his
workload. He points out when system memory is large enough to hold all
btree nodes in memory, c->btree_cache is not a LRU-like list any more.
Then the btree node with oldest journal entry is very probably not-
close to the tail of c->btree_cache list. In such situation much more
dirty btree nodes will be aggressively flushed before the target node
is flushed. When slow SATA SSD is used as cache device, such over-
aggressive flushing behavior will cause performance regression.

After spending a lot of time on debug and diagnose, I find the real
condition is more complicated, aggressive flushing dirty btree nodes
from tail of c->btree_cache list is not a good solution.
- When all btree nodes are cached in memory, c->btree_cache is not
  a LRU-like list, the btree nodes with oldest journal entry won't
  be close to the tail of the list.
- There can be hundreds dirty btree nodes reference the oldest journal
  entry, before flushing all the nodes the oldest journal entry cannot
  be reclaimed.
When the above two conditions mixed together, a simply flushing from
tail of c->btree_cache list is really NOT a good idea.

Fortunately there is still chance to make btree_flush_write() work
better. Here is how this patch avoids unnecessary btree nodes flushing,
- Only acquire c->journal.lock when getting oldest journal entry of
  fifo c->journal.pin. In rested locations check the journal entries
  locklessly, so their values can be changed on other cores
  in parallel.
- In loop list_for_each_entry_safe_reverse(), checking latest front
  point of fifo c->journal.pin. If it is different from the original
  point which we get with locking c->journal.lock, it means the oldest
  journal entry is reclaim on other cores. At this moment, all selected
  dirty nodes recorded in array btree_nodes[] are all flushed and clean
  on other CPU cores, it is unncessary to iterate c->btree_cache any
  longer. Just quit the list_for_each_entry_safe_reverse() loop and
  the following for-loop will skip all the selected clean nodes.
- Find a proper time to quit the list_for_each_entry_safe_reverse()
  loop. Check the refcount value of orignial fifo front point, if the
  value is larger than selected node number of btree_nodes[], it means
  more matching btree nodes should be scanned. Otherwise it means no
  more matching btee nodes in rest of c->btree_cache list, the loop
  can be quit. If the original oldest journal entry is reclaimed and
  fifo front point is updated, the refcount of original fifo front point
  will be 0, then the loop will be quit too.
- Not hold c->bucket_lock too long time. c->bucket_lock is also required
  for space allocation for cached data, hold it for too long time will
  block regular I/O requests. When iterating list c->btree_cache, even
  there are a lot of maching btree nodes, in order to not holding
  c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are
  selected and to flush in following for-loop.
With this patch, only btree nodes referencing oldest journal entry
are flushed to cache device, no aggressive flushing for  unnecessary
btree node any more. And in order to avoid blocking regluar I/O
requests, each time when btree_flush_write() called, at most only
BTREE_FLUSH_NR btree nodes are selected to flush, even there are more
maching btree nodes in list c->btree_cache.

At last, one more thing to explain: Why it is safe to read front point
of c->journal.pin without holding c->journal.lock inside the
list_for_each_entry_safe_reverse() loop ?

Here is my answer: When reading the front point of fifo c->journal.pin,
we don't need to know the exact value of front point, we just want to
check whether the value is different from the original front point
(which is accurate value because we get it while c->jouranl.lock is
held). For such purpose, it works as expected without holding
c->journal.lock. Even the front point is changed on other CPU core and
not updated to local core, and current iterating btree node has
identical journal entry local as original fetched fifo front point, it
is still safe. Because after holding mutex b->write_lock (with memory
barrier) this btree node can be found as clean and skipped, the loop
will quite latter when iterate on next node of list c->btree_cache.

Fixes: 91be66e131 ("bcache: performance improvement for btree_flush_write()")
Reported-by: Guoju Fang <fangguoju@gmail.com>
Reported-by: Shuang Li <psymon@bonuscloud.io>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14 16:34:19 -05:00
..
bcache bcache: avoid unnecessary btree nodes flushing in btree_flush_write() 2020-02-14 16:34:19 -05:00
persistent-data dm space map common: fix to ensure new block isn't already in use 2020-02-11 04:35:26 -08:00
Kconfig dm: add clone target 2019-09-12 09:32:31 -04:00
Makefile dm: add clone target 2019-09-12 09:32:31 -04:00
dm-bio-prison-v1.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v1.h block: switch bios to blk_status_t 2017-06-09 09:27:32 -06:00
dm-bio-prison-v2.c dm: adjust structure members to improve alignment 2018-06-08 11:53:14 -04:00
dm-bio-prison-v2.h
dm-bio-record.h block: replace bi_bdev with a gendisk pointer and partitions index 2017-08-23 12:49:55 -06:00
dm-bufio.c dm bufio: introduce a global cache replacement 2019-09-13 17:00:21 -04:00
dm-builtin.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-cache-background-tracker.c dm cache background tracker: fix sparse warning 2018-04-30 15:40:40 -04:00
dm-cache-background-tracker.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-block-types.h
dm-cache-metadata.c dm cache metadata: Fix loading discard bitset 2019-04-18 16:18:25 -04:00
dm-cache-metadata.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-policy-internal.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-policy-smq.c dm: remove unnecessary unlikely() around WARN_ON_ONCE() 2018-10-16 14:34:59 -04:00
dm-cache-policy.c
dm-cache-policy.h dm cache: significant rework to leverage dm-bio-prison-v2 2017-03-07 13:28:31 -05:00
dm-cache-target.c dm cache: fix bugs when a GFP_NOWAIT allocation fails 2019-10-17 11:13:50 -04:00
dm-clone-metadata.c dm clone metadata: Use a two phase commit 2019-12-21 11:05:00 +01:00
dm-clone-metadata.h dm clone metadata: Use a two phase commit 2019-12-21 11:05:00 +01:00
dm-clone-target.c dm clone: Flush destination device before committing metadata 2019-12-21 11:05:01 +01:00
dm-core.h dm: disable DISCARD if the underlying storage no longer supports it 2019-04-04 15:33:59 -04:00
dm-crypt.c dm crypt: fix benbi IV constructor crash if used in authenticated mode 2020-02-11 04:35:26 -08:00
dm-delay.c dm delay: fix a crash when invalid device is specified 2019-04-26 11:29:32 -04:00
dm-dust.c dm dust: use dust block size for badblocklist index 2019-08-21 11:27:17 -04:00
dm-era-target.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
dm-exception-store.c
dm-exception-store.h - Improve DM snapshot target's scalability by using finer grained 2019-05-16 15:55:48 -07:00
dm-flakey.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-init.c docs: device-mapper: move it to the admin-guide 2019-07-15 11:03:01 -03:00
dm-integrity.c block: centralize PI remapping logic to the block layer 2019-09-17 20:03:49 -06:00
dm-io.c dm: Use kzalloc for all structs with embedded biosets/mempools 2018-06-05 08:47:43 -06:00
dm-ioctl.c dm: introduce DM_GET_TARGET_VERSION 2019-09-16 10:18:01 -04:00
dm-kcopyd.c dm kcopyd: always complete failed jobs 2019-08-15 15:57:39 -04:00
dm-linear.c block: Kill gfp_t argument of blkdev_report_zones() 2019-07-11 20:04:37 -06:00
dm-log-userspace-base.c dm: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c dm log writes: fix incorrect comment about the logged sequence example 2019-07-09 14:13:33 -04:00
dm-log.c
dm-mpath.c dm mpath: remove harmful bio-based optimization 2019-12-21 11:04:56 +01:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-raid.c dm raid: fix updating of max_discard_sectors limit 2019-09-11 16:18:23 -04:00
dm-raid1.c dm raid1: use struct_size() with kzalloc() 2019-08-26 11:05:32 -04:00
dm-region-hash.c - Error path bug fix for overflow tests (Dan) 2018-06-12 18:28:00 -07:00
dm-round-robin.c
dm-rq.c block: Delay default elevator initialization 2019-09-05 19:52:34 -06:00
dm-rq.h dm: remove unused _rq_tio_cache and _rq_cache 2019-03-05 14:48:50 -05:00
dm-service-time.c dm mpath selector: more evenly distribute ties 2018-01-29 13:44:58 -05:00
dm-snap-persistent.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
dm-snap-transient.c
dm-snap.c dm snapshot: rework COW throttling to fix deadlock 2019-10-10 09:46:05 -04:00
dm-stats.c dm stats: use struct_size() helper 2019-09-04 09:39:22 -04:00
dm-stats.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dm-stripe.c dax: Introduce a ->copy_to_iter dax operation 2018-05-22 23:18:31 -07:00
dm-switch.c dm switch: use struct_size() in kzalloc() 2019-03-05 14:48:51 -05:00
dm-sysfs.c dm: remove legacy request-based IO path 2018-10-11 11:36:09 -04:00
dm-table.c dm: make dm_table_find_target return NULL 2019-08-23 10:13:12 -04:00
dm-target.c dm mpath: fix missing call of path selector type->end_io 2019-04-25 15:38:52 -04:00
dm-thin-metadata.c dm thin metadata: use pool locking at end of dm_pool_metadata_close 2020-02-11 04:35:26 -08:00
dm-thin-metadata.h dm thin metadata: Add support for a pre-commit callback 2019-12-21 11:05:01 +01:00
dm-thin.c dm thin: fix use-after-free in metadata_pre_commit_callback 2020-02-05 21:22:53 +00:00
dm-uevent.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-uevent.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156 2019-05-30 11:26:35 -07:00
dm-unstripe.c dm: Check for device sector overflow if CONFIG_LBDAF is not set 2018-12-18 09:02:26 -05:00
dm-verity-fec.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
dm-verity-fec.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
dm-verity-target.c dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity-verify-sig.c dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity-verify-sig.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-verity.h dm verity: add root hash pkcs#7 signature verification 2019-08-23 10:13:14 -04:00
dm-writecache.c dm writecache: fix incorrect flush sequence when doing SSD mode commit 2020-02-11 04:35:26 -08:00
dm-zero.c dm: don't return errnos from ->map 2017-06-09 09:27:32 -06:00
dm-zoned-metadata.c dm zoned: support zone sizes smaller than 128MiB 2020-02-11 04:35:26 -08:00
dm-zoned-reclaim.c dm zoned: reduce overhead of backing device checks 2019-12-17 19:56:12 +01:00
dm-zoned-target.c dm zoned: reduce overhead of backing device checks 2019-12-17 19:56:12 +01:00
dm-zoned.h dm zoned: reduce overhead of backing device checks 2019-12-17 19:56:12 +01:00
dm.c dm: fix potential for q->make_request_fn NULL pointer 2020-02-11 04:35:27 -08:00
dm.h dm: make dm_table_find_target return NULL 2019-08-23 10:13:12 -04:00
md-bitmap.c md/bitmap: avoid race window between md_bitmap_resize and bitmap_file_clear_bit 2019-12-31 16:44:20 +01:00
md-bitmap.h md: Avoid namespace collision with bitmap API 2018-08-01 15:49:39 -07:00
md-cluster.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45 2019-05-24 17:27:12 +02:00
md-cluster.h md-cluster: introduce resync_info_get interface for sanity check 2018-10-18 09:36:35 -07:00
md-faulty.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 2019-05-24 17:27:13 +02:00
md-linear.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-linear.h Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-11-14 16:07:26 -08:00
md-multipath.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
md-multipath.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
md.c md: make sure desc_nr less than MD_SB_DISKS 2020-01-04 19:18:34 +01:00
md.h md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
raid0.c block: fix an integer overflow in logical block size 2020-01-23 08:22:32 +01:00
raid0.h md/raid0: avoid RAID0 data corruption due to layout confusion. 2019-09-13 13:10:05 -07:00
raid1-10.c md: raid1-10: Unify r{1,10}bio_pool_free 2019-06-15 01:37:35 -06:00
raid1.c md: raid1: check rdev before reference in raid1_sync_request func 2020-01-09 10:19:48 +01:00
raid1.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00
raid5-cache.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5-log.h raid5: set write hint for PPL 2019-03-12 10:15:18 -07:00
raid5-ppl.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288 2019-06-05 17:36:37 +02:00
raid5.c raid5: need to set STRIPE_HANDLE for batch head 2020-01-09 10:19:48 +01:00
raid5.h raid5: use bio_end_sector in r5_next_bio 2019-09-13 13:14:43 -07:00
raid10.c md: improve handling of bio with REQ_PREFLUSH in md_flush_request() 2019-12-17 19:56:14 +01:00
raid10.h md: convert to bioset_init()/mempool_init() 2018-05-30 15:33:32 -06:00