alistair23-linux

redonkable

History

Aaron Lu 4efaceb1c5 mm, swap: use rbtree for swap_extent swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2019-07-12 11:05:43 -07:00
..
acpi	It's been a relatively busy cycle for docs:	2019-07-09 12:34:26 -07:00
asm-generic	asm-generic, x86: add bitops instrumentation for KASAN	2019-07-12 11:05:42 -07:00
clocksource	clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic	2019-07-03 11:00:59 +02:00
crypto	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6	2019-07-08 20:57:08 -07:00
drm	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500	2019-06-19 17:09:55 +02:00
dt-bindings	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2019-07-11 10:55:49 -07:00
keys	request_key improvements	2019-07-08 19:19:37 -07:00
kvm	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234	2019-06-19 17:09:07 +02:00
linux	mm, swap: use rbtree for swap_extent	2019-07-12 11:05:43 -07:00
math-emu	math-emu: Use statement expressions to fix Wshift-count-overflow warning	2019-05-31 15:23:25 +08:00
media	media updates for v5.3-rc1	2019-07-09 09:47:22 -07:00
memory	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500	2019-06-19 17:09:55 +02:00
misc	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152	2019-05-30 11:26:32 -07:00
net	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2019-07-11 10:55:49 -07:00
pcmcia	It's been a relatively busy cycle for docs:	2019-07-09 12:34:26 -07:00
ras	…
rdma	SPDX update for 5.2-rc4	2019-06-08 12:52:42 -07:00
scsi	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 335	2019-06-05 17:37:06 +02:00
soc	treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500	2019-06-19 17:09:55 +02:00
sound	ASoC: Updates for v5.3	2019-07-08 14:45:34 +02:00
target	scsi: target/iscsi: Handle too large immediate data buffers correctly	2019-04-12 20:20:06 -04:00
trace	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2019-07-11 10:55:49 -07:00
uapi	nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header	2019-07-12 11:05:40 -07:00
vdso	vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h	2019-06-26 07:28:09 +02:00
video	fbdev changes for v5.3:	2019-07-09 09:55:45 -07:00
xen	block: pass page to xen_biovec_phys_mergeable	2019-04-01 12:11:13 -06:00