1
0
Fork 0
Commit Graph

5249 Commits (1eff9d322a444245c67515edb52bc0eb68374aa8)

Author SHA1 Message Date
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Jens Axboe c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Mike Christie abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Christoph Hellwig c1c87c2ba9 loop: make do_req_filebacked more robust
Use a switch statement to iterate over the possible operations and
error out if it's an incorrect one.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Christoph Hellwig f0225cacfe loop: don't try to use AIO for discards
Fix a fat-fingered conversion to the req_op accessors, and also
use a switch statement to make it more obvious what is being checked.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: c2df40 ("drivers: use req op accessor");
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Vegard Nossum 97240963eb nbd: fix race in ioctl
Quentin ran into this bug:

WARNING: CPU: 64 PID: 10085 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x65/0x80
sysfs: cannot create duplicate filename '/devices/virtual/block/nbd3/pid'
Modules linked in: nbd
CPU: 64 PID: 10085 Comm: qemu-nbd Tainted: G      D         4.6.0+ #7
 0000000000000000 ffff8820330bba68 ffffffff814b8791 ffff8820330bbac8
 0000000000000000 ffff8820330bbab8 ffffffff810d04ab ffff8820330bbaa8
 0000001f00000296 0000000000017681 ffff8810380bf000 ffffffffa0001790
Call Trace:
 [<ffffffff814b8791>] dump_stack+0x4d/0x6c
 [<ffffffff810d04ab>] __warn+0xdb/0x100
 [<ffffffff810d0574>] warn_slowpath_fmt+0x44/0x50
 [<ffffffff81218c65>] sysfs_warn_dup+0x65/0x80
 [<ffffffff81218a02>] sysfs_add_file_mode_ns+0x172/0x180
 [<ffffffff81218a35>] sysfs_create_file_ns+0x25/0x30
 [<ffffffff81594a76>] device_create_file+0x36/0x90
 [<ffffffffa0000e8d>] __nbd_ioctl+0x32d/0x9b0 [nbd]
 [<ffffffff814cc8e8>] ? find_next_bit+0x18/0x20
 [<ffffffff810f7c29>] ? select_idle_sibling+0xe9/0x120
 [<ffffffff810f6cd7>] ? __enqueue_entity+0x67/0x70
 [<ffffffff810f9bf0>] ? enqueue_task_fair+0x630/0xe20
 [<ffffffff810efa76>] ? resched_curr+0x36/0x70
 [<ffffffff810f0078>] ? check_preempt_curr+0x78/0x90
 [<ffffffff810f00a2>] ? ttwu_do_wakeup+0x12/0x80
 [<ffffffff810f01b1>] ? ttwu_do_activate.constprop.86+0x61/0x70
 [<ffffffff810f0c15>] ? try_to_wake_up+0x185/0x2d0
 [<ffffffff810f0d6d>] ? default_wake_function+0xd/0x10
 [<ffffffff81105471>] ? autoremove_wake_function+0x11/0x40
 [<ffffffffa0001577>] nbd_ioctl+0x67/0x94 [nbd]
 [<ffffffff814ac0fd>] blkdev_ioctl+0x14d/0x940
 [<ffffffff811b0da2>] ? put_pipe_info+0x22/0x60
 [<ffffffff811d96cc>] block_ioctl+0x3c/0x40
 [<ffffffff811ba08d>] do_vfs_ioctl+0x8d/0x5e0
 [<ffffffff811aa329>] ? ____fput+0x9/0x10
 [<ffffffff810e9092>] ? task_work_run+0x72/0x90
 [<ffffffff811ba627>] SyS_ioctl+0x47/0x80
 [<ffffffff8185f5df>] entry_SYSCALL_64_fastpath+0x17/0x93
---[ end trace 7899b295e4f850c8 ]---

It seems fairly obvious that device_create_file() is not being protected
from being run concurrently on the same nbd.

Quentin found the following relevant commits:

1a2ad21 nbd: add locking to nbd_ioctl
90b8f28 [PATCH] end of methods switch: remove the old ones
d4430d6 [PATCH] beginning of methods conversion
08f8585 [PATCH] move block_device_operations to blkdev.h

It would seem that the race was introduced in the process of moving nbd
from BKL to unlocked ioctls.

By setting nbd->task_recv while the mutex is held, we can prevent other
processes from running concurrently (since nbd->task_recv is also checked
while the mutex is held).

Reported-and-tested-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Pavel Machek <pavel@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Jiri Kosina ff06db1efb floppy: fix open(O_ACCMODE) for ioctl-only open
Commit 09954bad4 ("floppy: refactor open() flags handling"), as a
side-effect, causes open(/dev/fdX, O_ACCMODE) to fail. It turns out that
this is being used setfdprm userspace for ioctl-only open().

Reintroduce back the original behavior wrt !(FMODE_READ|FMODE_WRITE)
modes, while still keeping the original O_NDELAY bug fixed.

Cc: stable@vger.kernel.org # v4.5+
Reported-by: Wim Osterholt <wim@djo.tudelft.nl>
Tested-by: Wim Osterholt <wim@djo.tudelft.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Linus Torvalds d52bd54db8 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of ocfs2

 - various hotfixes, mainly MM

 - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

 - printk updates

 - firmware

 - checkpatch

 - nilfs2

 - more kexec stuff than usual

 - rapidio updates

 - w1 things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  ipc: delete "nr_ipc_ns"
  kcov: allow more fine-grained coverage instrumentation
  init/Kconfig: add clarification for out-of-tree modules
  config: add android config fragments
  init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
  relay: add global mode support for buffer-only channels
  init: allow blacklisting of module_init functions
  w1:omap_hdq: fix regression
  w1: add helper macro module_w1_family
  w1: remove need for ida and use PLATFORM_DEVID_AUTO
  rapidio/switches: add driver for IDT gen3 switches
  powerpc/fsl_rio: apply changes for RIO spec rev 3
  rapidio: modify for rev.3 specification changes
  rapidio: change inbound window size type to u64
  rapidio/idt_gen2: fix locking warning
  rapidio: fix error handling in mbox request/release functions
  rapidio/tsi721_dma: advance queue processing from transfer submit call
  rapidio/tsi721: add messaging mbox selector parameter
  rapidio/tsi721: add PCIe MRRS override parameter
  rapidio/tsi721_dma: add channel mask and queue size parameters
  ...
2016-08-02 21:08:07 -04:00
Linus Torvalds 72b5ac54d6 The highlights are:
* RADOS namespace support in libceph and CephFS (Zheng Yan and myself).
    The stopgaps added in 4.5 to deny access to inodes in namespaces are
    removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully
    supported.
 
  * A large rework of the MDS cap flushing code (Zheng Yan).
 
  * Handle some of ->d_revalidate() in RCU mode (Jeff Layton).  We were
    overly pessimistic before, bailing at the first sight of LOOKUP_RCU.
 
 On top of that we've got a few CephFS bug fixes, a couple of cleanups
 and Arnd's workaround for a weird genksyms issue.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXoKLJAAoJEEp/3jgCEfOLDTUIAIcctpKUiNBokc95mQaXYl34
 j7lPIaD0/Ur7JPt4nMdtlywYJYSVV2c+SglHztj/+fv0G4bWbLVEFRruh9SwKIci
 PzttcmycIAqSn1f5gBZwyQbGuffd/F0EnBj7fFjcukt01i3s1ZQ7t4XtLGtAV0Ts
 aIfFtx9SqWig57Z1OZqNgnhnOoh6IqNbic3FL5Hvdl5N5pFbBcQho6Vzoa5O1osH
 URG6RmCcO4nykfSoxiivE7UZ+CImsXHkRD7rupBuIjqjZ8wvmZqQF5qxnkb9Dw2F
 IkNhrHkTSIiv4EsNPLAETTnFSozrL1nEykKr2FBW+ti8nxNcav+8FgVapqLvFIw=
 =gQ0/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client

Pull Ceph updates from Ilya Dryomov:
 "The highlights are:

   - RADOS namespace support in libceph and CephFS (Zheng Yan and
     myself).  The stopgaps added in 4.5 to deny access to inodes in
     namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature
     bit is now fully supported

   - A large rework of the MDS cap flushing code (Zheng Yan)

   - Handle some of ->d_revalidate() in RCU mode (Jeff Layton).  We were
     overly pessimistic before, bailing at the first sight of LOOKUP_RCU

  On top of that we've got a few CephFS bug fixes, a couple of cleanups
  and Arnd's workaround for a weird genksyms issue"

* tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits)
  ceph: fix symbol versioning for ceph_monc_do_statfs
  ceph: Correctly return NXIO errors from ceph_llseek
  ceph: Mark the file cache as unreclaimable
  ceph: optimize cap flush waiting
  ceph: cleanup ceph_flush_snaps()
  ceph: kick cap flushes before sending other cap message
  ceph: introduce an inode flag to indicates if snapflush is needed
  ceph: avoid sending duplicated cap flush message
  ceph: unify cap flush and snapcap flush
  ceph: use list instead of rbtree to track cap flushes
  ceph: update types of some local varibles
  ceph: include 'follows' of pending snapflush in cap reconnect message
  ceph: update cap reconnect message to version 3
  ceph: mount non-default filesystem by name
  libceph: fsmap.user subscription support
  ceph: handle LOOKUP_RCU in ceph_d_revalidate
  ceph: allow dentry_lease_is_valid to work under RCU walk
  ceph: clear d_fsinfo pointer under d_lock
  ceph: remove ceph_mdsc_lease_release
  ceph: don't use ->d_time
  ...
2016-08-02 19:39:09 -04:00
Luis de Bethencourt 9d5059c959 dynamic_debug: only add header when used
kernel.h header doesn't directly use dynamic debug, instead we can
include it in module.c (which used it via kernel.h).  printk.h only uses
it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
in that case.

Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
[luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
  Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:03 -04:00
Linus Torvalds f0c98ebc57 libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement either
    ADR, or provide one or more flush addresses per nvdimm. ADR
    (Asynchronous DRAM Refresh) flushes data in posted write buffers to the
    memory controller on a power-fail event. Flush addresses are defined in
    ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
    "Flush Hint Address Structure". A flush hint is an mmio address that
    when written and fenced assures that all previous posted writes
    targeting a given dimm have been flushed to media.
 
 2/ On-demand ARS (address range scrub):
    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices.  When latent errors are detected we re-scrub the media
    to refresh the bad block list, userspace can also request a re-scrub at
    any time.
 
 3/ Support for the Microsoft DSM (device specific method) command format.
 
 4/ Support for EDK2/OVMF virtual disk device memory ranges.
 
 5/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
 NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
 ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
 trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
 KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
 qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
 WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
 DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
 b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
 cN3edFVIdxvZeMgM5Ubq
 =xCBG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

 - Replace pcommit with ADR / directed-flushing.

   The pcommit instruction, which has not shipped on any product, is
   deprecated.  Instead, the requirement is that platforms implement
   either ADR, or provide one or more flush addresses per nvdimm.

   ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
   to the memory controller on a power-fail event.

   Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
   Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
   A flush hint is an mmio address that when written and fenced assures
   that all previous posted writes targeting a given dimm have been
   flushed to media.

 - On-demand ARS (address range scrub).

   Linux uses the results of the ACPI ARS commands to track bad blocks
   in pmem devices.  When latent errors are detected we re-scrub the
   media to refresh the bad block list, userspace can also request a
   re-scrub at any time.

 - Support for the Microsoft DSM (device specific method) command
   format.

 - Support for EDK2/OVMF virtual disk device memory ranges.

 - Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
  libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
  nfit: do an ARS scrub on hitting a latent media error
  nfit: move to nfit/ sub-directory
  nfit, libnvdimm: allow an ARS scrub to be triggered on demand
  libnvdimm: register nvdimm_bus devices with an nd_bus driver
  pmem: clarify a debug print in pmem_clear_poison
  x86/insn: remove pcommit
  Revert "KVM: x86: add pcommit support"
  nfit, tools/testing/nvdimm/: unify shutdown paths
  libnvdimm: move ->module to struct nvdimm_bus_descriptor
  nfit: cleanup acpi_nfit_init calling convention
  nfit: fix _FIT evaluation memory leak + use after free
  tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
  tools/testing/nvdimm: add virtual ramdisk range
  acpi, nfit: treat virtual ramdisk SPA as pmem region
  pmem: kill __pmem address space
  pmem: kill wmb_pmem()
  libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
  fs/dax: remove wmb_pmem()
  libnvdimm, pmem: flush posted-write queues on shutdown
  ...
2016-07-28 17:38:16 -07:00
Linus Torvalds 6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Yan, Zheng 30c156d995 libceph: rados pool namespace support
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.

The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-28 02:55:37 +02:00
Yan, Zheng 7627151ea3 libceph: define new ceph_file_layout structure
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28 02:55:36 +02:00
Linus Torvalds 08fd8c1768 xen: features and fixes for 4.8-rc0
- ACPI support for guests on ARM platforms.
 - Generic steal time support for arm and x86.
 - Support cases where kernel cpu is not Xen VCPU number (e.g., if
   in-guest kexec is used).
 - Use the system workqueue instead of a custom workqueue in various
   places.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXmLlrAAoJEFxbo/MsZsTRvRQH/1wOMF8BmlbZfR7H3qwDfjst
 ApNifCiZE08xDtWBlwUaBFAQxyflQS9BBiNZDVK0sysIdXeOdpWV7V0ZjRoLL+xr
 czsaaGXDcmXxJxApoMDVuT7FeP6rEk6LVAYRoHpVjJjMZGW3BbX1vZaMW4DXl2WM
 9YNaF2Lj+rpc1f8iG31nUxwkpmcXFog6ct4tu7HiyCFT3hDkHt/a4ghuBdQItCkd
 vqBa1pTpcGtQBhSmWzlylN/PV2+NKcRd+kGiwd09/O/rNzogTMCTTWeHKAtMpPYb
 Cu6oSqJtlK5o0vtr0qyLSWEGIoyjE2gE92s0wN3iCzFY1PldqdsxUO622nIj+6o=
 =G6q3
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from David Vrabel:
 "Features and fixes for 4.8-rc0:

   - ACPI support for guests on ARM platforms.
   - Generic steal time support for arm and x86.
   - Support cases where kernel cpu is not Xen VCPU number (e.g., if
     in-guest kexec is used).
   - Use the system workqueue instead of a custom workqueue in various
     places"

* tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
  xen: add static initialization of steal_clock op to xen_time_ops
  xen/pvhvm: run xen_vcpu_setup() for the boot CPU
  xen/evtchn: use xen_vcpu_id mapping
  xen/events: fifo: use xen_vcpu_id mapping
  xen/events: use xen_vcpu_id mapping in events_base
  x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
  x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
  xen: introduce xen_vcpu_id mapping
  x86/acpi: store ACPI ids from MADT for future usage
  x86/xen: update cpuid.h from Xen-4.7
  xen/evtchn: add IOCTL_EVTCHN_RESTRICT
  xen-blkback: really don't leak mode property
  xen-blkback: constify instance of "struct attribute_group"
  xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
  xen-blkback: prefer xenbus_scanf() over xenbus_gather()
  xen: support runqueue steal time on xen
  arm/xen: add support for vm_assist hypercall
  xen: update xen headers
  xen-pciback: drop superfluous variables
  xen-pciback: short-circuit read path used for merging write values
  ...
2016-07-27 11:35:37 -07:00
Linus Torvalds 0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Minchan Kim 9bc482d346 zram: use __GFP_MOVABLE for memory allocation
Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE from
now on.

I did test to see how it helps to make higher order pages.  Test
scenario is as follows.

KVM guest, 1G memory, ext4 formated zram block device,

  for i in `seq 1 8`;
  do
          dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
  done

  wait `pidof dd`

  for i in `seq 1 2 8`;
  do
          rm -rf mnt/test$i.txt
  done
  fstrim -v mnt

  echo "init"
  cat /proc/buddyinfo

  echo "compaction"
  echo 1 > /proc/sys/vm/compact_memory
  cat /proc/buddyinfo

old:

  init
  Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
  Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
  compaction
  Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
  Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0

new:

  init
  Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
  Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
  compaction
  Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
  Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0

As you can see, compaction made so many high-order pages. Yay!

Link: http://lkml.kernel.org/r/1464736881-24886-13-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky 16d37725a0 zram: drop gfp_t from zcomp_strm_alloc()
We now allocate streams from CPU_UP hot-plug path, there are no
context-dependent stream allocations anymore and we can schedule from
zcomp_strm_alloc().  Use GFP_KERNEL directly and drop a gfp_t parameter.

Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky eb9f56d825 zram: add more compression algorithms
Add "deflate", "lz4hc", "842" algorithms to the list of known
compression backends.  The real availability of those algorithms,
however, depends on the corresponding CONFIG_CRYPTO_FOO config options.

[sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
  Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky ce1ed9f98e zram: delete custom lzo/lz4
Remove lzo/lz4 backends, we use crypto API now.

[sergey.senozhatsky@gmail.com: zram-delete-custom-lzo-lz4-v3]
  Link: http://lkml.kernel.org/r/20160604024902.11778-6-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-7-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky 415403be37 zram: use crypto api to check alg availability
There is no way to get a string with all the crypto comp algorithms
supported by the crypto comp engine, so we need to maintain our own
backends list.  At the same time we additionally need to use
crypto_has_comp() to make sure that the user has requested a compression
algorithm that is recognized by the crypto comp engine.  Relying on
/proc/crypto is not an options here, because it does not show
not-yet-inserted compression modules.

Example:

 modprobe zram
 cat /proc/crypto | grep -i lz4
 modprobe lz4
 cat /proc/crypto | grep -i lz4
name         : lz4
driver       : lz4-generic
module       : lz4

So the user can't tell exactly if the lz4 is really supported from
/proc/crypto output, unless someone or something has loaded it.

This patch also adds crypto_has_comp() to zcomp_available_show().  We
store all the compression algorithms names in zcomp's `backends' array,
regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
are also supported by crypto engine.  This helps user to know the exact
list of compression algorithms that can be used.

Example:
  module lz4 is not loaded yet, but is supported by the crypto
  engine. /proc/crypto has no information on this module, while
  zram's `comp_algorithm' lists it:

 cat /proc/crypto | grep -i lz4

 cat /sys/block/zram0/comp_algorithm
[lzo] lz4 deflate lz4hc 842

We still use the `backends' array to determine if the requested
compression backend is known to crypto api.  This array, however, may not
contain some entries, therefore as the last step we call crypto_has_comp()
function which attempts to insmod the requested compression algorithm to
determine if crypto api supports it.  The advantage of this method is that
now we permit the usage of out-of-tree crypto compression modules
(implementing S/W or H/W compression).

[sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
  Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky ebaf9ab56d zram: switch to crypto compress API
We don't have an idle zstreams list anymore and our write path now works
absolutely differently, preventing preemption during compression.  This
removes possibilities of read paths preempting writes at wrong places
(which could badly affect the performance of both paths) and at the same
time opens the door for a move from custom LZO/LZ4 compression backends
implementation to a more generic one, using crypto compress API.

Joonsoo Kim [1] attempted to do this a while ago, but faced with the
need of introducing a new crypto API interface.  The root cause was the
fact that crypto API compression algorithms require a compression stream
structure (in zram terminology) for both compression and decompression
ops, while in reality only several of compression algorithms really need
it.  This resulted in a concept of context-less crypto API compression
backends [2].  Both write and read paths, though, would have been
executed with the preemption enabled, which in the worst case could have
resulted in a decreased worst-case performance, e.g.  consider the
following case:

	CPU0

	zram_write()
	  spin_lock()
	    take the last idle stream
	  spin_unlock()

	<< preempted >>

		zram_read()
		  spin_lock()
		   no idle streams
			  spin_unlock()
			  schedule()

	resuming zram_write compression()

but it took me some time to realize that, and it took even longer to
evolve zram and to make it ready for crypto API.  The key turned out to be
-- drop the idle streams list entirely.  Without the idle streams list we
are free to use compression algorithms that require compression stream for
decompression (read), because streams are now placed in per-cpu data and
each write path has to disable preemption for compression op, almost
completely eliminating the aforementioned case (technically, we still have
a small chance, because write path has a fast and a slow paths and the
slow path is executed with the preemption enabled; but the frequency of
failed fast path is too low).

TEST
====

- 4 CPUs, x86_64 system
- 3G zram, lzo
- fio tests: read, randread, write, randwrite, rw, randrw

test script [3] command:
 ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh

                   BASE           PATCHED
jobs1
READ:           2527.2MB/s	 2482.7MB/s
READ:           2102.7MB/s	 2045.0MB/s
WRITE:          1284.3MB/s	 1324.3MB/s
WRITE:          1080.7MB/s	 1101.9MB/s
READ:           430125KB/s	 437498KB/s
WRITE:          430538KB/s	 437919KB/s
READ:           399593KB/s	 403987KB/s
WRITE:          399910KB/s	 404308KB/s
jobs2
READ:           8133.5MB/s	 7854.8MB/s
READ:           7086.6MB/s	 6912.8MB/s
WRITE:          3177.2MB/s	 3298.3MB/s
WRITE:          2810.2MB/s	 2871.4MB/s
READ:           1017.6MB/s	 1023.4MB/s
WRITE:          1018.2MB/s	 1023.1MB/s
READ:           977836KB/s	 984205KB/s
WRITE:          979435KB/s	 985814KB/s
jobs3
READ:           13557MB/s	 13391MB/s
READ:           11876MB/s	 11752MB/s
WRITE:          4641.5MB/s	 4682.1MB/s
WRITE:          4164.9MB/s	 4179.3MB/s
READ:           1453.8MB/s	 1455.1MB/s
WRITE:          1455.1MB/s	 1458.2MB/s
READ:           1387.7MB/s	 1395.7MB/s
WRITE:          1386.1MB/s	 1394.9MB/s
jobs4
READ:           20271MB/s	 20078MB/s
READ:           18033MB/s	 17928MB/s
WRITE:          6176.8MB/s	 6180.5MB/s
WRITE:          5686.3MB/s	 5705.3MB/s
READ:           2009.4MB/s	 2006.7MB/s
WRITE:          2007.5MB/s	 2004.9MB/s
READ:           1929.7MB/s	 1935.6MB/s
WRITE:          1926.8MB/s	 1932.6MB/s
jobs5
READ:           18823MB/s	 19024MB/s
READ:           18968MB/s	 19071MB/s
WRITE:          6191.6MB/s	 6372.1MB/s
WRITE:          5818.7MB/s	 5787.1MB/s
READ:           2011.7MB/s	 1981.3MB/s
WRITE:          2011.4MB/s	 1980.1MB/s
READ:           1949.3MB/s	 1935.7MB/s
WRITE:          1940.4MB/s	 1926.1MB/s
jobs6
READ:           21870MB/s	 21715MB/s
READ:           19957MB/s	 19879MB/s
WRITE:          6528.4MB/s	 6537.6MB/s
WRITE:          6098.9MB/s	 6073.6MB/s
READ:           2048.6MB/s	 2049.9MB/s
WRITE:          2041.7MB/s	 2042.9MB/s
READ:           2013.4MB/s	 1990.4MB/s
WRITE:          2009.4MB/s	 1986.5MB/s
jobs7
READ:           21359MB/s	 21124MB/s
READ:           19746MB/s	 19293MB/s
WRITE:          6660.4MB/s	 6518.8MB/s
WRITE:          6211.6MB/s	 6193.1MB/s
READ:           2089.7MB/s	 2080.6MB/s
WRITE:          2085.8MB/s	 2076.5MB/s
READ:           2041.2MB/s	 2052.5MB/s
WRITE:          2037.5MB/s	 2048.8MB/s
jobs8
READ:           20477MB/s	 19974MB/s
READ:           18922MB/s	 18576MB/s
WRITE:          6851.9MB/s	 6788.3MB/s
WRITE:          6407.7MB/s	 6347.5MB/s
READ:           2134.8MB/s	 2136.1MB/s
WRITE:          2132.8MB/s	 2134.4MB/s
READ:           2074.2MB/s	 2069.6MB/s
WRITE:          2087.3MB/s	 2082.4MB/s
jobs9
READ:           19797MB/s	 19994MB/s
READ:           18806MB/s	 18581MB/s
WRITE:          6878.7MB/s	 6822.7MB/s
WRITE:          6456.8MB/s	 6447.2MB/s
READ:           2141.1MB/s	 2154.7MB/s
WRITE:          2144.4MB/s	 2157.3MB/s
READ:           2084.1MB/s	 2085.1MB/s
WRITE:          2091.5MB/s	 2092.5MB/s
jobs10
READ:           19794MB/s	 19784MB/s
READ:           18794MB/s	 18745MB/s
WRITE:          6984.4MB/s	 6676.3MB/s
WRITE:          6532.3MB/s	 6342.7MB/s
READ:           2150.6MB/s	 2155.4MB/s
WRITE:          2156.8MB/s	 2161.5MB/s
READ:           2106.4MB/s	 2095.6MB/s
WRITE:          2109.7MB/s	 2098.4MB/s

                                    BASE                       PATCHED
jobs1                              perfstat
stalled-cycles-frontend     102,480,595,419 (  41.53%)	  114,508,864,804 (  46.92%)
stalled-cycles-backend       51,941,417,832 (  21.05%)	   46,836,112,388 (  19.19%)
instructions                283,612,054,215 (    1.15)	  283,918,134,959 (    1.16)
branches                     56,372,560,385 ( 724.923)	   56,449,814,753 ( 733.766)
branch-misses                   374,826,000 (   0.66%)	      326,935,859 (   0.58%)
jobs2                              perfstat
stalled-cycles-frontend     155,142,745,777 (  40.99%)	  164,170,979,198 (  43.82%)
stalled-cycles-backend       70,813,866,387 (  18.71%)	   66,456,858,165 (  17.74%)
instructions                463,436,648,173 (    1.22)	  464,221,890,191 (    1.24)
branches                     91,088,733,902 ( 760.088)	   91,278,144,546 ( 769.133)
branch-misses                   504,460,363 (   0.55%)	      394,033,842 (   0.43%)
jobs3                              perfstat
stalled-cycles-frontend     201,300,397,212 (  39.84%)	  223,969,902,257 (  44.44%)
stalled-cycles-backend       87,712,593,974 (  17.36%)	   81,618,888,712 (  16.19%)
instructions                642,869,545,023 (    1.27)	  644,677,354,132 (    1.28)
branches                    125,724,560,594 ( 690.682)	  126,133,159,521 ( 694.542)
branch-misses                   527,941,798 (   0.42%)	      444,782,220 (   0.35%)
jobs4                              perfstat
stalled-cycles-frontend     246,701,197,429 (  38.12%)	  280,076,030,886 (  43.29%)
stalled-cycles-backend      119,050,341,112 (  18.40%)	  110,955,641,671 (  17.15%)
instructions                822,716,962,127 (    1.27)	  825,536,969,320 (    1.28)
branches                    160,590,028,545 ( 688.614)	  161,152,996,915 ( 691.068)
branch-misses                   650,295,287 (   0.40%)	      550,229,113 (   0.34%)
jobs5                              perfstat
stalled-cycles-frontend     298,958,462,516 (  38.30%)	  344,852,200,358 (  44.16%)
stalled-cycles-backend      137,558,742,122 (  17.62%)	  129,465,067,102 (  16.58%)
instructions              1,005,714,688,752 (    1.29)	1,007,657,999,432 (    1.29)
branches                    195,988,773,962 ( 697.730)	  196,446,873,984 ( 700.319)
branch-misses                   695,818,940 (   0.36%)	      624,823,263 (   0.32%)
jobs6                              perfstat
stalled-cycles-frontend     334,497,602,856 (  36.71%)	  387,590,419,779 (  42.38%)
stalled-cycles-backend      163,539,365,335 (  17.95%)	  152,640,193,639 (  16.69%)
instructions              1,184,738,177,851 (    1.30)	1,187,396,281,677 (    1.30)
branches                    230,592,915,640 ( 702.902)	  231,253,802,882 ( 702.356)
branch-misses                   747,934,786 (   0.32%)	      643,902,424 (   0.28%)
jobs7                              perfstat
stalled-cycles-frontend     396,724,684,187 (  37.71%)	  460,705,858,952 (  43.84%)
stalled-cycles-backend      188,096,616,496 (  17.88%)	  175,785,787,036 (  16.73%)
instructions              1,364,041,136,608 (    1.30)	1,366,689,075,112 (    1.30)
branches                    265,253,096,936 ( 700.078)	  265,890,524,883 ( 702.839)
branch-misses                   784,991,589 (   0.30%)	      729,196,689 (   0.27%)
jobs8                              perfstat
stalled-cycles-frontend     440,248,299,870 (  36.92%)	  509,554,793,816 (  42.46%)
stalled-cycles-backend      222,575,930,616 (  18.67%)	  213,401,248,432 (  17.78%)
instructions              1,542,262,045,114 (    1.29)	1,545,233,932,257 (    1.29)
branches                    299,775,178,439 ( 697.666)	  300,528,458,505 ( 694.769)
branch-misses                   847,496,084 (   0.28%)	      748,794,308 (   0.25%)
jobs9                              perfstat
stalled-cycles-frontend     506,269,882,480 (  37.86%)	  592,798,032,820 (  44.43%)
stalled-cycles-backend      253,192,498,861 (  18.93%)	  233,727,666,185 (  17.52%)
instructions              1,721,985,080,913 (    1.29)	1,724,666,236,005 (    1.29)
branches                    334,517,360,255 ( 694.134)	  335,199,758,164 ( 697.131)
branch-misses                   873,496,730 (   0.26%)	      815,379,236 (   0.24%)
jobs10                             perfstat
stalled-cycles-frontend     549,063,363,749 (  37.18%)	  651,302,376,662 (  43.61%)
stalled-cycles-backend      281,680,986,810 (  19.07%)	  277,005,235,582 (  18.55%)
instructions              1,901,859,271,180 (    1.29)	1,906,311,064,230 (    1.28)
branches                    369,398,536,153 ( 694.004)	  370,527,696,358 ( 688.409)
branch-misses                   967,929,335 (   0.26%)	      890,125,056 (   0.24%)

                            BASE           PATCHED
seconds elapsed        79.421641008	78.735285546
seconds elapsed        61.471246133	60.869085949
seconds elapsed        62.317058173	62.224188495
seconds elapsed        60.030739363	60.081102518
seconds elapsed        74.070398362	74.317582865
seconds elapsed        84.985953007	85.414364176
seconds elapsed        97.724553255	98.173311344
seconds elapsed        109.488066758	110.268399318
seconds elapsed        122.768189405	122.967164498
seconds elapsed        135.130035105	136.934770801

On my other system (8 x86_64 CPUs, short version of test results):

                            BASE           PATCHED
seconds elapsed        19.518065994	19.806320662
seconds elapsed        15.172772749	15.594718291
seconds elapsed        13.820925970	13.821708564
seconds elapsed        13.293097816	14.585206405
seconds elapsed        16.207284118	16.064431606
seconds elapsed        17.958376158	17.771825767
seconds elapsed        19.478009164	19.602961508
seconds elapsed        21.347152811	21.352318709
seconds elapsed        24.478121126	24.171088735
seconds elapsed        26.865057442	26.767327618

So performance-wise the numbers are quite similar.

Also update zcomp interface to be more aligned with the crypto API.

[1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
[2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
[3] https://github.com/sergey-senozhatsky/zram-perf-test

Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Sergey Senozhatsky 2aea8493d3 zram: rename zstrm find-release functions
This has started as a 'add zlib support' work, but after some thinking I
saw no blockers for a bigger change -- a switch to crypto API.

We don't have an idle zstreams list anymore and our write path now works
absolutely differently, preventing preemption during compression.  This
removes possibilities of read paths preempting writes at wrong places
and opens the door for a move from custom LZO/LZ4 compression backends
implementation to a more generic one, using crypto compress API.

This patch set also eliminates the need of a new context-less crypto API
interface, which was quite hard to sell, so we can move along faster.

benchmarks:

(x86_64, 4GB, zram-perf script)

perf reported run-time fio (max jobs=3).  I performed fio test with the
increasing number of parallel jobs (max to 3) on a 3G zram device, using
`static' data and the following crypto comp algorithms:

	842, deflate, lz4, lz4hc, lzo

the output was:

 - test running time (which can tell us what algorithms performs faster)

and

 - zram mm_stat (which tells the compressed memory size, max used memory, etc).

It's just for information.  for example, LZ4HC has twice the running
time of LZO, but the compressed memory size is: 23592960 vs 34603008
bytes.

  test-fio-zram-842
     197.907655282 seconds time elapsed
     201.623142884 seconds time elapsed
     226.854291345 seconds time elapsed
  test-fio-zram-DEFLATE
     253.259516155 seconds time elapsed
     258.148563401 seconds time elapsed
     290.251909365 seconds time elapsed
  test-fio-zram-LZ4
      27.022598717 seconds time elapsed
      29.580522717 seconds time elapsed
      33.293463430 seconds time elapsed
  test-fio-zram-LZ4HC
      56.393954615 seconds time elapsed
      74.904659747 seconds time elapsed
     101.940998564 seconds time elapsed
  test-fio-zram-LZO
      28.155948075 seconds time elapsed
      30.390036330 seconds time elapsed
      34.455773159 seconds time elapsed

zram mm_stat-s (max fio jobs=3)

  test-fio-zram-842
  mm_stat (jobs1): 3221225472 673185792 690266112        0 690266112        0        0
  mm_stat (jobs2): 3221225472 673185792 690266112        0 690266112        0        0
  mm_stat (jobs3): 3221225472 673185792 690266112        0 690266112        0        0
  test-fio-zram-DEFLATE
  mm_stat (jobs1): 3221225472  24379392  37761024        0  37761024        0        0
  mm_stat (jobs2): 3221225472  24379392  37761024        0  37761024        0        0
  mm_stat (jobs3): 3221225472  24379392  37761024        0  37761024        0        0
  test-fio-zram-LZ4
  mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
  mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
  mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
  test-fio-zram-LZ4HC
  mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
  mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
  mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
  test-fio-zram-LZO
  mm_stat (jobs1): 3221225472  34603008  50335744        0  50335744        0        0
  mm_stat (jobs2): 3221225472  34603008  50335744        0  50335744        0        0
  mm_stat (jobs3): 3221225472  34603008  50335744        0  50339840        0        0

This patch (of 8):

We don't perform any zstream idle list lookup anymore, so
zcomp_strm_find()/zcomp_strm_release() names are not representative.

Rename to zcomp_stream_get()/zcomp_stream_put().

Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Dan Williams 0606263f24 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next 2016-07-24 08:05:44 -07:00
Jan Beulich aea305e11f xen-blkback: really don't leak mode property
Commit 9d092603cc ("xen-blkback: do not leak mode property") left one
path unfixed; correct this.

Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-07-22 08:24:43 -04:00
Jan Beulich 530439484d xen-blkback: constify instance of "struct attribute_group"
The functions these get passed to have been taking pointers to const
since at least 2.6.16.

Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-07-22 08:23:52 -04:00
Jan Beulich ff595325ed xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2016-07-22 08:23:45 -04:00
Jan Beulich 6694389af9 xen-blkback: prefer xenbus_scanf() over xenbus_gather()
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
2016-07-22 08:23:38 -04:00
Toshi Kani 163d4baaeb block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device.  Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.

In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support.  This will allow to set the DAX capability based on how
mapped device is composed.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:01:01 -06:00
Christoph Hellwig 0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
Christoph Hellwig f9596695be virtio_blk: use blk_rq_map_kern
Similar to how SCSI and NVMe prepare passthrough requests.  This avoids
poking into request internals too much.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:29 -06:00
Christoph Hellwig 70246286e9 block: get rid of bio_rw and READA
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:01 -06:00
Dan Williams 7a9eb20666 pmem: kill __pmem address space
The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem().  Now that
wmb_pmem() is gone, there is little need to keep this annotation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12 19:25:38 -07:00
Minfei Huang 7a6497378a loop: Make user notify for adding loop device failed
There is no error number returned if loop driver fails in function
alloc_disk to add new loop device. Add a correct error number to make
user notify in this case.

Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 16:17:18 -07:00
Jens Axboe 41d512e51b Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm into for-4.8/drivers
Dan writes:

"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk().  See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.

It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1].  We would extend device_add_disk()
to take an attribute_group list.

This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.

[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
2016-07-08 16:04:11 -06:00
Linus Torvalds ac904ae6e6 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Three small fixes that have been queued up and tested for this series:

   - A bug fix for xen-blkfront from Bob Liu, fixing an issue with
     incomplete requests during migration.

   - A fix for an ancient issue in retrieving the IO priority of a
     different PID than self, preventing that task from going away while
     we access it.  From Omar.

   - A writeback fix from Tahsin, fixing a case where we'd call ihold()
     with a zero ref count inode"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix use-after-free in sys_ioprio_get()
  writeback: inode cgroup wb switch should not call ihold()
  xen-blkfront: save uncompleted reqs in blkfront_resume()
2016-07-07 15:34:09 -07:00
Al Viro b223f4e215 Merge branch 'd_real' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc 2016-06-30 23:34:49 -04:00
Jens Axboe df253e8454 Merge branch 'stable/for-jens-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus 2016-06-29 13:15:19 -06:00
Bob Liu 7b427a5953 xen-blkfront: save uncompleted reqs in blkfront_resume()
Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
migration, but that's too late after multi-queue was introduced.

After a migrate to another host (which may not have multiqueue support), the
number of rings (block hardware queues) may be changed and the ring and shadow
structure will also be reallocated.

The blkfront_recover() then can't 'save and resubmit' the real
uncompleted reqs because shadow structure have been reallocated.

This patch fixes this issue by moving the 'save' logic out of
blkfront_recover() to earlier place in blkfront_resume().

The 'resubmit' is not changed and still in blkfront_recover().

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
2016-06-29 12:32:39 -04:00
Bartlomiej Zolnierkiewicz 9e2d23f19e mg_disk: fix error path in mg_probe()
MG_DISK_MAJ is defined as 0 so dynamic block major number
allocation is used by the driver and the assigned major
number is stored in host->major.  This patch fixes error
path in mg_probe() to use host->major instead of using
MG_DISK_MAJ.

Cc: unsik Kim <donari75@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28 11:01:27 -06:00
Dan Williams 0d52c756a6 block: convert to device_add_disk()
For block drivers that specify a parent device, convert them to use
device_add_disk().

This conversion was done with the following semantic patch:

    @@
    struct gendisk *disk;
    expression E;
    @@

    - disk->driverfs_dev = E;
    ...
    - add_disk(disk);
    + device_add_disk(E, disk);

    @@
    struct gendisk *disk;
    expression E1, E2;
    @@

    - disk->driverfs_dev = E1;
    ...
    E2 = disk;
    ...
    - add_disk(E2);
    + device_add_disk(E1, E2);

...plus some manual fixups for a few missed conversions.

Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27 12:26:08 -07:00
Michal Hocko 32d6bd9059 tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
This is the third version of the patchset previously sent [1].  I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree.  I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2.  I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.

Motivation:

While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree.  It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often.  It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear.  Please note that GFP_REPEAT is
documented as

* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt

* _might_ fail.  This depends upon the particular VM implementation.
  while !costly requests have basically nofail semantic.  So one could
  reasonably expect that order-0 request with __GFP_REPEAT will not loop
  for ever.  This is not implemented right now though.

I would like to move on with __GFP_REPEAT and define a better semantic
for it.

  $ git grep __GFP_REPEAT origin/master | wc -l
  111
  $ git grep __GFP_REPEAT | wc -l
  36

So we are down to the third after this patch series.  The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests.  This still needs some double checking which I will do later
after all the simple ones are sorted out.

I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them.  Patches should be quite trivial to
review for stupid compile mistakes though.  The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.

[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org

This patch (of 19):

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.  Yet we
have the full kernel tree with its usage for apparently order-0
allocations.  This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).

Let's simply drop __GFP_REPEAT from those places.  This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.

Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Lars Ellenberg 1b57e66384 drbd: correctly handle failed crypto_alloc_hash
crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun <wuninsu@gmail.com>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Lars Ellenberg 27ea1d876e drbd: al_write_transaction: skip re-scanning of bitmap page pointer array
For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Lars Ellenberg 13c2088d41 drbd: finally report ms, not jiffies, in log message
Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Roland Kammerer 4e526a0046 drbd: get rid of empty statement in is_valid_state
This should silence a warning about an empty statement. Thanks to Fabian
Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and
avoids an additional indent level.

Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Fabian Frederick 7e5fec3168 drbd: code cleanups without semantic changes
This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 20004e2435 drbd: bump current uuid when resuming IO with diskless peer
Scenario, starting with normal operation
 Connected Primary/Secondary UpToDate/UpToDate
 NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
 ... more failures happen, secondary loses it's disk,
 but eventually is able to re-establish the replication link ...
 Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00