1
0
Fork 0
Commit Graph

526 Commits (e73c23ff736e1ea371dfa419d7bf8e77ee53044a)

Author SHA1 Message Date
Shaun Tancheff 778889d841 block: apply blk_partition_remap to REQ_OP_ZONE_RESET
If a ZBC device is partitioned and operations are performed on the partition
the zone information is rebased to the partition, however the zone reset
is not mapped from the partition to device as are other operations.

This causes the API (report zones / reset zone) to be unbalanced in this
regard. Checking for the zone reset op code explicitly will balance the
API.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 15:08:24 -07:00
Ming Lei 0a6219a95f block: deal with stale req count of plug list
In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:09:51 -07:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Jens Axboe 87760e5eef block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:40 -07:00
Jens Axboe cf43e6be86 block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:26 -07:00
Shaohua Li 50d24c3440 block: immediately dispatch big size request
Currently block plug holds up to 16 non-mergeable requests. This makes
sense if the request size is small, eg, reduce lock contention. But if
request size is big enough, we don't need to worry about lock
contention. Holding such request makes no sense and it lows the disk
utilization.

In practice, this improves 10% throughput for my raid5 sequential write
workload.

The size (128k) is arbitrary right now, but it makes sure lock
contention is small. This probably could be more intelligent, eg, check
average request size holded. Since this is mainly for sequential IO,
probably not worthy.

V2: check the last request instead of the first request, so as long as
there is one big size request we flush the plug.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-03 22:00:36 -06:00
Christoph Hellwig ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Christoph Hellwig e806402130 block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Shaun Tancheff 2d253440b5 block: Define zoned block device operations
Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of
host-managed and host-aware zoned block devices. With with these two
new operations, the total number of operations defined reaches 8 and
still fits with the 3 bits definition of REQ_OP_BITS.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-18 10:02:05 -06:00
Stephen Bates 6e219353af block: add poll_considered statistic
In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 08:41:21 -06:00
Jens Axboe 27489a3c82 blk-mq: turn hctx->run_work into a regular work struct
We don't need the larger delayed work struct, since we always run it
immediately.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-29 08:13:21 -06:00
Jens Axboe ee63cfa7fc block: add kblockd_schedule_work_on()
Add a helper to schedule a regular struct work on a particular CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-29 08:13:21 -06:00
Bart Van Assche 1b85608681 block: Fix race triggered by blk_set_queue_dying()
blk_set_queue_dying() can be called while another thread is
submitting I/O or changing queue flags, e.g. through dm_stop_queue().
Hence protect the QUEUE_FLAG_DYING flag change with locking.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-16 19:36:14 -06:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Christoph Hellwig 4613c5f1df scsi/osd: open code blk_make_request
I wish the OSD code could simply use blk_rq_map_* helpers like
everyone else, but the complex nature of deciding if we have
DATA IN and/or DATA OUT buffers might make this impossible
(at least for a mere human like me).

But using blk_rq_append_bio at least allows sharing the setup code
between request with or without dat a buffers, and given that this
is the last user of blk_make_request it allows getting rid of that
somewhat awkward interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:35 -06:00
Christoph Hellwig 98d61d5b1a block: simplify and export blk_rq_append_bio
The target SCSI passthrough backend is much better served with the low-level
blk_rq_append_bio construct then the helpers built on top of it, so export it.

Also use the opportunity to remove the pointless request_queue argument and
make the code flow a little more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:32 -06:00
Christoph Hellwig 0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
Sagi Grimberg 9645c1a233 block: Export blk_poll
The new NVMe over fabrics target will make use of this outside from a
module.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:31 -06:00
Jens Axboe b8269db456 cfq-iosched: temporarily boost queue priority for idle classes
If we're queuing REQ_PRIO IO and the task is running at an idle IO
class, then temporarily boost the priority. This prevents livelocks
due to priority inversion, when a low priority task is holding file
system resources while attempting to do IO.

An example of that is shown below. An ioniced idle task is holding
the directory mutex, while a normal priority task is trying to do
a directory lookup.

[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
[478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
[478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
[478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
[478381.206874] Call Trace:
[478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
[478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
[478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
[478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
[478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
[478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
[478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
[478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
[478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
[478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
[478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
[478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
[478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
[478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
[478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
[478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
[478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
[478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
[478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
[478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
[478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
[478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
[478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
[478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
[478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
[478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
[478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
[478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
[478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
[478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
[478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
[478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
[478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
[478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
[478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
[478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
[478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
[478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
[478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
2016-06-09 16:15:01 -06:00
Christoph Hellwig 288dab8a35 block: add a separate operation type for secure erase
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 09:52:25 -06:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie d9d8c5c489 block: convert is_sync helpers to use REQ_OPs.
This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 8fe0d473f5 block: convert merge/insert code to check for REQ_OPs.
This patch converts the block layer merging code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie ba568ea0a2 block: prepare elevator to use REQ_OPs.
This patch converts the elevator code to use separate variables
for the operation and flags, and to check req_op for the REQ_OP.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie e6a40b096e block: prepare request creation/destruction code to use REQ_OPs
This patch prepares *_get_request/*_put_request and freed_request,
to use separate variables for the operation and flags. In the
next patches the struct request users will be converted like
was done for bios where the op and flags are set separately.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4993b77d3f block: copy bio op to request op
The bio users should now always be setting up the bio op. This patch
has the block layer copy that to the request.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 95fe6c1a20 block, fs, mm, drivers: use bio set/get op accessors
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op

These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie a8ebb056a8 block, drivers, cgroup: use op_is_write helper instead of checking for REQ_WRITE
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.

This patch converts the drivers and cgroup to use the
op_is_write helper. This should just cover the simple
cases. I did dm, md and bcache in their own patches
because they were more involved.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Jens Axboe c888a8f95a block: kill off q->flush_flags
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-13 13:33:19 -06:00
Ming Lin 37e58237a1 block: add offset in blk_add_request_payload()
We could kmalloc() the payload, so need the offset in page.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:13:23 -06:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Linus Torvalds fcab86add7 Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:

 - ahci grew runtime power management support so that the controller can
   be turned off if no devices are attached.

 - sata_via isn't dead yet.  It got hotplug support and more refined
   workaround for certain WD drives.

 - Misc cleanups.  There's a merge from for-4.5-fixes to avoid confusing
   conflicts in ahci PCI ID table.

* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  ata: ahci_xgene: dereferencing uninitialized pointer in probe
  AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
  ata: sata_rcar: Use ARCH_RENESAS
  sata_via: Implement hotplug for VT6421
  sata_via: Apply WD workaround only when needed on VT6421
  ahci: Add runtime PM support for the host controller
  ahci: Add functions to manage runtime PM of AHCI ports
  ahci: Convert driver to use modern PM hooks
  ahci: Cache host controller version
  scsi: Drop runtime PM usage count after host is added
  scsi: Set request queue runtime PM status back to active on resume
  block: Add blk_set_runtime_active()
  ata: ahci_mvebu: add support for Armada 3700 variant
  libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
  libata: support AHCI on OCTEON platform
2016-03-18 20:06:46 -07:00
Mike Snitzer 6acfe68bac dm: fix excessive dm-mq context switching
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-22 11:04:40 -05:00
Mika Westerberg d07ab6d114 block: Add blk_set_runtime_active()
If block device is left runtime suspended during system suspend, resume
hook of the driver typically corrects runtime PM status of the device back
to "active" after it is resumed. However, this is not enough as queue's
runtime PM status is still "suspended". As long as it is in this state
blk_pm_peek_request() returns NULL and thus prevents new requests to be
processed.

Add new function blk_set_runtime_active() that can be used to force the
queue status back to "active" as needed.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-19 10:52:45 -05:00
James Bottomley 12ffbbe94d Merge remote-tracking branch 'mkp-scsi/4.5/scsi-fixes' into fixes 2016-02-04 21:37:52 -08:00
Martin K. Petersen 0fb5b1fb30 block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled
When a storage device rejects a WRITE SAME command we will disable write
same functionality for the device and return -EREMOTEIO to the block
layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
failing the path.

Yiwen Jiang discovered a small race where WRITE SAME requests issued
simultaneously would cause -EIO to be returned. This happened because
any requests being prepared after WRITE SAME had been disabled for the
device caused us to return BLKPREP_KILL. The latter caused the block
layer to return -EIO upon completion.

To overcome this we introduce BLKPREP_INVALID which indicates that this
is an invalid request for the device. blk_peek_request() is modified to
return -EREMOTEIO in that case.

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Suggested-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Reviewed-by: Ewan Milne <emilne@redhat.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-02-04 22:42:58 -05:00
Linus Torvalds 3e1e21c7bf Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-block
Pull NVMe updates from Jens Axboe:
 "Last branch for this series is the nvme changes.  It's in a separate
  branch to avoid splitting too much between core and NVMe changes,
  since NVMe is still helping drive some blk-mq changes.  That said, not
  a huge amount of core changes in here.  The grunt of the work is the
  continued split of the code"

* 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits)
  uapi: update install list after nvme.h rename
  NVMe: Export NVMe attributes to sysfs group
  NVMe: Shutdown controller only for power-off
  NVMe: IO queue deletion re-write
  NVMe: Remove queue freezing on resets
  NVMe: Use a retryable error code on reset
  NVMe: Fix admin queue ring wrap
  nvme: make SG_IO support optional
  nvme: fixes for NVME_IOCTL_IO_CMD on the char device
  nvme: synchronize access to ctrl->namespaces
  nvme: Move nvme_freeze/unfreeze_queues to nvme core
  PCI/AER: include header file
  NVMe: Export namespace attributes to sysfs
  NVMe: Add pci error handlers
  block: remove REQ_NO_TIMEOUT flag
  nvme: merge iod and cmd_info
  nvme: meta_sg doesn't have to be an array
  nvme: properly free resources for cancelled command
  nvme: simplify completion handling
  nvme: special case AEN requests
  ...
2016-01-21 19:58:02 -08:00
Linus Torvalds 7c24d9f3b2 Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "We don't have a lot of core changes this time around, it's mostly in
  drivers, which will come in a subsequent pull.

  The cores changes include:

   - blk-mq
        - Prep patch from Christoph, changing blk_mq_alloc_request() to
          take flags instead of just using gfp_t for sleep/nosleep.
        - Doc patch from me, clarifying the difference between legacy
          and blk-mq for timer usage.
        - Fixes from Raghavendra for memory-less numa nodes, and a reuse
          of CPU masks.

   - Cleanup from Geliang Tang, using offset_in_page() instead of open
     coding it.

   - From Ilya, rename request_queue slab to it reflects what it holds,
     and a fix for proper use of bdgrab/put.

   - A real fix for the split across stripe boundaries from Keith.  We
     yanked a broken version of this from 4.4-rc final, this one works.

   - From Mike Krinkin, emit a trace message when we split.

   - From Wei Tang, two small cleanups, not explicitly clearing memory
     that is already cleared"

* 'for-4.5/core' of git://git.kernel.dk/linux-block:
  block: use bd{grab,put}() instead of open-coding
  block: split bios to max possible length
  block: add call to split trace point
  blk-mq: Avoid memoryless numa node encoded in hctx numa_node
  blk-mq: Reuse hardware context cpumask for tags
  blk-mq: add a flags parameter to blk_mq_alloc_request
  Revert "blk-flush: Queue through IO scheduler when flush not required"
  block: clarify blk_add_timer() use case for blk-mq
  bio: use offset_in_page macro
  block: do not initialise statics to 0 or NULL
  block: do not initialise globals to 0 or NULL
  block: rename request_queue slab cache
2016-01-19 15:03:34 -08:00
Jens Axboe 21491412f2 block: add blk_start_queue_async()
We currently only have an inline/sync helper to restart a stopped
queue. If drivers need an async version, they have to roll their
own. Add a generic helper instead.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-28 13:07:07 -07:00
Junichi Nomura 23688bf4f8 block: ensure to split after potentially bouncing a bio
blk_queue_bio() does split then bounce, which makes the segment
counting based on pages before bouncing and could go wrong. Move
the split to after bouncing, like we do for blk-mq, and the we
fix the issue of having the bio count for segments be wrong.

Fixes: 54efd50bfd ("block: make generic_make_request handle arbitrarily sized bios")
Cc: stable@vger.kernel.org
Tested-by: Artem S. Tashkinov <t.artem@lycos.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 10:26:53 -07:00
Christoph Hellwig 287922eb0b block: defer timeouts to a workqueue
Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 09:38:16 -07:00
Ken Xue 4fd41a8552 SCSI: Fix NULL pointer dereference in runtime PM
The routines in scsi_pm.c assume that if a runtime-PM callback is
invoked for a SCSI device, it can only mean that the device's driver
has asked the block layer to handle the runtime power management (by
calling blk_pm_runtime_init(), which among other things sets q->dev).

However, this assumption turns out to be wrong for things like the ses
driver.  Normally ses devices are not allowed to do runtime PM, but
userspace can override this setting.  If this happens, the kernel gets
a NULL pointer dereference when blk_post_runtime_resume() tries to use
the uninitialized q->dev pointer.

This patch fixes the problem by checking q->dev in block layer before
handle runtime PM. Since ses doesn't define any PM callbacks and call
blk_pm_runtime_init(), the crash won't occur.

This fixes Bugzilla #101371.
https://bugzilla.kernel.org/show_bug.cgi?id=101371

More discussion can be found from below link.
http://marc.info/?l=linux-scsi&m=144163730531875&w=2

Signed-off-by: Ken Xue <Ken.Xue@amd.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Xiangliang Yu <Xiangliang.Yu@amd.com>
Cc: James E.J. Bottomley <JBottomley@odin.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Terry <Michael.terry@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-03 20:35:02 -07:00
Christoph Hellwig 6f3b0e8bcf blk-mq: add a flags parameter to blk_mq_alloc_request
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-01 10:53:59 -07:00
Hannes Reinecke bf4e6b4e75 block: Always check queue limits for cloned requests
When a cloned request is retried on other queues it always needs
to be checked against the queue limits of that queue.
Otherwise the calculations for nr_phys_segments might be wrong,
leading to a crash in scsi_init_sgtable().

To clarify this the patch renames blk_rq_check_limits()
to blk_cloned_rq_check_limits() and removes the symbol
export, as the new function should only be used for
cloned requests and never exported.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Fixes: e2a60da74 ("block: Clean up special command handling logic")
Cc: stable@vger.kernel.org # 3.7+
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-29 14:37:27 -07:00
Wei Tang d674d4145e block: do not initialise globals to 0 or NULL
This patch fixes the checkpatch.pl error to blk-exec.c:

ERROR: do not initialise globals to 0 or NULL

Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-24 15:24:25 -07:00
Ilya Dryomov c2789bd403 block: rename request_queue slab cache
Name the cache after the actual name of the struct.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-24 15:24:25 -07:00
Randy Dunlap ccc2600b8a block: fix blk-core.c kernel-doc warning
Fix kernel-doc warning in blk-core.c:

Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-11 09:36:57 -07:00