1
0
Fork 0
Commit Graph

826651 Commits (47ca23c1176a9e499d4652b33e398a7ddb0bef47)

Author SHA1 Message Date
Sagi Grimberg 6f53e73b9e nvmet: protect discovery change log event list iteration
When we iterate on the discovery subsystem controllers
we need to protect against concurrent mutations to it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:18:47 -04:00
Christoph Hellwig 893a74b7a7 nvme: mark nvme_core_init and nvme_core_exit static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-01 09:18:47 -04:00
Christoph Hellwig 811015409f nvme: move command size checks to the core
Most command aren't PCIe specific, so move the size checking for them
to core.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-01 09:18:36 -04:00
Minwoo Im a2faf94e57 nvme-fabrics: check more command sizes
struct common_command provides a common structure for NVMe-oF command
format.  It also needs to be checked for unintended size growth.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:27 -04:00
Minwoo Im a97234e1ff nvme-pci: check more command sizes
All the NVMe command has 64bytes fixed size so that it has been assured
with BUILD_BUG_ON().  The remaining command structures in linux/nvme.h
also need to be checked here.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:27 -04:00
Minwoo Im 665648673e nvme-pci: remove an unneeded variable initialization
Variable "n" will be assigned once kstrtoint() succeeds, otherwise it
will not be referred because kstrtoint() will return an error which
means go out from this function.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:27 -04:00
Keith Busch c8e9e9b764 nvme-pci: unquiesce admin queue on shutdown
Just like IO queues, the admin queue also will not be restarted after a
controller shutdown. Unquiesce this queue so that we do not block
request dispatch on a permanently disabled controller.

Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:22 -04:00
Keith Busch 9dc1a38ef1 nvme-pci: shutdown on timeout during deletion
We do not restart a controller in a deleting state for timeout errors.
When in this state, unblock potential request dispatchers with failed
completions by shutting down the controller on timeout detection.

Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:18 -04:00
Klaus Birkelund Jensen 049bf37262 nvme-pci: fix psdt field for single segment sgls
The shortcut for single segment SGL requests did not set the PSDT field
to mark the request as using SGLs.

Fixes: 297910571f ("nvme-pci: optimize mapping single segment requests using SGLs")
Signed-off-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:16 -04:00
Hannes Reinecke 592b6e7b02 nvme-multipath: don't print ANA group state by default
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:16 -04:00
Hannes Reinecke 525aa5a705 nvme-multipath: split bios with the ns_head bio_set before submitting
If the bio is moved to a different queue via blk_steal_bios() and
the original queue is destroyed in nvme_remove_ns() we'll be ending
with a crash in bio_endio() as the mempool for the split bio bvecs
had already been destroyed.
So split the bio using the original queue (which will remain during the
lifetime of the bio) before sending it down to the underlying device.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:15 -04:00
Sagi Grimberg f34e25898a nvme-tcp: fix possible null deref on a timed out io queue connect
If I/O queue connect times out, we might have freed the queue socket
already, so check for that on the error path in nvme_tcp_start_queue.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-01 09:17:15 -04:00
Jens Axboe 2d5abb9a1e bcache: make is_discard_enabled() static
It's not used outside this file.

Fixes: 631207314d ("bcache: fix failure in journal relplay")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-01 06:34:09 -06:00
Christoph Hellwig 12adb7a013 block: remove the unused blk_queue_dma_pad function
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:12:36 -06:00
Christoph Hellwig 3dcf60bcb6 block: add SPDX tags to block layer files missing licensing information
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:12:03 -06:00
Christoph Hellwig 6353599813 block: add a SPDX tag to blk-mq-rdma.h
This file has no copyright notice, but was added as part of a commit
adding another file using the default kernel GPLv2 license.  Add
a matching SPDX tag.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:12:02 -06:00
Christoph Hellwig 9fcd030baa sed-opal.h: remove redundant licence boilerplate
The file already has the correct SPDX header.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:12:00 -06:00
Christoph Hellwig a497ee34a4 block: switch all files cleared marked as GPLv2 or later to SPDX tags
All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:11:59 -06:00
Christoph Hellwig 8c16567d86 block: switch all files cleared marked as GPLv2 to SPDX tags
All these files have some form of the usual GPLv2 boilerplate.  Switch
them to use SPDX tags instead.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 16:11:57 -06:00
Christoph Hellwig dcdca753c1 block: clean up __bio_add_pc_page a bit
Share the bi_size update by moving the done label up, and duplicate
the bv_len update in the two callers to get rid of the bvec_merge
label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:44 -06:00
Christoph Hellwig 6601e44efd block: remove bogus comments in __bio_add_pc_page
We are never called with file system pages by defintions for the
passthrough interface, and we also never undo any addition later
these days.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:42 -06:00
Christoph Hellwig 4713839dfe block: remove the __bio_add_pc_page export
The same page optimization is a rather odd corner case, which is not
used outside bio.c and which really should not be used outside of bio.c
either - we have better highlevel helpers like the rq/bio mapping
helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:41 -06:00
Christoph Hellwig 2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Christoph Hellwig f936b06ae5 bcache: clean up do_btree_node_write a bit
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:11 -06:00
Coly Li cdca22bcbc bcache: remove redundant LIST_HEAD(journal) from run_cache_set()
Commit 95f18c9d13 ("bcache: avoid potential memleak of list of
journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
to remove the original define of LIST_HEAD(journal), which makes
the change no take effect. This patch removes redundant variable
LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
working.

Fixes: 95f18c9d13 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi>
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 08:20:46 -06:00
Jens Axboe 41d7f2ed84 Merge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block
Pull NVMe changes from Christoph.

* 'nvme-5.2' of git://git.infradead.org/nvme:
  nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
  nvme-rdma: fix typo in struct comment
  nvme-loop: kill timeout handler
  nvme-tcp: rename function to have nvme_tcp prefix
  nvme-rdma: fix a NULL deref when an admin connect times out
  nvme-tcp: fix a NULL deref when an admin connect times out
  nvmet-tcp: don't fail maxr2t greater than 1
  nvmet-file: clamp-down file namespace lba_shift
  nvmet: include <linux/scatterlist.h>
  nvmet: return a specified error it subsys_alloc fails
  nvmet: rename nvme_completion instances from rsp to cqe
  nvmet-rdma: remove p2p_client initialization from fast-path
2019-04-26 10:25:19 -06:00
Christoph Hellwig cc6be13159 mtip32xx: remove trim support
The trim support in mtip32xx has been "temporarily" disabled for 6
years, which is 3/4 of the time the driver even exists in the tree.

Remove it as it obviously is dead code now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-25 09:11:53 -06:00
Sagi Grimberg 01fa017484 nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
If our target exposed a namespace with a block size that is greater
than PAGE_SIZE, set 0 capacity on the namespace as we do not support it.

This issue encountered when the nvmet namespace was backed by a tempfile.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:42 +02:00
Minwoo Im 82bebbde02 nvme-rdma: fix typo in struct comment
struct nvme_rdma_cm_rej has two different attributes: recfmt and sts.
And sts will have value what this comment wanted to show.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:42 +02:00
Ming Lei 663d6fee66 nvme-loop: kill timeout handler
Firstly it doesn't make sense to handle timeout for loop: 1) for admin
queue, the request is always completed in code path of queuing IO. 2)
for normal IO request, the timeout on these IOs have been handled by
underlying queue already.

Secondly nvme-loop's timeout handler is simply broken, and easy to
cause issue: 1) no any sync/protection between timeout and normal
completion, and now it is driver's responsibility to deal with
that; 2) bad reset implementation, blk_mq_update_nr_hw_queues()
is called after all NSs's queue is stopped(quiesced), and easy
to trigger deadlock.

So kill the timeout handler.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewd-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:41 +02:00
Sagi Grimberg efb973b19b nvme-tcp: rename function to have nvme_tcp prefix
usually nvme_ prefix is for core functions.
While we're cleaning up, remove redundant empty lines

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:41 +02:00
Sagi Grimberg 1007709d7d nvme-rdma: fix a NULL deref when an admin connect times out
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.

Fixes: 4c174e6366 ("nvme-rdma: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:39 +02:00
Sagi Grimberg 7a42589654 nvme-tcp: fix a NULL deref when an admin connect times out
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.

Fixes: 39d5775746 ("nvme-tcp: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:37 +02:00
Sagi Grimberg 569b3d3db1 nvmet-tcp: don't fail maxr2t greater than 1
The host may support it, but nothing prevents us from
sending a single r2t at a time like we do anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:19 +02:00
Sagi Grimberg 525ec495e0 nvmet-file: clamp-down file namespace lba_shift
When the backing file is a tempfile for example, the inode i_blkbits
can be 1M in size which causes problems for hosts to support as the
disk block size. Instead, expose the minimum between i_blkbits and
12 (4K sector size).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by:- Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:19 +02:00
Enrico Weigelt, metux IT consult a5dffbb66d nvmet: include <linux/scatterlist.h>
Build breaks:

    drivers/nvme/target/core.c: In function 'nvmet_req_alloc_sgl':
    drivers/nvme/target/core.c:939:12: error: implicit declaration of \
function 'sgl_alloc'; did you mean 'bio_alloc'? \
[-Werror=implicit-function-declaration]
      req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
                ^~~~~~~~~
                bio_alloc
    drivers/nvme/target/core.c:939:10: warning: assignment makes pointer \
from integer without a cast [-Wint-conversion]
      req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
              ^
    drivers/nvme/target/core.c: In function 'nvmet_req_free_sgl':
    drivers/nvme/target/core.c:952:3: error: implicit declaration of \
function 'sgl_free'; did you mean 'ida_free'? [-Werror=implicit-function-declaration]
       sgl_free(req->sg);
       ^~~~~~~~
       ida_free

Cause:

    1. missing include to <linux/scatterlist.h>
    2. SGL_ALLOC needs to be enabled

Therefore adding the missing include, as well as Kconfig dependency.

Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:47:03 +02:00
Minwoo Im 6b7e631b92 nvmet: return a specified error it subsys_alloc fails
nvmet_subsys_alloc() returns its pointer or NULL if it fails.  We can
see three different steps in this function:
  1. memory allocation
  2. argument check
  3. memory allocation for string

But now the callers of this function do not seem to handle case 2 by
returning -ENOMEM only even if it fails with an invalid parameter.

This patch specifies error codes so that caller can pass it to its own
caller.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:41:26 +02:00
Max Gurtovoy fc6c973072 nvmet: rename nvme_completion instances from rsp to cqe
Use NVMe namings for improving code readability.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:41:26 +02:00
Max Gurtovoy 8dc2ed3f3e nvmet-rdma: remove p2p_client initialization from fast-path
Initialize it during command allocation.

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:41:26 +02:00
Shenghui Wang 95f18c9d13 bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().

If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.

If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.

This patch will release the list of journal_replay(s) in the case of
error detected.

v1 -> v2:
* Move the release code to the location after label 'err:' to
  simply the change.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Shenghui Wang f16277ca20 bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:29 -06:00
Tang Junhui 631207314d bcache: fix failure in journal relplay
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries
2057493-2057567 missing! (replaying 2057493-2076601), disabling
caching

The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.

It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.

Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.

(Coly Li: rebase the patch with other changes in bch_journal_replay())

Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li eb8cbb6df3 bcache: improve bcache_reboot()
This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li 63d63b51d7 bcache: add comments for closure_fn to be called in closure_queue()
Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li bb6d355c2a bcache: Add comments for blkdev_put() in registration code path
Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li 88c12d42d2 bcache: add error check for calling register_bdev()
This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li 68d10e6979 bcache: return error immediately in bch_journal_replay()
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li 2d17456eb1 bcache: add comments for kobj release callback routine
Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
  void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
  void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
  void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
  void bch_cache_release(struct kobject *kobj)

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li ce3e4cfb59 bcache: add failure check to run_cache_set() for journal replay
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set().  The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.

This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().

If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:28 -06:00
Coly Li 1bee2addc0 bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530       struct journal_device *ja = &ca->journal;
531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533       /* No space available on this device */
534       if (next == ja->discard_idx)
535               continue;
536
537       ja->cur_idx = next;
538       k->ptr[n++] = MAKE_PTR(0,
539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                         ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-24 10:56:27 -06:00