From 778c02a236a8728bb992de10ed1f12c0be5b7b0e Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:27 +0100 Subject: [PATCH 001/164] block, bfq: increase idling for weight-raised queues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index fac188dd78fa..f30d1cb887d4 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2545,6 +2545,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd) if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 && bfq_symmetric_scenario(bfqd)) sl = min_t(u64, sl, BFQ_MIN_TT); + else if (bfqq->wr_coeff > 1) + sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC); bfqd->last_idling_start = ktime_get(); hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl), From fb53ac6cd0269987b1b77f957db453b3ec7bf7e4 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:28 +0100 Subject: [PATCH 002/164] block, bfq: do not idle for lowest-weight queues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In most cases, it is detrimental for throughput to plug I/O dispatch when the in-service bfq_queue becomes temporarily empty (plugging is performed to wait for the possible arrival, soon, of new I/O from the in-service queue). There is however a case where plugging is needed for service guarantees. If a bfq_queue, say Q, has a higher weight than some other active bfq_queue, and is sync, i.e., contains sync I/O, then, to guarantee that Q does receive a higher share of the throughput than other lower-weight queues, it is necessary to plug I/O dispatch when Q remains temporarily empty while being served. For this reason, BFQ performs I/O plugging when some active bfq_queue has a higher weight than some other active bfq_queue. But this is unnecessarily overkill. In fact, if the in-service bfq_queue actually has a weight lower than or equal to the other queues, then the queue *must not* be guaranteed a higher share of the throughput than the other queues. So, not plugging I/O cannot cause any harm to the queue. And can boost throughput. Taking advantage of this fact, this commit does not plug I/O for sync bfq_queues with a weight lower than or equal to the weights of the other queues. Here is an example of the resulting throughput boost with the dbench workload, which is particularly nasty for BFQ. With the dbench test in the Phoronix suite, BFQ reaches its lowest total throughput with 6 clients on a filesystem with journaling, in case the journaling daemon has a higher weight than normal processes. Before this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5, after this commit it is ~100 MB/sec. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 204 +++++++++++++++++++++++++------------------- block/bfq-iosched.h | 6 +- block/bfq-wf2q.c | 2 +- 3 files changed, 118 insertions(+), 94 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index f30d1cb887d4..2eb587fe7c1a 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -629,12 +629,19 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) } /* - * The following function returns true if every queue must receive the - * same share of the throughput (this condition is used when deciding - * whether idling may be disabled, see the comments in the function - * bfq_better_to_idle()). + * The following function returns false either if every active queue + * must receive the same share of the throughput (symmetric scenario), + * or, as a special case, if bfqq must receive a share of the + * throughput lower than or equal to the share that every other active + * queue must receive. If bfqq does sync I/O, then these are the only + * two cases where bfqq happens to be guaranteed its share of the + * throughput even if I/O dispatching is not plugged when bfqq remains + * temporarily empty (for more details, see the comments in the + * function bfq_better_to_idle()). For this reason, the return value + * of this function is used to check whether I/O-dispatch plugging can + * be avoided. * - * Such a scenario occurs when: + * The above first case (symmetric scenario) occurs when: * 1) all active queues have the same weight, * 2) all active queues belong to the same I/O-priority class, * 3) all active groups at the same level in the groups tree have the same @@ -654,30 +661,36 @@ void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) * support or the cgroups interface are not enabled, thus no state * needs to be maintained in this case. */ -static bool bfq_symmetric_scenario(struct bfq_data *bfqd) +static bool bfq_asymmetric_scenario(struct bfq_data *bfqd, + struct bfq_queue *bfqq) { + bool smallest_weight = bfqq && + bfqq->weight_counter && + bfqq->weight_counter == + container_of( + rb_first_cached(&bfqd->queue_weights_tree), + struct bfq_weight_counter, + weights_node); + /* * For queue weights to differ, queue_weights_tree must contain * at least two nodes. */ - bool varied_queue_weights = !RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && - (bfqd->queue_weights_tree.rb_node->rb_left || - bfqd->queue_weights_tree.rb_node->rb_right); + bool varied_queue_weights = !smallest_weight && + !RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) && + (bfqd->queue_weights_tree.rb_root.rb_node->rb_left || + bfqd->queue_weights_tree.rb_root.rb_node->rb_right); bool multiple_classes_busy = (bfqd->busy_queues[0] && bfqd->busy_queues[1]) || (bfqd->busy_queues[0] && bfqd->busy_queues[2]) || (bfqd->busy_queues[1] && bfqd->busy_queues[2]); - /* - * For queue weights to differ, queue_weights_tree must contain - * at least two nodes. - */ - return !(varied_queue_weights || multiple_classes_busy + return varied_queue_weights || multiple_classes_busy #ifdef CONFIG_BFQ_GROUP_IOSCHED || bfqd->num_groups_with_pending_reqs > 0 #endif - ); + ; } /* @@ -694,10 +707,11 @@ static bool bfq_symmetric_scenario(struct bfq_data *bfqd) * should be low too. */ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, - struct rb_root *root) + struct rb_root_cached *root) { struct bfq_entity *entity = &bfqq->entity; - struct rb_node **new = &(root->rb_node), *parent = NULL; + struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL; + bool leftmost = true; /* * Do not insert if the queue is already associated with a @@ -726,8 +740,10 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, } if (entity->weight < __counter->weight) new = &((*new)->rb_left); - else + else { new = &((*new)->rb_right); + leftmost = false; + } } bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), @@ -736,7 +752,7 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, /* * In the unlucky event of an allocation failure, we just * exit. This will cause the weight of queue to not be - * considered in bfq_symmetric_scenario, which, in its turn, + * considered in bfq_asymmetric_scenario, which, in its turn, * causes the scenario to be deemed wrongly symmetric in case * bfqq's weight would have been the only weight making the * scenario asymmetric. On the bright side, no unbalance will @@ -750,7 +766,8 @@ void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfqq->weight_counter->weight = entity->weight; rb_link_node(&bfqq->weight_counter->weights_node, parent, new); - rb_insert_color(&bfqq->weight_counter->weights_node, root); + rb_insert_color_cached(&bfqq->weight_counter->weights_node, root, + leftmost); inc_counter: bfqq->weight_counter->num_active++; @@ -765,7 +782,7 @@ inc_counter: */ void __bfq_weights_tree_remove(struct bfq_data *bfqd, struct bfq_queue *bfqq, - struct rb_root *root) + struct rb_root_cached *root) { if (!bfqq->weight_counter) return; @@ -774,7 +791,7 @@ void __bfq_weights_tree_remove(struct bfq_data *bfqd, if (bfqq->weight_counter->num_active > 0) goto reset_entity_pointer; - rb_erase(&bfqq->weight_counter->weights_node, root); + rb_erase_cached(&bfqq->weight_counter->weights_node, root); kfree(bfqq->weight_counter); reset_entity_pointer: @@ -889,7 +906,7 @@ static unsigned long bfq_serv_to_charge(struct request *rq, struct bfq_queue *bfqq) { if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 || - !bfq_symmetric_scenario(bfqq->bfqd)) + bfq_asymmetric_scenario(bfqq->bfqd, bfqq)) return blk_rq_sectors(rq); return blk_rq_sectors(rq) * bfq_async_charge_factor; @@ -2543,7 +2560,7 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd) * queue). */ if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 && - bfq_symmetric_scenario(bfqd)) + !bfq_asymmetric_scenario(bfqd, bfqq)) sl = min_t(u64, sl, BFQ_MIN_TT); else if (bfqq->wr_coeff > 1) sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC); @@ -3500,8 +3517,9 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd, } /* - * There is a case where idling must be performed not for - * throughput concerns, but to preserve service guarantees. + * There is a case where idling does not have to be performed for + * throughput concerns, but to preserve the throughput share of + * the process associated with bfqq. * * To introduce this case, we can note that allowing the drive * to enqueue more than one request at a time, and hence @@ -3517,77 +3535,83 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd, * concern about per-process throughput distribution, and * makes its decisions only on a per-request basis. Therefore, * the service distribution enforced by the drive's internal - * scheduler is likely to coincide with the desired - * device-throughput distribution only in a completely - * symmetric scenario where: - * (i) each of these processes must get the same throughput as - * the others; - * (ii) the I/O of each process has the same properties, in - * terms of locality (sequential or random), direction - * (reads or writes), request sizes, greediness - * (from I/O-bound to sporadic), and so on. - * In fact, in such a scenario, the drive tends to treat - * the requests of each of these processes in about the same - * way as the requests of the others, and thus to provide - * each of these processes with about the same throughput - * (which is exactly the desired throughput distribution). In - * contrast, in any asymmetric scenario, device idling is - * certainly needed to guarantee that bfqq receives its - * assigned fraction of the device throughput (see [1] for - * details). - * The problem is that idling may significantly reduce - * throughput with certain combinations of types of I/O and - * devices. An important example is sync random I/O, on flash - * storage with command queueing. So, unless bfqq falls in the - * above cases where idling also boosts throughput, it would - * be important to check conditions (i) and (ii) accurately, - * so as to avoid idling when not strictly needed for service - * guarantees. + * scheduler is likely to coincide with the desired throughput + * distribution only in a completely symmetric, or favorably + * skewed scenario where: + * (i-a) each of these processes must get the same throughput as + * the others, + * (i-b) in case (i-a) does not hold, it holds that the process + * associated with bfqq must receive a lower or equal + * throughput than any of the other processes; + * (ii) the I/O of each process has the same properties, in + * terms of locality (sequential or random), direction + * (reads or writes), request sizes, greediness + * (from I/O-bound to sporadic), and so on; + + * In fact, in such a scenario, the drive tends to treat the requests + * of each process in about the same way as the requests of the + * others, and thus to provide each of these processes with about the + * same throughput. This is exactly the desired throughput + * distribution if (i-a) holds, or, if (i-b) holds instead, this is an + * even more convenient distribution for (the process associated with) + * bfqq. * - * Unfortunately, it is extremely difficult to thoroughly - * check condition (ii). And, in case there are active groups, - * it becomes very difficult to check condition (i) too. In - * fact, if there are active groups, then, for condition (i) - * to become false, it is enough that an active group contains - * more active processes or sub-groups than some other active - * group. More precisely, for condition (i) to hold because of - * such a group, it is not even necessary that the group is - * (still) active: it is sufficient that, even if the group - * has become inactive, some of its descendant processes still - * have some request already dispatched but still waiting for - * completion. In fact, requests have still to be guaranteed - * their share of the throughput even after being - * dispatched. In this respect, it is easy to show that, if a - * group frequently becomes inactive while still having - * in-flight requests, and if, when this happens, the group is - * not considered in the calculation of whether the scenario - * is asymmetric, then the group may fail to be guaranteed its - * fair share of the throughput (basically because idling may - * not be performed for the descendant processes of the group, - * but it had to be). We address this issue with the - * following bi-modal behavior, implemented in the function - * bfq_symmetric_scenario(). + * In contrast, in any asymmetric or unfavorable scenario, device + * idling (I/O-dispatch plugging) is certainly needed to guarantee + * that bfqq receives its assigned fraction of the device throughput + * (see [1] for details). + * + * The problem is that idling may significantly reduce throughput with + * certain combinations of types of I/O and devices. An important + * example is sync random I/O on flash storage with command + * queueing. So, unless bfqq falls in cases where idling also boosts + * throughput, it is important to check conditions (i-a), i(-b) and + * (ii) accurately, so as to avoid idling when not strictly needed for + * service guarantees. + * + * Unfortunately, it is extremely difficult to thoroughly check + * condition (ii). And, in case there are active groups, it becomes + * very difficult to check conditions (i-a) and (i-b) too. In fact, + * if there are active groups, then, for conditions (i-a) or (i-b) to + * become false 'indirectly', it is enough that an active group + * contains more active processes or sub-groups than some other active + * group. More precisely, for conditions (i-a) or (i-b) to become + * false because of such a group, it is not even necessary that the + * group is (still) active: it is sufficient that, even if the group + * has become inactive, some of its descendant processes still have + * some request already dispatched but still waiting for + * completion. In fact, requests have still to be guaranteed their + * share of the throughput even after being dispatched. In this + * respect, it is easy to show that, if a group frequently becomes + * inactive while still having in-flight requests, and if, when this + * happens, the group is not considered in the calculation of whether + * the scenario is asymmetric, then the group may fail to be + * guaranteed its fair share of the throughput (basically because + * idling may not be performed for the descendant processes of the + * group, but it had to be). We address this issue with the following + * bi-modal behavior, implemented in the function + * bfq_asymmetric_scenario(). * * If there are groups with requests waiting for completion * (as commented above, some of these groups may even be * already inactive), then the scenario is tagged as * asymmetric, conservatively, without checking any of the - * conditions (i) and (ii). So the device is idled for bfqq. + * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq. * This behavior matches also the fact that groups are created * exactly if controlling I/O is a primary concern (to * preserve bandwidth and latency guarantees). * - * On the opposite end, if there are no groups with requests - * waiting for completion, then only condition (i) is actually - * controlled, i.e., provided that condition (i) holds, idling - * is not performed, regardless of whether condition (ii) - * holds. In other words, only if condition (i) does not hold, - * then idling is allowed, and the device tends to be - * prevented from queueing many requests, possibly of several - * processes. Since there are no groups with requests waiting - * for completion, then, to control condition (i) it is enough - * to check just whether all the queues with requests waiting - * for completion also have the same weight. + * On the opposite end, if there are no groups with requests waiting + * for completion, then only conditions (i-a) and (i-b) are actually + * controlled, i.e., provided that conditions (i-a) or (i-b) holds, + * idling is not performed, regardless of whether condition (ii) + * holds. In other words, only if conditions (i-a) and (i-b) do not + * hold, then idling is allowed, and the device tends to be prevented + * from queueing many requests, possibly of several processes. Since + * there are no groups with requests waiting for completion, then, to + * control conditions (i-a) and (i-b) it is enough to check just + * whether all the queues with requests waiting for completion also + * have the same weight. * * Not checking condition (ii) evidently exposes bfqq to the * risk of getting less throughput than its fair share. @@ -3639,7 +3663,7 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd, * compound condition that is checked below for deciding * whether the scenario is asymmetric. To explain this * compound condition, we need to add that the function - * bfq_symmetric_scenario checks the weights of only + * bfq_asymmetric_scenario checks the weights of only * non-weight-raised queues, for efficiency reasons (see * comments on bfq_weights_tree_add()). Then the fact that * bfqq is weight-raised is checked explicitly here. More @@ -3667,7 +3691,7 @@ static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, return (bfqq->wr_coeff > 1 && bfqd->wr_busy_queues < bfq_tot_busy_queues(bfqd)) || - !bfq_symmetric_scenario(bfqd); + bfq_asymmetric_scenario(bfqd, bfqq); } /* @@ -5505,7 +5529,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) HRTIMER_MODE_REL); bfqd->idle_slice_timer.function = bfq_idle_slice_timer; - bfqd->queue_weights_tree = RB_ROOT; + bfqd->queue_weights_tree = RB_ROOT_CACHED; bfqd->num_groups_with_pending_reqs = 0; INIT_LIST_HEAD(&bfqd->active_list); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 062e1c4787f4..81cabf51a87e 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -450,7 +450,7 @@ struct bfq_data { * weight-raised @bfq_queue (see the comments to the functions * bfq_weights_tree_[add|remove] for further details). */ - struct rb_root queue_weights_tree; + struct rb_root_cached queue_weights_tree; /* * Number of groups with at least one descendant process that @@ -898,10 +898,10 @@ void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq, - struct rb_root *root); + struct rb_root_cached *root); void __bfq_weights_tree_remove(struct bfq_data *bfqd, struct bfq_queue *bfqq, - struct rb_root *root); + struct rb_root_cached *root); void bfq_weights_tree_remove(struct bfq_data *bfqd, struct bfq_queue *bfqq); void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq, diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index a11bef75483d..51ef1f00df80 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -737,7 +737,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); unsigned int prev_weight, new_weight; struct bfq_data *bfqd = NULL; - struct rb_root *root; + struct rb_root_cached *root; #ifdef CONFIG_BFQ_GROUP_IOSCHED struct bfq_sched_data *sd; struct bfq_group *bfqg; From 2341d662e9a2a5751ff8ac4ffa640fb493b0ee84 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:29 +0100 Subject: [PATCH 003/164] block, bfq: tune service injection basing on request service times MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Tested-by: Francesco Pollicino Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 417 ++++++++++++++++++++++++++++++++++++++++---- block/bfq-iosched.h | 51 +++--- 2 files changed, 409 insertions(+), 59 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 2eb587fe7c1a..f59efee7a601 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1721,6 +1721,123 @@ static void bfq_add_request(struct request *rq) bfqq->queued[rq_is_sync(rq)]++; bfqd->queued++; + if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) { + /* + * Periodically reset inject limit, to make sure that + * the latter eventually drops in case workload + * changes, see step (3) in the comments on + * bfq_update_inject_limit(). + */ + if (time_is_before_eq_jiffies(bfqq->decrease_time_jif + + msecs_to_jiffies(1000))) { + /* invalidate baseline total service time */ + bfqq->last_serv_time_ns = 0; + + /* + * Reset pointer in case we are waiting for + * some request completion. + */ + bfqd->waited_rq = NULL; + + /* + * If bfqq has a short think time, then start + * by setting the inject limit to 0 + * prudentially, because the service time of + * an injected I/O request may be higher than + * the think time of bfqq, and therefore, if + * one request was injected when bfqq remains + * empty, this injected request might delay + * the service of the next I/O request for + * bfqq significantly. In case bfqq can + * actually tolerate some injection, then the + * adaptive update will however raise the + * limit soon. This lucky circumstance holds + * exactly because bfqq has a short think + * time, and thus, after remaining empty, is + * likely to get new I/O enqueued---and then + * completed---before being expired. This is + * the very pattern that gives the + * limit-update algorithm the chance to + * measure the effect of injection on request + * service times, and then to update the limit + * accordingly. + * + * On the opposite end, if bfqq has a long + * think time, then start directly by 1, + * because: + * a) on the bright side, keeping at most one + * request in service in the drive is unlikely + * to cause any harm to the latency of bfqq's + * requests, as the service time of a single + * request is likely to be lower than the + * think time of bfqq; + * b) on the downside, after becoming empty, + * bfqq is likely to expire before getting its + * next request. With this request arrival + * pattern, it is very hard to sample total + * service times and update the inject limit + * accordingly (see comments on + * bfq_update_inject_limit()). So the limit is + * likely to be never, or at least seldom, + * updated. As a consequence, by setting the + * limit to 1, we avoid that no injection ever + * occurs with bfqq. On the downside, this + * proactive step further reduces chances to + * actually compute the baseline total service + * time. Thus it reduces chances to execute the + * limit-update algorithm and possibly raise the + * limit to more than 1. + */ + if (bfq_bfqq_has_short_ttime(bfqq)) + bfqq->inject_limit = 0; + else + bfqq->inject_limit = 1; + bfqq->decrease_time_jif = jiffies; + } + + /* + * The following conditions must hold to setup a new + * sampling of total service time, and then a new + * update of the inject limit: + * - bfqq is in service, because the total service + * time is evaluated only for the I/O requests of + * the queues in service; + * - this is the right occasion to compute or to + * lower the baseline total service time, because + * there are actually no requests in the drive, + * or + * the baseline total service time is available, and + * this is the right occasion to compute the other + * quantity needed to update the inject limit, i.e., + * the total service time caused by the amount of + * injection allowed by the current value of the + * limit. It is the right occasion because injection + * has actually been performed during the service + * hole, and there are still in-flight requests, + * which are very likely to be exactly the injected + * requests, or part of them; + * - the minimum interval for sampling the total + * service time and updating the inject limit has + * elapsed. + */ + if (bfqq == bfqd->in_service_queue && + (bfqd->rq_in_driver == 0 || + (bfqq->last_serv_time_ns > 0 && + bfqd->rqs_injected && bfqd->rq_in_driver > 0)) && + time_is_before_eq_jiffies(bfqq->decrease_time_jif + + msecs_to_jiffies(100))) { + bfqd->last_empty_occupied_ns = ktime_get_ns(); + /* + * Start the state machine for measuring the + * total service time of rq: setting + * wait_dispatch will cause bfqd->waited_rq to + * be set when rq will be dispatched. + */ + bfqd->wait_dispatch = true; + bfqd->rqs_injected = false; + } + } + elv_rb_add(&bfqq->sort_list, rq); /* @@ -2566,6 +2683,8 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd) sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC); bfqd->last_idling_start = ktime_get(); + bfqd->last_idling_start_jiffies = jiffies; + hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl), HRTIMER_MODE_REL); bfqg_stats_set_start_idle_time(bfqq_group(bfqq)); @@ -3240,13 +3359,6 @@ static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd, jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4); } -static bool bfq_bfqq_injectable(struct bfq_queue *bfqq) -{ - return BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 && - blk_queue_nonrot(bfqq->bfqd->queue) && - bfqq->bfqd->hw_tag; -} - /** * bfq_bfqq_expire - expire a queue. * @bfqd: device owning the queue. @@ -3361,6 +3473,14 @@ void bfq_bfqq_expire(struct bfq_data *bfqd, "expire (%d, slow %d, num_disp %d, short_ttime %d)", reason, slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq)); + /* + * bfqq expired, so no total service time needs to be computed + * any longer: reset state machine for measuring total service + * times. + */ + bfqd->rqs_injected = bfqd->wait_dispatch = false; + bfqd->waited_rq = NULL; + /* * Increase, decrease or leave budget unchanged according to * reason. @@ -3372,8 +3492,6 @@ void bfq_bfqq_expire(struct bfq_data *bfqd, if (ref == 1) /* bfqq is gone, no more actions on it */ return; - bfqq->injected_service = 0; - /* mark bfqq as waiting a request only if a bic still points to it */ if (!bfq_bfqq_busy(bfqq) && reason != BFQQE_BUDGET_TIMEOUT && @@ -3767,26 +3885,98 @@ static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq) return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq); } -static struct bfq_queue *bfq_choose_bfqq_for_injection(struct bfq_data *bfqd) +/* + * This function chooses the queue from which to pick the next extra + * I/O request to inject, if it finds a compatible queue. See the + * comments on bfq_update_inject_limit() for details on the injection + * mechanism, and for the definitions of the quantities mentioned + * below. + */ +static struct bfq_queue * +bfq_choose_bfqq_for_injection(struct bfq_data *bfqd) { - struct bfq_queue *bfqq; + struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue; + unsigned int limit = in_serv_bfqq->inject_limit; + /* + * If + * - bfqq is not weight-raised and therefore does not carry + * time-critical I/O, + * or + * - regardless of whether bfqq is weight-raised, bfqq has + * however a long think time, during which it can absorb the + * effect of an appropriate number of extra I/O requests + * from other queues (see bfq_update_inject_limit for + * details on the computation of this number); + * then injection can be performed without restrictions. + */ + bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 || + !bfq_bfqq_has_short_ttime(in_serv_bfqq); /* - * A linear search; but, with a high probability, very few - * steps are needed to find a candidate queue, i.e., a queue - * with enough budget left for its next request. In fact: + * If + * - the baseline total service time could not be sampled yet, + * so the inject limit happens to be still 0, and + * - a lot of time has elapsed since the plugging of I/O + * dispatching started, so drive speed is being wasted + * significantly; + * then temporarily raise inject limit to one request. + */ + if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 && + bfq_bfqq_wait_request(in_serv_bfqq) && + time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies + + bfqd->bfq_slice_idle) + ) + limit = 1; + + if (bfqd->rq_in_driver >= limit) + return NULL; + + /* + * Linear search of the source queue for injection; but, with + * a high probability, very few steps are needed to find a + * candidate queue, i.e., a queue with enough budget left for + * its next request. In fact: * - BFQ dynamically updates the budget of every queue so as * to accommodate the expected backlog of the queue; * - if a queue gets all its requests dispatched as injected * service, then the queue is removed from the active list - * (and re-added only if it gets new requests, but with - * enough budget for its new backlog). + * (and re-added only if it gets new requests, but then it + * is assigned again enough budget for its new backlog). */ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) if (!RB_EMPTY_ROOT(&bfqq->sort_list) && + (in_serv_always_inject || bfqq->wr_coeff > 1) && bfq_serv_to_charge(bfqq->next_rq, bfqq) <= - bfq_bfqq_budget_left(bfqq)) - return bfqq; + bfq_bfqq_budget_left(bfqq)) { + /* + * Allow for only one large in-flight request + * on non-rotational devices, for the + * following reason. On non-rotationl drives, + * large requests take much longer than + * smaller requests to be served. In addition, + * the drive prefers to serve large requests + * w.r.t. to small ones, if it can choose. So, + * having more than one large requests queued + * in the drive may easily make the next first + * request of the in-service queue wait for so + * long to break bfqq's service guarantees. On + * the bright side, large requests let the + * drive reach a very high throughput, even if + * there is only one in-flight large request + * at a time. + */ + if (blk_queue_nonrot(bfqd->queue) && + blk_rq_sectors(bfqq->next_rq) >= + BFQQ_SECT_THR_NONROT) + limit = min_t(unsigned int, 1, limit); + else + limit = in_serv_bfqq->inject_limit; + + if (bfqd->rq_in_driver < limit) { + bfqd->rqs_injected = true; + return bfqq; + } + } return NULL; } @@ -3873,14 +4063,32 @@ check_queue: * for a new request, or has requests waiting for a completion and * may idle after their completion, then keep it anyway. * - * Yet, to boost throughput, inject service from other queues if - * possible. + * Yet, inject service from other queues if it boosts + * throughput and is possible. */ if (bfq_bfqq_wait_request(bfqq) || (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) { - if (bfq_bfqq_injectable(bfqq) && - bfqq->injected_service * bfqq->inject_coeff < - bfqq->entity.service * 10) + struct bfq_queue *async_bfqq = + bfqq->bic && bfqq->bic->bfqq[0] && + bfq_bfqq_busy(bfqq->bic->bfqq[0]) ? + bfqq->bic->bfqq[0] : NULL; + + /* + * If the process associated with bfqq has also async + * I/O pending, then inject it + * unconditionally. Injecting I/O from the same + * process can cause no harm to the process. On the + * contrary, it can only increase bandwidth and reduce + * latency for the process. + */ + if (async_bfqq && + icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic && + bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <= + bfq_bfqq_budget_left(async_bfqq)) + bfqq = bfqq->bic->bfqq[0]; + else if (!idling_boosts_thr_without_issues(bfqd, bfqq) && + (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 || + !bfq_bfqq_has_short_ttime(bfqq))) bfqq = bfq_choose_bfqq_for_injection(bfqd); else bfqq = NULL; @@ -3972,15 +4180,15 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd, bfq_bfqq_served(bfqq, service_to_charge); + if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) { + bfqd->wait_dispatch = false; + bfqd->waited_rq = rq; + } + bfq_dispatch_remove(bfqd->queue, rq); - if (bfqq != bfqd->in_service_queue) { - if (likely(bfqd->in_service_queue)) - bfqd->in_service_queue->injected_service += - bfq_serv_to_charge(rq, bfqq); - + if (bfqq != bfqd->in_service_queue) goto return_rq; - } /* * If weight raising has to terminate for bfqq, then next @@ -4411,13 +4619,6 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfq_mark_bfqq_has_short_ttime(bfqq); bfq_mark_bfqq_sync(bfqq); bfq_mark_bfqq_just_created(bfqq); - /* - * Aggressively inject a lot of service: up to 90%. - * This coefficient remains constant during bfqq life, - * but this behavior might be changed, after enough - * testing and tuning. - */ - bfqq->inject_coeff = 1; } else bfq_clear_bfqq_sync(bfqq); @@ -4976,6 +5177,147 @@ static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq) bfq_put_queue(bfqq); } +/* + * The processes associated with bfqq may happen to generate their + * cumulative I/O at a lower rate than the rate at which the device + * could serve the same I/O. This is rather probable, e.g., if only + * one process is associated with bfqq and the device is an SSD. It + * results in bfqq becoming often empty while in service. In this + * respect, if BFQ is allowed to switch to another queue when bfqq + * remains empty, then the device goes on being fed with I/O requests, + * and the throughput is not affected. In contrast, if BFQ is not + * allowed to switch to another queue---because bfqq is sync and + * I/O-dispatch needs to be plugged while bfqq is temporarily + * empty---then, during the service of bfqq, there will be frequent + * "service holes", i.e., time intervals during which bfqq gets empty + * and the device can only consume the I/O already queued in its + * hardware queues. During service holes, the device may even get to + * remaining idle. In the end, during the service of bfqq, the device + * is driven at a lower speed than the one it can reach with the kind + * of I/O flowing through bfqq. + * + * To counter this loss of throughput, BFQ implements a "request + * injection mechanism", which tries to fill the above service holes + * with I/O requests taken from other queues. The hard part in this + * mechanism is finding the right amount of I/O to inject, so as to + * both boost throughput and not break bfqq's bandwidth and latency + * guarantees. In this respect, the mechanism maintains a per-queue + * inject limit, computed as below. While bfqq is empty, the injection + * mechanism dispatches extra I/O requests only until the total number + * of I/O requests in flight---i.e., already dispatched but not yet + * completed---remains lower than this limit. + * + * A first definition comes in handy to introduce the algorithm by + * which the inject limit is computed. We define as first request for + * bfqq, an I/O request for bfqq that arrives while bfqq is in + * service, and causes bfqq to switch from empty to non-empty. The + * algorithm updates the limit as a function of the effect of + * injection on the service times of only the first requests of + * bfqq. The reason for this restriction is that these are the + * requests whose service time is affected most, because they are the + * first to arrive after injection possibly occurred. + * + * To evaluate the effect of injection, the algorithm measures the + * "total service time" of first requests. We define as total service + * time of an I/O request, the time that elapses since when the + * request is enqueued into bfqq, to when it is completed. This + * quantity allows the whole effect of injection to be measured. It is + * easy to see why. Suppose that some requests of other queues are + * actually injected while bfqq is empty, and that a new request R + * then arrives for bfqq. If the device does start to serve all or + * part of the injected requests during the service hole, then, + * because of this extra service, it may delay the next invocation of + * the dispatch hook of BFQ. Then, even after R gets eventually + * dispatched, the device may delay the actual service of R if it is + * still busy serving the extra requests, or if it decides to serve, + * before R, some extra request still present in its queues. As a + * conclusion, the cumulative extra delay caused by injection can be + * easily evaluated by just comparing the total service time of first + * requests with and without injection. + * + * The limit-update algorithm works as follows. On the arrival of a + * first request of bfqq, the algorithm measures the total time of the + * request only if one of the three cases below holds, and, for each + * case, it updates the limit as described below: + * + * (1) If there is no in-flight request. This gives a baseline for the + * total service time of the requests of bfqq. If the baseline has + * not been computed yet, then, after computing it, the limit is + * set to 1, to start boosting throughput, and to prepare the + * ground for the next case. If the baseline has already been + * computed, then it is updated, in case it results to be lower + * than the previous value. + * + * (2) If the limit is higher than 0 and there are in-flight + * requests. By comparing the total service time in this case with + * the above baseline, it is possible to know at which extent the + * current value of the limit is inflating the total service + * time. If the inflation is below a certain threshold, then bfqq + * is assumed to be suffering from no perceivable loss of its + * service guarantees, and the limit is even tentatively + * increased. If the inflation is above the threshold, then the + * limit is decreased. Due to the lack of any hysteresis, this + * logic makes the limit oscillate even in steady workload + * conditions. Yet we opted for it, because it is fast in reaching + * the best value for the limit, as a function of the current I/O + * workload. To reduce oscillations, this step is disabled for a + * short time interval after the limit happens to be decreased. + * + * (3) Periodically, after resetting the limit, to make sure that the + * limit eventually drops in case the workload changes. This is + * needed because, after the limit has gone safely up for a + * certain workload, it is impossible to guess whether the + * baseline total service time may have changed, without measuring + * it again without injection. A more effective version of this + * step might be to just sample the baseline, by interrupting + * injection only once, and then to reset/lower the limit only if + * the total service time with the current limit does happen to be + * too large. + * + * More details on each step are provided in the comments on the + * pieces of code that implement these steps: the branch handling the + * transition from empty to non empty in bfq_add_request(), the branch + * handling injection in bfq_select_queue(), and the function + * bfq_choose_bfqq_for_injection(). These comments also explain some + * exceptions, made by the injection mechanism in some special cases. + */ +static void bfq_update_inject_limit(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns; + unsigned int old_limit = bfqq->inject_limit; + + if (bfqq->last_serv_time_ns > 0) { + u64 threshold = (bfqq->last_serv_time_ns * 3)>>1; + + if (tot_time_ns >= threshold && old_limit > 0) { + bfqq->inject_limit--; + bfqq->decrease_time_jif = jiffies; + } else if (tot_time_ns < threshold && + old_limit < bfqd->max_rq_in_driver<<1) + bfqq->inject_limit++; + } + + /* + * Either we still have to compute the base value for the + * total service time, and there seem to be the right + * conditions to do it, or we can lower the last base value + * computed. + */ + if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 0) || + tot_time_ns < bfqq->last_serv_time_ns) { + bfqq->last_serv_time_ns = tot_time_ns; + /* + * Now we certainly have a base value: make sure we + * start trying injection. + */ + bfqq->inject_limit = max_t(unsigned int, 1, old_limit); + } + + /* update complete, not waiting for any request completion any longer */ + bfqd->waited_rq = NULL; +} + /* * Handle either a requeue or a finish for rq. The things to do are * the same in both cases: all references to rq are to be dropped. In @@ -5020,6 +5362,9 @@ static void bfq_finish_requeue_request(struct request *rq) spin_lock_irqsave(&bfqd->lock, flags); + if (rq == bfqd->waited_rq) + bfq_update_inject_limit(bfqd, bfqq); + bfq_completed_request(bfqq, bfqd); bfq_finish_requeue_request_body(bfqq); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 81cabf51a87e..26869cfbbfa9 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -240,6 +240,13 @@ struct bfq_queue { /* next ioprio and ioprio class if a change is in progress */ unsigned short new_ioprio, new_ioprio_class; + /* last total-service-time sample, see bfq_update_inject_limit() */ + u64 last_serv_time_ns; + /* limit for request injection */ + unsigned int inject_limit; + /* last time the inject limit has been decreased, in jiffies */ + unsigned long decrease_time_jif; + /* * Shared bfq_queue if queue is cooperating with one or more * other queues. @@ -357,29 +364,6 @@ struct bfq_queue { /* max service rate measured so far */ u32 max_service_rate; - /* - * Ratio between the service received by bfqq while it is in - * service, and the cumulative service (of requests of other - * queues) that may be injected while bfqq is empty but still - * in service. To increase precision, the coefficient is - * measured in tenths of unit. Here are some example of (1) - * ratios, (2) resulting percentages of service injected - * w.r.t. to the total service dispatched while bfqq is in - * service, and (3) corresponding values of the coefficient: - * 1 (50%) -> 10 - * 2 (33%) -> 20 - * 10 (9%) -> 100 - * 9.9 (9%) -> 99 - * 1.5 (40%) -> 15 - * 0.5 (66%) -> 5 - * 0.1 (90%) -> 1 - * - * So, if the coefficient is lower than 10, then - * injected service is more than bfqq service. - */ - unsigned int inject_coeff; - /* amount of service injected in current service slot */ - unsigned int injected_service; }; /** @@ -544,6 +528,26 @@ struct bfq_data { /* time of last request completion (ns) */ u64 last_completion; + /* time of last transition from empty to non-empty (ns) */ + u64 last_empty_occupied_ns; + + /* + * Flag set to activate the sampling of the total service time + * of a just-arrived first I/O request (see + * bfq_update_inject_limit()). This will cause the setting of + * waited_rq when the request is finally dispatched. + */ + bool wait_dispatch; + /* + * If set, then bfq_update_inject_limit() is invoked when + * waited_rq is eventually completed. + */ + struct request *waited_rq; + /* + * True if some request has been injected during the last service hole. + */ + bool rqs_injected; + /* time of first rq dispatch in current observation interval (ns) */ u64 first_dispatch; /* time of last rq dispatch in current observation interval (ns) */ @@ -553,6 +557,7 @@ struct bfq_data { ktime_t last_budget_start; /* beginning of the last idle slice */ ktime_t last_idling_start; + unsigned long last_idling_start_jiffies; /* number of samples in current observation interval */ int peak_rate_samples; From 8cacc5ab3eacf5284bc9b0d7d5b85b748a338104 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:30 +0100 Subject: [PATCH 004/164] block, bfq: do not merge queues on flash storage with queueing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Tested-by: Francesco Pollicino Signed-off-by: Alessio Masola Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-cgroup.c | 3 +- block/bfq-iosched.c | 73 +++++++++++++++++++++++++++++++++++++++++---- block/bfq-iosched.h | 3 ++ 3 files changed, 73 insertions(+), 6 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index c6113af31960..2a74a3f2a8f7 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -578,7 +578,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfqg_and_blkg_get(bfqg); if (bfq_bfqq_busy(bfqq)) { - bfq_pos_tree_add_move(bfqd, bfqq); + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); bfq_activate_bfqq(bfqd, bfqq); } diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index f59efee7a601..b957e9db87d8 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -595,7 +595,16 @@ static bool bfq_too_late_for_merging(struct bfq_queue *bfqq) bfq_merge_time_limit); } -void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) +/* + * The following function is not marked as __cold because it is + * actually cold, but for the same performance goal described in the + * comments on the likely() at the beginning of + * bfq_setup_cooperator(). Unexpectedly, to reach an even lower + * execution time for the case where this function is not invoked, we + * had to add an unlikely() in each involved if(). + */ +void __cold +bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) { struct rb_node **p, *parent; struct bfq_queue *__bfqq; @@ -1849,8 +1858,9 @@ static void bfq_add_request(struct request *rq) /* * Adjust priority tree position, if next_rq changes. + * See comments on bfq_pos_tree_add_move() for the unlikely(). */ - if (prev != bfqq->next_rq) + if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq)) bfq_pos_tree_add_move(bfqd, bfqq); if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */ @@ -1990,7 +2000,9 @@ static void bfq_remove_request(struct request_queue *q, bfqq->pos_root = NULL; } } else { - bfq_pos_tree_add_move(bfqd, bfqq); + /* see comments on bfq_pos_tree_add_move() for the unlikely() */ + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } if (rq->cmd_flags & REQ_META) @@ -2075,7 +2087,12 @@ static void bfq_request_merged(struct request_queue *q, struct request *req, */ if (prev != bfqq->next_rq) { bfq_updated_next_req(bfqd, bfqq); - bfq_pos_tree_add_move(bfqd, bfqq); + /* + * See comments on bfq_pos_tree_add_move() for + * the unlikely(). + */ + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } } } @@ -2357,6 +2374,46 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, { struct bfq_queue *in_service_bfqq, *new_bfqq; + /* + * Do not perform queue merging if the device is non + * rotational and performs internal queueing. In fact, such a + * device reaches a high speed through internal parallelism + * and pipelining. This means that, to reach a high + * throughput, it must have many requests enqueued at the same + * time. But, in this configuration, the internal scheduling + * algorithm of the device does exactly the job of queue + * merging: it reorders requests so as to obtain as much as + * possible a sequential I/O pattern. As a consequence, with + * the workload generated by processes doing interleaved I/O, + * the throughput reached by the device is likely to be the + * same, with and without queue merging. + * + * Disabling merging also provides a remarkable benefit in + * terms of throughput. Merging tends to make many workloads + * artificially more uneven, because of shared queues + * remaining non empty for incomparably more time than + * non-merged queues. This may accentuate workload + * asymmetries. For example, if one of the queues in a set of + * merged queues has a higher weight than a normal queue, then + * the shared queue may inherit such a high weight and, by + * staying almost always active, may force BFQ to perform I/O + * plugging most of the time. This evidently makes it harder + * for BFQ to let the device reach a high throughput. + * + * Finally, the likely() macro below is not used because one + * of the two branches is more likely than the other, but to + * have the code path after the following if() executed as + * fast as possible for the case of a non rotational device + * with queueing. We want it because this is the fastest kind + * of device. On the opposite end, the likely() may lengthen + * the execution time of BFQ for the case of slower devices + * (rotational or at least without queueing). But in this case + * the execution time of BFQ matters very little, if not at + * all. + */ + if (likely(bfqd->nonrot_with_queueing)) + return NULL; + /* * Prevent bfqq from being merged if it has been created too * long ago. The idea is that true cooperating processes, and @@ -2986,8 +3043,10 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) bfq_requeue_bfqq(bfqd, bfqq, true); /* * Resort priority tree of potential close cooperators. + * See comments on bfq_pos_tree_add_move() for the unlikely(). */ - bfq_pos_tree_add_move(bfqd, bfqq); + if (unlikely(!bfqd->nonrot_with_queueing)) + bfq_pos_tree_add_move(bfqd, bfqq); } /* @@ -5051,6 +5110,9 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd) bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD; bfqd->max_rq_in_driver = 0; bfqd->hw_tag_samples = 0; + + bfqd->nonrot_with_queueing = + blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag; } static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) @@ -5882,6 +5944,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) INIT_HLIST_HEAD(&bfqd->burst_list); bfqd->hw_tag = -1; + bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue); bfqd->bfq_max_budget = bfq_default_max_budget; diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 26869cfbbfa9..829730b96fb2 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -497,6 +497,9 @@ struct bfq_data { /* number of requests dispatched and waiting for completion */ int rq_in_driver; + /* true if the device is non rotational and performs queueing */ + bool nonrot_with_queueing; + /* * Maximum number of requests in driver in the last * @hw_tag_samples completed requests. From 7074f076ff153021f408229b0ce63063dde9a400 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:31 +0100 Subject: [PATCH 005/164] block, bfq: do not tag totally seeky queues as soft rt MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sync random I/O is likely to be confused with soft real-time I/O, because it is characterized by limited throughput and apparently isochronous arrival pattern. To avoid false positives, this commits prevents bfq_queues containing only random (seeky) I/O from being tagged as soft real-time. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index b957e9db87d8..7044da0b1c52 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -242,6 +242,14 @@ static struct kmem_cache *bfq_pool; blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT)) #define BFQQ_CLOSE_THR (sector_t)(8 * 1024) #define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19) +/* + * Sync random I/O is likely to be confused with soft real-time I/O, + * because it is characterized by limited throughput and apparently + * isochronous arrival pattern. To avoid false positives, queues + * containing only random (seeky) I/O are prevented from being tagged + * as soft real-time. + */ +#define BFQQ_TOTALLY_SEEKY(bfqq) (bfqq->seek_history & -1) /* Min number of samples required to perform peak-rate update */ #define BFQ_RATE_MIN_SAMPLES 32 @@ -1622,6 +1630,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, */ in_burst = bfq_bfqq_in_large_burst(bfqq); soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 && + !BFQQ_TOTALLY_SEEKY(bfqq) && !in_burst && time_is_before_jiffies(bfqq->soft_rt_next_start) && bfqq->dispatched == 0; @@ -4816,6 +4825,11 @@ bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq, { bfqq->seek_history <<= 1; bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq); + + if (bfqq->wr_coeff > 1 && + bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time && + BFQQ_TOTALLY_SEEKY(bfqq)) + bfq_bfqq_end_wr(bfqq); } static void bfq_update_has_short_ttime(struct bfq_data *bfqd, From 84a746891e1d8364485c0a37533fe6c1380270d4 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:32 +0100 Subject: [PATCH 006/164] block, bfq: always protect newly-created queues from existing active queues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If many bfq_queues belonging to the same group happen to be created shortly after each other, then the processes associated with these queues have typically a common goal. In particular, bursts of queue creations are usually caused by services or applications that spawn many parallel threads/processes. Examples are systemd during boot, or git grep. If there are no other active queues, then, to help these processes get their job done as soon as possible, the best thing to do is to reach a high throughput. To this goal, it is usually better to not grant either weight-raising or device idling to the queues associated with these processes. And this is exactly what BFQ currently does. There is however a drawback: if, in contrast, some other queues are already active, then the newly created queues must be protected from the I/O flowing through the already existing queues. In this case, the best thing to do is the opposite as in the other case: it is much better to grant weight-raising and device idling to the newly-created queues, if they deserve it. This commit addresses this issue by doing so if there are already other active queues. This change also helps eliminating false positives, which occur when the newly-created queues do not belong to an actual large burst of creations, but some background task (e.g., a service) happens to trigger the creation of new queues in the middle, i.e., very close to when the victim queues are created. These false positive may cause total loss of control on process latencies. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 64 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 13 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 7044da0b1c52..49bde428f7f2 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1075,8 +1075,18 @@ static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq) hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node) hlist_del_init(&item->burst_list_node); - hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); - bfqd->burst_size = 1; + + /* + * Start the creation of a new burst list only if there is no + * active queue. See comments on the conditional invocation of + * bfq_handle_burst(). + */ + if (bfq_tot_busy_queues(bfqd) == 0) { + hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); + bfqd->burst_size = 1; + } else + bfqd->burst_size = 0; + bfqd->burst_parent_entity = bfqq->entity.parent; } @@ -1132,7 +1142,8 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) * many parallel threads/processes. Examples are systemd during boot, * or git grep. To help these processes get their job done as soon as * possible, it is usually better to not grant either weight-raising - * or device idling to their queues. + * or device idling to their queues, unless these queues must be + * protected from the I/O flowing through other active queues. * * In this comment we describe, firstly, the reasons why this fact * holds, and, secondly, the next function, which implements the main @@ -1144,7 +1155,10 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) * cumulatively served, the sooner the target job of these queues gets * completed. As a consequence, weight-raising any of these queues, * which also implies idling the device for it, is almost always - * counterproductive. In most cases it just lowers throughput. + * counterproductive, unless there are other active queues to isolate + * these new queues from. If there no other active queues, then + * weight-raising these new queues just lowers throughput in most + * cases. * * On the other hand, a burst of queue creations may be caused also by * the start of an application that does not consist of a lot of @@ -1178,14 +1192,16 @@ static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) * are very rare. They typically occur if some service happens to * start doing I/O exactly when the interactive task starts. * - * Turning back to the next function, it implements all the steps - * needed to detect the occurrence of a large burst and to properly - * mark all the queues belonging to it (so that they can then be - * treated in a different way). This goal is achieved by maintaining a - * "burst list" that holds, temporarily, the queues that belong to the - * burst in progress. The list is then used to mark these queues as - * belonging to a large burst if the burst does become large. The main - * steps are the following. + * Turning back to the next function, it is invoked only if there are + * no active queues (apart from active queues that would belong to the + * same, possible burst bfqq would belong to), and it implements all + * the steps needed to detect the occurrence of a large burst and to + * properly mark all the queues belonging to it (so that they can then + * be treated in a different way). This goal is achieved by + * maintaining a "burst list" that holds, temporarily, the queues that + * belong to the burst in progress. The list is then used to mark + * these queues as belonging to a large burst if the burst does become + * large. The main steps are the following. * * . when the very first queue is created, the queue is inserted into the * list (as it could be the first queue in a possible burst) @@ -5695,7 +5711,29 @@ static struct bfq_queue *bfq_init_rq(struct request *rq) } } - if (unlikely(bfq_bfqq_just_created(bfqq))) + /* + * Consider bfqq as possibly belonging to a burst of newly + * created queues only if: + * 1) A burst is actually happening (bfqd->burst_size > 0) + * or + * 2) There is no other active queue. In fact, if, in + * contrast, there are active queues not belonging to the + * possible burst bfqq may belong to, then there is no gain + * in considering bfqq as belonging to a burst, and + * therefore in not weight-raising bfqq. See comments on + * bfq_handle_burst(). + * + * This filtering also helps eliminating false positives, + * occurring when bfqq does not belong to an actual large + * burst, but some background task (e.g., a service) happens + * to trigger the creation of new queues very close to when + * bfqq and its possible companion queues are created. See + * comments on bfq_handle_burst() for further details also on + * this issue. + */ + if (unlikely(bfq_bfqq_just_created(bfqq) && + (bfqd->burst_size > 0 || + bfq_tot_busy_queues(bfqd) == 0))) bfq_handle_burst(bfqd, bfqq); return bfqq; From 1e66413c4f68e2a61a210e4f5ff5df7a2ab86a5b Mon Sep 17 00:00:00 2001 From: Francesco Pollicino Date: Tue, 12 Mar 2019 09:59:33 +0100 Subject: [PATCH 007/164] block, bfq: print SHARED instead of pid for shared queues in logs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The function "bfq_log_bfqq" prints the pid of the process associated with the queue passed as input. Unfortunately, if the queue is shared, then more than one process is associated with the queue. The pid that gets printed in this case is the pid of one of the associated processes. Which process gets printed depends on the exact sequence of merge events the queue underwent. So printing such a pid is rather useless and above all is often rather confusing because it reports a random pid between those of the associated processes. This commit addresses this issue by printing SHARED instead of a pid if the queue is shared. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Francesco Pollicino Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 10 ++++++++++ block/bfq-iosched.h | 23 +++++++++++++++++++---- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 49bde428f7f2..37cc8f127cf6 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -2590,6 +2590,16 @@ bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic, * assignment causes no harm). */ new_bfqq->bic = NULL; + /* + * If the queue is shared, the pid is the pid of one of the associated + * processes. Which pid depends on the exact sequence of merge events + * the queue underwent. So printing such a pid is useless and confusing + * because it reports a random pid between those of the associated + * processes. + * We mark such a queue with a pid -1, and then print SHARED instead of + * a pid in logging messages. + */ + new_bfqq->pid = -1; bfqq->bic = NULL; /* release process reference to bfqq */ bfq_put_queue(bfqq); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 829730b96fb2..67e63c276c7a 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -32,6 +32,8 @@ #define BFQ_DEFAULT_GRP_IOPRIO 0 #define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE +#define MAX_PID_STR_LENGTH 12 + /* * Soft real-time applications are extremely more latency sensitive * than interactive ones. Over-raise the weight of the former to @@ -1016,13 +1018,23 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq); /* --------------- end of interface of B-WF2Q+ ---------------- */ /* Logging facilities. */ +static inline void bfq_pid_to_str(int pid, char *str, int len) +{ + if (pid != -1) + snprintf(str, len, "%d", pid); + else + snprintf(str, len, "SHARED-"); +} + #ifdef CONFIG_BFQ_GROUP_IOSCHED struct bfq_group *bfqq_group(struct bfq_queue *bfqq); #define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \ + char pid_str[MAX_PID_STR_LENGTH]; \ + bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \ blk_add_cgroup_trace_msg((bfqd)->queue, \ bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \ - "bfq%d%c " fmt, (bfqq)->pid, \ + "bfq%s%c " fmt, pid_str, \ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \ } while (0) @@ -1033,10 +1045,13 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq); #else /* CONFIG_BFQ_GROUP_IOSCHED */ -#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \ - blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid, \ +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \ + char pid_str[MAX_PID_STR_LENGTH]; \ + bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \ + blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str, \ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \ - ##args) + ##args); \ +} while (0) #define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0) #endif /* CONFIG_BFQ_GROUP_IOSCHED */ From fffca087d587b03d0d0dca2e86bf8e688fbf2c18 Mon Sep 17 00:00:00 2001 From: Francesco Pollicino Date: Tue, 12 Mar 2019 09:59:34 +0100 Subject: [PATCH 008/164] block, bfq: save & resume weight on a queue merge/split MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit bfq saves the state of a queue each time a merge occurs, to be able to resume such a state when the queue is associated again with its original process, on a split. Unfortunately bfq does not save & restore also the weight of the queue. If the weight is not correctly resumed when the queue is recycled, then the weight of the recycled queue could differ from the weight of the original queue. This commit adds the missing save & resume of the weight. Tested-by: Holger Hoffstätte Tested-by: Oleksandr Natalenko Signed-off-by: Francesco Pollicino Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-iosched.c | 2 ++ block/bfq-iosched.h | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 37cc8f127cf6..ceb06abd73df 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1028,6 +1028,7 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, else bfq_clear_bfqq_IO_bound(bfqq); + bfqq->entity.new_weight = bic->saved_weight; bfqq->ttime = bic->saved_ttime; bfqq->wr_coeff = bic->saved_wr_coeff; bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt; @@ -2502,6 +2503,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq) if (!bic) return; + bic->saved_weight = bfqq->entity.orig_weight; bic->saved_ttime = bfqq->ttime; bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq); bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 67e63c276c7a..60c148728cc5 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -404,6 +404,15 @@ struct bfq_io_cq { */ bool was_in_burst_list; + /* + * Save the weight when a merge occurs, to be able + * to restore it in case of split. If the weight is not + * correctly resumed when the queue is recycled, + * then the weight of the recycled queue could differ + * from the weight of the original queue. + */ + unsigned int saved_weight; + /* * Similar to previous fields: save wr information. */ From 4438cf50e7b315ff4bc4cfff8520b906428c3024 Mon Sep 17 00:00:00 2001 From: Paolo Valente Date: Tue, 12 Mar 2019 09:59:35 +0100 Subject: [PATCH 009/164] doc, block, bfq: add information on bfq execution time The execution time of BFQ has been slightly lowered. Report the new execution time in BFQ documentation. Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- Documentation/block/bfq-iosched.txt | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt index 98a8dd5ee385..1a0f2ac02eb6 100644 --- a/Documentation/block/bfq-iosched.txt +++ b/Documentation/block/bfq-iosched.txt @@ -20,13 +20,26 @@ for that device, by setting low_latency to 0. See Section 3 for details on how to configure BFQ for the desired tradeoff between latency and throughput, or on how to maximize throughput. -BFQ has a non-null overhead, which limits the maximum IOPS that a CPU -can process for a device scheduled with BFQ. To give an idea of the -limits on slow or average CPUs, here are, first, the limits of BFQ for -three different CPUs, on, respectively, an average laptop, an old -desktop, and a cheap embedded system, in case full hierarchical -support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but -CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2): +As every I/O scheduler, BFQ adds some overhead to per-I/O-request +processing. To give an idea of this overhead, the total, +single-lock-protected, per-request processing time of BFQ---i.e., the +sum of the execution times of the request insertion, dispatch and +completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz +(dated CPU for notebooks; time measured with simple code +instrumentation, and using the throughput-sync.sh script of the S +suite [1], in performance-profiling mode). To put this result into +context, the total, single-lock-protected, per-request execution time +of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 +us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). + +Scheduling overhead further limits the maximum IOPS that a CPU can +process (already limited by the execution of the rest of the I/O +stack). To give an idea of the limits with BFQ, on slow or average +CPUs, here are, first, the limits of BFQ for three different CPUs, on, +respectively, an average laptop, an old desktop, and a cheap embedded +system, in case full hierarchical support is enabled (i.e., +CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not +set (Section 4-2): - Intel i7-4850HQ: 400 KIOPS - AMD A8-3850: 250 KIOPS - ARM CortexTM-A53 Octa-core: 80 KIOPS @@ -566,3 +579,5 @@ applications. Unset this tunable if you need/want to control weights. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf + +[3] https://github.com/Algodev-github/S From 56a85fd8376ef32458efb6ea97a820754e12f6bb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Holger=20Hoffst=C3=A4tte?= Date: Tue, 12 Feb 2019 15:54:24 -0700 Subject: [PATCH 010/164] loop: properly observe rotational flag of underlying device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The loop driver always declares the rotational flag of its device as rotational, even when the device of the mapped file is nonrotational, as is the case with SSDs or on tmpfs. This can confuse filesystem tools which are SSD-aware; in my case I frequently forget to tell mkfs.btrfs that my loop device on tmpfs is nonrotational, and that I really don't need any automatic metadata redundancy. The attached patch fixes this by introspecting the rotational flag of the mapped file's underlying block device, if it exists. If the mapped file's filesystem has no associated block device - as is the case on e.g. tmpfs - we assume nonrotational storage. If there is a better way to identify such non-devices I'd love to hear them. Cc: Jens Axboe Cc: linux-block@vger.kernel.org Cc: holger@applied-asynchrony.com Signed-off-by: Holger Hoffstätte Signed-off-by: Gwendal Grignou Signed-off-by: Benjamin Gordon Reviewed-by: Guenter Roeck Signed-off-by: Jens Axboe --- drivers/block/loop.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index bf1c61cab8eb..c20710e617c2 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -900,6 +900,24 @@ static int loop_prepare_queue(struct loop_device *lo) return 0; } +static void loop_update_rotational(struct loop_device *lo) +{ + struct file *file = lo->lo_backing_file; + struct inode *file_inode = file->f_mapping->host; + struct block_device *file_bdev = file_inode->i_sb->s_bdev; + struct request_queue *q = lo->lo_queue; + bool nonrot = true; + + /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */ + if (file_bdev) + nonrot = blk_queue_nonrot(bdev_get_queue(file_bdev)); + + if (nonrot) + blk_queue_flag_set(QUEUE_FLAG_NONROT, q); + else + blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); +} + static int loop_set_fd(struct loop_device *lo, fmode_t mode, struct block_device *bdev, unsigned int arg) { @@ -963,6 +981,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) blk_queue_write_cache(lo->lo_queue, true, false); + loop_update_rotational(lo); loop_update_dio(lo); set_capacity(lo->lo_disk, size); bd_set_size(bdev, size << 9); From 0383ad4374f7ad7edd925a2ee4753035c3f5508a Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 29 Mar 2019 15:07:54 +0800 Subject: [PATCH 011/164] block: pass page to xen_biovec_phys_mergeable xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec for checking if the two bvecs can be merged, so pass page to xen_biovec_phys_mergeable() directly. No function change. Cc: ris Ostrovsky Cc: Juergen Gross Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval Cc: Christoph Hellwig Reviewed-by: Christoph Hellwig Reviewed-by: Boris Ostrovsky Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk.h | 2 +- drivers/xen/biomerge.c | 5 +++-- include/xen/xen.h | 4 +++- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/block/blk.h b/block/blk.h index 5d636ee41663..e27fd1512e4b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -75,7 +75,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q, if (addr1 + vec1->bv_len != addr2) return false; - if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2)) + if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2->bv_page)) return false; if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask)) return false; diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c index f3fbb700f569..05a286d24f14 100644 --- a/drivers/xen/biomerge.c +++ b/drivers/xen/biomerge.c @@ -4,12 +4,13 @@ #include #include +/* check if @page can be merged with 'vec1' */ bool xen_biovec_phys_mergeable(const struct bio_vec *vec1, - const struct bio_vec *vec2) + const struct page *page) { #if XEN_PAGE_SIZE == PAGE_SIZE unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page)); - unsigned long bfn2 = pfn_to_bfn(page_to_pfn(vec2->bv_page)); + unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page)); return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2; #else diff --git a/include/xen/xen.h b/include/xen/xen.h index 19d032373de5..19a72f591e2b 100644 --- a/include/xen/xen.h +++ b/include/xen/xen.h @@ -43,8 +43,10 @@ extern struct hvm_start_info pvh_start_info; #endif /* CONFIG_XEN_DOM0 */ struct bio_vec; +struct page; + bool xen_biovec_phys_mergeable(const struct bio_vec *vec1, - const struct bio_vec *vec2); + const struct page *page); #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_XEN_BALLOON) extern u64 xen_saved_max_mem_size; From db5ebd6edd2627d7e81a031643cf43587f63e66c Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:04 +0800 Subject: [PATCH 012/164] block: avoid to break XEN by multi-page bvec XEN has special page merge requirement, see xen_biovec_phys_mergeable(). We can't merge pages into one bvec simply for XEN. So move XEN's specific check on page merge into __bio_try_merge_page(), then abvoid to break XEN by multi-page bvec. Cc: ris Ostrovsky Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval Cc: Christoph Hellwig Reviewed-by: Juergen Gross Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/bio.c b/block/bio.c index b64cedc7f87c..b2423e7aae08 100644 --- a/block/bio.c +++ b/block/bio.c @@ -776,6 +776,8 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, if (vec_end_addr + 1 != page_addr + off) return false; + if (xen_domain() && !xen_biovec_phys_mergeable(bv, page)) + return false; if (same_page && (vec_end_addr & PAGE_MASK) != page_addr) return false; From fd7d8d4232f08b0df623d9ea7e941f0350a26e14 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:05 +0800 Subject: [PATCH 013/164] block: don't merge adjacent bvecs to one segment in bio blk_queue_split For normal filesystem IO, each page is added via blk_add_page(), in which bvec(page) merge has been handled already, and basically not possible to merge two adjacent bvecs in one bio. So not try to merge two adjacent bvecs in blk_queue_split(). Cc: Omar Sandoval Cc: Christoph Hellwig Reviewed-by: Boris Ostrovsky Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-merge.c | 17 ----------------- 1 file changed, 17 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 1c9d4f0f96ea..aa9164eb7187 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -267,23 +267,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, goto split; } - if (bvprvp) { - if (seg_size + bv.bv_len > queue_max_segment_size(q)) - goto new_segment; - if (!biovec_phys_mergeable(q, bvprvp, &bv)) - goto new_segment; - - seg_size += bv.bv_len; - bvprv = bv; - bvprvp = &bvprv; - sectors += bv.bv_len >> 9; - - if (nsegs == 1 && seg_size > front_seg_size) - front_seg_size = seg_size; - - continue; - } -new_segment: if (nsegs == max_segs) goto split; From 5a8ce240d4d302d27a58fd34499b2404b3a8df4f Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:06 +0800 Subject: [PATCH 014/164] block: cleanup bio_add_pc_page REQ_PC is out of date, so replace it with passthrough IO. Also remove the local variable of 'prev' since we can reuse the top local variable of 'bvec'. No function change. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/block/bio.c b/block/bio.c index b2423e7aae08..cbd202250a32 100644 --- a/block/bio.c +++ b/block/bio.c @@ -648,7 +648,7 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs) EXPORT_SYMBOL(bio_clone_fast); /** - * bio_add_pc_page - attempt to add page to bio + * bio_add_pc_page - attempt to add page to passthrough bio * @q: the target queue * @bio: destination bio * @page: page to add @@ -660,7 +660,7 @@ EXPORT_SYMBOL(bio_clone_fast); * limitations. The target block device must allow bio's up to PAGE_SIZE, * so it is always possible to add a single page to an empty bio. * - * This should only be used by REQ_PC bios. + * This should only be used by passthrough bios. */ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset) @@ -683,11 +683,11 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page * a consecutive offset. Optimize this special case. */ if (bio->bi_vcnt > 0) { - struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1]; + bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; - if (page == prev->bv_page && - offset == prev->bv_offset + prev->bv_len) { - prev->bv_len += len; + if (page == bvec->bv_page && + offset == bvec->bv_offset + bvec->bv_len) { + bvec->bv_len += len; bio->bi_iter.bi_size += len; goto done; } @@ -696,7 +696,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page * If the queue doesn't support SG gaps and adding this * offset would create a gap, disallow it. */ - if (bvec_gap_to_prev(q, prev, offset)) + if (bvec_gap_to_prev(q, bvec, offset)) return 0; } From 5919482e222908d40279a616b1fe6400549e32b4 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:07 +0800 Subject: [PATCH 015/164] block: check if page is mergeable in one helper Now the check for deciding if one page is mergeable to current bvec becomes a bit complicated, and we need to reuse the code before adding pc page. So move the check in one dedicated helper. No function change. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 36 +++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/block/bio.c b/block/bio.c index cbd202250a32..7ab7060a0e6c 100644 --- a/block/bio.c +++ b/block/bio.c @@ -647,6 +647,24 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs) } EXPORT_SYMBOL(bio_clone_fast); +static inline bool page_is_mergeable(const struct bio_vec *bv, + struct page *page, unsigned int len, unsigned int off, + bool same_page) +{ + phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + + bv->bv_offset + bv->bv_len - 1; + phys_addr_t page_addr = page_to_phys(page); + + if (vec_end_addr + 1 != page_addr + off) + return false; + if (xen_domain() && !xen_biovec_phys_mergeable(bv, page)) + return false; + if (same_page && (vec_end_addr & PAGE_MASK) != page_addr) + return false; + + return true; +} + /** * bio_add_pc_page - attempt to add page to passthrough bio * @q: the target queue @@ -770,20 +788,12 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, if (bio->bi_vcnt > 0) { struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; - phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + - bv->bv_offset + bv->bv_len - 1; - phys_addr_t page_addr = page_to_phys(page); - if (vec_end_addr + 1 != page_addr + off) - return false; - if (xen_domain() && !xen_biovec_phys_mergeable(bv, page)) - return false; - if (same_page && (vec_end_addr & PAGE_MASK) != page_addr) - return false; - - bv->bv_len += len; - bio->bi_iter.bi_size += len; - return true; + if (page_is_mergeable(bv, page, len, off, same_page)) { + bv->bv_len += len; + bio->bi_iter.bi_size += len; + return true; + } } return false; } From 190470871ae28da7bdb3909f6124385c8472fc97 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:08 +0800 Subject: [PATCH 016/164] block: put the same page when adding it to bio When the added page is merged to last same page in bio_add_pc_page(), the user may need to put this page for avoiding page leak. bio_map_user_iov() needs this kind of handling, and now it deals with it by itself in hack style. Moves the handling of put page into __bio_add_pc_page(), so bio_map_user_iov() may be simplified a bit, and maybe more users can benefit from this change. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 28 ++++++++++++++++------------ include/linux/bio.h | 3 +++ 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7ab7060a0e6c..26853e072cd7 100644 --- a/block/bio.c +++ b/block/bio.c @@ -666,12 +666,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, } /** - * bio_add_pc_page - attempt to add page to passthrough bio + * __bio_add_pc_page - attempt to add page to passthrough bio * @q: the target queue * @bio: destination bio * @page: page to add * @len: vec entry length * @offset: vec entry offset + * @put_same_page: put the page if it is same with last added page * * Attempt to add a page to the bio_vec maplist. This can fail for a * number of reasons, such as the bio being full or target block device @@ -680,8 +681,9 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, * * This should only be used by passthrough bios. */ -int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page - *page, unsigned int len, unsigned int offset) +int __bio_add_pc_page(struct request_queue *q, struct bio *bio, + struct page *page, unsigned int len, unsigned int offset, + bool put_same_page) { int retried_segments = 0; struct bio_vec *bvec; @@ -705,6 +707,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page if (page == bvec->bv_page && offset == bvec->bv_offset + bvec->bv_len) { + if (put_same_page) + put_page(page); bvec->bv_len += len; bio->bi_iter.bi_size += len; goto done; @@ -763,6 +767,13 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page blk_recount_segments(q, bio); return 0; } +EXPORT_SYMBOL(__bio_add_pc_page); + +int bio_add_pc_page(struct request_queue *q, struct bio *bio, + struct page *page, unsigned int len, unsigned int offset) +{ + return __bio_add_pc_page(q, bio, page, len, offset, false); +} EXPORT_SYMBOL(bio_add_pc_page); /** @@ -1397,21 +1408,14 @@ struct bio *bio_map_user_iov(struct request_queue *q, for (j = 0; j < npages; j++) { struct page *page = pages[j]; unsigned int n = PAGE_SIZE - offs; - unsigned short prev_bi_vcnt = bio->bi_vcnt; if (n > bytes) n = bytes; - if (!bio_add_pc_page(q, bio, page, n, offs)) + if (!__bio_add_pc_page(q, bio, page, n, offs, + true)) break; - /* - * check if vector was merged with previous - * drop page reference if needed - */ - if (bio->bi_vcnt == prev_bi_vcnt) - put_page(page); - added += n; bytes -= n; offs = 0; diff --git a/include/linux/bio.h b/include/linux/bio.h index bb6090aa165d..bb915591557b 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -432,6 +432,9 @@ void bio_chain(struct bio *, struct bio *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, unsigned int, unsigned int); +extern int __bio_add_pc_page(struct request_queue *, struct bio *, + struct page *, unsigned int, unsigned int, + bool); bool __bio_try_merge_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off, bool same_page); void __bio_add_page(struct bio *bio, struct page *page, From 489fbbcb51d0249569d863f9220de69cb31f1922 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 29 Mar 2019 15:08:00 +0800 Subject: [PATCH 017/164] block: enable multi-page bvec for passthrough IO Now block IO stack is basically ready for supporting multi-page bvec, however it isn't enabled on passthrough IO. One reason is that passthrough IO is dispatched to LLD directly and bio split is bypassed, so the bio has to be built correctly for dispatch to LLD from the beginning. Implement multi-page support for passthrough IO by limitting each bvec as block device's segment and applying all kinds of queue limit in blk_add_pc_page(). Then we don't need to calculate segments any more for passthrough IO any more, turns out code is simplified much. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 60 +++++++++++++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/block/bio.c b/block/bio.c index 26853e072cd7..8d516d508ae3 100644 --- a/block/bio.c +++ b/block/bio.c @@ -665,6 +665,27 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, return true; } +/* + * Check if the @page can be added to the current segment(@bv), and make + * sure to call it only if page_is_mergeable(@bv, @page) is true + */ +static bool can_add_page_to_seg(struct request_queue *q, + struct bio_vec *bv, struct page *page, unsigned len, + unsigned offset) +{ + unsigned long mask = queue_segment_boundary(q); + phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset; + phys_addr_t addr2 = page_to_phys(page) + offset + len - 1; + + if ((addr1 | mask) != (addr2 | mask)) + return false; + + if (bv->bv_len + len > queue_max_segment_size(q)) + return false; + + return true; +} + /** * __bio_add_pc_page - attempt to add page to passthrough bio * @q: the target queue @@ -685,7 +706,6 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, bool put_same_page) { - int retried_segments = 0; struct bio_vec *bvec; /* @@ -709,6 +729,7 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, offset == bvec->bv_offset + bvec->bv_len) { if (put_same_page) put_page(page); + bvec_merge: bvec->bv_len += len; bio->bi_iter.bi_size += len; goto done; @@ -720,11 +741,18 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, */ if (bvec_gap_to_prev(q, bvec, offset)) return 0; + + if (page_is_mergeable(bvec, page, len, offset, false) && + can_add_page_to_seg(q, bvec, page, len, offset)) + goto bvec_merge; } if (bio_full(bio)) return 0; + if (bio->bi_phys_segments >= queue_max_segments(q)) + return 0; + /* * setup the new entry, we might clear it again later if we * cannot add the page @@ -734,38 +762,12 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, bvec->bv_len = len; bvec->bv_offset = offset; bio->bi_vcnt++; - bio->bi_phys_segments++; bio->bi_iter.bi_size += len; - /* - * Perform a recount if the number of segments is greater - * than queue_max_segments(q). - */ - - while (bio->bi_phys_segments > queue_max_segments(q)) { - - if (retried_segments) - goto failed; - - retried_segments = 1; - blk_recount_segments(q, bio); - } - - /* If we may be able to merge these biovecs, force a recount */ - if (bio->bi_vcnt > 1 && biovec_phys_mergeable(q, bvec - 1, bvec)) - bio_clear_flag(bio, BIO_SEG_VALID); - done: + bio->bi_phys_segments = bio->bi_vcnt; + bio_set_flag(bio, BIO_SEG_VALID); return len; - - failed: - bvec->bv_page = NULL; - bvec->bv_len = 0; - bvec->bv_offset = 0; - bio->bi_vcnt--; - bio->bi_iter.bi_size -= len; - blk_recount_segments(q, bio); - return 0; } EXPORT_SYMBOL(__bio_add_pc_page); From cae6c2e54cc10514fec26e333f63c5cded9d2383 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:10 +0800 Subject: [PATCH 018/164] block: remove argument of 'request_queue' from __blk_bvec_map_sg The argument of 'request_queue' isn't used by __blk_bvec_map_sg(), so remove it. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-merge.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index aa9164eb7187..9ec704bb58ec 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -520,7 +520,7 @@ new_segment: *bvprv = *bvec; } -static inline int __blk_bvec_map_sg(struct request_queue *q, struct bio_vec bv, +static inline int __blk_bvec_map_sg(struct bio_vec bv, struct scatterlist *sglist, struct scatterlist **sg) { *sg = sglist; @@ -555,9 +555,9 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, int nsegs = 0; if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) - nsegs = __blk_bvec_map_sg(q, rq->special_vec, sglist, &sg); + nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, &sg); else if (rq->bio && bio_op(rq->bio) == REQ_OP_WRITE_SAME) - nsegs = __blk_bvec_map_sg(q, bio_iovec(rq->bio), sglist, &sg); + nsegs = __blk_bvec_map_sg(bio_iovec(rq->bio), sglist, &sg); else if (rq->bio) nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg); From 16e3e4187758d8936d358b26149de785b7d5a9b7 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:11 +0800 Subject: [PATCH 019/164] block: reuse __blk_bvec_map_sg() for mapping page sized bvec Inside __blk_segment_map_sg(), page sized bvec mapping is optimized a bit with one standalone branch. So reuse __blk_bvec_map_sg() to do that. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-merge.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 9ec704bb58ec..3e934ee9a907 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -493,6 +493,14 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, return nsegs; } +static inline int __blk_bvec_map_sg(struct bio_vec bv, + struct scatterlist *sglist, struct scatterlist **sg) +{ + *sg = blk_next_sg(sg, sglist); + sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset); + return 1; +} + static inline void __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, struct scatterlist *sglist, struct bio_vec *bvprv, @@ -511,23 +519,13 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, } else { new_segment: if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) { - *sg = blk_next_sg(sg, sglist); - sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset); - (*nsegs) += 1; + (*nsegs) += __blk_bvec_map_sg(*bvec, sglist, sg); } else (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg); } *bvprv = *bvec; } -static inline int __blk_bvec_map_sg(struct bio_vec bv, - struct scatterlist *sglist, struct scatterlist **sg) -{ - *sg = sglist; - sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset); - return 1; -} - static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist, struct scatterlist **sg) From f6970f83ef79503cb24ca8324e5cfa1188674f85 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Sun, 17 Mar 2019 18:01:12 +0800 Subject: [PATCH 020/164] block: don't check if adjacent bvecs in one bio can be mergeable Now both passthrough and FS IO have supported multi-page bvec, and bvec merging has been handled actually when adding page to bio, then adjacent bvecs won't be mergeable any more if they belong to same bio. So only try to merge bvecs if they are from different bios. Cc: Omar Sandoval Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-merge.c | 69 ++++++++++++++++++++++++++++------------------- 1 file changed, 42 insertions(+), 27 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 3e934ee9a907..8f96d683b577 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -354,11 +354,11 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio) { struct bio_vec bv, bvprv = { NULL }; - int prev = 0; unsigned int seg_size, nr_phys_segs; unsigned front_seg_size; struct bio *fbio, *bbio; struct bvec_iter iter; + bool new_bio = false; if (!bio) return 0; @@ -379,7 +379,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, nr_phys_segs = 0; for_each_bio(bio) { bio_for_each_bvec(bv, bio, iter) { - if (prev) { + if (new_bio) { if (seg_size + bv.bv_len > queue_max_segment_size(q)) goto new_segment; @@ -387,7 +387,6 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, goto new_segment; seg_size += bv.bv_len; - bvprv = bv; if (nr_phys_segs == 1 && seg_size > front_seg_size) @@ -396,12 +395,13 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, continue; } new_segment: - bvprv = bv; - prev = 1; bvec_split_segs(q, &bv, &nr_phys_segs, &seg_size, &front_seg_size, NULL, UINT_MAX); + new_bio = false; } bbio = bio; + bvprv = bv; + new_bio = true; } fbio->bi_seg_front_size = front_seg_size; @@ -501,29 +501,26 @@ static inline int __blk_bvec_map_sg(struct bio_vec bv, return 1; } -static inline void -__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, - struct scatterlist *sglist, struct bio_vec *bvprv, - struct scatterlist **sg, int *nsegs) +/* only try to merge bvecs into one sg if they are from two bios */ +static inline bool +__blk_segment_map_sg_merge(struct request_queue *q, struct bio_vec *bvec, + struct bio_vec *bvprv, struct scatterlist **sg) { int nbytes = bvec->bv_len; - if (*sg) { - if ((*sg)->length + nbytes > queue_max_segment_size(q)) - goto new_segment; - if (!biovec_phys_mergeable(q, bvprv, bvec)) - goto new_segment; + if (!*sg) + return false; - (*sg)->length += nbytes; - } else { -new_segment: - if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) { - (*nsegs) += __blk_bvec_map_sg(*bvec, sglist, sg); - } else - (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg); - } - *bvprv = *bvec; + if ((*sg)->length + nbytes > queue_max_segment_size(q)) + return false; + + if (!biovec_phys_mergeable(q, bvprv, bvec)) + return false; + + (*sg)->length += nbytes; + + return true; } static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, @@ -533,11 +530,29 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct bio_vec bvec, bvprv = { NULL }; struct bvec_iter iter; int nsegs = 0; + bool new_bio = false; - for_each_bio(bio) - bio_for_each_bvec(bvec, bio, iter) - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg, - &nsegs); + for_each_bio(bio) { + bio_for_each_bvec(bvec, bio, iter) { + /* + * Only try to merge bvecs from two bios given we + * have done bio internal merge when adding pages + * to bio + */ + if (new_bio && + __blk_segment_map_sg_merge(q, &bvec, &bvprv, sg)) + goto next_bvec; + + if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE) + nsegs += __blk_bvec_map_sg(bvec, sglist, sg); + else + nsegs += blk_bvec_map_sg(q, &bvec, sglist, sg); + next_bvec: + new_bio = false; + } + bvprv = bvec; + new_bio = true; + } return nsegs; } From 81ba6abd2bcd2812974bd3a4c43d1d032acfa751 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 28 Mar 2019 11:05:31 +0800 Subject: [PATCH 021/164] block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF loop is one block device, for any bio submitted to this device, the upper layer does guarantee that pages added to loop's bio won't go away when the bio is in-flight. So mark loop's bvec as ITER_BVEC_FLAG_NO_REF then get_page/put_page can be saved for serving loop's IO. Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- drivers/block/loop.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index c20710e617c2..102d79575895 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -264,12 +264,20 @@ lo_do_transfer(struct loop_device *lo, int cmd, return ret; } +static inline void loop_iov_iter_bvec(struct iov_iter *i, + unsigned int direction, const struct bio_vec *bvec, + unsigned long nr_segs, size_t count) +{ + iov_iter_bvec(i, direction, bvec, nr_segs, count); + i->type |= ITER_BVEC_FLAG_NO_REF; +} + static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) { struct iov_iter i; ssize_t bw; - iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len); + loop_iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len); file_start_write(file); bw = vfs_iter_write(file, &i, ppos, 0); @@ -347,7 +355,7 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq, ssize_t len; rq_for_each_segment(bvec, rq, iter) { - iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len); + loop_iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len); len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); if (len < 0) return len; @@ -388,7 +396,7 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq, b.bv_offset = 0; b.bv_len = bvec.bv_len; - iov_iter_bvec(&i, READ, &b, 1, b.bv_len); + loop_iov_iter_bvec(&i, READ, &b, 1, b.bv_len); len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); if (len < 0) { ret = len; @@ -555,7 +563,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, } atomic_set(&cmd->ref, 2); - iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq)); + loop_iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq)); iter.iov_offset = offset; cmd->iocb.ki_pos = pos; From 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef Mon Sep 17 00:00:00 2001 From: Nigel Croxon Date: Fri, 29 Mar 2019 10:46:15 -0700 Subject: [PATCH 022/164] Don't jump to compute_result state from check_result state Changing state from check_state_check_result to check_state_compute_result not only is unsafe but also doesn't appear to serve a valid purpose. A raid6 check should only be pushing out extra writes if doing repair and a mis-match occurs. The stripe dev management will already try and do repair writes for failing sectors. This patch makes the raid6 check_state_check_result handling work more like raid5's. If somehow too many failures for a check, just quit the check operation for the stripe. When any checks pass, don't try and use check_state_compute_result for a purpose it isn't needed for and is unsafe for. Just mark the stripe as in sync for passing its parity checks and let the stripe dev read/write code and the bad blocks list do their job handling I/O errors. Repro steps from Xiao: These are the steps to reproduce this problem: 1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c 2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1 3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6 sde is the disk created by scsi_debug 4. echo "2" >/sys/module/scsi_debug/parameters/opts 5. raid-check It panic: [ 4854.730899] md: data-check of RAID array md127 [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current] [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00 [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0 [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current] [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00 [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000 [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current] [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00 [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000 [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current] [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00 [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0 [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1). [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1). [ 4854.906201] ------------[ cut here ]------------ [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190! raid5.c:4190 above is this BUG_ON: handle_parity_checks6() ... BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ Cc: # v3.16+ OriginalAuthor: David Jeffery Cc: Xiao Ni Tested-by: David Jeffery Signed-off-by: David Jeffy Signed-off-by: Nigel Croxon Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/raid5.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c033bfcb209e..364dd2f6fa1b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, case check_state_check_result: sh->check_state = check_state_idle; + if (s->failed > 1) + break; /* handle a successful check operation, if parity is correct * we are done. Otherwise update the mismatch count and repair * parity if !MD_RECOVERY_CHECK */ if (sh->ops.zero_sum_result == 0) { - /* both parities are correct */ - if (!s->failed) - set_bit(STRIPE_INSYNC, &sh->state); - else { - /* in contrast to the raid5 case we can validate - * parity, but still have a failure to write - * back - */ - sh->check_state = check_state_compute_result; - /* Returning at this point means that we may go - * off and bring p and/or q uptodate again so - * we make sure to check zero_sum_result again - * to verify if p or q need writeback - */ - } + /* Any parity checked was correct */ + set_bit(STRIPE_INSYNC, &sh->state); } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { From 4bc034d35377196c854236133b07730a777c4aba Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 29 Mar 2019 10:46:16 -0700 Subject: [PATCH 023/164] Revert "MD: fix lock contention for flush bios" This reverts commit 5a409b4f56d50b212334f338cb8465d65550cd85. This patch has two problems. 1/ it make multiple calls to submit_bio() from inside a make_request_fn. The bios thus submitted will be queued on current->bio_list and not submitted immediately. As the bios are allocated from a mempool, this can theoretically result in a deadlock - all the pool of requests could be in various ->bio_list queues and a subsequent mempool_alloc could block waiting for one of them to be released. 2/ It aims to handle a case when there are many concurrent flush requests. It handles this by submitting many requests in parallel - all of which are identical and so most of which do nothing useful. It would be more efficient to just send one lower-level request, but allow that to satisfy multiple upper-level requests. Fixes: 5a409b4f56d5 ("MD: fix lock contention for flush bios") Cc: # v4.19+ Tested-by: Xiao Ni Signed-off-by: NeilBrown Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/md.c | 161 +++++++++++++++++------------------------------- drivers/md/md.h | 22 +++---- 2 files changed, 63 insertions(+), 120 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 05ffffb8b769..021e522001c1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -132,24 +132,6 @@ static inline int speed_max(struct mddev *mddev) mddev->sync_speed_max : sysctl_speed_limit_max; } -static void * flush_info_alloc(gfp_t gfp_flags, void *data) -{ - return kzalloc(sizeof(struct flush_info), gfp_flags); -} -static void flush_info_free(void *flush_info, void *data) -{ - kfree(flush_info); -} - -static void * flush_bio_alloc(gfp_t gfp_flags, void *data) -{ - return kzalloc(sizeof(struct flush_bio), gfp_flags); -} -static void flush_bio_free(void *flush_bio, void *data) -{ - kfree(flush_bio); -} - static struct ctl_table_header *raid_table_header; static struct ctl_table raid_table[] = { @@ -423,54 +405,30 @@ static int md_congested(void *data, int bits) /* * Generic flush handling for md */ -static void submit_flushes(struct work_struct *ws) + +static void md_end_flush(struct bio *bio) { - struct flush_info *fi = container_of(ws, struct flush_info, flush_work); - struct mddev *mddev = fi->mddev; - struct bio *bio = fi->bio; - - bio->bi_opf &= ~REQ_PREFLUSH; - md_handle_request(mddev, bio); - - mempool_free(fi, mddev->flush_pool); -} - -static void md_end_flush(struct bio *fbio) -{ - struct flush_bio *fb = fbio->bi_private; - struct md_rdev *rdev = fb->rdev; - struct flush_info *fi = fb->fi; - struct bio *bio = fi->bio; - struct mddev *mddev = fi->mddev; + struct md_rdev *rdev = bio->bi_private; + struct mddev *mddev = rdev->mddev; rdev_dec_pending(rdev, mddev); - if (atomic_dec_and_test(&fi->flush_pending)) { - if (bio->bi_iter.bi_size == 0) { - /* an empty barrier - all done */ - bio_endio(bio); - mempool_free(fi, mddev->flush_pool); - } else { - INIT_WORK(&fi->flush_work, submit_flushes); - queue_work(md_wq, &fi->flush_work); - } + if (atomic_dec_and_test(&mddev->flush_pending)) { + /* The pre-request flush has finished */ + queue_work(md_wq, &mddev->flush_work); } - - mempool_free(fb, mddev->flush_bio_pool); - bio_put(fbio); + bio_put(bio); } -void md_flush_request(struct mddev *mddev, struct bio *bio) +static void md_submit_flush_data(struct work_struct *ws); + +static void submit_flushes(struct work_struct *ws) { + struct mddev *mddev = container_of(ws, struct mddev, flush_work); struct md_rdev *rdev; - struct flush_info *fi; - - fi = mempool_alloc(mddev->flush_pool, GFP_NOIO); - - fi->bio = bio; - fi->mddev = mddev; - atomic_set(&fi->flush_pending, 1); + INIT_WORK(&mddev->flush_work, md_submit_flush_data); + atomic_set(&mddev->flush_pending, 1); rcu_read_lock(); rdev_for_each_rcu(rdev, mddev) if (rdev->raid_disk >= 0 && @@ -480,40 +438,59 @@ void md_flush_request(struct mddev *mddev, struct bio *bio) * we reclaim rcu_read_lock */ struct bio *bi; - struct flush_bio *fb; atomic_inc(&rdev->nr_pending); atomic_inc(&rdev->nr_pending); rcu_read_unlock(); - - fb = mempool_alloc(mddev->flush_bio_pool, GFP_NOIO); - fb->fi = fi; - fb->rdev = rdev; - bi = bio_alloc_mddev(GFP_NOIO, 0, mddev); - bio_set_dev(bi, rdev->bdev); bi->bi_end_io = md_end_flush; - bi->bi_private = fb; + bi->bi_private = rdev; + bio_set_dev(bi, rdev->bdev); bi->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; - - atomic_inc(&fi->flush_pending); + atomic_inc(&mddev->flush_pending); submit_bio(bi); - rcu_read_lock(); rdev_dec_pending(rdev, mddev); } rcu_read_unlock(); + if (atomic_dec_and_test(&mddev->flush_pending)) + queue_work(md_wq, &mddev->flush_work); +} - if (atomic_dec_and_test(&fi->flush_pending)) { - if (bio->bi_iter.bi_size == 0) { - /* an empty barrier - all done */ - bio_endio(bio); - mempool_free(fi, mddev->flush_pool); - } else { - INIT_WORK(&fi->flush_work, submit_flushes); - queue_work(md_wq, &fi->flush_work); - } +static void md_submit_flush_data(struct work_struct *ws) +{ + struct mddev *mddev = container_of(ws, struct mddev, flush_work); + struct bio *bio = mddev->flush_bio; + + /* + * must reset flush_bio before calling into md_handle_request to avoid a + * deadlock, because other bios passed md_handle_request suspend check + * could wait for this and below md_handle_request could wait for those + * bios because of suspend check + */ + mddev->flush_bio = NULL; + wake_up(&mddev->sb_wait); + + if (bio->bi_iter.bi_size == 0) { + /* an empty barrier - all done */ + bio_endio(bio); + } else { + bio->bi_opf &= ~REQ_PREFLUSH; + md_handle_request(mddev, bio); } } + +void md_flush_request(struct mddev *mddev, struct bio *bio) +{ + spin_lock_irq(&mddev->lock); + wait_event_lock_irq(mddev->sb_wait, + !mddev->flush_bio, + mddev->lock); + mddev->flush_bio = bio; + spin_unlock_irq(&mddev->lock); + + INIT_WORK(&mddev->flush_work, submit_flushes); + queue_work(md_wq, &mddev->flush_work); +} EXPORT_SYMBOL(md_flush_request); static inline struct mddev *mddev_get(struct mddev *mddev) @@ -560,6 +537,7 @@ void mddev_init(struct mddev *mddev) atomic_set(&mddev->openers, 0); atomic_set(&mddev->active_io, 0); spin_lock_init(&mddev->lock); + atomic_set(&mddev->flush_pending, 0); init_waitqueue_head(&mddev->sb_wait); init_waitqueue_head(&mddev->recovery_wait); mddev->reshape_position = MaxSector; @@ -5511,22 +5489,6 @@ int md_run(struct mddev *mddev) if (err) return err; } - if (mddev->flush_pool == NULL) { - mddev->flush_pool = mempool_create(NR_FLUSH_INFOS, flush_info_alloc, - flush_info_free, mddev); - if (!mddev->flush_pool) { - err = -ENOMEM; - goto abort; - } - } - if (mddev->flush_bio_pool == NULL) { - mddev->flush_bio_pool = mempool_create(NR_FLUSH_BIOS, flush_bio_alloc, - flush_bio_free, mddev); - if (!mddev->flush_bio_pool) { - err = -ENOMEM; - goto abort; - } - } spin_lock(&pers_lock); pers = find_pers(mddev->level, mddev->clevel); @@ -5686,11 +5648,8 @@ int md_run(struct mddev *mddev) return 0; abort: - mempool_destroy(mddev->flush_bio_pool); - mddev->flush_bio_pool = NULL; - mempool_destroy(mddev->flush_pool); - mddev->flush_pool = NULL; - + bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->sync_set); return err; } EXPORT_SYMBOL_GPL(md_run); @@ -5894,14 +5853,6 @@ static void __md_stop(struct mddev *mddev) mddev->to_remove = &md_redundancy_group; module_put(pers->owner); clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - if (mddev->flush_bio_pool) { - mempool_destroy(mddev->flush_bio_pool); - mddev->flush_bio_pool = NULL; - } - if (mddev->flush_pool) { - mempool_destroy(mddev->flush_pool); - mddev->flush_pool = NULL; - } } void md_stop(struct mddev *mddev) diff --git a/drivers/md/md.h b/drivers/md/md.h index c52afb52c776..2deb84fa93f9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -252,19 +252,6 @@ enum mddev_sb_flags { MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */ }; -#define NR_FLUSH_INFOS 8 -#define NR_FLUSH_BIOS 64 -struct flush_info { - struct bio *bio; - struct mddev *mddev; - struct work_struct flush_work; - atomic_t flush_pending; -}; -struct flush_bio { - struct flush_info *fi; - struct md_rdev *rdev; -}; - struct mddev { void *private; struct md_personality *pers; @@ -470,8 +457,13 @@ struct mddev { * metadata and bitmap writes */ - mempool_t *flush_pool; - mempool_t *flush_bio_pool; + /* Generic flush handling. + * The last to finish preflush schedules a worker to submit + * the rest of the request (without the REQ_PREFLUSH flag). + */ + struct bio *flush_bio; + atomic_t flush_pending; + struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); struct md_cluster_info *cluster_info; From 2bc13b83e6298486371761de503faeffd15b7534 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 29 Mar 2019 10:46:17 -0700 Subject: [PATCH 024/164] md: batch flush requests. Currently if many flush requests are submitted to an md device is quick succession, they are serialized and can take a long to process them all. We don't really need to call flush all those times - a single flush call can satisfy all requests submitted before it started. So keep track of when the current flush started and when it finished, allow any pending flush that was requested before the flush started to complete without waiting any more. Test results from Xiao: Test is done on a raid10 device which is created by 4 SSDs. The tool is dbench. 1. The latest linux stable kernel Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 10.509 78.305 Flush 2078376 0.013 10.094 Close 21787697 0.019 18.821 LockX 96580 0.007 3.184 Mkdir 384 0.008 0.062 Rename 1255883 0.191 23.534 ReadX 46495589 0.020 14.230 WriteX 14790591 7.123 60.706 Unlink 5989118 0.440 54.551 UnlockX 96580 0.005 2.736 FIND_FIRST 10393845 0.042 12.079 SET_FILE_INFORMATION 2415558 0.129 10.088 QUERY_FILE_INFORMATION 4711725 0.005 8.462 QUERY_PATH_INFORMATION 26883327 0.032 21.715 QUERY_FS_INFORMATION 4929409 0.010 8.238 NTCreateX 29660080 0.100 53.268 Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs max_latency=60.712 ms 2. With patch1 "Revert "MD: fix lock contention for flush bios"" Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 256 8.326 36.761 Flush 693291 3.974 180.269 Close 7266404 0.009 36.929 LockX 32160 0.006 0.840 Mkdir 128 0.008 0.021 Rename 418755 0.063 29.945 ReadX 15498708 0.007 7.216 WriteX 4932310 22.482 267.928 Unlink 1997557 0.109 47.553 UnlockX 32160 0.004 1.110 FIND_FIRST 3465791 0.036 7.320 SET_FILE_INFORMATION 805825 0.015 1.561 QUERY_FILE_INFORMATION 1570950 0.005 2.403 QUERY_PATH_INFORMATION 8965483 0.013 14.277 QUERY_FS_INFORMATION 1643626 0.009 3.314 NTCreateX 9892174 0.061 41.278 Throughput 345.009 MB/sec (sync open) 128 clients 128 procs max_latency=267.939 m 3. With patch1 and patch2 Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 9.570 54.588 Flush 2061354 0.666 15.102 Close 21604811 0.012 25.697 LockX 95770 0.007 1.424 Mkdir 384 0.008 0.053 Rename 1245411 0.096 12.263 ReadX 46103198 0.011 12.116 WriteX 14667988 7.375 60.069 Unlink 5938936 0.173 30.905 UnlockX 95770 0.005 4.147 FIND_FIRST 10306407 0.041 11.715 SET_FILE_INFORMATION 2395987 0.048 7.640 QUERY_FILE_INFORMATION 4672371 0.005 9.291 QUERY_PATH_INFORMATION 26656735 0.018 19.719 QUERY_FS_INFORMATION 4887940 0.010 7.654 NTCreateX 29410811 0.059 28.551 Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs max_latency=60.075 ms Cc: # v4.19+ Tested-by: Xiao Ni Signed-off-by: NeilBrown Signed-off-by: Song Liu Signed-off-by: Jens Axboe --- drivers/md/md.c | 27 +++++++++++++++++++++++---- drivers/md/md.h | 3 +++ 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 021e522001c1..d0f688399a56 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -427,6 +427,7 @@ static void submit_flushes(struct work_struct *ws) struct mddev *mddev = container_of(ws, struct mddev, flush_work); struct md_rdev *rdev; + mddev->start_flush = ktime_get_boottime(); INIT_WORK(&mddev->flush_work, md_submit_flush_data); atomic_set(&mddev->flush_pending, 1); rcu_read_lock(); @@ -467,6 +468,7 @@ static void md_submit_flush_data(struct work_struct *ws) * could wait for this and below md_handle_request could wait for those * bios because of suspend check */ + mddev->last_flush = mddev->start_flush; mddev->flush_bio = NULL; wake_up(&mddev->sb_wait); @@ -481,15 +483,32 @@ static void md_submit_flush_data(struct work_struct *ws) void md_flush_request(struct mddev *mddev, struct bio *bio) { + ktime_t start = ktime_get_boottime(); spin_lock_irq(&mddev->lock); wait_event_lock_irq(mddev->sb_wait, - !mddev->flush_bio, + !mddev->flush_bio || + ktime_after(mddev->last_flush, start), mddev->lock); - mddev->flush_bio = bio; + if (!ktime_after(mddev->last_flush, start)) { + WARN_ON(mddev->flush_bio); + mddev->flush_bio = bio; + bio = NULL; + } spin_unlock_irq(&mddev->lock); - INIT_WORK(&mddev->flush_work, submit_flushes); - queue_work(md_wq, &mddev->flush_work); + if (!bio) { + INIT_WORK(&mddev->flush_work, submit_flushes); + queue_work(md_wq, &mddev->flush_work); + } else { + /* flush was performed for some other bio while we waited. */ + if (bio->bi_iter.bi_size == 0) + /* an empty barrier - all done */ + bio_endio(bio); + else { + bio->bi_opf &= ~REQ_PREFLUSH; + mddev->pers->make_request(mddev, bio); + } + } } EXPORT_SYMBOL(md_flush_request); diff --git a/drivers/md/md.h b/drivers/md/md.h index 2deb84fa93f9..257cb4c9e22b 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -463,6 +463,9 @@ struct mddev { */ struct bio *flush_bio; atomic_t flush_pending; + ktime_t start_flush, last_flush; /* last_flush is when the last completed + * flush was started. + */ struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); From 2b24e6f63ac9e817630424c6d8f008256348dfc4 Mon Sep 17 00:00:00 2001 From: Johannes Thumshirn Date: Wed, 3 Apr 2019 11:15:19 +0200 Subject: [PATCH 025/164] block: bio: ensure newly added bio flags don't override BVEC_POOL_IDX With the introduction of BIO_NO_PAGE_REF we've used up all available bits in bio::bi_flags. Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX causing crashes. Reviewed-by: Ming Lei Reviewed-by: Hannes Reinecke Reviewed-by: Bart Van Assche Reviewed-by: Christoph Hellwig Signed-off-by: Johannes Thumshirn Signed-off-by: Jens Axboe --- block/bio.c | 3 +++ include/linux/blk_types.h | 29 ++++++++++++++++------------- 2 files changed, 19 insertions(+), 13 deletions(-) diff --git a/block/bio.c b/block/bio.c index 8d516d508ae3..c2592c5d70b9 100644 --- a/block/bio.c +++ b/block/bio.c @@ -2218,6 +2218,9 @@ static int __init init_bio(void) bio_slab_nr = 0; bio_slabs = kcalloc(bio_slab_max, sizeof(struct bio_slab), GFP_KERNEL); + + BUILD_BUG_ON(BIO_FLAG_LAST > BVEC_POOL_OFFSET); + if (!bio_slabs) panic("bio: can't allocate bios\n"); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 791fee35df88..be418275763c 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,21 +215,24 @@ struct bio { /* * bio flags */ -#define BIO_NO_PAGE_REF 0 /* don't put release vec pages */ -#define BIO_SEG_VALID 1 /* bi_phys_segments valid */ -#define BIO_CLONED 2 /* doesn't own data */ -#define BIO_BOUNCED 3 /* bio is a bounce bio */ -#define BIO_USER_MAPPED 4 /* contains user pages */ -#define BIO_NULL_MAPPED 5 /* contains invalid user pages */ -#define BIO_QUIET 6 /* Make BIO Quiet */ -#define BIO_CHAIN 7 /* chained bio, ->bi_remaining in effect */ -#define BIO_REFFED 8 /* bio has elevated ->bi_cnt */ -#define BIO_THROTTLED 9 /* This bio has already been subjected to +enum { + BIO_NO_PAGE_REF, /* don't put release vec pages */ + BIO_SEG_VALID, /* bi_phys_segments valid */ + BIO_CLONED, /* doesn't own data */ + BIO_BOUNCED, /* bio is a bounce bio */ + BIO_USER_MAPPED, /* contains user pages */ + BIO_NULL_MAPPED, /* contains invalid user pages */ + BIO_QUIET, /* Make BIO Quiet */ + BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ + BIO_REFFED, /* bio has elevated ->bi_cnt */ + BIO_THROTTLED, /* This bio has already been subjected to * throttling rules. Don't do it again. */ -#define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion + BIO_TRACE_COMPLETION, /* bio_endio() should trace the final completion * of this bio. */ -#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */ -#define BIO_TRACKED 12 /* set if bio goes through the rq_qos path */ + BIO_QUEUE_ENTERED, /* can use blk_queue_enter_live() */ + BIO_TRACKED, /* set if bio goes through the rq_qos path */ + BIO_FLAG_LAST +}; /* See BVEC_POOL_OFFSET below before adding new flags */ From 43e2d08d0790a09ee1d2c838e33343ee1bcaf610 Mon Sep 17 00:00:00 2001 From: Max Gurtovoy Date: Mon, 25 Feb 2019 13:00:04 +0200 Subject: [PATCH 026/164] nvme: avoid double dereference to convert le to cpu Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of le64_to_cpup. This will also align the code to nvme-core driver convention. Signed-off-by: Max Gurtovoy Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 470601980794..b5939112b9b6 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1588,7 +1588,7 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b) static void nvme_update_disk_info(struct gendisk *disk, struct nvme_ns *ns, struct nvme_id_ns *id) { - sector_t capacity = le64_to_cpup(&id->nsze) << (ns->lba_shift - 9); + sector_t capacity = le64_to_cpu(id->nsze) << (ns->lba_shift - 9); unsigned short bs = 1 << ns->lba_shift; blk_mq_freeze_queue(disk->queue); @@ -2549,7 +2549,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) ctrl->crdt[2] = le16_to_cpu(id->crdt3); ctrl->oacs = le16_to_cpu(id->oacs); - ctrl->oncs = le16_to_cpup(&id->oncs); + ctrl->oncs = le16_to_cpu(id->oncs); ctrl->oaes = le32_to_cpu(id->oaes); atomic_set(&ctrl->abort_limit, id->acl + 1); ctrl->vwc = id->vwc; From cfe03c2ec46200dfefa5167e6a2bb50c0426c5f4 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 12 Mar 2019 18:05:10 +0100 Subject: [PATCH 027/164] nvmet: avoid double errno conversions Use errno_to_nvme_status to convert from a negative errno to a nvme status field instead of going through a blk_status_t. Also remove the pointless status variable in nvmet_bdev_execute_write_zeroes. Signed-off-by: Christoph Hellwig Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/target/io-cmd-bdev.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index a065dbfc43b1..3efc52f9c309 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -196,7 +196,7 @@ static u16 nvmet_bdev_discard_range(struct nvmet_req *req, GFP_KERNEL, 0, bio); if (ret && ret != -EOPNOTSUPP) { req->error_slba = le64_to_cpu(range->slba); - return blk_to_nvme_status(req, errno_to_blk_status(ret)); + return errno_to_nvme_status(req, ret); } return NVME_SC_SUCCESS; } @@ -252,7 +252,6 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req) { struct nvme_write_zeroes_cmd *write_zeroes = &req->cmd->write_zeroes; struct bio *bio = NULL; - u16 status = NVME_SC_SUCCESS; sector_t sector; sector_t nr_sector; int ret; @@ -264,13 +263,12 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req) ret = __blkdev_issue_zeroout(req->ns->bdev, sector, nr_sector, GFP_KERNEL, &bio, 0); - status = blk_to_nvme_status(req, errno_to_blk_status(ret)); if (bio) { bio->bi_private = req; bio->bi_end_io = nvmet_bio_done; submit_bio(bio); } else { - nvmet_req_complete(req, status); + nvmet_req_complete(req, errno_to_nvme_status(req, ret)); } } From 6b80f1d2cc5a76f0121a43a612515bc8a8976e66 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Sat, 23 Feb 2019 12:51:08 -0600 Subject: [PATCH 028/164] nvmet-fc: use zero-sized array and struct_size() in kzalloc() Update the code to use a zero-sized array instead of a pointer in structure nvmet_fc_tgt_queue and use struct_size() in kzalloc(). Notice that one of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(struct boo) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva Reviewed-by: James Smart Signed-off-by: Christoph Hellwig --- drivers/nvme/target/fc.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 98b7b1f4ee96..9369a11fe7a9 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -128,12 +128,12 @@ struct nvmet_fc_tgt_queue { struct nvmet_cq nvme_cq; struct nvmet_sq nvme_sq; struct nvmet_fc_tgt_assoc *assoc; - struct nvmet_fc_fcp_iod *fod; /* array of fcp_iods */ struct list_head fod_list; struct list_head pending_cmd_list; struct list_head avail_defer_list; struct workqueue_struct *work_q; struct kref ref; + struct nvmet_fc_fcp_iod fod[]; /* array of fcp_iods */ } __aligned(sizeof(unsigned long long)); struct nvmet_fc_tgt_assoc { @@ -588,9 +588,7 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc, if (qid > NVMET_NR_QUEUES) return NULL; - queue = kzalloc((sizeof(*queue) + - (sizeof(struct nvmet_fc_fcp_iod) * sqsize)), - GFP_KERNEL); + queue = kzalloc(struct_size(queue, fod, sqsize), GFP_KERNEL); if (!queue) return NULL; @@ -603,7 +601,6 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc, if (!queue->work_q) goto out_a_put; - queue->fod = (struct nvmet_fc_fcp_iod *)&queue[1]; queue->qid = qid; queue->sqsize = sqsize; queue->assoc = assoc; From 70583295388a3ceabc22bddeb16e788f58225d70 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Fri, 8 Mar 2019 15:41:21 -0800 Subject: [PATCH 029/164] nvmet-tcp: implement C2HData SUCCESS optimization TP 8000 says that the use of the SUCCESS flag depends on weather the controller support disabling sq_head pointer updates. Given that we support it by default, makes sense that we go the extra mile to actually use the SUCCESS flag. When we create the C2HData PDU header, we check if sqhd_disabled is set on our queue, if so, we set the SUCCESS flag in the PDU header and skip sending a completion response capsule. Signed-off-by: Sagi Grimberg Reviewed-by: Oliver Smith-Denny Tested-by: Oliver Smith-Denny Signed-off-by: Christoph Hellwig --- drivers/nvme/target/tcp.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index ad0df786fe93..0a941abf56ec 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -371,7 +371,8 @@ static void nvmet_setup_c2h_data_pdu(struct nvmet_tcp_cmd *cmd) cmd->state = NVMET_TCP_SEND_DATA_PDU; pdu->hdr.type = nvme_tcp_c2h_data; - pdu->hdr.flags = NVME_TCP_F_DATA_LAST; + pdu->hdr.flags = NVME_TCP_F_DATA_LAST | (queue->nvme_sq.sqhd_disabled ? + NVME_TCP_F_DATA_SUCCESS : 0); pdu->hdr.hlen = sizeof(*pdu); pdu->hdr.pdo = pdu->hdr.hlen + hdgst; pdu->hdr.plen = @@ -542,8 +543,19 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd) cmd->state = NVMET_TCP_SEND_DDGST; cmd->offset = 0; } else { - nvmet_setup_response_pdu(cmd); + if (queue->nvme_sq.sqhd_disabled) { + cmd->queue->snd_cmd = NULL; + nvmet_tcp_put_cmd(cmd); + } else { + nvmet_setup_response_pdu(cmd); + } } + + if (queue->nvme_sq.sqhd_disabled) { + kfree(cmd->iov); + sgl_free(cmd->req.sg); + } + return 1; } @@ -619,7 +631,13 @@ static int nvmet_try_send_ddgst(struct nvmet_tcp_cmd *cmd) return ret; cmd->offset += ret; - nvmet_setup_response_pdu(cmd); + + if (queue->nvme_sq.sqhd_disabled) { + cmd->queue->snd_cmd = NULL; + nvmet_tcp_put_cmd(cmd); + } else { + nvmet_setup_response_pdu(cmd); + } return 1; } From 7c349dde26b75db3fa1863e36984ac2271cd797a Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Fri, 8 Mar 2019 10:43:06 -0700 Subject: [PATCH 030/164] nvme-pci: use a flag for polled queues A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: Keith Busch Reviewed-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a90cf5d63aac..4c0461bd6cfc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -189,7 +189,7 @@ struct nvme_queue { dma_addr_t cq_dma_addr; u32 __iomem *q_db; u16 q_depth; - s16 cq_vector; + u16 cq_vector; u16 sq_tail; u16 last_sq_tail; u16 cq_head; @@ -200,6 +200,7 @@ struct nvme_queue { #define NVMEQ_ENABLED 0 #define NVMEQ_SQ_CMB 1 #define NVMEQ_DELETE_ERROR 2 +#define NVMEQ_POLLED 3 u32 *dbbuf_sq_db; u32 *dbbuf_cq_db; u32 *dbbuf_sq_ei; @@ -1088,7 +1089,7 @@ static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag) * using the CQ lock. For normal interrupt driven threads we have * to disable the interrupt to avoid racing with it. */ - if (nvmeq->cq_vector == -1) { + if (test_bit(NVMEQ_POLLED, &nvmeq->flags)) { spin_lock(&nvmeq->cq_poll_lock); found = nvme_process_cq(nvmeq, &start, &end, tag); spin_unlock(&nvmeq->cq_poll_lock); @@ -1148,7 +1149,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, struct nvme_command c; int flags = NVME_QUEUE_PHYS_CONTIG; - if (vector != -1) + if (!test_bit(NVMEQ_POLLED, &nvmeq->flags)) flags |= NVME_CQ_IRQ_ENABLED; /* @@ -1161,10 +1162,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, c.create_cq.cqid = cpu_to_le16(qid); c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1); c.create_cq.cq_flags = cpu_to_le16(flags); - if (vector != -1) - c.create_cq.irq_vector = cpu_to_le16(vector); - else - c.create_cq.irq_vector = 0; + c.create_cq.irq_vector = cpu_to_le16(vector); return nvme_submit_sync_cmd(dev->ctrl.admin_q, &c, NULL, 0); } @@ -1410,10 +1408,8 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq) nvmeq->dev->online_queues--; if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q); - if (nvmeq->cq_vector == -1) - return 0; - pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq); - nvmeq->cq_vector = -1; + if (!test_and_clear_bit(NVMEQ_POLLED, &nvmeq->flags)) + pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq); return 0; } @@ -1507,7 +1503,6 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; nvmeq->q_depth = depth; nvmeq->qid = qid; - nvmeq->cq_vector = -1; dev->ctrl.queue_count++; return 0; @@ -1552,7 +1547,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) { struct nvme_dev *dev = nvmeq->dev; int result; - s16 vector; + u16 vector = 0; clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags); @@ -1563,7 +1558,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) if (!polled) vector = dev->num_vecs == 1 ? 0 : qid; else - vector = -1; + set_bit(NVMEQ_POLLED, &nvmeq->flags); result = adapter_alloc_cq(dev, qid, nvmeq, vector); if (result) @@ -1578,7 +1573,8 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) nvmeq->cq_vector = vector; nvme_init_queue(nvmeq, qid); - if (vector != -1) { + if (!polled) { + nvmeq->cq_vector = vector; result = queue_request_irq(nvmeq); if (result < 0) goto release_sq; @@ -1588,7 +1584,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled) return result; release_sq: - nvmeq->cq_vector = -1; dev->online_queues--; adapter_delete_sq(dev, qid); release_cq: @@ -1730,7 +1725,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev) nvme_init_queue(nvmeq, 0); result = queue_request_irq(nvmeq); if (result) { - nvmeq->cq_vector = -1; + dev->online_queues--; return result; } @@ -2171,10 +2166,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) * number of interrupts. */ result = queue_request_irq(adminq); - if (result) { - adminq->cq_vector = -1; + if (result) return result; - } set_bit(NVMEQ_ENABLED, &adminq->flags); result = nvme_create_io_queues(dev); From 88a041f4c1f6a21284c70b491929ed35336a0ea9 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Fri, 8 Mar 2019 10:43:11 -0700 Subject: [PATCH 031/164] nvme-pci: remove q_dmadev from nvme_queue We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: Keith Busch Reviewed-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 4c0461bd6cfc..8af2b10b4507 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -177,7 +177,6 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl) * commands and one for I/O commands). */ struct nvme_queue { - struct device *q_dmadev; struct nvme_dev *dev; spinlock_t sq_lock; struct nvme_command *sq_cmds; @@ -1369,16 +1368,16 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) static void nvme_free_queue(struct nvme_queue *nvmeq) { - dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth), + dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth), (void *)nvmeq->cqes, nvmeq->cq_dma_addr); if (!nvmeq->sq_cmds) return; if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) { - pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev), + pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev), nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth)); } else { - dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth), + dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth), nvmeq->sq_cmds, nvmeq->sq_dma_addr); } } @@ -1494,7 +1493,6 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth)) goto free_cqdma; - nvmeq->q_dmadev = dev->dev; nvmeq->dev = dev; spin_lock_init(&nvmeq->sq_lock); spin_lock_init(&nvmeq->cq_poll_lock); From 39f8e36401142d73e33a954ac4bdf844fb5de9ae Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Fri, 8 Mar 2019 10:43:13 -0700 Subject: [PATCH 032/164] nvme-pci: remove unused nvme_iod member Signed-off-by: Keith Busch Reviewed-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 8af2b10b4507..0cba927224b4 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -220,7 +220,6 @@ struct nvme_iod { int aborted; int npages; /* In the PRP list. 0 means small pool in use */ int nents; /* Used in scatterlist */ - int length; /* Of data, in bytes */ dma_addr_t first_dma; struct scatterlist meta_sg; /* metadata requires single contiguous buffer */ struct scatterlist *sg; @@ -603,7 +602,6 @@ static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) iod->aborted = 0; iod->npages = -1; iod->nents = 0; - iod->length = size; return BLK_STS_OK; } From 3aef3cae4342c1d8137a1c0782cbb66f1be3943c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 09:14:01 -0700 Subject: [PATCH 033/164] block: add a req_bvec helper Return the currently active bvec segment, potentially spanning multiple pages. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- include/linux/blkdev.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5c58a3b2bf00..84ce76f92d83 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -932,6 +932,17 @@ static inline unsigned int blk_rq_payload_bytes(struct request *rq) return blk_rq_bytes(rq); } +/* + * Return the first full biovec in the request. The caller needs to check that + * there are any bvecs before calling this helper. + */ +static inline struct bio_vec req_bvec(struct request *rq) +{ + if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) + return rq->special_vec; + return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter); +} + static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, int op) { From 2a876f5e25e8ec9fa5777d36e5695ee33dd63f6f Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:38:29 -0700 Subject: [PATCH 034/164] block: add a rq_integrity_vec helper This provides a nice little shortcut to get the integrity data for drivers like NVMe that only support a single integrity segment. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- include/linux/blkdev.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 84ce76f92d83..3a13fbe13e08 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1559,6 +1559,17 @@ static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi, return bio_integrity_intervals(bi, sectors) * bi->tuple_size; } +/* + * Return the first bvec that contains integrity data. Only drivers that are + * limited to a single integrity segment should use this helper. + */ +static inline struct bio_vec *rq_integrity_vec(struct request *rq) +{ + if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1)) + return NULL; + return rq->bio->bi_integrity->bip_vec; +} + #else /* CONFIG_BLK_DEV_INTEGRITY */ struct bio; @@ -1633,6 +1644,11 @@ static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi, return 0; } +static inline struct bio_vec *rq_integrity_vec(struct request *rq) +{ + return NULL; +} + #endif /* CONFIG_BLK_DEV_INTEGRITY */ struct block_device_operations { From 9d9de535f385a8b3ba0e88ca0abf386c5704bbfc Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:18:30 -0700 Subject: [PATCH 035/164] block: add a rq_dma_dir helper In a lot of places we want to know the DMA direction for a given struct request. Add a little helper to make it a littler easier. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- include/linux/blkdev.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 3a13fbe13e08..74469a4dc0a1 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -641,6 +641,9 @@ static inline bool blk_account_rq(struct request *rq) #define rq_data_dir(rq) (op_is_write(req_op(rq)) ? WRITE : READ) +#define rq_dma_dir(rq) \ + (op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE) + static inline bool queue_is_mq(struct request_queue *q) { return q->mq_ops; From 3ab3a0313cb8c50391d74e40fd46a3408d8e4de9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:40:36 -0700 Subject: [PATCH 036/164] block: add dma_map_bvec helper Provide a nice little shortcut for mapping a single bvec. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- include/linux/blkdev.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 74469a4dc0a1..4b85dc066264 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -644,6 +644,10 @@ static inline bool blk_account_rq(struct request *rq) #define rq_dma_dir(rq) \ (op_is_write(req_op(rq)) ? DMA_TO_DEVICE : DMA_FROM_DEVICE) +#define dma_map_bvec(dev, bv, dir, attrs) \ + dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \ + (dir), (attrs)) + static inline bool queue_is_mq(struct request_queue *q) { return q->mq_ops; From 9b048119a153590b934ef49aae309b723587f527 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:04:01 -0700 Subject: [PATCH 037/164] nvme-pci: remove nvme_init_iod nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 56 ++++++++++++++++------------------------- 1 file changed, 22 insertions(+), 34 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 0cba927224b4..2102a107e09b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -208,10 +208,10 @@ struct nvme_queue { }; /* - * The nvme_iod describes the data in an I/O, including the list of PRP - * entries. You can't see it in this data structure because C doesn't let - * me express that. Use nvme_init_iod to ensure there's enough space - * allocated to store the PRP list. + * The nvme_iod describes the data in an I/O. + * + * The sg pointer contains the list of PRP/SGL chunk allocations in addition + * to the actual struct scatterlist. */ struct nvme_iod { struct nvme_request req; @@ -583,29 +583,6 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req) return true; } -static blk_status_t nvme_init_iod(struct request *rq, struct nvme_dev *dev) -{ - struct nvme_iod *iod = blk_mq_rq_to_pdu(rq); - int nseg = blk_rq_nr_phys_segments(rq); - unsigned int size = blk_rq_payload_bytes(rq); - - iod->use_sgl = nvme_pci_use_sgls(dev, rq); - - if (nseg > NVME_INT_PAGES || size > NVME_INT_BYTES(dev)) { - iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); - if (!iod->sg) - return BLK_STS_RESOURCE; - } else { - iod->sg = iod->inline_sg; - } - - iod->aborted = 0; - iod->npages = -1; - iod->nents = 0; - - return BLK_STS_OK; -} - static void nvme_free_iod(struct nvme_dev *dev, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); @@ -837,6 +814,17 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, blk_status_t ret = BLK_STS_IOERR; int nr_mapped; + if (blk_rq_payload_bytes(req) > NVME_INT_BYTES(dev) || + blk_rq_nr_phys_segments(req) > NVME_INT_PAGES) { + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); + if (!iod->sg) + return BLK_STS_RESOURCE; + } else { + iod->sg = iod->inline_sg; + } + + iod->use_sgl = nvme_pci_use_sgls(dev, req); + sg_init_table(iod->sg, blk_rq_nr_phys_segments(req)); iod->nents = blk_rq_map_sg(q, req, iod->sg); if (!iod->nents) @@ -881,6 +869,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, out_unmap: dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); out: + nvme_free_iod(dev, req); return ret; } @@ -913,9 +902,14 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct nvme_queue *nvmeq = hctx->driver_data; struct nvme_dev *dev = nvmeq->dev; struct request *req = bd->rq; + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); struct nvme_command cmnd; blk_status_t ret; + iod->aborted = 0; + iod->npages = -1; + iod->nents = 0; + /* * We should not need to do this, but we're still using this to * ensure we can drain requests on a dying queue. @@ -927,21 +921,15 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, if (ret) return ret; - ret = nvme_init_iod(req, dev); - if (ret) - goto out_free_cmd; - if (blk_rq_nr_phys_segments(req)) { ret = nvme_map_data(dev, req, &cmnd); if (ret) - goto out_cleanup_iod; + goto out_free_cmd; } blk_mq_start_request(req); nvme_submit_cmd(nvmeq, &cmnd, bd->last); return BLK_STS_OK; -out_cleanup_iod: - nvme_free_iod(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; From 915f04c93db4e3a7388c8ad8ddfc28830e4cbce3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:13:03 -0700 Subject: [PATCH 038/164] nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 2102a107e09b..2af6cfbd77ec 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -888,7 +888,6 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) dma_unmap_sg(dev->dev, &iod->meta_sg, 1, dma_dir); } - nvme_cleanup_cmd(req); nvme_free_iod(dev, req); } @@ -939,6 +938,7 @@ static void nvme_pci_complete_rq(struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + nvme_cleanup_cmd(req); nvme_unmap_data(iod->nvmeq->dev, req); nvme_complete_rq(req); } From 7fe07d14f71fabef642a478626248a9121e95b7b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:15:19 -0700 Subject: [PATCH 039/164] nvme-pci: merge nvme_free_iod into nvme_unmap_data This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 44 ++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 27 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 2af6cfbd77ec..de199aff8d05 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -583,14 +583,24 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req) return true; } -static void nvme_free_iod(struct nvme_dev *dev, struct request *req) +static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + enum dma_data_direction dma_dir = rq_data_dir(req) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE; const int last_prp = dev->ctrl.page_size / sizeof(__le64) - 1; dma_addr_t dma_addr = iod->first_dma, next_dma_addr; - int i; + if (iod->nents) { + /* P2PDMA requests do not need to be unmapped */ + if (!is_pci_p2pdma_page(sg_page(iod->sg))) + dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); + + if (blk_integrity_rq(req)) + dma_unmap_sg(dev->dev, &iod->meta_sg, 1, dma_dir); + } + if (iod->npages == 0) dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0], dma_addr); @@ -847,50 +857,30 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, ret = nvme_pci_setup_prps(dev, req, &cmnd->rw); if (ret != BLK_STS_OK) - goto out_unmap; + goto out; ret = BLK_STS_IOERR; if (blk_integrity_rq(req)) { if (blk_rq_count_integrity_sg(q, req->bio) != 1) - goto out_unmap; + goto out; sg_init_table(&iod->meta_sg, 1); if (blk_rq_map_integrity_sg(q, req->bio, &iod->meta_sg) != 1) - goto out_unmap; + goto out; if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir)) - goto out_unmap; + goto out; cmnd->rw.metadata = cpu_to_le64(sg_dma_address(&iod->meta_sg)); } return BLK_STS_OK; -out_unmap: - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); out: - nvme_free_iod(dev, req); + nvme_unmap_data(dev, req); return ret; } -static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) -{ - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - enum dma_data_direction dma_dir = rq_data_dir(req) ? - DMA_TO_DEVICE : DMA_FROM_DEVICE; - - if (iod->nents) { - /* P2PDMA requests do not need to be unmapped */ - if (!is_pci_p2pdma_page(sg_page(iod->sg))) - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); - - if (blk_integrity_rq(req)) - dma_unmap_sg(dev->dev, &iod->meta_sg, 1, dma_dir); - } - - nvme_free_iod(dev, req); -} - /* * NOTE: ns is NULL when called on the admin queue. */ From b15c592de37ed9d71499a3b8a750d1b235fcba3d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:52:21 -0700 Subject: [PATCH 040/164] nvme-pci: only call nvme_unmap_data for requests transferring data This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index de199aff8d05..030ee94452dd 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -929,7 +929,8 @@ static void nvme_pci_complete_rq(struct request *req) struct nvme_iod *iod = blk_mq_rq_to_pdu(req); nvme_cleanup_cmd(req); - nvme_unmap_data(iod->nvmeq->dev, req); + if (blk_rq_nr_phys_segments(req)) + nvme_unmap_data(iod->nvmeq->dev, req); nvme_complete_rq(req); } From 783b94bd9250478154904fa782d2cfc46336cdf6 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 08:19:18 -0700 Subject: [PATCH 041/164] nvme-pci: do not build a scatterlist to map metadata We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 030ee94452dd..0679ac7fed19 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -221,7 +221,7 @@ struct nvme_iod { int npages; /* In the PRP list. 0 means small pool in use */ int nents; /* Used in scatterlist */ dma_addr_t first_dma; - struct scatterlist meta_sg; /* metadata requires single contiguous buffer */ + dma_addr_t meta_dma; struct scatterlist *sg; struct scatterlist inline_sg[0]; }; @@ -592,13 +592,16 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) dma_addr_t dma_addr = iod->first_dma, next_dma_addr; int i; + if (blk_integrity_rq(req)) { + dma_unmap_page(dev->dev, iod->meta_dma, + rq_integrity_vec(req)->bv_len, dma_dir); + } + if (iod->nents) { /* P2PDMA requests do not need to be unmapped */ if (!is_pci_p2pdma_page(sg_page(iod->sg))) dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); - if (blk_integrity_rq(req)) - dma_unmap_sg(dev->dev, &iod->meta_sg, 1, dma_dir); } if (iod->npages == 0) @@ -861,17 +864,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, ret = BLK_STS_IOERR; if (blk_integrity_rq(req)) { - if (blk_rq_count_integrity_sg(q, req->bio) != 1) + iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req), + dma_dir, 0); + if (dma_mapping_error(dev->dev, iod->meta_dma)) goto out; - - sg_init_table(&iod->meta_sg, 1); - if (blk_rq_map_integrity_sg(q, req->bio, &iod->meta_sg) != 1) - goto out; - - if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir)) - goto out; - - cmnd->rw.metadata = cpu_to_le64(sg_dma_address(&iod->meta_sg)); + cmnd->rw.metadata = cpu_to_le64(iod->meta_dma); } return BLK_STS_OK; From 4aedb705437f6f98b45f45c394e6803ca67abd33 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Sun, 3 Mar 2019 09:46:28 -0700 Subject: [PATCH 042/164] nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data This prepares for some bigger changes to the data mapping helpers. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 50 +++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 22 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 0679ac7fed19..10e6b5d055e9 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -592,11 +592,6 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) dma_addr_t dma_addr = iod->first_dma, next_dma_addr; int i; - if (blk_integrity_rq(req)) { - dma_unmap_page(dev->dev, iod->meta_dma, - rq_integrity_vec(req)->bv_len, dma_dir); - } - if (iod->nents) { /* P2PDMA requests do not need to be unmapped */ if (!is_pci_p2pdma_page(sg_page(iod->sg))) @@ -858,26 +853,25 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw, nr_mapped); else ret = nvme_pci_setup_prps(dev, req, &cmnd->rw); - - if (ret != BLK_STS_OK) - goto out; - - ret = BLK_STS_IOERR; - if (blk_integrity_rq(req)) { - iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req), - dma_dir, 0); - if (dma_mapping_error(dev->dev, iod->meta_dma)) - goto out; - cmnd->rw.metadata = cpu_to_le64(iod->meta_dma); - } - - return BLK_STS_OK; - out: - nvme_unmap_data(dev, req); + if (ret != BLK_STS_OK) + nvme_unmap_data(dev, req); return ret; } +static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req, + struct nvme_command *cmnd) +{ + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + + iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req), + rq_dma_dir(req), 0); + if (dma_mapping_error(dev->dev, iod->meta_dma)) + return BLK_STS_IOERR; + cmnd->rw.metadata = cpu_to_le64(iod->meta_dma); + return 0; +} + /* * NOTE: ns is NULL when called on the admin queue. */ @@ -913,9 +907,17 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, goto out_free_cmd; } + if (blk_integrity_rq(req)) { + ret = nvme_map_metadata(dev, req, &cmnd); + if (ret) + goto out_unmap_data; + } + blk_mq_start_request(req); nvme_submit_cmd(nvmeq, &cmnd, bd->last); return BLK_STS_OK; +out_unmap_data: + nvme_unmap_data(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; @@ -924,10 +926,14 @@ out_free_cmd: static void nvme_pci_complete_rq(struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + struct nvme_dev *dev = iod->nvmeq->dev; nvme_cleanup_cmd(req); + if (blk_integrity_rq(req)) + dma_unmap_page(dev->dev, iod->meta_dma, + rq_integrity_vec(req)->bv_len, rq_data_dir(req)); if (blk_rq_nr_phys_segments(req)) - nvme_unmap_data(iod->nvmeq->dev, req); + nvme_unmap_data(dev, req); nvme_complete_rq(req); } From d43f1ccfad053dbefba1d15443cdc36ca60958f0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 5 Mar 2019 05:46:58 -0700 Subject: [PATCH 043/164] nvme-pci: remove the inline scatterlist optimization We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 38 ++++++-------------------------------- 1 file changed, 6 insertions(+), 32 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 10e6b5d055e9..bd7e4209ab36 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -223,7 +223,6 @@ struct nvme_iod { dma_addr_t first_dma; dma_addr_t meta_dma; struct scatterlist *sg; - struct scatterlist inline_sg[0]; }; /* @@ -370,12 +369,6 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db, return true; } -/* - * Max size of iod being embedded in the request payload - */ -#define NVME_INT_PAGES 2 -#define NVME_INT_BYTES(dev) (NVME_INT_PAGES * (dev)->ctrl.page_size) - /* * Will slightly overestimate the number of pages needed. This is OK * as it only leads to a small amount of wasted memory for the lifetime of @@ -410,15 +403,6 @@ static unsigned int nvme_pci_iod_alloc_size(struct nvme_dev *dev, return alloc_size + sizeof(struct scatterlist) * nseg; } -static unsigned int nvme_pci_cmd_size(struct nvme_dev *dev, bool use_sgl) -{ - unsigned int alloc_size = nvme_pci_iod_alloc_size(dev, - NVME_INT_BYTES(dev), NVME_INT_PAGES, - use_sgl); - - return sizeof(struct nvme_iod) + alloc_size; -} - static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data, unsigned int hctx_idx) { @@ -621,8 +605,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) dma_addr = next_dma_addr; } - if (iod->sg != iod->inline_sg) - mempool_free(iod->sg, dev->iod_mempool); + mempool_free(iod->sg, dev->iod_mempool); } static void nvme_print_sgl(struct scatterlist *sgl, int nents) @@ -822,14 +805,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, blk_status_t ret = BLK_STS_IOERR; int nr_mapped; - if (blk_rq_payload_bytes(req) > NVME_INT_BYTES(dev) || - blk_rq_nr_phys_segments(req) > NVME_INT_PAGES) { - iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); - if (!iod->sg) - return BLK_STS_RESOURCE; - } else { - iod->sg = iod->inline_sg; - } + iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); + if (!iod->sg) + return BLK_STS_RESOURCE; iod->use_sgl = nvme_pci_use_sgls(dev, req); @@ -1612,7 +1590,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) dev->admin_tagset.queue_depth = NVME_AQ_MQ_TAG_DEPTH; dev->admin_tagset.timeout = ADMIN_TIMEOUT; dev->admin_tagset.numa_node = dev_to_node(dev->dev); - dev->admin_tagset.cmd_size = nvme_pci_cmd_size(dev, false); + dev->admin_tagset.cmd_size = sizeof(struct nvme_iod); dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; dev->admin_tagset.driver_data = dev; @@ -2257,11 +2235,7 @@ static int nvme_dev_add(struct nvme_dev *dev) dev->tagset.numa_node = dev_to_node(dev->dev); dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; - dev->tagset.cmd_size = nvme_pci_cmd_size(dev, false); - if ((dev->ctrl.sgls & ((1 << 0) | (1 << 1))) && sgl_threshold) { - dev->tagset.cmd_size = max(dev->tagset.cmd_size, - nvme_pci_cmd_size(dev, true)); - } + dev->tagset.cmd_size = sizeof(struct nvme_iod); dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE; dev->tagset.driver_data = dev; From dff824b2aadb7808f50ceb0927acaec5ad750ce7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 5 Mar 2019 05:49:34 -0700 Subject: [PATCH 044/164] nvme-pci: optimize mapping of small single segment requests If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 45 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index bd7e4209ab36..59731264b052 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -221,6 +221,7 @@ struct nvme_iod { int npages; /* In the PRP list. 0 means small pool in use */ int nents; /* Used in scatterlist */ dma_addr_t first_dma; + unsigned int dma_len; /* length of single DMA segment mapping */ dma_addr_t meta_dma; struct scatterlist *sg; }; @@ -576,13 +577,18 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) dma_addr_t dma_addr = iod->first_dma, next_dma_addr; int i; - if (iod->nents) { - /* P2PDMA requests do not need to be unmapped */ - if (!is_pci_p2pdma_page(sg_page(iod->sg))) - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); - + if (iod->dma_len) { + dma_unmap_page(dev->dev, dma_addr, iod->dma_len, dma_dir); + return; } + WARN_ON_ONCE(!iod->nents); + + /* P2PDMA requests do not need to be unmapped */ + if (!is_pci_p2pdma_page(sg_page(iod->sg))) + dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req)); + + if (iod->npages == 0) dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0], dma_addr); @@ -795,6 +801,24 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev, return BLK_STS_OK; } +static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev, + struct request *req, struct nvme_rw_command *cmnd, + struct bio_vec *bv) +{ + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + unsigned int first_prp_len = dev->ctrl.page_size - bv->bv_offset; + + iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0); + if (dma_mapping_error(dev->dev, iod->first_dma)) + return BLK_STS_RESOURCE; + iod->dma_len = bv->bv_len; + + cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma); + if (bv->bv_len > first_prp_len) + cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len); + return 0; +} + static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { @@ -805,6 +829,17 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, blk_status_t ret = BLK_STS_IOERR; int nr_mapped; + if (blk_rq_nr_phys_segments(req) == 1) { + struct bio_vec bv = req_bvec(req); + + if (!is_pci_p2pdma_page(bv.bv_page)) { + if (bv.bv_offset + bv.bv_len <= dev->ctrl.page_size * 2) + return nvme_setup_prp_simple(dev, req, + &cmnd->rw, &bv); + } + } + + iod->dma_len = 0; iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; From 297910571f08f1d7e398793df6e606ebb375a3f1 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 5 Mar 2019 05:54:18 -0700 Subject: [PATCH 045/164] nvme-pci: optimize mapping single segment requests using SGLs If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 59731264b052..82aa5cb21828 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -819,6 +819,23 @@ static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev, return 0; } +static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev, + struct request *req, struct nvme_rw_command *cmnd, + struct bio_vec *bv) +{ + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + + iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0); + if (dma_mapping_error(dev->dev, iod->first_dma)) + return BLK_STS_RESOURCE; + iod->dma_len = bv->bv_len; + + cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma); + cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len); + cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4; + return 0; +} + static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { @@ -836,6 +853,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, if (bv.bv_offset + bv.bv_len <= dev->ctrl.page_size * 2) return nvme_setup_prp_simple(dev, req, &cmnd->rw, &bv); + + if (iod->nvmeq->qid && + dev->ctrl.sgls & ((1 << 0) | (1 << 1))) + return nvme_setup_sgl_simple(dev, req, + &cmnd->rw, &bv); } } From 70479b71bc80ae6f63c8d6644cc76dff99f79686 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 5 Mar 2019 05:59:02 -0700 Subject: [PATCH 046/164] nvme-pci: tidy up nvme_map_data Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/pci.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 82aa5cb21828..c1eecde6b853 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -840,10 +840,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); - struct request_queue *q = req->q; - enum dma_data_direction dma_dir = rq_data_dir(req) ? - DMA_TO_DEVICE : DMA_FROM_DEVICE; - blk_status_t ret = BLK_STS_IOERR; + blk_status_t ret = BLK_STS_RESOURCE; int nr_mapped; if (blk_rq_nr_phys_segments(req) == 1) { @@ -865,25 +862,21 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); if (!iod->sg) return BLK_STS_RESOURCE; - - iod->use_sgl = nvme_pci_use_sgls(dev, req); - sg_init_table(iod->sg, blk_rq_nr_phys_segments(req)); - iod->nents = blk_rq_map_sg(q, req, iod->sg); + iod->nents = blk_rq_map_sg(req->q, req, iod->sg); if (!iod->nents) goto out; - ret = BLK_STS_RESOURCE; - if (is_pci_p2pdma_page(sg_page(iod->sg))) nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents, - dma_dir); + rq_dma_dir(req)); else nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, - dma_dir, DMA_ATTR_NO_WARN); + rq_dma_dir(req), DMA_ATTR_NO_WARN); if (!nr_mapped) goto out; + iod->use_sgl = nvme_pci_use_sgls(dev, req); if (iod->use_sgl) ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw, nr_mapped); else From e84c2091a45228b62867ec0565898ef5404706a2 Mon Sep 17 00:00:00 2001 From: Max Gurtovoy Date: Tue, 2 Apr 2019 14:52:47 +0300 Subject: [PATCH 047/164] nvmet: never fail double namespace enablement In case we create N namespaces while N < NVMET_MAX_NAMESPACES, we can perform "echo 1 > /enable" as much as we want. In case N == NVMET_MAX_NAMESPACES we fail. Make sure we have the same flow for any N. Signed-off-by: Max Gurtovoy Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index b3e765a95af8..4dc388a2ecb0 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -494,13 +494,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns) int ret; mutex_lock(&subsys->lock); - ret = -EMFILE; - if (subsys->nr_namespaces == NVMET_MAX_NAMESPACES) - goto out_unlock; ret = 0; if (ns->enabled) goto out_unlock; + ret = -EMFILE; + if (subsys->nr_namespaces == NVMET_MAX_NAMESPACES) + goto out_unlock; + ret = nvmet_bdev_ns_enable(ns); if (ret == -ENOTBLK) ret = nvmet_file_ns_enable(ns); From 013a63ef4edcd2366299225c3b081102171e8fa9 Mon Sep 17 00:00:00 2001 From: Max Gurtovoy Date: Tue, 2 Apr 2019 14:51:54 +0300 Subject: [PATCH 048/164] nvmet: add safety check for subsystem lock during nvmet_ns_changed we need to make sure that subsystem lock is taken during ctrl's list traversing. nvmet_ns_changed function is not static and can be used from various callers simultaneously. Signed-off-by: Max Gurtovoy Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 4dc388a2ecb0..4d8dd29479c0 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -214,6 +214,8 @@ void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid) { struct nvmet_ctrl *ctrl; + lockdep_assert_held(&subsys->lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid)); if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_NS_ATTR)) From d0de579c043c3a2ab60ce75eb6cf4d414becc676 Mon Sep 17 00:00:00 2001 From: Kenneth Heitke Date: Thu, 4 Apr 2019 12:57:45 -0600 Subject: [PATCH 049/164] nvme: log the error status on Identify Namespace failure Identify Namespace failures are logged as a warning but there is not an indication of the cause for the failure. Update the log message to include the error status. Signed-off-by: Kenneth Heitke Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index b5939112b9b6..ddb943395118 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1105,7 +1105,7 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl, error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id)); if (error) { - dev_warn(ctrl->device, "Identify namespace failed\n"); + dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error); kfree(id); return NULL; } From 72deb455b5ec619ff043c30bc90025aa3de3cdda Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 5 Apr 2019 18:08:59 +0200 Subject: [PATCH 050/164] block: remove CONFIG_LBDAF Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/process/submit-checklist.rst | 27 +++++++--------- .../translations/ja_JP/SubmitChecklist | 22 ++++++------- arch/arc/configs/haps_hs_defconfig | 1 - arch/arc/configs/haps_hs_smp_defconfig | 1 - arch/arc/configs/nsim_700_defconfig | 1 - arch/arc/configs/nsim_hs_defconfig | 1 - arch/arc/configs/nsim_hs_smp_defconfig | 1 - arch/arc/configs/nsimosci_defconfig | 1 - arch/arc/configs/nsimosci_hs_defconfig | 1 - arch/arc/configs/nsimosci_hs_smp_defconfig | 1 - arch/arm/configs/aspeed_g4_defconfig | 1 - arch/arm/configs/aspeed_g5_defconfig | 1 - arch/arm/configs/at91_dt_defconfig | 1 - arch/arm/configs/clps711x_defconfig | 1 - arch/arm/configs/efm32_defconfig | 1 - arch/arm/configs/ezx_defconfig | 1 - arch/arm/configs/h3600_defconfig | 1 - arch/arm/configs/imote2_defconfig | 1 - arch/arm/configs/moxart_defconfig | 1 - arch/arm/configs/multi_v4t_defconfig | 1 - arch/arm/configs/omap1_defconfig | 1 - arch/arm/configs/stm32_defconfig | 1 - arch/arm/configs/u300_defconfig | 1 - arch/arm/configs/vexpress_defconfig | 1 - arch/m68k/configs/amcore_defconfig | 1 - arch/m68k/configs/m5475evb_defconfig | 1 - arch/m68k/configs/stmark2_defconfig | 1 - arch/mips/configs/ar7_defconfig | 1 - arch/mips/configs/decstation_defconfig | 1 - arch/mips/configs/decstation_r4k_defconfig | 1 - arch/mips/configs/loongson1b_defconfig | 1 - arch/mips/configs/loongson1c_defconfig | 1 - arch/mips/configs/rb532_defconfig | 1 - arch/mips/configs/rbtx49xx_defconfig | 1 - arch/parisc/configs/generic-32bit_defconfig | 1 - arch/sh/configs/apsh4ad0a_defconfig | 1 - arch/sh/configs/ecovec24-romimage_defconfig | 1 - arch/sh/configs/rsk7264_defconfig | 1 - arch/sh/configs/rsk7269_defconfig | 1 - arch/sh/configs/sh7785lcr_32bit_defconfig | 1 - block/Kconfig | 24 -------------- drivers/block/drbd/drbd_int.h | 5 --- drivers/block/ps3disk.c | 4 +-- drivers/md/dm-exception-store.h | 28 ++-------------- drivers/md/dm-integrity.c | 8 ++--- drivers/md/md.c | 6 ++-- drivers/nvdimm/pfn_devs.c | 4 +-- drivers/scsi/sd.c | 32 ------------------- fs/ext4/resize.c | 2 -- fs/ext4/super.c | 32 ++++--------------- fs/gfs2/Kconfig | 1 - fs/nfs/Kconfig | 1 - fs/ocfs2/super.c | 10 ------ fs/stack.c | 15 ++++----- fs/xfs/Kconfig | 1 - fs/xfs/xfs_super.c | 10 +----- include/linux/genhd.h | 8 ++--- include/linux/kernel.h | 14 ++------ include/linux/types.h | 5 --- lib/Kconfig.debug | 1 - .../formal/srcu-cbmc/include/linux/types.h | 4 --- 61 files changed, 52 insertions(+), 250 deletions(-) diff --git a/Documentation/process/submit-checklist.rst b/Documentation/process/submit-checklist.rst index 367353c54949..c88867b173d9 100644 --- a/Documentation/process/submit-checklist.rst +++ b/Documentation/process/submit-checklist.rst @@ -72,47 +72,44 @@ and elsewhere regarding submitting Linux kernel patches. 13) Has been build- and runtime tested with and without ``CONFIG_SMP`` and ``CONFIG_PREEMPT.`` -14) If the patch affects IO/Disk, etc: has been tested with and without - ``CONFIG_LBDAF.`` +16) All codepaths have been exercised with all lockdep features enabled. -15) All codepaths have been exercised with all lockdep features enabled. +17) All new ``/proc`` entries are documented under ``Documentation/`` -16) All new ``/proc`` entries are documented under ``Documentation/`` - -17) All new kernel boot parameters are documented in +18) All new kernel boot parameters are documented in ``Documentation/admin-guide/kernel-parameters.rst``. -18) All new module parameters are documented with ``MODULE_PARM_DESC()`` +19) All new module parameters are documented with ``MODULE_PARM_DESC()`` -19) All new userspace interfaces are documented in ``Documentation/ABI/``. +20) All new userspace interfaces are documented in ``Documentation/ABI/``. See ``Documentation/ABI/README`` for more information. Patches that change userspace interfaces should be CCed to linux-api@vger.kernel.org. -20) Check that it all passes ``make headers_check``. +21) Check that it all passes ``make headers_check``. -21) Has been checked with injection of at least slab and page-allocation +22) Has been checked with injection of at least slab and page-allocation failures. See ``Documentation/fault-injection/``. If the new code is substantial, addition of subsystem-specific fault injection might be appropriate. -22) Newly-added code has been compiled with ``gcc -W`` (use +23) Newly-added code has been compiled with ``gcc -W`` (use ``make EXTRA_CFLAGS=-W``). This will generate lots of noise, but is good for finding bugs like "warning: comparison between signed and unsigned". -23) Tested after it has been merged into the -mm patchset to make sure +24) Tested after it has been merged into the -mm patchset to make sure that it still works with all of the other queued patches and various changes in the VM, VFS, and other subsystems. -24) All memory barriers {e.g., ``barrier()``, ``rmb()``, ``wmb()``} need a +25) All memory barriers {e.g., ``barrier()``, ``rmb()``, ``wmb()``} need a comment in the source code that explains the logic of what they are doing and why. -25) If any ioctl's are added by the patch, then also update +26) If any ioctl's are added by the patch, then also update ``Documentation/ioctl/ioctl-number.txt``. -26) If your modified source code depends on or uses any of the kernel +27) If your modified source code depends on or uses any of the kernel APIs or features that are related to the following ``Kconfig`` symbols, then test multiple builds with the related ``Kconfig`` symbols disabled and/or ``=m`` (if that option is available) [not all of these at the diff --git a/Documentation/translations/ja_JP/SubmitChecklist b/Documentation/translations/ja_JP/SubmitChecklist index 60c7c35ac517..b42220d3d46c 100644 --- a/Documentation/translations/ja_JP/SubmitChecklist +++ b/Documentation/translations/ja_JP/SubmitChecklist @@ -74,38 +74,34 @@ Linux カーネルパッチ投稿者向けチェックリスト 13: CONFIG_SMP, CONFIG_PREEMPT を有効にした場合と無効にした場合の両方で ビルドした上、動作確認を行ってください。 -14: もしパッチがディスクのI/O性能などに影響を与えるようであれば、 - 'CONFIG_LBDAF'オプションを有効にした場合と無効にした場合の両方で - テストを実施してみてください。 +14: lockdepの機能を全て有効にした上で、全てのコードパスを評価してください。 -15: lockdepの機能を全て有効にした上で、全てのコードパスを評価してください。 - -16: /proc に新しいエントリを追加した場合には、Documentation/ 配下に +15: /proc に新しいエントリを追加した場合には、Documentation/ 配下に 必ずドキュメントを追加してください。 -17: 新しいブートパラメータを追加した場合には、 +16: 新しいブートパラメータを追加した場合には、 必ずDocumentation/admin-guide/kernel-parameters.rst に説明を追加してください。 -18: 新しくmoduleにパラメータを追加した場合には、MODULE_PARM_DESC()を +17: 新しくmoduleにパラメータを追加した場合には、MODULE_PARM_DESC()を 利用して必ずその説明を記述してください。 -19: 新しいuserspaceインタフェースを作成した場合には、Documentation/ABI/ に +18: 新しいuserspaceインタフェースを作成した場合には、Documentation/ABI/ に Documentation/ABI/README を参考にして必ずドキュメントを追加してください。 -20: 'make headers_check'を実行して全く問題がないことを確認してください。 +19: 'make headers_check'を実行して全く問題がないことを確認してください。 -21: 少なくともslabアロケーションとpageアロケーションに失敗した場合の +20: 少なくともslabアロケーションとpageアロケーションに失敗した場合の 挙動について、fault-injectionを利用して確認してください。 Documentation/fault-injection/ を参照してください。 追加したコードがかなりの量であったならば、サブシステム特有の fault-injectionを追加したほうが良いかもしれません。 -22: 新たに追加したコードは、`gcc -W'でコンパイルしてください。 +21: 新たに追加したコードは、`gcc -W'でコンパイルしてください。 このオプションは大量の不要なメッセージを出力しますが、 "warning: comparison between signed and unsigned" のようなメッセージは、 バグを見つけるのに役に立ちます。 -23: 投稿したパッチが -mm パッチセットにマージされた後、全ての既存のパッチや +22: 投稿したパッチが -mm パッチセットにマージされた後、全ての既存のパッチや VM, VFS およびその他のサブシステムに関する様々な変更と、現時点でも共存 できることを確認するテストを行ってください。 diff --git a/arch/arc/configs/haps_hs_defconfig b/arch/arc/configs/haps_hs_defconfig index f56cc2070c11..b117e6c16d41 100644 --- a/arch/arc/configs/haps_hs_defconfig +++ b/arch/arc/configs/haps_hs_defconfig @@ -15,7 +15,6 @@ CONFIG_PERF_EVENTS=y # CONFIG_COMPAT_BRK is not set CONFIG_SLAB=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/haps_hs_smp_defconfig b/arch/arc/configs/haps_hs_smp_defconfig index b6f2482c7e74..33a787c375e2 100644 --- a/arch/arc/configs/haps_hs_smp_defconfig +++ b/arch/arc/configs/haps_hs_smp_defconfig @@ -17,7 +17,6 @@ CONFIG_PERF_EVENTS=y CONFIG_SLAB=y CONFIG_KPROBES=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsim_700_defconfig b/arch/arc/configs/nsim_700_defconfig index 318e4cd29629..de398c7b10b3 100644 --- a/arch/arc/configs/nsim_700_defconfig +++ b/arch/arc/configs/nsim_700_defconfig @@ -18,7 +18,6 @@ CONFIG_PERF_EVENTS=y CONFIG_ISA_ARCOMPACT=y CONFIG_KPROBES=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsim_hs_defconfig b/arch/arc/configs/nsim_hs_defconfig index c15807b0e0c1..2dbd34a9ff07 100644 --- a/arch/arc/configs/nsim_hs_defconfig +++ b/arch/arc/configs/nsim_hs_defconfig @@ -20,7 +20,6 @@ CONFIG_MODULES=y CONFIG_MODULE_FORCE_LOAD=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsim_hs_smp_defconfig b/arch/arc/configs/nsim_hs_smp_defconfig index 65e983fd942b..c7135f1e2583 100644 --- a/arch/arc/configs/nsim_hs_smp_defconfig +++ b/arch/arc/configs/nsim_hs_smp_defconfig @@ -18,7 +18,6 @@ CONFIG_MODULES=y CONFIG_MODULE_FORCE_LOAD=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsimosci_defconfig b/arch/arc/configs/nsimosci_defconfig index 08c5b99ac341..385a71d3c478 100644 --- a/arch/arc/configs/nsimosci_defconfig +++ b/arch/arc/configs/nsimosci_defconfig @@ -18,7 +18,6 @@ CONFIG_PERF_EVENTS=y CONFIG_ISA_ARCOMPACT=y CONFIG_KPROBES=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsimosci_hs_defconfig b/arch/arc/configs/nsimosci_hs_defconfig index 5b5e26d67955..248a2c3bdc12 100644 --- a/arch/arc/configs/nsimosci_hs_defconfig +++ b/arch/arc/configs/nsimosci_hs_defconfig @@ -17,7 +17,6 @@ CONFIG_PERF_EVENTS=y # CONFIG_COMPAT_BRK is not set CONFIG_KPROBES=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arc/configs/nsimosci_hs_smp_defconfig b/arch/arc/configs/nsimosci_hs_smp_defconfig index 26af9b2f7fcb..1a4bc7b660fb 100644 --- a/arch/arc/configs/nsimosci_hs_smp_defconfig +++ b/arch/arc/configs/nsimosci_hs_smp_defconfig @@ -12,7 +12,6 @@ CONFIG_PERF_EVENTS=y # CONFIG_COMPAT_BRK is not set CONFIG_KPROBES=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/aspeed_g4_defconfig b/arch/arm/configs/aspeed_g4_defconfig index 1446262921b4..bdbade6af9c7 100644 --- a/arch/arm/configs/aspeed_g4_defconfig +++ b/arch/arm/configs/aspeed_g4_defconfig @@ -23,7 +23,6 @@ CONFIG_SLAB_FREELIST_RANDOM=y CONFIG_JUMP_LABEL=y CONFIG_STRICT_KERNEL_RWX=y CONFIG_GCC_PLUGINS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_BLK_DEBUG_FS is not set # CONFIG_IOSCHED_DEADLINE is not set diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig index 02fa3a41add5..4bde84eae4eb 100644 --- a/arch/arm/configs/aspeed_g5_defconfig +++ b/arch/arm/configs/aspeed_g5_defconfig @@ -23,7 +23,6 @@ CONFIG_SLAB_FREELIST_RANDOM=y CONFIG_JUMP_LABEL=y CONFIG_STRICT_KERNEL_RWX=y CONFIG_GCC_PLUGINS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_BLK_DEBUG_FS is not set # CONFIG_IOSCHED_DEADLINE is not set diff --git a/arch/arm/configs/at91_dt_defconfig b/arch/arm/configs/at91_dt_defconfig index e4b1be66b3f5..b7752929975c 100644 --- a/arch/arm/configs/at91_dt_defconfig +++ b/arch/arm/configs/at91_dt_defconfig @@ -9,7 +9,6 @@ CONFIG_EMBEDDED=y CONFIG_SLAB=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/clps711x_defconfig b/arch/arm/configs/clps711x_defconfig index fc105c9178cc..09ae750164e0 100644 --- a/arch/arm/configs/clps711x_defconfig +++ b/arch/arm/configs/clps711x_defconfig @@ -6,7 +6,6 @@ CONFIG_RD_LZMA=y CONFIG_EMBEDDED=y CONFIG_SLOB=y CONFIG_JUMP_LABEL=y -# CONFIG_LBDAF is not set CONFIG_PARTITION_ADVANCED=y # CONFIG_IOSCHED_CFQ is not set CONFIG_ARCH_CLPS711X=y diff --git a/arch/arm/configs/efm32_defconfig b/arch/arm/configs/efm32_defconfig index ee42158f41ec..10ea92513a69 100644 --- a/arch/arm/configs/efm32_defconfig +++ b/arch/arm/configs/efm32_defconfig @@ -11,7 +11,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_SLUB_DEBUG is not set -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/ezx_defconfig b/arch/arm/configs/ezx_defconfig index 484e51fbd4a6..e3afca5bd9d6 100644 --- a/arch/arm/configs/ezx_defconfig +++ b/arch/arm/configs/ezx_defconfig @@ -13,7 +13,6 @@ CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_CFQ is not set CONFIG_ARCH_PXA=y diff --git a/arch/arm/configs/h3600_defconfig b/arch/arm/configs/h3600_defconfig index ebeca11faa48..175881b7da7c 100644 --- a/arch/arm/configs/h3600_defconfig +++ b/arch/arm/configs/h3600_defconfig @@ -4,7 +4,6 @@ CONFIG_HIGH_RES_TIMERS=y CONFIG_LOG_BUF_SHIFT=14 CONFIG_BLK_DEV_INITRD=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/imote2_defconfig b/arch/arm/configs/imote2_defconfig index f204017c26b9..9b779e13e05d 100644 --- a/arch/arm/configs/imote2_defconfig +++ b/arch/arm/configs/imote2_defconfig @@ -12,7 +12,6 @@ CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_CFQ is not set CONFIG_ARCH_PXA=y diff --git a/arch/arm/configs/moxart_defconfig b/arch/arm/configs/moxart_defconfig index 078228a19339..6a11669fa536 100644 --- a/arch/arm/configs/moxart_defconfig +++ b/arch/arm/configs/moxart_defconfig @@ -15,7 +15,6 @@ CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_SLUB_DEBUG is not set # CONFIG_COMPAT_BRK is not set -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set CONFIG_ARCH_MULTI_V4=y diff --git a/arch/arm/configs/multi_v4t_defconfig b/arch/arm/configs/multi_v4t_defconfig index 9a6390c172d6..eeea0c41138b 100644 --- a/arch/arm/configs/multi_v4t_defconfig +++ b/arch/arm/configs/multi_v4t_defconfig @@ -5,7 +5,6 @@ CONFIG_BLK_DEV_INITRD=y CONFIG_EMBEDDED=y CONFIG_SLOB=y CONFIG_JUMP_LABEL=y -# CONFIG_LBDAF is not set CONFIG_PARTITION_ADVANCED=y # CONFIG_IOSCHED_CFQ is not set CONFIG_ARCH_MULTI_V4T=y diff --git a/arch/arm/configs/omap1_defconfig b/arch/arm/configs/omap1_defconfig index cfc00b0961ec..8448a7f407a4 100644 --- a/arch/arm/configs/omap1_defconfig +++ b/arch/arm/configs/omap1_defconfig @@ -17,7 +17,6 @@ CONFIG_OPROFILE=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/stm32_defconfig b/arch/arm/configs/stm32_defconfig index 0258ba891376..152321d2893e 100644 --- a/arch/arm/configs/stm32_defconfig +++ b/arch/arm/configs/stm32_defconfig @@ -13,7 +13,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_SLUB_DEBUG is not set -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/u300_defconfig b/arch/arm/configs/u300_defconfig index 36d77406e31b..831ba6a9ee8b 100644 --- a/arch/arm/configs/u300_defconfig +++ b/arch/arm/configs/u300_defconfig @@ -9,7 +9,6 @@ CONFIG_EXPERT=y # CONFIG_VM_EVENT_COUNTERS is not set CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/arm/configs/vexpress_defconfig b/arch/arm/configs/vexpress_defconfig index 392ed3b3613c..484d77a7f589 100644 --- a/arch/arm/configs/vexpress_defconfig +++ b/arch/arm/configs/vexpress_defconfig @@ -14,7 +14,6 @@ CONFIG_PROFILING=y CONFIG_OPROFILE=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/m68k/configs/amcore_defconfig b/arch/m68k/configs/amcore_defconfig index 0857cdbfde0c..d5e683dd885d 100644 --- a/arch/m68k/configs/amcore_defconfig +++ b/arch/m68k/configs/amcore_defconfig @@ -12,7 +12,6 @@ CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_SLUB_DEBUG is not set # CONFIG_COMPAT_BRK is not set -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_CFQ is not set # CONFIG_MMU is not set diff --git a/arch/m68k/configs/m5475evb_defconfig b/arch/m68k/configs/m5475evb_defconfig index 4f4ccd13c11b..434bd3750966 100644 --- a/arch/m68k/configs/m5475evb_defconfig +++ b/arch/m68k/configs/m5475evb_defconfig @@ -11,7 +11,6 @@ CONFIG_SYSCTL_SYSCALL=y # CONFIG_AIO is not set CONFIG_EMBEDDED=y CONFIG_MODULES=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/m68k/configs/stmark2_defconfig b/arch/m68k/configs/stmark2_defconfig index 69f23c7b0497..27fa9465d19d 100644 --- a/arch/m68k/configs/stmark2_defconfig +++ b/arch/m68k/configs/stmark2_defconfig @@ -17,7 +17,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set # CONFIG_COMPAT_BRK is not set -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_BLK_CMDLINE_PARSER=y # CONFIG_MMU is not set diff --git a/arch/mips/configs/ar7_defconfig b/arch/mips/configs/ar7_defconfig index 9fbfb6e5c7d2..c83fdf649327 100644 --- a/arch/mips/configs/ar7_defconfig +++ b/arch/mips/configs/ar7_defconfig @@ -18,7 +18,6 @@ CONFIG_KEXEC=y # CONFIG_SECCOMP is not set CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_BSD_DISKLABEL=y diff --git a/arch/mips/configs/decstation_defconfig b/arch/mips/configs/decstation_defconfig index 0c86ed86266a..30a6eafdb1d0 100644 --- a/arch/mips/configs/decstation_defconfig +++ b/arch/mips/configs/decstation_defconfig @@ -17,7 +17,6 @@ CONFIG_TC=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_SRCVERSION_ALL=y -# CONFIG_LBDAF is not set CONFIG_PARTITION_ADVANCED=y CONFIG_OSF_PARTITION=y # CONFIG_EFI_PARTITION is not set diff --git a/arch/mips/configs/decstation_r4k_defconfig b/arch/mips/configs/decstation_r4k_defconfig index 0e54ab2680ce..e2b58dbf4aa9 100644 --- a/arch/mips/configs/decstation_r4k_defconfig +++ b/arch/mips/configs/decstation_r4k_defconfig @@ -16,7 +16,6 @@ CONFIG_TC=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_SRCVERSION_ALL=y -# CONFIG_LBDAF is not set CONFIG_PARTITION_ADVANCED=y CONFIG_OSF_PARTITION=y # CONFIG_EFI_PARTITION is not set diff --git a/arch/mips/configs/loongson1b_defconfig b/arch/mips/configs/loongson1b_defconfig index b064d68a5424..aa7e98c5f5fc 100644 --- a/arch/mips/configs/loongson1b_defconfig +++ b/arch/mips/configs/loongson1b_defconfig @@ -19,7 +19,6 @@ CONFIG_MACH_LOONGSON32=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODVERSIONS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_NET=y diff --git a/arch/mips/configs/loongson1c_defconfig b/arch/mips/configs/loongson1c_defconfig index 5d76559b56cd..520e7ef35383 100644 --- a/arch/mips/configs/loongson1c_defconfig +++ b/arch/mips/configs/loongson1c_defconfig @@ -20,7 +20,6 @@ CONFIG_LOONGSON1_LS1C=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODVERSIONS=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_NET=y diff --git a/arch/mips/configs/rb532_defconfig b/arch/mips/configs/rb532_defconfig index 7befe05fd813..ed1038f62a2c 100644 --- a/arch/mips/configs/rb532_defconfig +++ b/arch/mips/configs/rb532_defconfig @@ -19,7 +19,6 @@ CONFIG_PCI=y # CONFIG_PCI_QUIRKS is not set CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y CONFIG_MAC_PARTITION=y diff --git a/arch/mips/configs/rbtx49xx_defconfig b/arch/mips/configs/rbtx49xx_defconfig index 50a2c9ad583f..b0f0c5f9ad9d 100644 --- a/arch/mips/configs/rbtx49xx_defconfig +++ b/arch/mips/configs/rbtx49xx_defconfig @@ -17,7 +17,6 @@ CONFIG_TOSHIBA_RBTX4938_MPLEX_KEEP=y CONFIG_PCI=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_NET=y CONFIG_PACKET=y diff --git a/arch/parisc/configs/generic-32bit_defconfig b/arch/parisc/configs/generic-32bit_defconfig index 37ae4b57c001..a8f9bbef0975 100644 --- a/arch/parisc/configs/generic-32bit_defconfig +++ b/arch/parisc/configs/generic-32bit_defconfig @@ -14,7 +14,6 @@ CONFIG_SLAB=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PA7100LC=y CONFIG_SMP=y diff --git a/arch/sh/configs/apsh4ad0a_defconfig b/arch/sh/configs/apsh4ad0a_defconfig index 825c641726c4..d0d9ebc7165b 100644 --- a/arch/sh/configs/apsh4ad0a_defconfig +++ b/arch/sh/configs/apsh4ad0a_defconfig @@ -19,7 +19,6 @@ CONFIG_SLAB=y CONFIG_PROFILING=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_CFQ_GROUP_IOSCHED=y CONFIG_CPU_SUBTYPE_SH7786=y diff --git a/arch/sh/configs/ecovec24-romimage_defconfig b/arch/sh/configs/ecovec24-romimage_defconfig index 0c5dfccbfe37..bdb61d1d0127 100644 --- a/arch/sh/configs/ecovec24-romimage_defconfig +++ b/arch/sh/configs/ecovec24-romimage_defconfig @@ -7,7 +7,6 @@ CONFIG_LOG_BUF_SHIFT=14 CONFIG_BLK_DEV_INITRD=y # CONFIG_KALLSYMS is not set CONFIG_SLAB=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_CPU_SUBTYPE_SH7724=y CONFIG_MEMORY_SIZE=0x10000000 diff --git a/arch/sh/configs/rsk7264_defconfig b/arch/sh/configs/rsk7264_defconfig index 2b9b731fc86b..ad003ee469ea 100644 --- a/arch/sh/configs/rsk7264_defconfig +++ b/arch/sh/configs/rsk7264_defconfig @@ -16,7 +16,6 @@ CONFIG_PERF_COUNTERS=y CONFIG_SLAB=y CONFIG_MMAP_ALLOW_UNINITIALIZED=y CONFIG_PROFILING=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_PARTITION_ADVANCED=y # CONFIG_IOSCHED_DEADLINE is not set diff --git a/arch/sh/configs/rsk7269_defconfig b/arch/sh/configs/rsk7269_defconfig index d041f7bcb84c..27fc01d58cf8 100644 --- a/arch/sh/configs/rsk7269_defconfig +++ b/arch/sh/configs/rsk7269_defconfig @@ -3,7 +3,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_EMBEDDED=y # CONFIG_VM_EVENT_COUNTERS is not set CONFIG_SLAB=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set # CONFIG_IOSCHED_DEADLINE is not set # CONFIG_IOSCHED_CFQ is not set diff --git a/arch/sh/configs/sh7785lcr_32bit_defconfig b/arch/sh/configs/sh7785lcr_32bit_defconfig index 2ddf5ca7094e..a89ccc15af23 100644 --- a/arch/sh/configs/sh7785lcr_32bit_defconfig +++ b/arch/sh/configs/sh7785lcr_32bit_defconfig @@ -11,7 +11,6 @@ CONFIG_PROFILING=y CONFIG_GCOV_KERNEL=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y -# CONFIG_LBDAF is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_CPU_SUBTYPE_SH7785=y CONFIG_MEMORY_START=0x40000000 diff --git a/block/Kconfig b/block/Kconfig index 028bc085dac8..1b220101a9cb 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -26,30 +26,6 @@ menuconfig BLOCK if BLOCK -config LBDAF - bool "Support for large (2TB+) block devices and files" - depends on !64BIT - default y - help - Enable block devices or files of size 2TB and larger. - - This option is required to support the full capacity of large - (2TB+) block devices, including RAID, disk, Network Block Device, - Logical Volume Manager (LVM) and loopback. - - This option also enables support for single files larger than - 2TB. - - The ext4 filesystem requires that this feature be enabled in - order to support filesystems that have the huge_file feature - enabled. Otherwise, it will refuse to mount in the read-write - mode any filesystems that use the huge_file feature, which is - enabled by default by mke2fs.ext4. - - The GFS2 filesystem also requires this feature. - - If unsure, say Y. - config BLK_SCSI_REQUEST bool diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index 000a2f4c0e92..acd7af3630e9 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -1317,10 +1317,6 @@ struct bm_extent { #define DRBD_MAX_SECTORS_FIXED_BM \ ((MD_128MB_SECT - MD_32kB_SECT - MD_4kB_SECT) * (1LL<<(BM_EXT_SHIFT-9))) -#if !defined(CONFIG_LBDAF) && BITS_PER_LONG == 32 -#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_32 -#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_32 -#else #define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_FIXED_BM /* 16 TB in units of sectors */ #if BITS_PER_LONG == 32 @@ -1333,7 +1329,6 @@ struct bm_extent { #define DRBD_MAX_SECTORS_FLEX (1UL << 51) /* corresponds to (1UL << 38) bits right now. */ #endif -#endif /* Estimate max bio size as 256 * PAGE_SIZE, * so for typical PAGE_SIZE of 4k, that is (1<<20) Byte. diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c index 4e1d9b31f60c..cc61c5ce3ad5 100644 --- a/drivers/block/ps3disk.c +++ b/drivers/block/ps3disk.c @@ -102,7 +102,7 @@ static void ps3disk_scatter_gather(struct ps3_storage_device *dev, rq_for_each_segment(bvec, req, iter) { unsigned long flags; - dev_dbg(&dev->sbd.core, "%s:%u: bio %u: %u sectors from %lu\n", + dev_dbg(&dev->sbd.core, "%s:%u: bio %u: %u sectors from %llu\n", __func__, __LINE__, i, bio_sectors(iter.bio), iter.bio->bi_iter.bi_sector); @@ -496,7 +496,7 @@ static int ps3disk_probe(struct ps3_system_bus_device *_dev) dev->regions[dev->region_idx].size*priv->blocking_factor); dev_info(&dev->sbd.core, - "%s is a %s (%llu MiB total, %lu MiB for OtherOS)\n", + "%s is a %s (%llu MiB total, %llu MiB for OtherOS)\n", gendisk->disk_name, priv->model, priv->raw_capacity >> 11, get_capacity(gendisk) >> 11); diff --git a/drivers/md/dm-exception-store.h b/drivers/md/dm-exception-store.h index 12b5216c2cfe..721efc493942 100644 --- a/drivers/md/dm-exception-store.h +++ b/drivers/md/dm-exception-store.h @@ -135,9 +135,8 @@ struct dm_dev *dm_snap_cow(struct dm_snapshot *snap); /* * Funtions to manipulate consecutive chunks */ -# if defined(CONFIG_LBDAF) || (BITS_PER_LONG == 64) -# define DM_CHUNK_CONSECUTIVE_BITS 8 -# define DM_CHUNK_NUMBER_BITS 56 +#define DM_CHUNK_CONSECUTIVE_BITS 8 +#define DM_CHUNK_NUMBER_BITS 56 static inline chunk_t dm_chunk_number(chunk_t chunk) { @@ -163,29 +162,6 @@ static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e) e->new_chunk -= (1ULL << DM_CHUNK_NUMBER_BITS); } -# else -# define DM_CHUNK_CONSECUTIVE_BITS 0 - -static inline chunk_t dm_chunk_number(chunk_t chunk) -{ - return chunk; -} - -static inline unsigned dm_consecutive_chunk_count(struct dm_exception *e) -{ - return 0; -} - -static inline void dm_consecutive_chunk_count_inc(struct dm_exception *e) -{ -} - -static inline void dm_consecutive_chunk_count_dec(struct dm_exception *e) -{ -} - -# endif - /* * Return the number of sectors in the device. */ diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index d57d997a52c8..0eb56ba89a7f 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -88,14 +88,10 @@ struct journal_entry { #if BITS_PER_LONG == 64 #define journal_entry_set_sector(je, x) do { smp_wmb(); WRITE_ONCE((je)->u.sector, cpu_to_le64(x)); } while (0) -#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) -#elif defined(CONFIG_LBDAF) -#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32((x) >> 32)); } while (0) -#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) #else -#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32(0)); } while (0) -#define journal_entry_get_sector(je) le32_to_cpu((je)->u.s.sector_lo) +#define journal_entry_set_sector(je, x) do { (je)->u.s.sector_lo = cpu_to_le32(x); smp_wmb(); WRITE_ONCE((je)->u.s.sector_hi, cpu_to_le32((x) >> 32)); } while (0) #endif +#define journal_entry_get_sector(je) le64_to_cpu((je)->u.sector) #define journal_entry_is_unused(je) ((je)->u.s.sector_hi == cpu_to_le32(-1)) #define journal_entry_set_unused(je) do { ((je)->u.s.sector_hi = cpu_to_le32(-1)); } while (0) #define journal_entry_is_inprogress(je) ((je)->u.s.sector_hi == cpu_to_le32(-2)) diff --git a/drivers/md/md.c b/drivers/md/md.c index d0f688399a56..1fa2682951f1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1106,8 +1106,7 @@ static int super_90_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor * (not needed for Linear and RAID0 as metadata doesn't * record this size) */ - if (IS_ENABLED(CONFIG_LBDAF) && (u64)rdev->sectors >= (2ULL << 32) && - sb->level >= 1) + if ((u64)rdev->sectors >= (2ULL << 32) && sb->level >= 1) rdev->sectors = (sector_t)(2ULL << 32) - 2; if (rdev->sectors < ((sector_t)sb->size) * 2 && sb->level >= 1) @@ -1405,8 +1404,7 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors) /* Limit to 4TB as metadata cannot record more than that. * 4TB == 2^32 KB, or 2*2^32 sectors. */ - if (IS_ENABLED(CONFIG_LBDAF) && (u64)num_sectors >= (2ULL << 32) && - rdev->mddev->level >= 1) + if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1) num_sectors = (sector_t)(2ULL << 32) - 2; do { md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index d271bd731af7..01f40672507f 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -391,7 +391,7 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn) bb_present = badblocks_check(&nd_region->bb, meta_start, meta_num, &first_bad, &num_bad); if (bb_present) { - dev_dbg(&nd_pfn->dev, "meta: %x badblocks at %lx\n", + dev_dbg(&nd_pfn->dev, "meta: %x badblocks at %llx\n", num_bad, first_bad); nsoff = ALIGN_DOWN((nd_region->ndr_start + (first_bad << 9)) - nsio->res.start, @@ -410,7 +410,7 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn) } if (rc) { dev_err(&nd_pfn->dev, - "error clearing %x badblocks at %lx\n", + "error clearing %x badblocks at %llx\n", num_bad, first_bad); return rc; } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 2b2bc4b49d78..92c34d93e051 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2256,22 +2256,6 @@ static void read_capacity_error(struct scsi_disk *sdkp, struct scsi_device *sdp, #define READ_CAPACITY_RETRIES_ON_RESET 10 -/* - * Ensure that we don't overflow sector_t when CONFIG_LBDAF is not set - * and the reported logical block size is bigger than 512 bytes. Note - * that last_sector is a u64 and therefore logical_to_sectors() is not - * applicable. - */ -static bool sd_addressable_capacity(u64 lba, unsigned int sector_size) -{ - u64 last_sector = (lba + 1ULL) << (ilog2(sector_size) - 9); - - if (sizeof(sector_t) == 4 && last_sector > U32_MAX) - return false; - - return true; -} - static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, unsigned char *buffer) { @@ -2337,14 +2321,6 @@ static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, return -ENODEV; } - if (!sd_addressable_capacity(lba, sector_size)) { - sd_printk(KERN_ERR, sdkp, "Too big for this kernel. Use a " - "kernel compiled with support for large block " - "devices.\n"); - sdkp->capacity = 0; - return -EOVERFLOW; - } - /* Logical blocks per physical block exponent */ sdkp->physical_block_size = (1 << (buffer[13] & 0xf)) * sector_size; @@ -2426,14 +2402,6 @@ static int read_capacity_10(struct scsi_disk *sdkp, struct scsi_device *sdp, return sector_size; } - if (!sd_addressable_capacity(lba, sector_size)) { - sd_printk(KERN_ERR, sdkp, "Too big for this kernel. Use a " - "kernel compiled with support for large block " - "devices.\n"); - sdkp->capacity = 0; - return -EOVERFLOW; - } - sdkp->capacity = lba + 1; sdkp->physical_block_size = sector_size; return sector_size; diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index e7ae26e36c9c..38faf661e237 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1760,8 +1760,6 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es, ext4_msg(sb, KERN_ERR, "filesystem too large to resize to %llu blocks safely", n_blocks_count); - if (sizeof(sector_t) < 8) - ext4_warning(sb, "CONFIG_LBDAF not enabled"); return -EINVAL; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 6ed4eb81e674..d10e9e724bdd 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2706,13 +2706,9 @@ static loff_t ext4_max_size(int blkbits, int has_huge_files) loff_t res; loff_t upper_limit = MAX_LFS_FILESIZE; - /* small i_blocks in vfs inode? */ - if (!has_huge_files || sizeof(blkcnt_t) < sizeof(u64)) { - /* - * CONFIG_LBDAF is not enabled implies the inode - * i_block represent total blocks in 512 bytes - * 32 == size of vfs inode i_blocks * 8 - */ + BUILD_BUG_ON(sizeof(blkcnt_t) < sizeof(u64)); + + if (!has_huge_files) { upper_limit = (1LL << 32) - 1; /* total blocks in file system block size */ @@ -2753,11 +2749,11 @@ static loff_t ext4_max_bitmap_size(int bits, int has_huge_files) * number of 512-byte sectors of the file. */ - if (!has_huge_files || sizeof(blkcnt_t) < sizeof(u64)) { + if (!has_huge_files) { /* - * !has_huge_files or CONFIG_LBDAF not enabled implies that - * the inode i_block field represents total file blocks in - * 2^32 512-byte sectors == size of vfs inode i_blocks * 8 + * !has_huge_files or implies that the inode i_block field + * represents total file blocks in 2^32 512-byte sectors == + * size of vfs inode i_blocks * 8 */ upper_limit = (1LL << 32) - 1; @@ -2897,18 +2893,6 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly) ~EXT4_FEATURE_RO_COMPAT_SUPP)); return 0; } - /* - * Large file size enabled file system can only be mounted - * read-write on 32-bit systems if kernel is built with CONFIG_LBDAF - */ - if (ext4_has_feature_huge_file(sb)) { - if (sizeof(blkcnt_t) < sizeof(u64)) { - ext4_msg(sb, KERN_ERR, "Filesystem with huge files " - "cannot be mounted RDWR without " - "CONFIG_LBDAF"); - return 0; - } - } if (ext4_has_feature_bigalloc(sb) && !ext4_has_feature_extents(sb)) { ext4_msg(sb, KERN_ERR, "Can't support bigalloc feature without " @@ -4057,8 +4041,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) if (err) { ext4_msg(sb, KERN_ERR, "filesystem" " too large to mount safely on this system"); - if (sizeof(sector_t) < 8) - ext4_msg(sb, KERN_WARNING, "CONFIG_LBDAF not enabled"); goto failed_mount; } diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig index 3ed2b088dcfd..6a1e499543f5 100644 --- a/fs/gfs2/Kconfig +++ b/fs/gfs2/Kconfig @@ -1,6 +1,5 @@ config GFS2_FS tristate "GFS2 file system support" - depends on (64BIT || LBDAF) select FS_POSIX_ACL select CRC32 select LIBCRC32C diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig index 5f93cfacb3d1..69d02cf8cf37 100644 --- a/fs/nfs/Kconfig +++ b/fs/nfs/Kconfig @@ -121,7 +121,6 @@ config PNFS_FILE_LAYOUT config PNFS_BLOCK tristate depends on NFS_V4_1 && BLK_DEV_DM - depends on 64BIT || LBDAF default NFS_V4 config PNFS_FLEXFILE_LAYOUT diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 96ae7cedd487..fc3d29eceb2f 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -600,7 +600,6 @@ static unsigned long long ocfs2_max_file_offset(unsigned int bbits, */ #if BITS_PER_LONG == 32 -# if defined(CONFIG_LBDAF) BUILD_BUG_ON(sizeof(sector_t) != 8); /* * We might be limited by page cache size. @@ -614,15 +613,6 @@ static unsigned long long ocfs2_max_file_offset(unsigned int bbits, */ bitshift = 31; } -# else - /* - * We are limited by the size of sector_t. Use block size, as - * that's what we expose to the VFS. - */ - bytes = 1 << bbits; - trim = 1; - bitshift = 31; -# endif #endif /* diff --git a/fs/stack.c b/fs/stack.c index a54e33ed10f1..664ed35558bd 100644 --- a/fs/stack.c +++ b/fs/stack.c @@ -21,11 +21,10 @@ void fsstack_copy_inode_size(struct inode *dst, struct inode *src) i_size = i_size_read(src); /* - * But if CONFIG_LBDAF (on 32-bit), we ought to make an effort to - * keep the two halves of i_blocks in sync despite SMP or PREEMPT - - * though stat's generic_fillattr() doesn't bother, and we won't be - * applying quotas (where i_blocks does become important) at the - * upper level. + * But on 32-bit, we ought to make an effort to keep the two halves of + * i_blocks in sync despite SMP or PREEMPT - though stat's + * generic_fillattr() doesn't bother, and we won't be applying quotas + * (where i_blocks does become important) at the upper level. * * We don't actually know what locking is used at the lower level; * but if it's a filesystem that supports quotas, it will be using @@ -44,9 +43,9 @@ void fsstack_copy_inode_size(struct inode *dst, struct inode *src) * include/linux/fs.h). We don't necessarily hold i_mutex when this * is called, so take i_lock for that case. * - * And if CONFIG_LBDAF (on 32-bit), continue our effort to keep the - * two halves of i_blocks in sync despite SMP or PREEMPT: use i_lock - * for that case too, and do both at once by combining the tests. + * And if on 32-bit, continue our effort to keep the two halves of + * i_blocks in sync despite SMP or PREEMPT: use i_lock for that case + * too, and do both at once by combining the tests. * * There is none of this locking overhead in the 64-bit case. */ diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 457ac9f97377..99af5e5bda9f 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -1,7 +1,6 @@ config XFS_FS tristate "XFS filesystem support" depends on BLOCK - depends on (64BIT || LBDAF) select EXPORTFS select LIBCRC32C select FS_IOMAP diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index f093ea244849..703b6be063ef 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -539,26 +539,18 @@ xfs_max_file_offset( /* Figure out maximum filesize, on Linux this can depend on * the filesystem blocksize (on 32 bit platforms). - * __block_write_begin does this in an [unsigned] long... + * __block_write_begin does this in an [unsigned] long long... * page->index << (PAGE_SHIFT - bbits) * So, for page sized blocks (4K on 32 bit platforms), * this wraps at around 8Tb (hence MAX_LFS_FILESIZE which is * (((u64)PAGE_SIZE << (BITS_PER_LONG-1))-1) * but for smaller blocksizes it is less (bbits = log2 bsize). - * Note1: get_block_t takes a long (implicit cast from above) - * Note2: The Large Block Device (LBD and HAVE_SECTOR_T) patch - * can optionally convert the [unsigned] long from above into - * an [unsigned] long long. */ #if BITS_PER_LONG == 32 -# if defined(CONFIG_LBDAF) ASSERT(sizeof(sector_t) == 8); pagefactor = PAGE_SIZE; bitshift = BITS_PER_LONG; -# else - pagefactor = PAGE_SIZE >> (PAGE_SHIFT - blockshift); -# endif #endif return (((uint64_t)pagefactor) << bitshift) - 1; diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 06c0fd594097..98076b1b5e48 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -714,7 +714,7 @@ static inline void hd_free_part(struct hd_struct *part) */ static inline sector_t part_nr_sects_read(struct hd_struct *part) { -#if BITS_PER_LONG==32 && defined(CONFIG_LBDAF) && defined(CONFIG_SMP) +#if BITS_PER_LONG==32 && defined(CONFIG_SMP) sector_t nr_sects; unsigned seq; do { @@ -722,7 +722,7 @@ static inline sector_t part_nr_sects_read(struct hd_struct *part) nr_sects = part->nr_sects; } while (read_seqcount_retry(&part->nr_sects_seq, seq)); return nr_sects; -#elif BITS_PER_LONG==32 && defined(CONFIG_LBDAF) && defined(CONFIG_PREEMPT) +#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT) sector_t nr_sects; preempt_disable(); @@ -741,11 +741,11 @@ static inline sector_t part_nr_sects_read(struct hd_struct *part) */ static inline void part_nr_sects_write(struct hd_struct *part, sector_t size) { -#if BITS_PER_LONG==32 && defined(CONFIG_LBDAF) && defined(CONFIG_SMP) +#if BITS_PER_LONG==32 && defined(CONFIG_SMP) write_seqcount_begin(&part->nr_sects_seq); part->nr_sects = size; write_seqcount_end(&part->nr_sects_seq); -#elif BITS_PER_LONG==32 && defined(CONFIG_LBDAF) && defined(CONFIG_PREEMPT) +#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT) preempt_disable(); part->nr_sects = size; preempt_enable(); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 34a5036debd3..24ef5a018a5e 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -17,6 +17,7 @@ #include #include #include +#include #define STACK_MAGIC 0xdeadbeef @@ -175,18 +176,7 @@ #define _RET_IP_ (unsigned long)__builtin_return_address(0) #define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; }) -#ifdef CONFIG_LBDAF -# define sector_div(a, b) do_div(a, b) -#else -# define sector_div(n, b)( \ -{ \ - int _res; \ - _res = (n) % (b); \ - (n) /= (b); \ - _res; \ -} \ -) -#endif +#define sector_div(a, b) do_div(a, b) /** * upper_32_bits - return bits 32-63 of a number diff --git a/include/linux/types.h b/include/linux/types.h index cc0dbbe551d5..231114ae38f4 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -127,13 +127,8 @@ typedef s64 int64_t; * * blkcnt_t is the type of the inode's block count. */ -#ifdef CONFIG_LBDAF typedef u64 sector_t; typedef u64 blkcnt_t; -#else -typedef unsigned long sector_t; -typedef unsigned long blkcnt_t; -#endif /* * The type of an index into the pagecache. diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0d9e81779e37..d8781786cf63 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1927,7 +1927,6 @@ config TEST_STATIC_KEYS config TEST_KMOD tristate "kmod stress tester" depends on m - depends on BLOCK && (64BIT || LBDAF) # for XFS, BTRFS depends on NETDEVICES && NET_CORE && INET # for TUN select TEST_LKM select XFS_FS diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/types.h b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/types.h index d27285f8ee82..8bc960e5e713 100644 --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/types.h +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/types.h @@ -59,11 +59,7 @@ typedef __u32 uint32_t; * * blkcnt_t is the type of the inode's block count. */ -#ifdef CONFIG_LBDAF typedef u64 sector_t; -#else -typedef unsigned long sector_t; -#endif /* * The type of an index into the pagecache. From 78bf47353b0041865564deeed257a54f047c2fdc Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:15:53 +0100 Subject: [PATCH 051/164] block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value opal_mbr_data.enable_disable incorrectly: enable_disable is expected to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable was passed directly to set_mbr_done and set_mbr_enable_disable where is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE actually disabled the shadow MBR and vice versa. This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to OPAL_FALSE/TRUE. The change affects existing programs using IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when setting up an Opal drive. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index e0de4dd448b3..119640897293 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -2095,13 +2095,16 @@ static int opal_erase_locking_range(struct opal_dev *dev, static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, struct opal_mbr_data *opal_mbr) { + u8 enable_disable = opal_mbr->enable_disable == OPAL_MBR_ENABLE ? + OPAL_TRUE : OPAL_FALSE; + const struct opal_step mbr_steps[] = { { opal_discovery0, }, { start_admin1LSP_opal_session, &opal_mbr->key }, - { set_mbr_done, &opal_mbr->enable_disable }, + { set_mbr_done, &enable_disable }, { end_opal_session, }, { start_admin1LSP_opal_session, &opal_mbr->key }, - { set_mbr_enable_disable, &opal_mbr->enable_disable }, + { set_mbr_enable_disable, &enable_disable }, { end_opal_session, }, { NULL, } }; @@ -2221,7 +2224,7 @@ static int __opal_lock_unlock(struct opal_dev *dev, static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) { - u8 mbr_done_tf = 1; + u8 mbr_done_tf = OPAL_TRUE; const struct opal_step mbrdone_step [] = { { opal_discovery0, }, { start_admin1LSP_opal_session, key }, From 1e815b33c5ccd3936b71292b5ffb84e97e1df9e0 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:15:54 +0100 Subject: [PATCH 052/164] block: sed-opal: fix typos and formatting This should make no change in functionality. The formatting changes were triggered by checkpatch.pl. Reviewed-by: Scott Bauer Reviewed-by: Jon Derrick Reviewed-by: Christoph Hellwig Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 18 ++++++++++-------- include/uapi/linux/sed-opal.h | 2 +- 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 119640897293..d12a910e06cb 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -157,7 +157,7 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = { /* C_PIN_TABLE object ID's */ - [OPAL_C_PIN_MSID] = + [OPAL_C_PIN_MSID] = { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x84, 0x02}, [OPAL_C_PIN_SID] = { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x00, 0x01}, @@ -551,7 +551,6 @@ static void add_medium_atom_header(struct opal_dev *cmd, bool bytestring, static void add_token_u64(int *err, struct opal_dev *cmd, u64 number) { - size_t len; int msb; @@ -623,7 +622,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr) static int build_locking_user(u8 *buffer, size_t length, u8 lr) { if (length > OPAL_UID_LENGTH) { - pr_debug("Can't build locking range user, Length OOB\n"); + pr_debug("Can't build locking range user. Length OOB\n"); return -ERANGE; } @@ -1324,6 +1323,7 @@ static int start_SIDASP_opal_session(struct opal_dev *dev, void *data) if (!key) { const struct opal_key *okey = data; + ret = start_generic_opal_session(dev, OPAL_SID_UID, OPAL_ADMINSP_UID, okey->key, @@ -1341,6 +1341,7 @@ static int start_SIDASP_opal_session(struct opal_dev *dev, void *data) static int start_admin1LSP_opal_session(struct opal_dev *dev, void *data) { struct opal_key *key = data; + return start_generic_opal_session(dev, OPAL_ADMIN1_UID, OPAL_LOCKINGSP_UID, key->key, key->key_len); @@ -1714,7 +1715,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) write_locked = 0; break; case OPAL_LK: - /* vars are initalized to locked */ + /* vars are initialized to locked */ break; default: pr_debug("Tried to set an invalid locking state... returning to uland\n"); @@ -1775,7 +1776,7 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data) write_locked = 0; break; case OPAL_LK: - /* vars are initalized to locked */ + /* vars are initialized to locked */ break; default: pr_debug("Tried to set an invalid locking state.\n"); @@ -1854,7 +1855,7 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev) return error; lc_status = response_get_u64(&dev->parsed, 4); - /* 0x08 is Manufacured Inactive */ + /* 0x08 is Manufactured Inactive */ /* 0x09 is Manufactured */ if (lc_status != OPAL_MANUFACTURED_INACTIVE) { pr_debug("Couldn't determine the status of the Lifecycle state\n"); @@ -2225,7 +2226,7 @@ static int __opal_lock_unlock(struct opal_dev *dev, static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) { u8 mbr_done_tf = OPAL_TRUE; - const struct opal_step mbrdone_step [] = { + const struct opal_step mbrdone_step[] = { { opal_discovery0, }, { start_admin1LSP_opal_session, key }, { set_mbr_done, &mbr_done_tf }, @@ -2276,7 +2277,8 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) return ret; } -static int opal_activate_lsp(struct opal_dev *dev, struct opal_lr_act *opal_lr_act) +static int opal_activate_lsp(struct opal_dev *dev, + struct opal_lr_act *opal_lr_act) { const struct opal_step active_steps[] = { { opal_discovery0, }, diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h index 627624d35030..e092e124dd16 100644 --- a/include/uapi/linux/sed-opal.h +++ b/include/uapi/linux/sed-opal.h @@ -58,7 +58,7 @@ struct opal_key { struct opal_lr_act { struct opal_key key; __u32 sum; - __u8 num_lrs; + __u8 num_lrs; __u8 lr[OPAL_MAX_LRS]; __u8 align[2]; /* Align to 8 byte boundary */ }; From 1b6b75b0137fd2b5af533eceba5f5db62b1c45b0 Mon Sep 17 00:00:00 2001 From: Jonas Rabenstein Date: Thu, 14 Feb 2019 01:15:55 +0100 Subject: [PATCH 053/164] block: sed-opal: use correct macro for method length Also the values of OPAL_UID_LENGTH and OPAL_METHOD_LENGTH are the same, it is weird to use OPAL_UID_LENGTH for the definition of the methods. Signed-off-by: Jonas Rabenstein Signed-off-by: David Kozub Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index d12a910e06cb..e59ae364f1ef 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -181,7 +181,7 @@ static const u8 opaluid[][OPAL_UID_LENGTH] = { * Derived from: TCG_Storage_Architecture_Core_Spec_v2.01_r1.00 * Section: 6.3 Assigned UIDs */ -static const u8 opalmethod[][OPAL_UID_LENGTH] = { +static const u8 opalmethod[][OPAL_METHOD_LENGTH] = { [OPAL_PROPERTIES] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0x01 }, [OPAL_STARTSESSION] = From e2821a50b17c1b760e7d597777de61241f22fd55 Mon Sep 17 00:00:00 2001 From: Jonas Rabenstein Date: Thu, 14 Feb 2019 01:15:56 +0100 Subject: [PATCH 054/164] block: sed-opal: unify space check in add_token_* All add_token_* functions have a common set of conditions that have to be checked. Use a common function for those checks in order to avoid different behaviour as well as code duplication. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Co-authored-by: David Kozub Signed-off-by: Jonas Rabenstein Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index e59ae364f1ef..d285bd4b2b9b 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -510,15 +510,24 @@ static int opal_discovery0(struct opal_dev *dev, void *data) return opal_discovery0_end(dev); } -static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok) +static bool can_add(int *err, struct opal_dev *cmd, size_t len) { if (*err) - return; - if (cmd->pos >= IO_BUFFER_LENGTH - 1) { - pr_debug("Error adding u8: end of buffer.\n"); + return false; + + if (len > IO_BUFFER_LENGTH || cmd->pos > IO_BUFFER_LENGTH - len) { + pr_debug("Error adding %zu bytes: end of buffer.\n", len); *err = -ERANGE; - return; + return false; } + + return true; +} + +static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok) +{ + if (!can_add(err, cmd, 1)) + return; cmd->cmd[cmd->pos++] = tok; } @@ -562,9 +571,8 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number) msb = fls64(number); len = DIV_ROUND_UP(msb, 8); - if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) { + if (!can_add(err, cmd, len + 1)) { pr_debug("Error adding u64: end of buffer.\n"); - *err = -ERANGE; return; } add_short_atom_header(cmd, false, false, len); @@ -586,9 +594,8 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd, is_short_atom = false; } - if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) { + if (!can_add(err, cmd, header_len + len)) { pr_debug("Error adding bytestring: end of buffer.\n"); - *err = -ERANGE; return; } From 78d584ca31efbadfd8db105dec09c362d75b97b9 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:15:57 +0100 Subject: [PATCH 055/164] block: sed-opal: close parameter list in cmd_finalize Every step ends by calling cmd_finalize (via finalize_and_send) yet every step adds the token OPAL_ENDLIST on its own. Moving this into cmd_finalize decreases code duplication. Co-authored-by: Jonas Rabenstein Signed-off-by: David Kozub Signed-off-by: Jonas Rabenstein Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Acked-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 25 +++---------------------- 1 file changed, 3 insertions(+), 22 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index d285bd4b2b9b..c5dff6199bd6 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -655,6 +655,9 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn) struct opal_header *hdr; int err = 0; + /* close parameter list */ + add_token_u8(&err, cmd, OPAL_ENDLIST); + add_token_u8(&err, cmd, OPAL_ENDOFDATA); add_token_u8(&err, cmd, OPAL_STARTLIST); add_token_u8(&err, cmd, 0); @@ -1073,7 +1076,6 @@ static int gen_key(struct opal_dev *dev, void *data) add_token_bytestring(&err, dev, opalmethod[OPAL_GENKEY], OPAL_UID_LENGTH); add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building gen key command\n"); @@ -1136,7 +1138,6 @@ static int get_active_key(struct opal_dev *dev, void *data) add_token_u8(&err, dev, 10); /* ActiveKey */ add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building get active key command\n"); return err; @@ -1182,7 +1183,6 @@ static int generic_lr_enable_disable(struct opal_dev *dev, add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); return err; } @@ -1248,8 +1248,6 @@ static int setup_locking_range(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); - } if (err) { pr_debug("Error building Setup Locking range command.\n"); @@ -1289,7 +1287,6 @@ static int start_generic_opal_session(struct opal_dev *dev, switch (auth) { case OPAL_ANYBODY_UID: - add_token_u8(&err, dev, OPAL_ENDLIST); break; case OPAL_ADMIN1_UID: case OPAL_SID_UID: @@ -1302,7 +1299,6 @@ static int start_generic_opal_session(struct opal_dev *dev, add_token_bytestring(&err, dev, opaluid[auth], OPAL_UID_LENGTH); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); break; default: pr_debug("Cannot start Admin SP session with auth %d\n", auth); @@ -1400,7 +1396,6 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data) add_token_u8(&err, dev, 3); add_token_bytestring(&err, dev, lk_ul_user, OPAL_UID_LENGTH); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building STARTSESSION command.\n"); @@ -1423,7 +1418,6 @@ static int revert_tper(struct opal_dev *dev, void *data) add_token_bytestring(&err, dev, opalmethod[OPAL_REVERT], OPAL_UID_LENGTH); add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building REVERT TPER command.\n"); return err; @@ -1457,7 +1451,6 @@ static int internal_activate_user(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building Activate UserN command.\n"); @@ -1484,7 +1477,6 @@ static int erase_locking_range(struct opal_dev *dev, void *data) add_token_bytestring(&err, dev, opalmethod[OPAL_ERASE], OPAL_UID_LENGTH); add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building Erase Locking Range Command.\n"); @@ -1515,7 +1507,6 @@ static int set_mbr_done(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error Building set MBR Done command\n"); @@ -1547,7 +1538,6 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error Building set MBR done command\n"); @@ -1579,7 +1569,6 @@ static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid, add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); return err; } @@ -1688,7 +1677,6 @@ static int add_user_to_lr(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building add user to locking range command.\n"); @@ -1749,7 +1737,6 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { pr_debug("Error building SET command.\n"); @@ -1837,11 +1824,8 @@ static int activate_lsp(struct opal_dev *dev, void *data) } add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); - } else { add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_ENDLIST); } if (err) { @@ -1898,7 +1882,6 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data) add_token_u8(&err, dev, 6); /* Lifecycle Column */ add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { @@ -1958,8 +1941,6 @@ static int get_msid_cpin_pin(struct opal_dev *dev, void *data) add_token_u8(&err, dev, 4); /* End Column */ add_token_u8(&err, dev, 3); /* Lifecycle Column */ add_token_u8(&err, dev, OPAL_ENDNAME); - - add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { From e8b2922459cf15140ab8cc1f92b6861674fff1a3 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:15:58 +0100 Subject: [PATCH 056/164] block: sed-opal: unify cmd start Every step starts with resetting the cmd buffer as well as the comid and constructs the appropriate OPAL_CALL command. Consequently, those actions may be combined into one generic function. On should take care that the opening and closing tokens for the argument list are already emitted by cmd_start and cmd_finalize respectively and thus must not be additionally added. Co-authored-by: Jonas Rabenstein Signed-off-by: David Kozub Signed-off-by: Jonas Rabenstein Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Acked-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 228 ++++++++++++++--------------------------------- 1 file changed, 69 insertions(+), 159 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index c5dff6199bd6..0348fb896a5d 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -655,7 +655,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn) struct opal_header *hdr; int err = 0; - /* close parameter list */ + /* close the parameter list opened from cmd_start */ add_token_u8(&err, cmd, OPAL_ENDLIST); add_token_u8(&err, cmd, OPAL_ENDOFDATA); @@ -1000,6 +1000,27 @@ static void clear_opal_cmd(struct opal_dev *dev) memset(dev->cmd, 0, IO_BUFFER_LENGTH); } +static int cmd_start(struct opal_dev *dev, const u8 *uid, const u8 *method) +{ + int err = 0; + + clear_opal_cmd(dev); + set_comid(dev, dev->comid); + + add_token_u8(&err, dev, OPAL_CALL); + add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); + add_token_bytestring(&err, dev, method, OPAL_METHOD_LENGTH); + + /* + * Every method call is followed by its parameters enclosed within + * OPAL_STARTLIST and OPAL_ENDLIST tokens. We automatically open the + * parameter list here and close it later in cmd_finalize. + */ + add_token_u8(&err, dev, OPAL_STARTLIST); + + return err; +} + static int start_opal_session_cont(struct opal_dev *dev) { u32 hsn, tsn; @@ -1062,20 +1083,13 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont) static int gen_key(struct opal_dev *dev, void *data) { u8 uid[OPAL_UID_LENGTH]; - int err = 0; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + int err; memcpy(uid, dev->prev_data, min(sizeof(uid), dev->prev_d_len)); kfree(dev->prev_data); dev->prev_data = NULL; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_GENKEY], - OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, uid, opalmethod[OPAL_GENKEY]); if (err) { pr_debug("Error building gen key command\n"); @@ -1113,21 +1127,14 @@ static int get_active_key_cont(struct opal_dev *dev) static int get_active_key(struct opal_dev *dev, void *data) { u8 uid[OPAL_UID_LENGTH]; - int err = 0; + int err; u8 *lr = data; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - err = build_locking_range(uid, sizeof(uid), *lr); if (err) return err; - err = 0; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, uid, opalmethod[OPAL_GET]); add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, 3); /* startCloumn */ @@ -1150,13 +1157,10 @@ static int generic_lr_enable_disable(struct opal_dev *dev, u8 *uid, bool rle, bool wle, bool rl, bool wl) { - int err = 0; + int err; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH); + err = cmd_start(dev, uid, opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1203,10 +1207,7 @@ static int setup_locking_range(struct opal_dev *dev, void *data) u8 uid[OPAL_UID_LENGTH]; struct opal_user_lr_setup *setup = data; u8 lr; - int err = 0; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + int err; lr = setup->session.opal_key.lr; err = build_locking_range(uid, sizeof(uid), lr); @@ -1216,12 +1217,8 @@ static int setup_locking_range(struct opal_dev *dev, void *data) if (lr == 0) err = enable_global_lr(dev, uid, setup); else { - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], - OPAL_UID_LENGTH); + err = cmd_start(dev, uid, opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1265,22 +1262,15 @@ static int start_generic_opal_session(struct opal_dev *dev, u8 key_len) { u32 hsn; - int err = 0; + int err; if (key == NULL && auth != OPAL_ANYBODY_UID) return OPAL_INVAL_PARAM; - clear_opal_cmd(dev); - - set_comid(dev, dev->comid); hsn = GENERIC_HOST_SESSION_NUM; + err = cmd_start(dev, opaluid[OPAL_SMUID_UID], + opalmethod[OPAL_STARTSESSION]); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_SMUID_UID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_STARTSESSION], - OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u64(&err, dev, hsn); add_token_bytestring(&err, dev, opaluid[sp_type], OPAL_UID_LENGTH); add_token_u8(&err, dev, 1); @@ -1360,30 +1350,21 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data) u8 *key = session->opal_key.key; u32 hsn = GENERIC_HOST_SESSION_NUM; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - - if (session->sum) { + if (session->sum) err = build_locking_user(lk_ul_user, sizeof(lk_ul_user), session->opal_key.lr); - if (err) - return err; - - } else if (session->who != OPAL_ADMIN1 && !session->sum) { + else if (session->who != OPAL_ADMIN1 && !session->sum) err = build_locking_user(lk_ul_user, sizeof(lk_ul_user), session->who - 1); - if (err) - return err; - } else + else memcpy(lk_ul_user, opaluid[OPAL_ADMIN1_UID], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_SMUID_UID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_STARTSESSION], - OPAL_UID_LENGTH); + if (err) + return err; + + err = cmd_start(dev, opaluid[OPAL_SMUID_UID], + opalmethod[OPAL_STARTSESSION]); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u64(&err, dev, hsn); add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID], OPAL_UID_LENGTH); @@ -1407,17 +1388,10 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data) static int revert_tper(struct opal_dev *dev, void *data) { - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_ADMINSP_UID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_REVERT], - OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, opaluid[OPAL_ADMINSP_UID], + opalmethod[OPAL_REVERT]); if (err) { pr_debug("Error building REVERT TPER command.\n"); return err; @@ -1430,18 +1404,12 @@ static int internal_activate_user(struct opal_dev *dev, void *data) { struct opal_session_info *session = data; u8 uid[OPAL_UID_LENGTH]; - int err = 0; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + int err; memcpy(uid, opaluid[OPAL_USER1_UID], OPAL_UID_LENGTH); uid[7] = session->who; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, uid, opalmethod[OPAL_SET]); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1464,19 +1432,12 @@ static int erase_locking_range(struct opal_dev *dev, void *data) { struct opal_session_info *session = data; u8 uid[OPAL_UID_LENGTH]; - int err = 0; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + int err; if (build_locking_range(uid, sizeof(uid), session->opal_key.lr) < 0) return -ERANGE; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_ERASE], - OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, uid, opalmethod[OPAL_ERASE]); if (err) { pr_debug("Error building Erase Locking Range Command.\n"); @@ -1488,16 +1449,11 @@ static int erase_locking_range(struct opal_dev *dev, void *data) static int set_mbr_done(struct opal_dev *dev, void *data) { u8 *mbr_done_tf = data; - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + err = cmd_start(dev, opaluid[OPAL_MBRCONTROL], + opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_MBRCONTROL], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1519,16 +1475,11 @@ static int set_mbr_done(struct opal_dev *dev, void *data) static int set_mbr_enable_disable(struct opal_dev *dev, void *data) { u8 *mbr_en_dis = data; - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + err = cmd_start(dev, opaluid[OPAL_MBRCONTROL], + opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_MBRCONTROL], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1550,16 +1501,10 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data) static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid, struct opal_dev *dev) { - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + err = cmd_start(dev, cpin_uid, opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, cpin_uid, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], - OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1616,10 +1561,7 @@ static int add_user_to_lr(struct opal_dev *dev, void *data) u8 lr_buffer[OPAL_UID_LENGTH]; u8 user_uid[OPAL_UID_LENGTH]; struct opal_lock_unlock *lkul = data; - int err = 0; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + int err; memcpy(lr_buffer, opaluid[OPAL_LOCKINGRANGE_ACE_RDLOCKED], OPAL_UID_LENGTH); @@ -1634,12 +1576,8 @@ static int add_user_to_lr(struct opal_dev *dev, void *data) user_uid[7] = lkul->session.who; - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, lr_buffer, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], - OPAL_UID_LENGTH); + err = cmd_start(dev, lr_buffer, opalmethod[OPAL_SET]); - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); @@ -1693,9 +1631,6 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) u8 read_locked = 1, write_locked = 1; int err = 0; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - if (build_locking_range(lr_buffer, sizeof(lr_buffer), lkul->session.opal_key.lr) < 0) return -ERANGE; @@ -1717,10 +1652,8 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) return OPAL_INVAL_PARAM; } - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, lr_buffer, OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_SET], OPAL_UID_LENGTH); - add_token_u8(&err, dev, OPAL_STARTLIST); + err = cmd_start(dev, lr_buffer, opalmethod[OPAL_SET]); + add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); @@ -1791,17 +1724,10 @@ static int activate_lsp(struct opal_dev *dev, void *data) struct opal_lr_act *opal_act = data; u8 user_lr[OPAL_UID_LENGTH]; u8 uint_3 = 0x83; - int err = 0, i; - - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_ACTIVATE], - OPAL_UID_LENGTH); + int err, i; + err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID], + opalmethod[OPAL_ACTIVATE]); if (opal_act->sum) { err = build_locking_range(user_lr, sizeof(user_lr), @@ -1809,7 +1735,6 @@ static int activate_lsp(struct opal_dev *dev, void *data) if (err) return err; - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, uint_3); add_token_u8(&err, dev, 6); @@ -1824,8 +1749,6 @@ static int activate_lsp(struct opal_dev *dev, void *data) } add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDNAME); - } else { - add_token_u8(&err, dev, OPAL_STARTLIST); } if (err) { @@ -1859,17 +1782,11 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev) /* Determine if we're in the Manufactured Inactive or Active state */ static int get_lsp_lifecycle(struct opal_dev *dev, void *data) { - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); + err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID], + opalmethod[OPAL_GET]); - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_LOCKINGSP_UID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH); - - add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); @@ -1919,19 +1836,12 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev) static int get_msid_cpin_pin(struct opal_dev *dev, void *data) { - int err = 0; + int err; - clear_opal_cmd(dev); - set_comid(dev, dev->comid); - - add_token_u8(&err, dev, OPAL_CALL); - add_token_bytestring(&err, dev, opaluid[OPAL_C_PIN_MSID], - OPAL_UID_LENGTH); - add_token_bytestring(&err, dev, opalmethod[OPAL_GET], OPAL_UID_LENGTH); + err = cmd_start(dev, opaluid[OPAL_C_PIN_MSID], + opalmethod[OPAL_GET]); add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, 3); /* Start Column */ add_token_u8(&err, dev, 3); /* PIN */ From 7d9b62ae2a7db6dfa218999b7dd65517a6f9cfb7 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:15:59 +0100 Subject: [PATCH 057/164] block: sed-opal: unify error handling of responses response_get_{string,u64} include error handling for argument resp being NULL but response_get_token does not handle this. Make all three of response_get_{string,u64,token} handle NULL resp in the same way. Co-authored-by: Jonas Rabenstein Signed-off-by: David Kozub Signed-off-by: Jonas Rabenstein Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/block/sed-opal.c b/block/sed-opal.c index 0348fb896a5d..3f368b14efd9 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -696,6 +696,11 @@ static const struct opal_resp_tok *response_get_token( { const struct opal_resp_tok *tok; + if (!resp) { + pr_debug("Response is NULL\n"); + return ERR_PTR(-EINVAL); + } + if (n >= resp->num) { pr_debug("Token number doesn't exist: %d, resp: %d\n", n, resp->num); From b68f09ecdeaa7d59397429cb95aa83f02b9bd107 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:00 +0100 Subject: [PATCH 058/164] block: sed-opal: reuse response_get_token to decrease code duplication response_get_token had already been in place, its functionality had been duplicated within response_get_{u64,bytestring} with the same error handling. Unify the handling by reusing response_get_token within the other functions. Co-authored-by: Jonas Rabenstein Signed-off-by: David Kozub Signed-off-by: Jonas Rabenstein Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 52 +++++++++++++++++------------------------------- 1 file changed, 18 insertions(+), 34 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 3f368b14efd9..5cb8034b53c8 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -883,27 +883,19 @@ static size_t response_get_string(const struct parsed_resp *resp, int n, const char **store) { u8 skip; - const struct opal_resp_tok *token; + const struct opal_resp_tok *tok; *store = NULL; - if (!resp) { - pr_debug("Response is NULL\n"); + tok = response_get_token(resp, n); + if (IS_ERR(tok)) return 0; - } - if (n >= resp->num) { - pr_debug("Response has %d tokens. Can't access %d\n", - resp->num, n); - return 0; - } - - token = &resp->toks[n]; - if (token->type != OPAL_DTA_TOKENID_BYTESTRING) { + if (tok->type != OPAL_DTA_TOKENID_BYTESTRING) { pr_debug("Token is not a byte string!\n"); return 0; } - switch (token->width) { + switch (tok->width) { case OPAL_WIDTH_TINY: case OPAL_WIDTH_SHORT: skip = 1; @@ -919,37 +911,29 @@ static size_t response_get_string(const struct parsed_resp *resp, int n, return 0; } - *store = token->pos + skip; - return token->len - skip; + *store = tok->pos + skip; + return tok->len - skip; } static u64 response_get_u64(const struct parsed_resp *resp, int n) { - if (!resp) { - pr_debug("Response is NULL\n"); + const struct opal_resp_tok *tok; + + tok = response_get_token(resp, n); + if (IS_ERR(tok)) + return 0; + + if (tok->type != OPAL_DTA_TOKENID_UINT) { + pr_debug("Token is not unsigned int: %d\n", tok->type); return 0; } - if (n >= resp->num) { - pr_debug("Response has %d tokens. Can't access %d\n", - resp->num, n); + if (tok->width != OPAL_WIDTH_TINY && tok->width != OPAL_WIDTH_SHORT) { + pr_debug("Atom is not short or tiny: %d\n", tok->width); return 0; } - if (resp->toks[n].type != OPAL_DTA_TOKENID_UINT) { - pr_debug("Token is not unsigned it: %d\n", - resp->toks[n].type); - return 0; - } - - if (!(resp->toks[n].width == OPAL_WIDTH_TINY || - resp->toks[n].width == OPAL_WIDTH_SHORT)) { - pr_debug("Atom is not short or tiny: %d\n", - resp->toks[n].width); - return 0; - } - - return resp->toks[n].stored.u; + return tok->stored.u; } static bool response_token_matches(const struct opal_resp_tok *token, u8 match) From b2f9c6eb3f5f44d2fded05856f69050d7170eeff Mon Sep 17 00:00:00 2001 From: Jonas Rabenstein Date: Thu, 14 Feb 2019 01:16:01 +0100 Subject: [PATCH 059/164] block: sed-opal: print failed function address Add function address (and if available its symbol) to the message if a step function fails. Signed-off-by: Jonas Rabenstein Signed-off-by: David Kozub Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick --- block/sed-opal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 5cb8034b53c8..1f246200b574 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -394,8 +394,8 @@ static int next(struct opal_dev *dev) error = step->fn(dev, step->data); if (error) { - pr_debug("Error on step function: %d with error %d: %s\n", - state, error, + pr_debug("Step %d (%pS) failed with error %d: %s\n", + state, step->fn, error, opal_error_to_human(error)); /* For each OPAL command we do a discovery0 then we From 285599590e2e1f067d56a5855244e95f6303b28f Mon Sep 17 00:00:00 2001 From: Jonas Rabenstein Date: Thu, 14 Feb 2019 01:16:02 +0100 Subject: [PATCH 060/164] block: sed-opal: split generation of bytestring header and content Split the header generation from the (normal) memcpy part if a bytestring is copied into the command buffer. This allows in-place generation of the bytestring content. For example, copy_from_user may be used without an intermediate buffer. Signed-off-by: Jonas Rabenstein Signed-off-by: David Kozub Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 1f246200b574..ad66d1dc725a 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -580,15 +580,11 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number) add_token_u8(err, cmd, number >> (len * 8)); } -static void add_token_bytestring(int *err, struct opal_dev *cmd, - const u8 *bytestring, size_t len) +static u8 *add_bytestring_header(int *err, struct opal_dev *cmd, size_t len) { size_t header_len = 1; bool is_short_atom = true; - if (*err) - return; - if (len & ~SHORT_ATOM_LEN_MASK) { header_len = 2; is_short_atom = false; @@ -596,7 +592,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd, if (!can_add(err, cmd, header_len + len)) { pr_debug("Error adding bytestring: end of buffer.\n"); - return; + return NULL; } if (is_short_atom) @@ -604,9 +600,19 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd, else add_medium_atom_header(cmd, true, false, len); - memcpy(&cmd->cmd[cmd->pos], bytestring, len); - cmd->pos += len; + return &cmd->cmd[cmd->pos]; +} +static void add_token_bytestring(int *err, struct opal_dev *cmd, + const u8 *bytestring, size_t len) +{ + u8 *start; + + start = add_bytestring_header(err, cmd, len); + if (!start) + return; + memcpy(start, bytestring, len); + cmd->pos += len; } static int build_locking_range(u8 *buffer, size_t length, u8 lr) From a4ddbd1b7b2cf5d18f34fdf8ddbb539f4c307564 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:03 +0100 Subject: [PATCH 061/164] block: sed-opal: add token for OPAL_LIFECYCLE Define OPAL_LIFECYCLE token and use it instead of literals in get_lsp_lifecycle. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/opal_proto.h | 2 ++ block/sed-opal.c | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/block/opal_proto.h b/block/opal_proto.h index e20be8258854..b6e352cfe982 100644 --- a/block/opal_proto.h +++ b/block/opal_proto.h @@ -170,6 +170,8 @@ enum opal_token { OPAL_READLOCKED = 0x07, OPAL_WRITELOCKED = 0x08, OPAL_ACTIVEKEY = 0x0A, + /* lockingsp table */ + OPAL_LIFECYCLE = 0x06, /* locking info table */ OPAL_MAXRANGES = 0x04, /* mbr control */ diff --git a/block/sed-opal.c b/block/sed-opal.c index ad66d1dc725a..7f02e50e2bce 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -1786,12 +1786,12 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, 3); /* Start Column */ - add_token_u8(&err, dev, 6); /* Lifecycle Column */ + add_token_u8(&err, dev, OPAL_LIFECYCLE); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_STARTNAME); add_token_u8(&err, dev, 4); /* End Column */ - add_token_u8(&err, dev, 6); /* Lifecycle Column */ + add_token_u8(&err, dev, OPAL_LIFECYCLE); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); From 3fff234b851cf7cd638efea658e9cbcf33c3a691 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:04 +0100 Subject: [PATCH 062/164] block: sed-opal: unify retrieval of table columns Instead of having multiple places defining the same argument list to get a specific column of a sed-opal table, provide a generic version and call it from those functions. Co-authored-by: David Kozub Signed-off-by: Jonas Rabenstein Signed-off-by: David Kozub Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 130 +++++++++++++++++------------------------------ 1 file changed, 47 insertions(+), 83 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 7f02e50e2bce..5395ab1c5248 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -1075,6 +1075,37 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont) return opal_send_recv(dev, cont); } +/* + * request @column from table @table on device @dev. On success, the column + * data will be available in dev->resp->tok[4] + */ +static int generic_get_column(struct opal_dev *dev, const u8 *table, + u64 column) +{ + int err; + + err = cmd_start(dev, table, opalmethod[OPAL_GET]); + + add_token_u8(&err, dev, OPAL_STARTLIST); + + add_token_u8(&err, dev, OPAL_STARTNAME); + add_token_u8(&err, dev, OPAL_STARTCOLUMN); + add_token_u64(&err, dev, column); + add_token_u8(&err, dev, OPAL_ENDNAME); + + add_token_u8(&err, dev, OPAL_STARTNAME); + add_token_u8(&err, dev, OPAL_ENDCOLUMN); + add_token_u64(&err, dev, column); + add_token_u8(&err, dev, OPAL_ENDNAME); + + add_token_u8(&err, dev, OPAL_ENDLIST); + + if (err) + return err; + + return finalize_and_send(dev, parse_and_check_status); +} + static int gen_key(struct opal_dev *dev, void *data) { u8 uid[OPAL_UID_LENGTH]; @@ -1129,23 +1160,11 @@ static int get_active_key(struct opal_dev *dev, void *data) if (err) return err; - err = cmd_start(dev, uid, opalmethod[OPAL_GET]); - add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 3); /* startCloumn */ - add_token_u8(&err, dev, 10); /* ActiveKey */ - add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 4); /* endColumn */ - add_token_u8(&err, dev, 10); /* ActiveKey */ - add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); - if (err) { - pr_debug("Error building get active key command\n"); + err = generic_get_column(dev, uid, OPAL_ACTIVEKEY); + if (err) return err; - } - return finalize_and_send(dev, get_active_key_cont); + return get_active_key_cont(dev); } static int generic_lr_enable_disable(struct opal_dev *dev, @@ -1754,14 +1773,16 @@ static int activate_lsp(struct opal_dev *dev, void *data) return finalize_and_send(dev, parse_and_check_status); } -static int get_lsp_lifecycle_cont(struct opal_dev *dev) +/* Determine if we're in the Manufactured Inactive or Active state */ +static int get_lsp_lifecycle(struct opal_dev *dev, void *data) { u8 lc_status; - int error = 0; + int err; - error = parse_and_check_status(dev); - if (error) - return error; + err = generic_get_column(dev, opaluid[OPAL_LOCKINGSP_UID], + OPAL_LIFECYCLE); + if (err) + return err; lc_status = response_get_u64(&dev->parsed, 4); /* 0x08 is Manufactured Inactive */ @@ -1774,49 +1795,19 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev) return 0; } -/* Determine if we're in the Manufactured Inactive or Active state */ -static int get_lsp_lifecycle(struct opal_dev *dev, void *data) -{ - int err; - - err = cmd_start(dev, opaluid[OPAL_LOCKINGSP_UID], - opalmethod[OPAL_GET]); - - add_token_u8(&err, dev, OPAL_STARTLIST); - - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 3); /* Start Column */ - add_token_u8(&err, dev, OPAL_LIFECYCLE); - add_token_u8(&err, dev, OPAL_ENDNAME); - - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 4); /* End Column */ - add_token_u8(&err, dev, OPAL_LIFECYCLE); - add_token_u8(&err, dev, OPAL_ENDNAME); - - add_token_u8(&err, dev, OPAL_ENDLIST); - - if (err) { - pr_debug("Error Building GET Lifecycle Status command\n"); - return err; - } - - return finalize_and_send(dev, get_lsp_lifecycle_cont); -} - -static int get_msid_cpin_pin_cont(struct opal_dev *dev) +static int get_msid_cpin_pin(struct opal_dev *dev, void *data) { const char *msid_pin; size_t strlen; - int error = 0; + int err; - error = parse_and_check_status(dev); - if (error) - return error; + err = generic_get_column(dev, opaluid[OPAL_C_PIN_MSID], OPAL_PIN); + if (err) + return err; strlen = response_get_string(&dev->parsed, 4, &msid_pin); if (!msid_pin) { - pr_debug("%s: Couldn't extract PIN from response\n", __func__); + pr_debug("Couldn't extract MSID_CPIN from response\n"); return OPAL_INVAL_PARAM; } @@ -1829,33 +1820,6 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev) return 0; } -static int get_msid_cpin_pin(struct opal_dev *dev, void *data) -{ - int err; - - err = cmd_start(dev, opaluid[OPAL_C_PIN_MSID], - opalmethod[OPAL_GET]); - - add_token_u8(&err, dev, OPAL_STARTLIST); - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 3); /* Start Column */ - add_token_u8(&err, dev, 3); /* PIN */ - add_token_u8(&err, dev, OPAL_ENDNAME); - - add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 4); /* End Column */ - add_token_u8(&err, dev, 3); /* Lifecycle Column */ - add_token_u8(&err, dev, OPAL_ENDNAME); - add_token_u8(&err, dev, OPAL_ENDLIST); - - if (err) { - pr_debug("Error building Get MSID CPIN PIN command.\n"); - return err; - } - - return finalize_and_send(dev, get_msid_cpin_pin_cont); -} - static int end_opal_session(struct opal_dev *dev, void *data) { int err = 0; From 372be408447506c43cc1bede4261324ef66d8fb4 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:05 +0100 Subject: [PATCH 063/164] block: sed-opal: use named Opal tokens instead of integer literals Replace integer literals by Opal tokens defined in opal_proto.h where possible. Reviewed-by: Christoph Hellwig Acked-by: Jon Derrick Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 5395ab1c5248..be0d633783c8 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -1180,12 +1180,12 @@ static int generic_lr_enable_disable(struct opal_dev *dev, add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 5); /* ReadLockEnabled */ + add_token_u8(&err, dev, OPAL_READLOCKENABLED); add_token_u8(&err, dev, rle); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 6); /* WriteLockEnabled */ + add_token_u8(&err, dev, OPAL_WRITELOCKENABLED); add_token_u8(&err, dev, wle); add_token_u8(&err, dev, OPAL_ENDNAME); @@ -1238,22 +1238,22 @@ static int setup_locking_range(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 3); /* Ranges Start */ + add_token_u8(&err, dev, OPAL_RANGESTART); add_token_u64(&err, dev, setup->range_start); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 4); /* Ranges length */ + add_token_u8(&err, dev, OPAL_RANGELENGTH); add_token_u64(&err, dev, setup->range_length); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 5); /*ReadLockEnabled */ + add_token_u8(&err, dev, OPAL_READLOCKENABLED); add_token_u64(&err, dev, !!setup->RLE); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 6); /*WriteLockEnabled*/ + add_token_u8(&err, dev, OPAL_WRITELOCKENABLED); add_token_u64(&err, dev, !!setup->WLE); add_token_u8(&err, dev, OPAL_ENDNAME); @@ -1472,7 +1472,7 @@ static int set_mbr_done(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 2); /* Done */ + add_token_u8(&err, dev, OPAL_MBRDONE); add_token_u8(&err, dev, *mbr_done_tf); /* Done T or F */ add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); @@ -1498,7 +1498,7 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 1); + add_token_u8(&err, dev, OPAL_MBRENABLE); add_token_u8(&err, dev, *mbr_en_dis); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); @@ -1523,7 +1523,7 @@ static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid, add_token_u8(&err, dev, OPAL_VALUES); add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_STARTNAME); - add_token_u8(&err, dev, 3); /* PIN */ + add_token_u8(&err, dev, OPAL_PIN); add_token_bytestring(&err, dev, key, key_len); add_token_u8(&err, dev, OPAL_ENDNAME); add_token_u8(&err, dev, OPAL_ENDLIST); From 3db87236cfb29f143028b91293a8aee01cf932e7 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:06 +0100 Subject: [PATCH 064/164] block: sed-opal: pass steps via argument rather than via opal_dev The steps argument is only read by the next function, so it can be passed directly as an argument rather than via opal_dev. Normally, the steps is an array on the stack, so the pointer stops being valid then the function that set opal_dev.steps returns. If opal_dev.steps was not set to NULL before return it would become a dangling pointer. When the steps are passed as argument this becomes easier to see and more difficult to misuse. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 156 +++++++++++++++++++++-------------------------- 1 file changed, 68 insertions(+), 88 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index be0d633783c8..f027c0cb682e 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -85,7 +85,6 @@ struct opal_dev { void *data; sec_send_recv *send_recv; - const struct opal_step *steps; struct mutex dev_lock; u16 comid; u32 hsn; @@ -382,37 +381,34 @@ static void check_geometry(struct opal_dev *dev, const void *data) dev->lowest_lba = geo->lowest_aligned_lba; } -static int next(struct opal_dev *dev) +static int next(struct opal_dev *dev, const struct opal_step *steps, + size_t n_steps) { const struct opal_step *step; - int state = 0, error = 0; + size_t state; + int error = 0; - do { - step = &dev->steps[state]; - if (!step->fn) - break; + for (state = 0; state < n_steps; state++) { + step = &steps[state]; error = step->fn(dev, step->data); - if (error) { - pr_debug("Step %d (%pS) failed with error %d: %s\n", - state, step->fn, error, - opal_error_to_human(error)); + if (error) + goto out_error; + } - /* For each OPAL command we do a discovery0 then we - * start some sort of session. - * If we haven't passed state 1 then there was an error - * on discovery0 or during the attempt to start a - * session. Therefore we shouldn't attempt to terminate - * a session, as one has not yet been created. - */ - if (state > 1) { - end_opal_session_error(dev); - return error; - } + return 0; - } - state++; - } while (!error); +out_error: + /* + * For each OPAL command the first step in steps does a discovery0 + * and the second step starts some sort of session. If an error occurred + * in the first two steps (and thus stopping the loop with state <= 1) + * then there was an error before or during the attempt to start a + * session. Therefore we shouldn't attempt to terminate a session, as + * one has not yet been created. + */ + if (state > 1) + end_opal_session_error(dev); return error; } @@ -1836,17 +1832,13 @@ static int end_opal_session(struct opal_dev *dev, void *data) static int end_opal_session_error(struct opal_dev *dev) { const struct opal_step error_end_session[] = { - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; - dev->steps = error_end_session; - return next(dev); + return next(dev, error_end_session, ARRAY_SIZE(error_end_session)); } -static inline void setup_opal_dev(struct opal_dev *dev, - const struct opal_step *steps) +static inline void setup_opal_dev(struct opal_dev *dev) { - dev->steps = steps; dev->tsn = 0; dev->hsn = 0; dev->prev_data = NULL; @@ -1855,14 +1847,13 @@ static inline void setup_opal_dev(struct opal_dev *dev, static int check_opal_support(struct opal_dev *dev) { const struct opal_step steps[] = { - { opal_discovery0, }, - { NULL, } + { opal_discovery0, } }; int ret; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, steps, ARRAY_SIZE(steps)); dev->supported = !ret; mutex_unlock(&dev->dev_lock); return ret; @@ -1919,14 +1910,13 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev, { start_auth_opal_session, opal_session }, { get_active_key, &opal_session->opal_key.lr }, { gen_key, }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, erase_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -1938,14 +1928,13 @@ static int opal_erase_locking_range(struct opal_dev *dev, { opal_discovery0, }, { start_auth_opal_session, opal_session }, { erase_locking_range, opal_session }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, erase_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -1963,8 +1952,7 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, { end_opal_session, }, { start_admin1LSP_opal_session, &opal_mbr->key }, { set_mbr_enable_disable, &enable_disable }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; @@ -1973,8 +1961,8 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, return -EINVAL; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, mbr_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -1991,7 +1979,7 @@ static int opal_save(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk) suspend->lr = lk_unlk->session.opal_key.lr; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, NULL); + setup_opal_dev(dev); add_suspend_info(dev, suspend); mutex_unlock(&dev->dev_lock); return 0; @@ -2004,8 +1992,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev, { opal_discovery0, }, { start_admin1LSP_opal_session, &lk_unlk->session.opal_key }, { add_user_to_lr, lk_unlk }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; @@ -2027,8 +2014,8 @@ static int opal_add_user_to_lr(struct opal_dev *dev, } mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, steps, ARRAY_SIZE(steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2038,14 +2025,13 @@ static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal) const struct opal_step revert_steps[] = { { opal_discovery0, }, { start_SIDASP_opal_session, opal }, - { revert_tper, }, /* controller will terminate session */ - { NULL, } + { revert_tper, } /* controller will terminate session */ }; int ret; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, revert_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, revert_steps, ARRAY_SIZE(revert_steps)); mutex_unlock(&dev->dev_lock); /* @@ -2065,19 +2051,20 @@ static int __opal_lock_unlock(struct opal_dev *dev, { opal_discovery0, }, { start_auth_opal_session, &lk_unlk->session }, { lock_unlock_locking_range, lk_unlk }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; const struct opal_step unlock_sum_steps[] = { { opal_discovery0, }, { start_auth_opal_session, &lk_unlk->session }, { lock_unlock_locking_range_sum, lk_unlk }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; - dev->steps = lk_unlk->session.sum ? unlock_sum_steps : unlock_steps; - return next(dev); + if (lk_unlk->session.sum) + return next(dev, unlock_sum_steps, + ARRAY_SIZE(unlock_sum_steps)); + else + return next(dev, unlock_steps, ARRAY_SIZE(unlock_steps)); } static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) @@ -2087,12 +2074,10 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) { opal_discovery0, }, { start_admin1LSP_opal_session, key }, { set_mbr_done, &mbr_done_tf }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; - dev->steps = mbrdone_step; - return next(dev); + return next(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step)); } static int opal_lock_unlock(struct opal_dev *dev, @@ -2119,8 +2104,7 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) { end_opal_session, }, { start_SIDASP_opal_session, opal }, { set_sid_cpin_pin, opal }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; @@ -2128,8 +2112,8 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) return -ENODEV; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, owner_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, owner_steps, ARRAY_SIZE(owner_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2142,8 +2126,7 @@ static int opal_activate_lsp(struct opal_dev *dev, { start_SIDASP_opal_session, &opal_lr_act->key }, { get_lsp_lifecycle, }, { activate_lsp, opal_lr_act }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; @@ -2151,8 +2134,8 @@ static int opal_activate_lsp(struct opal_dev *dev, return -EINVAL; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, active_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, active_steps, ARRAY_SIZE(active_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2164,14 +2147,13 @@ static int opal_setup_locking_range(struct opal_dev *dev, { opal_discovery0, }, { start_auth_opal_session, &opal_lrs->session }, { setup_locking_range, opal_lrs }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, lr_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, lr_steps, ARRAY_SIZE(lr_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2182,8 +2164,7 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw) { opal_discovery0, }, { start_auth_opal_session, &opal_pw->session }, { set_new_pw, &opal_pw->new_user_pw }, - { end_opal_session, }, - { NULL } + { end_opal_session, } }; int ret; @@ -2194,8 +2175,8 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw) return -EINVAL; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, pw_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, pw_steps, ARRAY_SIZE(pw_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2207,8 +2188,7 @@ static int opal_activate_user(struct opal_dev *dev, { opal_discovery0, }, { start_admin1LSP_opal_session, &opal_session->opal_key }, { internal_activate_user, opal_session }, - { end_opal_session, }, - { NULL, } + { end_opal_session, } }; int ret; @@ -2220,8 +2200,8 @@ static int opal_activate_user(struct opal_dev *dev, } mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, act_steps); - ret = next(dev); + setup_opal_dev(dev); + ret = next(dev, act_steps, ARRAY_SIZE(act_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2238,7 +2218,7 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) return false; mutex_lock(&dev->dev_lock); - setup_opal_dev(dev, NULL); + setup_opal_dev(dev); list_for_each_entry(suspend, &dev->unlk_lst, node) { dev->tsn = 0; From 0af2648ec30cf811b835f01a8501b4747f3a9022 Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:07 +0100 Subject: [PATCH 065/164] block: sed-opal: don't repeat opal_discovery0 in each steps array Originally each of the opal functions that call next include opal_discovery0 in the array of steps. This is superfluous and can be done always inside next. Acked-by: Jon Derrick Reviewed-by: Christoph Hellwig Reviewed-by: Scott Bauer Signed-off-by: David Kozub Signed-off-by: Jens Axboe --- block/sed-opal.c | 75 +++++++++++++++++++++++++++--------------------- 1 file changed, 42 insertions(+), 33 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index f027c0cb682e..b947efd6d4d9 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -216,6 +216,7 @@ static const u8 opalmethod[][OPAL_METHOD_LENGTH] = { }; static int end_opal_session_error(struct opal_dev *dev); +static int opal_discovery0_step(struct opal_dev *dev); struct opal_suspend_data { struct opal_lock_unlock unlk; @@ -381,17 +382,33 @@ static void check_geometry(struct opal_dev *dev, const void *data) dev->lowest_lba = geo->lowest_aligned_lba; } +static int execute_step(struct opal_dev *dev, + const struct opal_step *step, size_t stepIndex) +{ + int error = step->fn(dev, step->data); + + if (error) { + pr_debug("Step %zu (%pS) failed with error %d: %s\n", + stepIndex, step->fn, error, + opal_error_to_human(error)); + } + + return error; +} + static int next(struct opal_dev *dev, const struct opal_step *steps, size_t n_steps) { - const struct opal_step *step; - size_t state; - int error = 0; + size_t state = 0; + int error; + + /* first do a discovery0 */ + error = opal_discovery0_step(dev); + if (error) + return error; for (state = 0; state < n_steps; state++) { - step = &steps[state]; - - error = step->fn(dev, step->data); + error = execute_step(dev, &steps[state], state); if (error) goto out_error; } @@ -400,14 +417,14 @@ static int next(struct opal_dev *dev, const struct opal_step *steps, out_error: /* - * For each OPAL command the first step in steps does a discovery0 - * and the second step starts some sort of session. If an error occurred - * in the first two steps (and thus stopping the loop with state <= 1) - * then there was an error before or during the attempt to start a - * session. Therefore we shouldn't attempt to terminate a session, as - * one has not yet been created. + * For each OPAL command the first step in steps starts some sort of + * session. If an error occurred in the initial discovery0 or if an + * error occurred in the first step (and thus stopping the loop with + * state == 0) then there was an error before or during the attempt to + * start a session. Therefore we shouldn't attempt to terminate a + * session, as one has not yet been created. */ - if (state > 1) + if (state > 0) end_opal_session_error(dev); return error; @@ -506,6 +523,14 @@ static int opal_discovery0(struct opal_dev *dev, void *data) return opal_discovery0_end(dev); } +static int opal_discovery0_step(struct opal_dev *dev) +{ + const struct opal_step discovery0_step = { + opal_discovery0, + }; + return execute_step(dev, &discovery0_step, 0); +} + static bool can_add(int *err, struct opal_dev *cmd, size_t len) { if (*err) @@ -1831,10 +1856,10 @@ static int end_opal_session(struct opal_dev *dev, void *data) static int end_opal_session_error(struct opal_dev *dev) { - const struct opal_step error_end_session[] = { - { end_opal_session, } + const struct opal_step error_end_session = { + end_opal_session, }; - return next(dev, error_end_session, ARRAY_SIZE(error_end_session)); + return execute_step(dev, &error_end_session, 0); } static inline void setup_opal_dev(struct opal_dev *dev) @@ -1846,14 +1871,11 @@ static inline void setup_opal_dev(struct opal_dev *dev) static int check_opal_support(struct opal_dev *dev) { - const struct opal_step steps[] = { - { opal_discovery0, } - }; int ret; mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, steps, ARRAY_SIZE(steps)); + ret = opal_discovery0_step(dev); dev->supported = !ret; mutex_unlock(&dev->dev_lock); return ret; @@ -1906,7 +1928,6 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev, struct opal_session_info *opal_session) { const struct opal_step erase_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, opal_session }, { get_active_key, &opal_session->opal_key.lr }, { gen_key, }, @@ -1925,7 +1946,6 @@ static int opal_erase_locking_range(struct opal_dev *dev, struct opal_session_info *opal_session) { const struct opal_step erase_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, opal_session }, { erase_locking_range, opal_session }, { end_opal_session, } @@ -1946,7 +1966,6 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, OPAL_TRUE : OPAL_FALSE; const struct opal_step mbr_steps[] = { - { opal_discovery0, }, { start_admin1LSP_opal_session, &opal_mbr->key }, { set_mbr_done, &enable_disable }, { end_opal_session, }, @@ -1989,7 +2008,6 @@ static int opal_add_user_to_lr(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk) { const struct opal_step steps[] = { - { opal_discovery0, }, { start_admin1LSP_opal_session, &lk_unlk->session.opal_key }, { add_user_to_lr, lk_unlk }, { end_opal_session, } @@ -2023,7 +2041,6 @@ static int opal_add_user_to_lr(struct opal_dev *dev, static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal) { const struct opal_step revert_steps[] = { - { opal_discovery0, }, { start_SIDASP_opal_session, opal }, { revert_tper, } /* controller will terminate session */ }; @@ -2048,13 +2065,11 @@ static int __opal_lock_unlock(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk) { const struct opal_step unlock_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, &lk_unlk->session }, { lock_unlock_locking_range, lk_unlk }, { end_opal_session, } }; const struct opal_step unlock_sum_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, &lk_unlk->session }, { lock_unlock_locking_range_sum, lk_unlk }, { end_opal_session, } @@ -2071,7 +2086,6 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) { u8 mbr_done_tf = OPAL_TRUE; const struct opal_step mbrdone_step[] = { - { opal_discovery0, }, { start_admin1LSP_opal_session, key }, { set_mbr_done, &mbr_done_tf }, { end_opal_session, } @@ -2098,7 +2112,6 @@ static int opal_lock_unlock(struct opal_dev *dev, static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) { const struct opal_step owner_steps[] = { - { opal_discovery0, }, { start_anybodyASP_opal_session, }, { get_msid_cpin_pin, }, { end_opal_session, }, @@ -2122,7 +2135,6 @@ static int opal_activate_lsp(struct opal_dev *dev, struct opal_lr_act *opal_lr_act) { const struct opal_step active_steps[] = { - { opal_discovery0, }, { start_SIDASP_opal_session, &opal_lr_act->key }, { get_lsp_lifecycle, }, { activate_lsp, opal_lr_act }, @@ -2144,7 +2156,6 @@ static int opal_setup_locking_range(struct opal_dev *dev, struct opal_user_lr_setup *opal_lrs) { const struct opal_step lr_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, &opal_lrs->session }, { setup_locking_range, opal_lrs }, { end_opal_session, } @@ -2161,7 +2172,6 @@ static int opal_setup_locking_range(struct opal_dev *dev, static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw) { const struct opal_step pw_steps[] = { - { opal_discovery0, }, { start_auth_opal_session, &opal_pw->session }, { set_new_pw, &opal_pw->new_user_pw }, { end_opal_session, } @@ -2185,7 +2195,6 @@ static int opal_activate_user(struct opal_dev *dev, struct opal_session_info *opal_session) { const struct opal_step act_steps[] = { - { opal_discovery0, }, { start_admin1LSP_opal_session, &opal_session->opal_key }, { internal_activate_user, opal_session }, { end_opal_session, } From a80f36cc64f09956686f3729bf3eb65b7abfc32e Mon Sep 17 00:00:00 2001 From: David Kozub Date: Thu, 14 Feb 2019 01:16:08 +0100 Subject: [PATCH 066/164] block: sed-opal: rename next to execute_steps As the function is responsible for executing the individual steps supplied in the steps argument, execute_steps is a more descriptive name than the rather generic next. Signed-off-by: David Kozub Reviewed-by: Scott Bauer Reviewed-by: Christoph Hellwig Reviewed-by: Jon Derrick Signed-off-by: Jens Axboe --- block/sed-opal.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index b947efd6d4d9..b1aa0cc25803 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -396,8 +396,8 @@ static int execute_step(struct opal_dev *dev, return error; } -static int next(struct opal_dev *dev, const struct opal_step *steps, - size_t n_steps) +static int execute_steps(struct opal_dev *dev, + const struct opal_step *steps, size_t n_steps) { size_t state = 0; int error; @@ -1937,7 +1937,7 @@ static int opal_secure_erase_locking_range(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps)); + ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -1954,7 +1954,7 @@ static int opal_erase_locking_range(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, erase_steps, ARRAY_SIZE(erase_steps)); + ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -1981,7 +1981,7 @@ static int opal_enable_disable_shadow_mbr(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); + ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2033,7 +2033,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, steps, ARRAY_SIZE(steps)); + ret = execute_steps(dev, steps, ARRAY_SIZE(steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2048,7 +2048,7 @@ static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal) mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, revert_steps, ARRAY_SIZE(revert_steps)); + ret = execute_steps(dev, revert_steps, ARRAY_SIZE(revert_steps)); mutex_unlock(&dev->dev_lock); /* @@ -2076,10 +2076,11 @@ static int __opal_lock_unlock(struct opal_dev *dev, }; if (lk_unlk->session.sum) - return next(dev, unlock_sum_steps, - ARRAY_SIZE(unlock_sum_steps)); + return execute_steps(dev, unlock_sum_steps, + ARRAY_SIZE(unlock_sum_steps)); else - return next(dev, unlock_steps, ARRAY_SIZE(unlock_steps)); + return execute_steps(dev, unlock_steps, + ARRAY_SIZE(unlock_steps)); } static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) @@ -2091,7 +2092,7 @@ static int __opal_set_mbr_done(struct opal_dev *dev, struct opal_key *key) { end_opal_session, } }; - return next(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step)); + return execute_steps(dev, mbrdone_step, ARRAY_SIZE(mbrdone_step)); } static int opal_lock_unlock(struct opal_dev *dev, @@ -2126,7 +2127,7 @@ static int opal_take_ownership(struct opal_dev *dev, struct opal_key *opal) mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, owner_steps, ARRAY_SIZE(owner_steps)); + ret = execute_steps(dev, owner_steps, ARRAY_SIZE(owner_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2147,7 +2148,7 @@ static int opal_activate_lsp(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, active_steps, ARRAY_SIZE(active_steps)); + ret = execute_steps(dev, active_steps, ARRAY_SIZE(active_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2164,7 +2165,7 @@ static int opal_setup_locking_range(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, lr_steps, ARRAY_SIZE(lr_steps)); + ret = execute_steps(dev, lr_steps, ARRAY_SIZE(lr_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2186,7 +2187,7 @@ static int opal_set_new_pw(struct opal_dev *dev, struct opal_new_pw *opal_pw) mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, pw_steps, ARRAY_SIZE(pw_steps)); + ret = execute_steps(dev, pw_steps, ARRAY_SIZE(pw_steps)); mutex_unlock(&dev->dev_lock); return ret; } @@ -2210,7 +2211,7 @@ static int opal_activate_user(struct opal_dev *dev, mutex_lock(&dev->dev_lock); setup_opal_dev(dev); - ret = next(dev, act_steps, ARRAY_SIZE(act_steps)); + ret = execute_steps(dev, act_steps, ARRAY_SIZE(act_steps)); mutex_unlock(&dev->dev_lock); return ret; } From 9bc00750f5b6a33764918be4f80745d936c95f1a Mon Sep 17 00:00:00 2001 From: Dongli Zhang Date: Tue, 12 Mar 2019 09:31:56 +0800 Subject: [PATCH 067/164] virtio_blk: replace 0 by HCTX_TYPE_DEFAULT to index blk_mq_tag_set->map Use HCTX_TYPE_DEFAULT instead of 0 to avoid hardcoding. Signed-off-by: Dongli Zhang Signed-off-by: Jens Axboe --- drivers/block/virtio_blk.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 4bc083b7c9b5..bed6035be4cc 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -691,7 +691,8 @@ static int virtblk_map_queues(struct blk_mq_tag_set *set) { struct virtio_blk *vblk = set->driver_data; - return blk_mq_virtio_map_queues(&set->map[0], vblk->vdev, 0); + return blk_mq_virtio_map_queues(&set->map[HCTX_TYPE_DEFAULT], + vblk->vdev, 0); } #ifdef CONFIG_VIRTIO_BLK_SCSI From d0b0a81acbd809228b57fb27a89028ecd0fc542a Mon Sep 17 00:00:00 2001 From: Hisao Tanabe Date: Mon, 8 Apr 2019 00:27:42 +0900 Subject: [PATCH 068/164] block: remove unused variable 'def' The 'def' local variable became unused after commit f382fb0bcef4 ("block: remove legacy IO schedulers"), let's remove it. Reviewed-by: Christoph Hellwig Signed-off-by: Hisao Tanabe Signed-off-by: Jens Axboe --- block/elevator.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index d6d835a08de6..2e5399d9f40f 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -509,8 +509,6 @@ void elv_unregister_queue(struct request_queue *q) int elv_register(struct elevator_type *e) { - char *def = ""; - /* create icq_cache if requested */ if (e->icq_size) { if (WARN_ON(e->icq_size < sizeof(struct io_cq)) || @@ -535,8 +533,8 @@ int elv_register(struct elevator_type *e) list_add_tail(&e->list, &elv_list); spin_unlock(&elv_list_lock); - printk(KERN_INFO "io scheduler %s registered%s\n", e->elevator_name, - def); + printk(KERN_INFO "io scheduler %s registered\n", e->elevator_name); + return 0; } EXPORT_SYMBOL_GPL(elv_register); From 636b8fe86bede8c9f797365986b8406ff2183f13 Mon Sep 17 00:00:00 2001 From: Angelo Ruocco Date: Mon, 8 Apr 2019 17:35:34 +0200 Subject: [PATCH 069/164] block, bfq: fix some typos in comments Some of the comments in the bfq files had typos. This patch fixes them. Signed-off-by: Angelo Ruocco Signed-off-by: Paolo Valente Signed-off-by: Jens Axboe --- block/bfq-cgroup.c | 2 +- block/bfq-iosched.c | 16 ++++++++-------- block/bfq-iosched.h | 4 ++-- block/bfq-wf2q.c | 10 +++++----- 4 files changed, 16 insertions(+), 16 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 2a74a3f2a8f7..793c027ca60e 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -1103,7 +1103,7 @@ struct cftype bfq_blkcg_legacy_files[] = { }, #endif /* CONFIG_DEBUG_BLK_CGROUP */ - /* the same statictics which cover the bfqg and its descendants */ + /* the same statistics which cover the bfqg and its descendants */ { .name = "bfq.io_service_bytes_recursive", .private = (unsigned long)&blkcg_policy_bfq, diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ceb06abd73df..bcaa4bde3961 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -189,7 +189,7 @@ static const int bfq_default_max_budget = 16 * 1024; /* * When a sync request is dispatched, the queue that contains that * request, and all the ancestor entities of that queue, are charged - * with the number of sectors of the request. In constrast, if the + * with the number of sectors of the request. In contrast, if the * request is async, then the queue and its ancestor entities are * charged with the number of sectors of the request, multiplied by * the factor below. This throttles the bandwidth for async I/O, @@ -217,7 +217,7 @@ const int bfq_timeout = HZ / 8; * queue merging. * * As can be deduced from the low time limit below, queue merging, if - * successful, happens at the very beggining of the I/O of the involved + * successful, happens at the very beginning of the I/O of the involved * cooperating processes, as a consequence of the arrival of the very * first requests from each cooperator. After that, there is very * little chance to find cooperators. @@ -441,7 +441,7 @@ void bfq_schedule_dispatch(struct bfq_data *bfqd) /* * Lifted from AS - choose which of rq1 and rq2 that is best served now. - * We choose the request that is closesr to the head right now. Distance + * We choose the request that is closer to the head right now. Distance * behind the head is penalized and only allowed to a certain extent. */ static struct request *bfq_choose_req(struct bfq_data *bfqd, @@ -989,7 +989,7 @@ static unsigned int bfq_wr_duration(struct bfq_data *bfqd) * of several files * mplayer took 23 seconds to start, if constantly weight-raised. * - * As for higher values than that accomodating the above bad + * As for higher values than that accommodating the above bad * scenario, tests show that higher values would often yield * the opposite of the desired result, i.e., would worsen * responsiveness by allowing non-interactive applications to @@ -2636,8 +2636,8 @@ static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq, /* * bic still points to bfqq, then it has not yet been * redirected to some other bfq_queue, and a queue - * merge beween bfqq and new_bfqq can be safely - * fulfillled, i.e., bic can be redirected to new_bfqq + * merge between bfqq and new_bfqq can be safely + * fulfilled, i.e., bic can be redirected to new_bfqq * and bfqq can be put. */ bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq, @@ -3089,7 +3089,7 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) /* * All in-service entities must have been properly deactivated * or requeued before executing the next function, which - * resets all in-service entites as no more in service. + * resets all in-service entities as no more in service. */ __bfq_bfqd_reset_in_service(bfqd); } @@ -5632,7 +5632,7 @@ static void bfq_prepare_request(struct request *rq, struct bio *bio) * preparation is that, after the prepare_request hook is invoked for * rq, rq may still be transformed into a request with no icq, i.e., a * request not associated with any queue. No bfq hook is invoked to - * signal this tranformation. As a consequence, should these + * signal this transformation. As a consequence, should these * preparation operations be performed when the prepare_request hook * is invoked, and should rq be transformed one moment later, bfq * would end up in an inconsistent state, because it would have diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 60c148728cc5..e7dc07cf9a57 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -91,7 +91,7 @@ struct bfq_service_tree { * expiration. This peculiar definition allows for the following * optimization, not yet exploited: while a given entity is still in * service, we already know which is the best candidate for next - * service among the other active entitities in the same parent + * service among the other active entities in the same parent * entity. We can then quickly compare the timestamps of the * in-service entity with those of such best candidate. * @@ -142,7 +142,7 @@ struct bfq_weight_counter { * * Unless cgroups are used, the weight value is calculated from the * ioprio to export the same interface as CFQ. When dealing with - * ``well-behaved'' queues (i.e., queues that do not spend too much + * "well-behaved" queues (i.e., queues that do not spend too much * time to consume their budget and have true sequential behavior, and * when there are no external factors breaking anticipation) the * relative weights at each level of the cgroups hierarchy should be diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 51ef1f00df80..d2ea98ef26a3 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -59,7 +59,7 @@ static bool bfq_update_parent_budget(struct bfq_entity *next_in_service); * bfq_update_next_in_service - update sd->next_in_service * @sd: sched_data for which to perform the update. * @new_entity: if not NULL, pointer to the entity whose activation, - * requeueing or repositionig triggered the invocation of + * requeueing or repositioning triggered the invocation of * this function. * @expiration: id true, this function is being invoked after the * expiration of the in-service entity @@ -90,7 +90,7 @@ static bool bfq_update_next_in_service(struct bfq_sched_data *sd, /* * If this update is triggered by the activation, requeueing - * or repositiong of an entity that does not coincide with + * or repositioning of an entity that does not coincide with * sd->next_in_service, then a full lookup in the active tree * can be avoided. In fact, it is enough to check whether the * just-modified entity has the same priority as @@ -1396,7 +1396,7 @@ left: * In this first case, update the virtual time in @st too (see the * comments on this update inside the function). * - * In constrast, if there is an in-service entity, then return the + * In contrast, if there is an in-service entity, then return the * entity that would be set in service if not only the above * conditions, but also the next one held true: the currently * in-service entity, on expiration, @@ -1479,12 +1479,12 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, * is being invoked as a part of the expiration path * of the in-service queue. In this case, even if * sd->in_service_entity is not NULL, - * sd->in_service_entiy at this point is actually not + * sd->in_service_entity at this point is actually not * in service any more, and, if needed, has already * been properly queued or requeued into the right * tree. The reason why sd->in_service_entity is still * not NULL here, even if expiration is true, is that - * sd->in_service_entiy is reset as a last step in the + * sd->in_service_entity is reset as a last step in the * expiration path. So, if expiration is true, tell * __bfq_lookup_next_entity that there is no * sd->in_service_entity. From b21e11c5c8311b8bf6923ff29d57f2a5f997e939 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 2 Apr 2019 10:26:44 +0800 Subject: [PATCH 070/164] block: fix build warning in merging bvecs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") changes bvec merge by only considering two bvecs from different bios. However, if the former bio doesn't inlcude any io bvec, then the following warning may be triggered: warning: ‘bvec.bv_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized] In practice, it shouldn't be triggered. Fixes it by adding check on former bio, the check shouldn't add any cost given 'bio->bi_iter' can be hit in cache. Reported-by: Jens Axboe Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-merge.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 8f96d683b577..895795cdb145 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -353,7 +353,7 @@ EXPORT_SYMBOL(blk_queue_split); static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio *bio) { - struct bio_vec bv, bvprv = { NULL }; + struct bio_vec uninitialized_var(bv), bvprv = { NULL }; unsigned int seg_size, nr_phys_segs; unsigned front_seg_size; struct bio *fbio, *bbio; @@ -400,8 +400,10 @@ new_segment: new_bio = false; } bbio = bio; - bvprv = bv; - new_bio = true; + if (likely(bio->bi_iter.bi_size)) { + bvprv = bv; + new_bio = true; + } } fbio->bi_seg_front_size = front_seg_size; @@ -527,7 +529,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, struct scatterlist *sglist, struct scatterlist **sg) { - struct bio_vec bvec, bvprv = { NULL }; + struct bio_vec uninitialized_var(bvec), bvprv = { NULL }; struct bvec_iter iter; int nsegs = 0; bool new_bio = false; @@ -550,8 +552,10 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio, next_bvec: new_bio = false; } - bvprv = bvec; - new_bio = true; + if (likely(bio->bi_iter.bi_size)) { + bvprv = bvec; + new_bio = true; + } } return nsegs; From 0d413829bd20d5563c2c3287479a7348810cb13f Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Sun, 7 Apr 2019 17:19:38 +0900 Subject: [PATCH 071/164] block: null: Add documentation for "zone_nr_conv" param zone_nr_conv module parameter was introduced by a commit ea2c18e1("null_blk: Add conventional zone configuration for zoned support"). This patch describes "zone_nr_conv" module parameter to the document. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Minwoo Im Signed-off-by: Jens Axboe --- Documentation/block/null_blk.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index 4cad1024fff7..41f0a3d33bbd 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -93,3 +93,7 @@ zoned=[0/1]: Default: 0 zone_size=[MB]: Default: 256 Per zone size when exposed as a zoned block device. Must be a power of two. + +zone_nr_conv=[nr_conv]: Default: 0 + The number of conventional zones to create when block device is zoned. If + zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1. From ee37e62191a59d253fc916b9fc763deb777211e2 Mon Sep 17 00:00:00 2001 From: Yufen Yu Date: Tue, 2 Apr 2019 14:22:14 +0800 Subject: [PATCH 072/164] md: add mddev->pers to avoid potential NULL pointer dereference When doing re-add, we need to ensure rdev->mddev->pers is not NULL, which can avoid potential NULL pointer derefence in fallowing add_bound_rdev(). Fixes: a6da4ef85cef ("md: re-add a failed disk") Cc: Xiao Ni Cc: NeilBrown Cc: # 4.4+ Reviewed-by: NeilBrown Signed-off-by: Yufen Yu Signed-off-by: Song Liu --- drivers/md/md.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 1fa2682951f1..664b77ceaf2d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2850,8 +2850,10 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) err = 0; } } else if (cmd_match(buf, "re-add")) { - if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1) && - rdev->saved_raid_disk >= 0) { + if (!rdev->mddev->pers) + err = -EINVAL; + else if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1) && + rdev->saved_raid_disk >= 0) { /* clear_bit is performed _after_ all the devices * have their local Faulty bit cleared. If any writes * happen in the meantime in the local node, they From ed4d0a4ea11e19863952ac6a7cea3bbb27ccd452 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:10 +0200 Subject: [PATCH 073/164] md: add a missing endianness conversion in check_sb_changes The on-disk value is little endian and we need to convert it to native endian before storing the value in the in-core structure. Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode") Cc: # 4.20+ Acked-by: Guoqing Jiang Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 664b77ceaf2d..a15eb7e37c6d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9227,7 +9227,7 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) * reshape is happening in the remote node, we need to * update reshape_position and call start_reshape. */ - mddev->reshape_position = sb->reshape_position; + mddev->reshape_position = le64_to_cpu(sb->reshape_position); if (mddev->pers->update_reshape_pos) mddev->pers->update_reshape_pos(mddev); if (mddev->pers->start_reshape) From c35403f82ced283807f31eeafc2a7aebf1b20331 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:11 +0200 Subject: [PATCH 074/164] md: use correct types in md_bitmap_print_sb If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md-bitmap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 1cd4f991792c..3a62a46b75c7 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -490,10 +490,10 @@ void md_bitmap_print_sb(struct bitmap *bitmap) pr_debug(" magic: %08x\n", le32_to_cpu(sb->magic)); pr_debug(" version: %d\n", le32_to_cpu(sb->version)); pr_debug(" uuid: %08x.%08x.%08x.%08x\n", - le32_to_cpu(*(__u32 *)(sb->uuid+0)), - le32_to_cpu(*(__u32 *)(sb->uuid+4)), - le32_to_cpu(*(__u32 *)(sb->uuid+8)), - le32_to_cpu(*(__u32 *)(sb->uuid+12))); + le32_to_cpu(*(__le32 *)(sb->uuid+0)), + le32_to_cpu(*(__le32 *)(sb->uuid+4)), + le32_to_cpu(*(__le32 *)(sb->uuid+8)), + le32_to_cpu(*(__le32 *)(sb->uuid+12))); pr_debug(" events: %llu\n", (unsigned long long) le64_to_cpu(sb->events)); pr_debug("events cleared: %llu\n", From 00485d094244db47ae2bf4f738e8d0f8d9f5890f Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:12 +0200 Subject: [PATCH 075/164] md: use correct type in super_1_load If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index a15eb7e37c6d..7a74e2de40b5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1548,7 +1548,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ */ s32 offset; sector_t bb_sector; - u64 *bbp; + __le64 *bbp; int i; int sectors = le16_to_cpu(sb->bblog_size); if (sectors > (PAGE_SIZE / 512)) @@ -1560,7 +1560,7 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_ if (!sync_page_io(rdev, bb_sector, sectors << 9, rdev->bb_page, REQ_OP_READ, 0, true)) return -EIO; - bbp = (u64 *)page_address(rdev->bb_page); + bbp = (__le64 *)page_address(rdev->bb_page); rdev->badblocks.shift = sb->bblog_shift; for (i = 0 ; i < (sectors << (9-3)) ; i++, bbp++) { u64 bb = le64_to_cpu(*bbp); From ae50640bebc48f1fc0092f16ea004c7c4d12c985 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:13 +0200 Subject: [PATCH 076/164] md: use correct type in super_1_sync If we want to convert from a little endian format we need to cast to a little endian type, otherwise sparse will be unhappy. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 7a74e2de40b5..4a31380b0eff 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1872,7 +1872,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev) md_error(mddev, rdev); else { struct badblocks *bb = &rdev->badblocks; - u64 *bbp = (u64 *)page_address(rdev->bb_page); + __le64 *bbp = (__le64 *)page_address(rdev->bb_page); u64 *p = bb->page; sb->feature_map |= cpu_to_le32(MD_FEATURE_BAD_BLOCKS); if (bb->changed) { From 2b598ee54a1e50323143a613aa29eb40377d8792 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:14 +0200 Subject: [PATCH 077/164] md: mark md_cluster_mod static Sparse complains that it has no external declaration, and it turns out that it is never even used outside of md.c. So just mark it static and drop the export. Acked-by: Guoqing Jiang Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/md.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 4a31380b0eff..541015373f6a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -88,8 +88,7 @@ static struct kobj_type md_ktype; struct md_cluster_operations *md_cluster_ops; EXPORT_SYMBOL(md_cluster_ops); -struct module *md_cluster_mod; -EXPORT_SYMBOL(md_cluster_mod); +static struct module *md_cluster_mod; static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; From 368ecade0532982c8916a49e66b8105413d8db59 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:15 +0200 Subject: [PATCH 078/164] md: add __acquires/__releases annotations to (un)lock_two_stripes This tells sparse that we acquire/release the two stripe locks and avoids a warning. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/raid5.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 364dd2f6fa1b..d794d8745144 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -711,6 +711,8 @@ static bool is_full_stripe_write(struct stripe_head *sh) } static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) + __acquires(&sh1->stripe_lock) + __acquires(&sh2->stripe_lock) { if (sh1 > sh2) { spin_lock_irq(&sh2->stripe_lock); @@ -722,6 +724,8 @@ static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) } static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) + __releases(&sh1->stripe_lock) + __releases(&sh2->stripe_lock) { spin_unlock(&sh1->stripe_lock); spin_unlock_irq(&sh2->stripe_lock); From efcd487c69b9d968552a6bf80e7839c4f28b419d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 4 Apr 2019 18:56:16 +0200 Subject: [PATCH 079/164] md: add __acquires/__releases annotations to handle_active_stripes This tells sparse that we release and reacquire the device_lock and avoids a warning. Signed-off-by: Christoph Hellwig Signed-off-by: Song Liu --- drivers/md/raid5.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index d794d8745144..2b0a715e70c9 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6159,6 +6159,8 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio, static int handle_active_stripes(struct r5conf *conf, int group, struct r5worker *worker, struct list_head *temp_inactive_list) + __releases(&conf->device_lock) + __acquires(&conf->device_lock) { struct stripe_head *batch[MAX_STRIPE_BATCH], *sh; int i, batch_size = 0, hash; From 8a96a0e408102fb7aa73d8aa0b5e2219cfd51e55 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 11 Apr 2019 08:23:27 +0200 Subject: [PATCH 080/164] block: rewrite blk_bvec_map_sg to avoid a nth_page call The offset in scatterlists is allowed to be larger than the page size, so don't go to great length to avoid that case and simplify the arithmetics. Signed-off-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Ming Lei Reviewed-by: Johannes Thumshirn Signed-off-by: Jens Axboe --- block/blk-merge.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 895795cdb145..247b17f2a0f6 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -469,26 +469,17 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, struct scatterlist **sg) { unsigned nbytes = bvec->bv_len; - unsigned nsegs = 0, total = 0, offset = 0; + unsigned nsegs = 0, total = 0; while (nbytes > 0) { - unsigned seg_size; - struct page *pg; - unsigned idx; + unsigned offset = bvec->bv_offset + total; + unsigned len = min(get_max_segment_size(q, offset), nbytes); *sg = blk_next_sg(sg, sglist); + sg_set_page(*sg, bvec->bv_page, len, offset); - seg_size = get_max_segment_size(q, bvec->bv_offset + total); - seg_size = min(nbytes, seg_size); - - offset = (total + bvec->bv_offset) % PAGE_SIZE; - idx = (total + bvec->bv_offset) / PAGE_SIZE; - pg = bvec_nth_page(bvec->bv_page, idx); - - sg_set_page(*sg, pg, seg_size, offset); - - total += seg_size; - nbytes -= seg_size; + total += len; + nbytes -= len; nsegs++; } From a10584c3cda9cbb1a1ccd072783bfd625f99e40d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 11 Apr 2019 08:23:28 +0200 Subject: [PATCH 081/164] block: refactor __bio_iov_bvec_add_pages Return early on error, and add an unlikely annotation for that case. Reviewed-by: Ming Lei Signed-off-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Signed-off-by: Jens Axboe --- block/bio.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/block/bio.c b/block/bio.c index c2592c5d70b9..ad346082a971 100644 --- a/block/bio.c +++ b/block/bio.c @@ -873,20 +873,19 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count); size = bio_add_page(bio, bv->bv_page, len, bv->bv_offset + iter->iov_offset); - if (size == len) { - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { - struct page *page; - int i; + if (unlikely(size != len)) + return -EINVAL; - mp_bvec_for_each_page(page, bv, i) - get_page(page); - } + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { + struct page *page; + int i; - iov_iter_advance(iter, size); - return 0; + mp_bvec_for_each_page(page, bv, i) + get_page(page); } - return -EINVAL; + iov_iter_advance(iter, size); + return 0; } #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) From 14eacf12dbc75352fa746dfd9e24de3170ba5ff5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 11 Apr 2019 08:23:29 +0200 Subject: [PATCH 082/164] block: don't allow multiple bio_iov_iter_get_pages calls per bio No caller uses bio_iov_iter_get_pages multiple times on a given bio, and that funtionality isn't all that useful. Removing it will make some future changes a little easier and also simplifies the function a bit. Reviewed-by: Ming Lei Reviewed-by: Bart Van Assche Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Signed-off-by: Jens Axboe --- block/bio.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/block/bio.c b/block/bio.c index ad346082a971..c2a389b1509a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -958,7 +958,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { const bool is_bvec = iov_iter_is_bvec(iter); - unsigned short orig_vcnt = bio->bi_vcnt; + int ret; + + if (WARN_ON_ONCE(bio->bi_vcnt)) + return -EINVAL; /* * If this is a BVEC iter, then the pages are kernel pages. Don't @@ -968,19 +971,13 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) bio_set_flag(bio, BIO_NO_PAGE_REF); do { - int ret; - if (is_bvec) ret = __bio_iov_bvec_add_pages(bio, iter); else ret = __bio_iov_iter_get_pages(bio, iter); + } while (!ret && iov_iter_count(iter) && !bio_full(bio)); - if (unlikely(ret)) - return bio->bi_vcnt > orig_vcnt ? 0 : ret; - - } while (iov_iter_count(iter) && !bio_full(bio)); - - return 0; + return bio->bi_vcnt ? 0 : ret; } static void submit_bio_wait_endio(struct bio *bio) From 7321ecbfc7cf85211460a1dc6bb0ccfc3dcf9df0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 11 Apr 2019 08:23:30 +0200 Subject: [PATCH 083/164] block: change how we get page references in bio_iov_iter_get_pages Instead of needing a special macro to iterate over all pages in a bvec just do a second passs over the whole bio. This also matches what we do on the release side. The release side helper is moved up to where we need the get helper to clearly express the symmetry. Reviewed-by: Ming Lei Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 51 ++++++++++++++++++++++---------------------- include/linux/bvec.h | 5 ----- 2 files changed, 25 insertions(+), 31 deletions(-) diff --git a/block/bio.c b/block/bio.c index c2a389b1509a..d3490aeb1a7e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -861,6 +861,26 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static void bio_get_pages(struct bio *bio) +{ + struct bvec_iter_all iter_all; + struct bio_vec *bvec; + int i; + + bio_for_each_segment_all(bvec, bio, i, iter_all) + get_page(bvec->bv_page); +} + +static void bio_release_pages(struct bio *bio) +{ + struct bvec_iter_all iter_all; + struct bio_vec *bvec; + int i; + + bio_for_each_segment_all(bvec, bio, i, iter_all) + put_page(bvec->bv_page); +} + static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) { const struct bio_vec *bv = iter->bvec; @@ -875,15 +895,6 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) bv->bv_offset + iter->iov_offset); if (unlikely(size != len)) return -EINVAL; - - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { - struct page *page; - int i; - - mp_bvec_for_each_page(page, bv, i) - get_page(page); - } - iov_iter_advance(iter, size); return 0; } @@ -963,13 +974,6 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) if (WARN_ON_ONCE(bio->bi_vcnt)) return -EINVAL; - /* - * If this is a BVEC iter, then the pages are kernel pages. Don't - * release them on IO completion, if the caller asked us to. - */ - if (is_bvec && iov_iter_bvec_no_ref(iter)) - bio_set_flag(bio, BIO_NO_PAGE_REF); - do { if (is_bvec) ret = __bio_iov_bvec_add_pages(bio, iter); @@ -977,6 +981,11 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) ret = __bio_iov_iter_get_pages(bio, iter); } while (!ret && iov_iter_count(iter) && !bio_full(bio)); + if (iov_iter_bvec_no_ref(iter)) + bio_set_flag(bio, BIO_NO_PAGE_REF); + else + bio_get_pages(bio); + return bio->bi_vcnt ? 0 : ret; } @@ -1670,16 +1679,6 @@ void bio_set_pages_dirty(struct bio *bio) } } -static void bio_release_pages(struct bio *bio) -{ - struct bio_vec *bvec; - int i; - struct bvec_iter_all iter_all; - - bio_for_each_segment_all(bvec, bio, i, iter_all) - put_page(bvec->bv_page); -} - /* * bio_check_pages_dirty() will check that all the BIO's pages are still dirty. * If they are, then fine. If, however, some pages are clean then they must diff --git a/include/linux/bvec.h b/include/linux/bvec.h index f6275c4da13a..307bbda62b7b 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -189,9 +189,4 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec, } } -#define mp_bvec_for_each_page(pg, bv, i) \ - for (i = (bv)->bv_offset / PAGE_SIZE; \ - (i <= (((bv)->bv_offset + (bv)->bv_len - 1) / PAGE_SIZE)) && \ - (pg = bvec_nth_page((bv)->bv_page, i)); i += 1) - #endif /* __LINUX_BVEC_ITER_H */ From 52d52d1c98a90cfe860b83498e4b6074aad95c15 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 11 Apr 2019 08:23:31 +0200 Subject: [PATCH 084/164] block: only allow contiguous page structs in a bio_vec We currently have to call nth_page when iterating over pages inside a bio_vec. Jens complained a while ago that this is fairly expensive. To mitigate this we can check that that the actual page structures are contiguous when adding them to the bio, and just do check pointer arithmetics later on. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 9 +++++++-- include/linux/bvec.h | 13 ++++--------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/block/bio.c b/block/bio.c index d3490aeb1a7e..8adc2a20d57d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -659,8 +659,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, return false; if (xen_domain() && !xen_biovec_phys_mergeable(bv, page)) return false; - if (same_page && (vec_end_addr & PAGE_MASK) != page_addr) - return false; + + if ((vec_end_addr & PAGE_MASK) != page_addr) { + if (same_page) + return false; + if (pfn_to_page(PFN_DOWN(vec_end_addr)) + 1 != page) + return false; + } return true; } diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 307bbda62b7b..44b0f4684190 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -51,11 +51,6 @@ struct bvec_iter_all { unsigned done; }; -static inline struct page *bvec_nth_page(struct page *page, int idx) -{ - return idx == 0 ? page : nth_page(page, idx); -} - /* * various member access, note that bio_data should of course not be used * on highmem page vectors @@ -92,8 +87,8 @@ static inline struct page *bvec_nth_page(struct page *page, int idx) PAGE_SIZE - bvec_iter_offset((bvec), (iter))) #define bvec_iter_page(bvec, iter) \ - bvec_nth_page(mp_bvec_iter_page((bvec), (iter)), \ - mp_bvec_iter_page_idx((bvec), (iter))) + (mp_bvec_iter_page((bvec), (iter)) + \ + mp_bvec_iter_page_idx((bvec), (iter))) #define bvec_iter_bvec(bvec, iter) \ ((struct bio_vec) { \ @@ -157,7 +152,7 @@ static inline void mp_bvec_next_segment(const struct bio_vec *bvec, struct bio_vec *bv = &iter_all->bv; if (bv->bv_page) { - bv->bv_page = nth_page(bv->bv_page, 1); + bv->bv_page++; bv->bv_offset = 0; } else { bv->bv_page = bvec->bv_page; @@ -177,7 +172,7 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec, unsigned total = bvec->bv_offset + bvec->bv_len; unsigned last_page = (total - 1) / PAGE_SIZE; - seg->bv_page = bvec_nth_page(bvec->bv_page, last_page); + seg->bv_page = bvec->bv_page + last_page; /* the whole segment is inside the last page */ if (bvec->bv_offset >= last_page * PAGE_SIZE) { From 673387a930059fc4ad8060847a1d46f94e702281 Mon Sep 17 00:00:00 2001 From: Martin Wilck Date: Wed, 27 Mar 2019 14:51:01 +0100 Subject: [PATCH 085/164] block: genhd: remove async_events field The async_events field, intended to be used for drivers that support asynchronous notifications about disk events (aka media change events), isn't currently used by any driver, and apparently that has been that way for a long time (if not forever). Remove it. Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe --- block/genhd.c | 10 ++++------ drivers/block/pktcdvd.c | 1 - include/linux/genhd.h | 1 - 3 files changed, 4 insertions(+), 8 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 703267865f14..ee76de0fb4cc 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1628,12 +1628,11 @@ static unsigned long disk_events_poll_jiffies(struct gendisk *disk) /* * If device-specific poll interval is set, always use it. If - * the default is being used, poll iff there are events which - * can't be monitored asynchronously. + * the default is being used, poll if the POLL flag is set. */ if (ev->poll_msecs >= 0) intv_msecs = ev->poll_msecs; - else if (disk->events & ~disk->async_events) + else if (disk->events) intv_msecs = disk_events_dfl_poll_msecs; return msecs_to_jiffies(intv_msecs); @@ -1860,6 +1859,7 @@ static void disk_check_events(struct disk_events *ev, * * events : list of all supported events * events_async : list of events which can be detected w/o polling + * (always empty, only for backwards compatibility) * events_poll_msecs : polling interval, 0: disable, -1: system default */ static ssize_t __disk_events_show(unsigned int events, char *buf) @@ -1890,9 +1890,7 @@ static ssize_t disk_events_show(struct device *dev, static ssize_t disk_events_async_show(struct device *dev, struct device_attribute *attr, char *buf) { - struct gendisk *disk = dev_to_disk(dev); - - return __disk_events_show(disk->async_events, buf); + return 0; } static ssize_t disk_events_poll_msecs_show(struct device *dev, diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index f5a71023f76c..024060165afa 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -2761,7 +2761,6 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev) /* inherit events of the host device */ disk->events = pd->bdev->bd_disk->events; - disk->async_events = pd->bdev->bd_disk->async_events; add_disk(disk); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 98076b1b5e48..a65d9fc17369 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -185,7 +185,6 @@ struct gendisk { char *(*devnode)(struct gendisk *gd, umode_t *mode); unsigned int events; /* supported events */ - unsigned int async_events; /* async events, subset of all */ /* Array of pointers to partitions indexed by partno. * Protected with matching bdev lock but stat and other From c92e2f04b35938da23eb9a7f7101cbdd5ac7cdc4 Mon Sep 17 00:00:00 2001 From: Martin Wilck Date: Wed, 27 Mar 2019 14:51:02 +0100 Subject: [PATCH 086/164] block: disk_events: introduce event flags Currently, an empty disk->events field tells the block layer not to forward media change events to user space. This was done in commit 7c88a168da80 ("block: don't propagate unlisted DISK_EVENTs to userland") in order to avoid events from "fringe" drivers to be forwarded to user space. By doing so, the block layer lost the information which events were supported by a particular block device, and most importantly, whether or not a given device supports media change events at all. Prepare for not interpreting the "events" field this way in the future any more. This is done by adding an additional field "event_flags" to struct gendisk, and two flag bits that can be set to have the device treated like one that had the "events" field set to a non-zero value before. This applies only to the sd and sr drivers, which are changed to set the new flags. The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the blocklayer to generate udev events from kernel events. In order to add the event_flags field to struct gendisk, the events field is converted to an "unsigned short"; it doesn't need to hold values bigger than 2 anyway. This patch doesn't change behavior. Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe --- block/genhd.c | 13 +++++++++---- drivers/scsi/sd.c | 1 + drivers/scsi/sr.c | 1 + include/linux/genhd.h | 10 +++++++++- 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index ee76de0fb4cc..5375be39e8a5 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1632,7 +1632,7 @@ static unsigned long disk_events_poll_jiffies(struct gendisk *disk) */ if (ev->poll_msecs >= 0) intv_msecs = ev->poll_msecs; - else if (disk->events) + else if (disk->event_flags & DISK_EVENT_FLAG_POLL) intv_msecs = disk_events_dfl_poll_msecs; return msecs_to_jiffies(intv_msecs); @@ -1842,11 +1842,13 @@ static void disk_check_events(struct disk_events *ev, /* * Tell userland about new events. Only the events listed in - * @disk->events are reported. Unlisted events are processed the - * same internally but never get reported to userland. + * @disk->events are reported, and only if DISK_EVENT_FLAG_UEVENT + * is set. Otherwise, events are processed internally but never + * get reported to userland. */ for (i = 0; i < ARRAY_SIZE(disk_uevents); i++) - if (events & disk->events & (1 << i)) + if ((events & disk->events & (1 << i)) && + (disk->event_flags & DISK_EVENT_FLAG_UEVENT)) envp[nr_events++] = disk_uevents[i]; if (nr_events) @@ -1884,6 +1886,9 @@ static ssize_t disk_events_show(struct device *dev, { struct gendisk *disk = dev_to_disk(dev); + if (!(disk->event_flags & DISK_EVENT_FLAG_UEVENT)) + return 0; + return __disk_events_show(disk->events, buf); } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 92c34d93e051..ebc80354714c 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3293,6 +3293,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie) if (sdp->removable) { gd->flags |= GENHD_FL_REMOVABLE; gd->events |= DISK_EVENT_MEDIA_CHANGE; + gd->event_flags = DISK_EVENT_FLAG_POLL | DISK_EVENT_FLAG_UEVENT; } blk_pm_runtime_init(sdp->request_queue, dev); diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 039c27c2d7b3..c3f443d5aea8 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -716,6 +716,7 @@ static int sr_probe(struct device *dev) disk->fops = &sr_bdops; disk->flags = GENHD_FL_CD | GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; disk->events = DISK_EVENT_MEDIA_CHANGE | DISK_EVENT_EJECT_REQUEST; + disk->event_flags = DISK_EVENT_FLAG_POLL | DISK_EVENT_FLAG_UEVENT; blk_queue_rq_timeout(sdev->request_queue, SR_TIMEOUT); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index a65d9fc17369..6547c9256d5c 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -150,6 +150,13 @@ enum { DISK_EVENT_EJECT_REQUEST = 1 << 1, /* eject requested */ }; +enum { + /* Poll even if events_poll_msecs is unset */ + DISK_EVENT_FLAG_POLL = 1 << 0, + /* Forward events to udev */ + DISK_EVENT_FLAG_UEVENT = 1 << 1, +}; + struct disk_part_tbl { struct rcu_head rcu_head; int len; @@ -184,7 +191,8 @@ struct gendisk { char disk_name[DISK_NAME_LEN]; /* name of major driver */ char *(*devnode)(struct gendisk *gd, umode_t *mode); - unsigned int events; /* supported events */ + unsigned short events; /* supported events */ + unsigned short event_flags; /* flags related to event processing */ /* Array of pointers to partitions indexed by partno. * Protected with matching bdev lock but stat and other From 3c12c8e94ca04d668ad0cded7857fea2637834b3 Mon Sep 17 00:00:00 2001 From: Martin Wilck Date: Wed, 27 Mar 2019 14:51:03 +0100 Subject: [PATCH 087/164] Revert "ide: unexport DISK_EVENT_MEDIA_CHANGE for ide-gd and ide-cd" This reverts commit 7eec77a1816a7042591a6cbdb4820e9e7ebffe0e. Instead of leaving disk->events completely empty, we now export the supported events again, and tell the block layer not to forward events to user space by not setting DISK_EVENT_FLAG_UEVENT. This allows the block layer to distinguish between devices that for which events should be handled in kernel only, and devices which don't support any meda change events at all. Cc: Borislav Petkov Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe --- drivers/ide/ide-cd.c | 1 + drivers/ide/ide-cd_ioctl.c | 5 +++-- drivers/ide/ide-gd.c | 6 ++++-- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 1f03884a6808..3b15adc6ce98 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1797,6 +1797,7 @@ static int ide_cd_probe(ide_drive_t *drive) ide_cd_read_toc(drive); g->fops = &idecd_ops; g->flags |= GENHD_FL_REMOVABLE | GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; + g->events = DISK_EVENT_MEDIA_CHANGE; device_add_disk(&drive->gendev, g, NULL); return 0; diff --git a/drivers/ide/ide-cd_ioctl.c b/drivers/ide/ide-cd_ioctl.c index 4a6e1a413ead..46f2df288c6a 100644 --- a/drivers/ide/ide-cd_ioctl.c +++ b/drivers/ide/ide-cd_ioctl.c @@ -82,8 +82,9 @@ int ide_cdrom_drive_status(struct cdrom_device_info *cdi, int slot_nr) /* * ide-cd always generates media changed event if media is missing, which - * makes it impossible to use for proper event reporting, so disk->events - * is cleared to 0 and the following function is used only to trigger + * makes it impossible to use for proper event reporting, so + * DISK_EVENT_FLAG_UEVENT is cleared in disk->event_flags + * and the following function is used only to trigger * revalidation and never propagated to userland. */ unsigned int ide_cdrom_check_events_real(struct cdrom_device_info *cdi, diff --git a/drivers/ide/ide-gd.c b/drivers/ide/ide-gd.c index 04e008e8f6f9..f233b34ea0c0 100644 --- a/drivers/ide/ide-gd.c +++ b/drivers/ide/ide-gd.c @@ -299,8 +299,9 @@ static unsigned int ide_gd_check_events(struct gendisk *disk, /* * The following is used to force revalidation on the first open on * removeable devices, and never gets reported to userland as - * genhd->events is 0. This is intended as removeable ide disk - * can't really detect MEDIA_CHANGE events. + * DISK_EVENT_FLAG_UEVENT isn't set in genhd->event_flags. + * This is intended as removable ide disk can't really detect + * MEDIA_CHANGE events. */ ret = drive->dev_flags & IDE_DFLAG_MEDIA_CHANGED; drive->dev_flags &= ~IDE_DFLAG_MEDIA_CHANGED; @@ -416,6 +417,7 @@ static int ide_gd_probe(ide_drive_t *drive) if (drive->dev_flags & IDE_DFLAG_REMOVABLE) g->flags = GENHD_FL_REMOVABLE; g->fops = &ide_gd_ops; + g->events = DISK_EVENT_MEDIA_CHANGE; device_add_disk(&drive->gendev, g, NULL); return 0; From 773008f6fe0544aa28140ced0504cefba17381aa Mon Sep 17 00:00:00 2001 From: Martin Wilck Date: Wed, 27 Mar 2019 14:51:04 +0100 Subject: [PATCH 088/164] Revert "block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers" This reverts commit 9fd097b14918875bd6f125ed699d7bbbba5893ee. Instead of leaving disk->events completely empty, we now export the supported events again, and tell the block layer not to forward events to user space by not setting DISK_EVENT_FLAG_UEVENT. This allows the block layer to distinguish between devices that for which events should be handled in kernel only, and devices which don't support any meda change events at all. Cc: Jiri Kosina Cc: Tim Waugh Cc: Michal Simek Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe --- drivers/block/amiflop.c | 1 + drivers/block/ataflop.c | 1 + drivers/block/floppy.c | 1 + drivers/block/paride/pcd.c | 1 + drivers/block/paride/pd.c | 1 + drivers/block/paride/pf.c | 1 + drivers/block/swim.c | 1 + drivers/block/swim3.c | 1 + drivers/block/xsysace.c | 1 + drivers/cdrom/gdrom.c | 1 + 10 files changed, 10 insertions(+) diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c index 0903e0803ec8..92b930cb3b72 100644 --- a/drivers/block/amiflop.c +++ b/drivers/block/amiflop.c @@ -1829,6 +1829,7 @@ static int __init fd_probe_drives(void) disk->major = FLOPPY_MAJOR; disk->first_minor = drive; disk->fops = &floppy_fops; + disk->events = DISK_EVENT_MEDIA_CHANGE; sprintf(disk->disk_name, "fd%d", drive); disk->private_data = &unit[drive]; set_capacity(disk, 880*2); diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c index b0dbbdfeb33e..c7b5c4671f05 100644 --- a/drivers/block/ataflop.c +++ b/drivers/block/ataflop.c @@ -2028,6 +2028,7 @@ static int __init atari_floppy_init (void) unit[i].disk->first_minor = i; sprintf(unit[i].disk->disk_name, "fd%d", i); unit[i].disk->fops = &floppy_fops; + unit[i].disk->events = DISK_EVENT_MEDIA_CHANGE; unit[i].disk->private_data = &unit[i]; set_capacity(unit[i].disk, MAX_DISK_SIZE * 2); add_disk(unit[i].disk); diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 95f608d1a098..8072bd9881e6 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -4540,6 +4540,7 @@ static int __init do_floppy_init(void) disks[drive]->major = FLOPPY_MAJOR; disks[drive]->first_minor = TOMINOR(drive); disks[drive]->fops = &floppy_fops; + disks[drive]->events = DISK_EVENT_MEDIA_CHANGE; sprintf(disks[drive]->disk_name, "fd%d", drive); timer_setup(&motor_off_timer[drive], motor_off_callback, 0); diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c index 377a694dc228..5436d856e656 100644 --- a/drivers/block/paride/pcd.c +++ b/drivers/block/paride/pcd.c @@ -342,6 +342,7 @@ static void pcd_init_units(void) strcpy(disk->disk_name, cd->name); /* umm... */ disk->fops = &pcd_bdops; disk->flags = GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; + disk->events = DISK_EVENT_MEDIA_CHANGE; } } diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c index 0ff9b12d0e35..6f9ad3fc716f 100644 --- a/drivers/block/paride/pd.c +++ b/drivers/block/paride/pd.c @@ -897,6 +897,7 @@ static void pd_probe_drive(struct pd_unit *disk) p->fops = &pd_fops; p->major = major; p->first_minor = (disk - pd) << PD_BITS; + p->events = DISK_EVENT_MEDIA_CHANGE; disk->gd = p; p->private_data = disk; diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c index 103b617cdc31..1aca4a8acb55 100644 --- a/drivers/block/paride/pf.c +++ b/drivers/block/paride/pf.c @@ -319,6 +319,7 @@ static void __init pf_init_units(void) disk->first_minor = unit; strcpy(disk->disk_name, pf->name); disk->fops = &pf_fops; + disk->events = DISK_EVENT_MEDIA_CHANGE; if (!(*drives[unit])[D_PRT]) pf_drive_count++; } diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 3fa6fcc34790..67b5ec281c6d 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -862,6 +862,7 @@ static int swim_floppy_init(struct swim_priv *swd) swd->unit[drive].disk->first_minor = drive; sprintf(swd->unit[drive].disk->disk_name, "fd%d", drive); swd->unit[drive].disk->fops = &floppy_fops; + swd->unit[drive].disk->events = DISK_EVENT_MEDIA_CHANGE; swd->unit[drive].disk->private_data = &swd->unit[drive]; set_capacity(swd->unit[drive].disk, 2880); add_disk(swd->unit[drive].disk); diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index 1e2ae90d7715..cf42729c788e 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1216,6 +1216,7 @@ static int swim3_attach(struct macio_dev *mdev, disk->first_minor = floppy_count; disk->fops = &floppy_fops; disk->private_data = fs; + disk->events = DISK_EVENT_MEDIA_CHANGE; disk->flags |= GENHD_FL_REMOVABLE; sprintf(disk->disk_name, "fd%d", floppy_count); set_capacity(disk, 2880); diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c index 87ccef4bd69e..8d299507efe7 100644 --- a/drivers/block/xsysace.c +++ b/drivers/block/xsysace.c @@ -1032,6 +1032,7 @@ static int ace_setup(struct ace_device *ace) ace->gd->major = ace_major; ace->gd->first_minor = ace->id * ACE_NUM_MINORS; ace->gd->fops = &ace_fops; + ace->gd->events = DISK_EVENT_MEDIA_CHANGE; ace->gd->queue = ace->queue; ace->gd->private_data = ace; snprintf(ace->gd->disk_name, 32, "xs%c", ace->id + 'a'); diff --git a/drivers/cdrom/gdrom.c b/drivers/cdrom/gdrom.c index f8b7345fe1cb..5cf3bade0d57 100644 --- a/drivers/cdrom/gdrom.c +++ b/drivers/cdrom/gdrom.c @@ -786,6 +786,7 @@ static int probe_gdrom(struct platform_device *devptr) goto probe_fail_cdrom_register; } gd.disk->fops = &gdrom_bdops; + gd.disk->events = DISK_EVENT_MEDIA_CHANGE; /* latch on to the interrupt */ err = gdrom_set_interrupt_handlers(); if (err) From cdf3e3deb747d5e193dee617ed37c83060eb576f Mon Sep 17 00:00:00 2001 From: Martin Wilck Date: Wed, 27 Mar 2019 14:51:05 +0100 Subject: [PATCH 089/164] block: check_events: don't bother with events if unsupported Drivers now report to the block layer if they support media change events. If this is not the case, there's no need to allocate the event structure, and all event handling code can effectively be skipped. This simplifies code flow in particular for non-removable sd devices. This effectively reverts commit 75e3f3ee3c64 ("block: always allocate genhd->ev if check_events is implemented"). The sysfs files for the events are kept in place even if no events are supported, as user space may rely on them being present. The only difference is that an error code is now returned if the user tries to set poll_msecs. Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Martin Wilck Signed-off-by: Jens Axboe --- block/genhd.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 5375be39e8a5..1d0d25f7b0fe 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1904,6 +1904,9 @@ static ssize_t disk_events_poll_msecs_show(struct device *dev, { struct gendisk *disk = dev_to_disk(dev); + if (!disk->ev) + return sprintf(buf, "-1\n"); + return sprintf(buf, "%ld\n", disk->ev->poll_msecs); } @@ -1920,6 +1923,9 @@ static ssize_t disk_events_poll_msecs_store(struct device *dev, if (intv < 0 && intv != -1) return -EINVAL; + if (!disk->ev) + return -ENODEV; + disk_block_events(disk); disk->ev->poll_msecs = intv; __disk_unblock_events(disk, true); @@ -1984,7 +1990,7 @@ static void disk_alloc_events(struct gendisk *disk) { struct disk_events *ev; - if (!disk->fops->check_events) + if (!disk->fops->check_events || !disk->events) return; ev = kzalloc(sizeof(*ev), GFP_KERNEL); @@ -2006,14 +2012,14 @@ static void disk_alloc_events(struct gendisk *disk) static void disk_add_events(struct gendisk *disk) { - if (!disk->ev) - return; - /* FIXME: error handling */ if (sysfs_create_files(&disk_to_dev(disk)->kobj, disk_events_attrs) < 0) pr_warn("%s: failed to create sysfs files for events\n", disk->disk_name); + if (!disk->ev) + return; + mutex_lock(&disk_events_mutex); list_add_tail(&disk->ev->node, &disk_events); mutex_unlock(&disk_events_mutex); @@ -2027,14 +2033,13 @@ static void disk_add_events(struct gendisk *disk) static void disk_del_events(struct gendisk *disk) { - if (!disk->ev) - return; + if (disk->ev) { + disk_block_events(disk); - disk_block_events(disk); - - mutex_lock(&disk_events_mutex); - list_del_init(&disk->ev->node); - mutex_unlock(&disk_events_mutex); + mutex_lock(&disk_events_mutex); + list_del_init(&disk->ev->node); + mutex_unlock(&disk_events_mutex); + } sysfs_remove_files(&disk_to_dev(disk)->kobj, disk_events_attrs); } From 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd Mon Sep 17 00:00:00 2001 From: Yufen Yu Date: Tue, 2 Apr 2019 20:06:34 +0800 Subject: [PATCH 090/164] block: fix use-after-free on gendisk commit 2da78092dda "block: Fix dev_t minor allocation lifetime" specifically moved blk_free_devt(dev->devt) call to part_release() to avoid reallocating device number before the device is fully shutdown. However, it can cause use-after-free on gendisk in get_gendisk(). We use md device as example to show the race scenes: Process1 Worker Process2 md_free blkdev_open del_gendisk add delete_partition_work_fn() to wq __blkdev_get get_gendisk put_disk disk_release kfree(disk) find part from ext_devt_idr get_disk_and_module(disk) cause use after free delete_partition_work_fn put_device(part) part_release remove part from ext_devt_idr Before is removed from ext_devt_idr by delete_partition_work_fn(), we can find the devt and then access gendisk by hd_struct pointer. But, if we access the gendisk after it have been freed, it can cause in use-after-freeon gendisk in get_gendisk(). We fix this by adding a new helper blk_invalidate_devt() in delete_partition() and del_gendisk(). It replaces hd_struct pointer in idr with value 'NULL', and deletes the entry from idr in part_release() as we do now. Thanks to Jan Kara for providing the solution and more clear comments for the code. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Al Viro Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: Yufen Yu Signed-off-by: Jens Axboe --- block/genhd.c | 19 +++++++++++++++++++ block/partition-generic.c | 7 +++++++ include/linux/genhd.h | 1 + 3 files changed, 27 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 1d0d25f7b0fe..83f5c33d1e80 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -531,6 +531,18 @@ void blk_free_devt(dev_t devt) } } +/** + * We invalidate devt by assigning NULL pointer for devt in idr. + */ +void blk_invalidate_devt(dev_t devt) +{ + if (MAJOR(devt) == BLOCK_EXT_MAJOR) { + spin_lock_bh(&ext_devt_lock); + idr_replace(&ext_devt_idr, NULL, blk_mangle_minor(MINOR(devt))); + spin_unlock_bh(&ext_devt_lock); + } +} + static char *bdevt_str(dev_t devt, char *buf) { if (MAJOR(devt) <= 0xff && MINOR(devt) <= 0xff) { @@ -793,6 +805,13 @@ void del_gendisk(struct gendisk *disk) if (!(disk->flags & GENHD_FL_HIDDEN)) blk_unregister_region(disk_devt(disk), disk->minors); + /* + * Remove gendisk pointer from idr so that it cannot be looked up + * while RCU period before freeing gendisk is running to prevent + * use-after-free issues. Note that the device number stays + * "in-use" until we really free the gendisk. + */ + blk_invalidate_devt(disk_devt(disk)); kobject_put(disk->part0.holder_dir); kobject_put(disk->slave_dir); diff --git a/block/partition-generic.c b/block/partition-generic.c index 8e596a8dff32..aee643ce13d1 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -285,6 +285,13 @@ void delete_partition(struct gendisk *disk, int partno) kobject_put(part->holder_dir); device_del(part_to_dev(part)); + /* + * Remove gendisk pointer from idr so that it cannot be looked up + * while RCU period before freeing gendisk is running to prevent + * use-after-free issues. Note that the device number stays + * "in-use" until we really free the gendisk. + */ + blk_invalidate_devt(part_devt(part)); hd_struct_kill(part); } diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 6547c9256d5c..8b5330dd5ac0 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -617,6 +617,7 @@ struct unixware_disklabel { extern int blk_alloc_devt(struct hd_struct *part, dev_t *devt); extern void blk_free_devt(dev_t devt); +extern void blk_invalidate_devt(dev_t devt); extern dev_t blk_lookup_devt(const char *name, int partno); extern char *disk_name (struct gendisk *hd, int partno, char *buf); From c42d3240990814eec1e4b2b93fa0487fc4873aed Mon Sep 17 00:00:00 2001 From: Pawel Baldysiak Date: Wed, 27 Mar 2019 13:48:21 +0100 Subject: [PATCH 091/164] md: return -ENODEV if rdev has no mddev assigned Mdadm expects that setting drive as faulty will fail with -EBUSY only if this operation will cause RAID to be failed. If this happens, it will try to stop the array. Currently -EBUSY might also be returned if rdev is in the middle of the removal process - for example there is a race with mdmon that already requested the drive to be failed/removed. If rdev does not contain mddev, return -ENODEV instead, so the caller can distinguish between those two cases and behave accordingly. Reviewed-by: NeilBrown Signed-off-by: Pawel Baldysiak Signed-off-by: Song Liu --- drivers/md/md.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 541015373f6a..45ffa23fa85d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -3380,10 +3380,10 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr, return -EIO; if (!capable(CAP_SYS_ADMIN)) return -EACCES; - rv = mddev ? mddev_lock(mddev): -EBUSY; + rv = mddev ? mddev_lock(mddev) : -ENODEV; if (!rv) { if (rdev->mddev == NULL) - rv = -EBUSY; + rv = -ENODEV; else rv = entry->store(rdev, page, length); mddev_unlock(mddev); From a25d8c327bb41742dbd59f8c545f59f3b9c39983 Mon Sep 17 00:00:00 2001 From: Song Liu Date: Tue, 16 Apr 2019 09:34:21 -0700 Subject: [PATCH 092/164] Revert "Don't jump to compute_result state from check_result state" This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef. Cc: Dan Williams Cc: Nigel Croxon Cc: Xiao Ni Signed-off-by: Song Liu --- drivers/md/raid5.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2b0a715e70c9..b5742d07662d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4227,15 +4227,26 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, case check_state_check_result: sh->check_state = check_state_idle; - if (s->failed > 1) - break; /* handle a successful check operation, if parity is correct * we are done. Otherwise update the mismatch count and repair * parity if !MD_RECOVERY_CHECK */ if (sh->ops.zero_sum_result == 0) { - /* Any parity checked was correct */ - set_bit(STRIPE_INSYNC, &sh->state); + /* both parities are correct */ + if (!s->failed) + set_bit(STRIPE_INSYNC, &sh->state); + else { + /* in contrast to the raid5 case we can validate + * parity, but still have a failure to write + * back + */ + sh->check_state = check_state_compute_result; + /* Returning at this point means that we may go + * off and bring p and/or q uptodate again so + * we make sure to check zero_sum_result again + * to verify if p or q need writeback + */ + } } else { atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches); if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) { From b2176a1dfb518d870ee073445d27055fea64dfb8 Mon Sep 17 00:00:00 2001 From: Nigel Croxon Date: Tue, 16 Apr 2019 09:50:09 -0700 Subject: [PATCH 093/164] md/raid: raid5 preserve the writeback action after the parity check The problem is that any 'uptodate' vs 'disks' check is not precise in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the device that might try to kick off writes and then skip the action. Better to prevent the raid driver from taking unexpected action *and* keep the system alive vs killing the machine with BUG_ON. Note: fixed warning reported by kbuild test robot Signed-off-by: Dan Williams Signed-off-by: Nigel Croxon Signed-off-by: Song Liu --- drivers/md/raid5.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b5742d07662d..7fde645d2e90 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4191,7 +4191,7 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, /* now write out any block on a failed drive, * or P or Q if they were recomputed */ - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ + dev = NULL; if (s->failed == 2) { dev = &sh->dev[s->failed_num[1]]; s->locked++; @@ -4216,6 +4216,14 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, set_bit(R5_LOCKED, &dev->flags); set_bit(R5_Wantwrite, &dev->flags); } + if (WARN_ONCE(dev && !test_bit(R5_UPTODATE, &dev->flags), + "%s: disk%td not up to date\n", + mdname(conf->mddev), + dev - (struct r5dev *) &sh->dev)) { + clear_bit(R5_LOCKED, &dev->flags); + clear_bit(R5_Wantwrite, &dev->flags); + s->locked--; + } clear_bit(STRIPE_DEGRADED, &sh->state); set_bit(STRIPE_INSYNC, &sh->state); From 6fcc44d1d77fea3c7230e4d109b37f6977aa675a Mon Sep 17 00:00:00 2001 From: Yufen Yu Date: Tue, 2 Apr 2019 20:06:34 +0800 Subject: [PATCH 094/164] block: fix use-after-free on gendisk commit 2da78092dda "block: Fix dev_t minor allocation lifetime" specifically moved blk_free_devt(dev->devt) call to part_release() to avoid reallocating device number before the device is fully shutdown. However, it can cause use-after-free on gendisk in get_gendisk(). We use md device as example to show the race scenes: Process1 Worker Process2 md_free blkdev_open del_gendisk add delete_partition_work_fn() to wq __blkdev_get get_gendisk put_disk disk_release kfree(disk) find part from ext_devt_idr get_disk_and_module(disk) cause use after free delete_partition_work_fn put_device(part) part_release remove part from ext_devt_idr Before is removed from ext_devt_idr by delete_partition_work_fn(), we can find the devt and then access gendisk by hd_struct pointer. But, if we access the gendisk after it have been freed, it can cause in use-after-freeon gendisk in get_gendisk(). We fix this by adding a new helper blk_invalidate_devt() in delete_partition() and del_gendisk(). It replaces hd_struct pointer in idr with value 'NULL', and deletes the entry from idr in part_release() as we do now. Thanks to Jan Kara for providing the solution and more clear comments for the code. Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime") Cc: Al Viro Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Jan Kara Suggested-by: Jan Kara Signed-off-by: Yufen Yu Signed-off-by: Jens Axboe --- block/genhd.c | 19 +++++++++++++++++++ block/partition-generic.c | 7 +++++++ include/linux/genhd.h | 1 + 3 files changed, 27 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 1d0d25f7b0fe..83f5c33d1e80 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -531,6 +531,18 @@ void blk_free_devt(dev_t devt) } } +/** + * We invalidate devt by assigning NULL pointer for devt in idr. + */ +void blk_invalidate_devt(dev_t devt) +{ + if (MAJOR(devt) == BLOCK_EXT_MAJOR) { + spin_lock_bh(&ext_devt_lock); + idr_replace(&ext_devt_idr, NULL, blk_mangle_minor(MINOR(devt))); + spin_unlock_bh(&ext_devt_lock); + } +} + static char *bdevt_str(dev_t devt, char *buf) { if (MAJOR(devt) <= 0xff && MINOR(devt) <= 0xff) { @@ -793,6 +805,13 @@ void del_gendisk(struct gendisk *disk) if (!(disk->flags & GENHD_FL_HIDDEN)) blk_unregister_region(disk_devt(disk), disk->minors); + /* + * Remove gendisk pointer from idr so that it cannot be looked up + * while RCU period before freeing gendisk is running to prevent + * use-after-free issues. Note that the device number stays + * "in-use" until we really free the gendisk. + */ + blk_invalidate_devt(disk_devt(disk)); kobject_put(disk->part0.holder_dir); kobject_put(disk->slave_dir); diff --git a/block/partition-generic.c b/block/partition-generic.c index 8e596a8dff32..aee643ce13d1 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -285,6 +285,13 @@ void delete_partition(struct gendisk *disk, int partno) kobject_put(part->holder_dir); device_del(part_to_dev(part)); + /* + * Remove gendisk pointer from idr so that it cannot be looked up + * while RCU period before freeing gendisk is running to prevent + * use-after-free issues. Note that the device number stays + * "in-use" until we really free the gendisk. + */ + blk_invalidate_devt(part_devt(part)); hd_struct_kill(part); } diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 6547c9256d5c..8b5330dd5ac0 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -617,6 +617,7 @@ struct unixware_disklabel { extern int blk_alloc_devt(struct hd_struct *part, dev_t *devt); extern void blk_free_devt(dev_t devt); +extern void blk_invalidate_devt(dev_t devt); extern dev_t blk_lookup_devt(const char *name, int partno); extern char *disk_name (struct gendisk *hd, int partno, char *buf); From f6b50160a06d4a0d6a3999ab0c5aec4f52dba248 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Mon, 22 Apr 2019 21:23:21 +0800 Subject: [PATCH 095/164] brd: re-enable __GFP_HIGHMEM in brd_insert_page() __GFP_HIGHMEM is disabled if dax is enabled on brd, however dax support for brd has been removed since commit (7a862fbbdec6 "brd: remove dax support"), so restore __GFP_HIGHMEM in brd_insert_page(). Also remove the no longer applicable comments about DAX and highmem. Cc: stable@vger.kernel.org Fixes: 7a862fbbdec6 ("brd: remove dax support") Signed-off-by: Hou Tao Signed-off-by: Jens Axboe --- drivers/block/brd.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index c18586fccb6f..17defbf4f332 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -96,13 +96,8 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) /* * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. - * - * Cannot support DAX and highmem, because our ->direct_access - * routine for DAX must return memory that is always addressable. - * If DAX was reworked to use pfns and kmap throughout, this - * restriction might be able to be lifted. */ - gfp_flags = GFP_NOIO | __GFP_ZERO; + gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM; page = alloc_page(gfp_flags); if (!page) return NULL; From f9f76879bc4521019697970bad3bc1dd0bec211f Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 19 Apr 2019 08:56:24 +0200 Subject: [PATCH 096/164] block: avoid scatterlist offsets > PAGE_SIZE While we generally allow scatterlists to have offsets larger than page size for an entry, and other subsystems like the crypto code make use of that, the block layer isn't quite ready for that. Flip the switch back to avoid them for now, and revisit that decision early in a merge window once the known offenders are fixed. Fixes: 8a96a0e40810 ("block: rewrite blk_bvec_map_sg to avoid a nth_page call") Reviewed-by: Ming Lei Tested-by: Guenter Roeck Reported-by: Guenter Roeck Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/blk-merge.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 247b17f2a0f6..21e87a714a73 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -474,9 +474,21 @@ static unsigned blk_bvec_map_sg(struct request_queue *q, while (nbytes > 0) { unsigned offset = bvec->bv_offset + total; unsigned len = min(get_max_segment_size(q, offset), nbytes); + struct page *page = bvec->bv_page; + + /* + * Unfortunately a fair number of drivers barf on scatterlists + * that have an offset larger than PAGE_SIZE, despite other + * subsystems dealing with that invariant just fine. For now + * stick to the legacy format where we never present those from + * the block layer, but the code below should be removed once + * these offenders (mostly MMC/SD drivers) are fixed. + */ + page += (offset >> PAGE_SHIFT); + offset &= ~PAGE_MASK; *sg = blk_next_sg(sg, sglist); - sg_set_page(*sg, bvec->bv_page, len, offset); + sg_set_page(*sg, page, len, offset); total += len; nbytes -= len; From 4d25339e32a1b6e1f490bb78b1e5b0fa9eb3e073 Mon Sep 17 00:00:00 2001 From: Weiping Zhang Date: Tue, 2 Apr 2019 21:14:30 +0800 Subject: [PATCH 097/164] block: don't show io_timeout if driver has no timeout handler If the low level driver has no timeout handler, the /sys/block//queue/io_timeout will not be displayed. Reviewed-by: Bart Van Assche Signed-off-by: Weiping Zhang Signed-off-by: Jens Axboe --- block/blk-sysfs.c | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 422327089e0f..a16a02c52a85 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -728,7 +728,7 @@ static struct queue_sysfs_entry throtl_sample_time_entry = { }; #endif -static struct attribute *default_attrs[] = { +static struct attribute *queue_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, &queue_max_hw_sectors_entry.attr, @@ -770,6 +770,25 @@ static struct attribute *default_attrs[] = { NULL, }; +static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr, + int n) +{ + struct request_queue *q = + container_of(kobj, struct request_queue, kobj); + + if (attr == &queue_io_timeout_entry.attr && + (!q->mq_ops || !q->mq_ops->timeout)) + return 0; + + return attr->mode; +} + +static struct attribute_group queue_attr_group = { + .attrs = queue_attrs, + .is_visible = queue_attr_visible, +}; + + #define to_queue(atr) container_of((atr), struct queue_sysfs_entry, attr) static ssize_t @@ -890,7 +909,6 @@ static const struct sysfs_ops queue_sysfs_ops = { struct kobj_type blk_queue_ktype = { .sysfs_ops = &queue_sysfs_ops, - .default_attrs = default_attrs, .release = blk_release_queue, }; @@ -939,6 +957,14 @@ int blk_register_queue(struct gendisk *disk) goto unlock; } + ret = sysfs_create_group(&q->kobj, &queue_attr_group); + if (ret) { + blk_trace_remove_sysfs(dev); + kobject_del(&q->kobj); + kobject_put(&dev->kobj); + goto unlock; + } + if (queue_is_mq(q)) { __blk_mq_register_dev(dev, q); blk_mq_debugfs_register(q); From 551879a48f01826fd86568d7bd1e774cb0de3295 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 23 Apr 2019 10:51:04 +0800 Subject: [PATCH 098/164] block: clarify that bio_add_page() and related helpers can add multi pages bio_add_page() and __bio_add_page() are capable of adding pages into bio, and now we have at least two such usages alreay: - __bio_iov_bvec_add_pages() - nvmet_bdev_execute_rw(). So update comments on these two helpers. The thing is a bit special for __bio_try_merge_page(), given the caller needs to know if the new added page is same with the last added page, then it isn't safe to pass multi-page in case that 'same_page' is true, so adds warning on potential misuse, and updates comment on __bio_try_merge_page(). Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/block/bio.c b/block/bio.c index 5959141d4e46..c81ed2dfee53 100644 --- a/block/bio.c +++ b/block/bio.c @@ -667,6 +667,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv, return false; } + WARN_ON_ONCE(same_page && (len + off) > PAGE_SIZE); + return true; } @@ -786,9 +788,9 @@ EXPORT_SYMBOL(bio_add_pc_page); /** * __bio_try_merge_page - try appending data to an existing bvec. * @bio: destination bio - * @page: page to add + * @page: start page to add * @len: length of the data to add - * @off: offset of the data in @page + * @off: offset of the data relative to @page * @same_page: if %true only merge if the new data is in the same physical * page as the last segment of the bio. * @@ -796,6 +798,8 @@ EXPORT_SYMBOL(bio_add_pc_page); * a useful optimisation for file systems with a block size smaller than the * page size. * + * Warn if (@len, @off) crosses pages in case that @same_page is true. + * * Return %true on success or %false on failure. */ bool __bio_try_merge_page(struct bio *bio, struct page *page, @@ -818,11 +822,11 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, EXPORT_SYMBOL_GPL(__bio_try_merge_page); /** - * __bio_add_page - add page to a bio in a new segment + * __bio_add_page - add page(s) to a bio in a new segment * @bio: destination bio - * @page: page to add - * @len: length of the data to add - * @off: offset of the data in @page + * @page: start page to add + * @len: length of the data to add, may cross pages + * @off: offset of the data relative to @page, may cross pages * * Add the data at @page + @off to @bio as a new bvec. The caller must ensure * that @bio has space for another bvec. @@ -845,13 +849,13 @@ void __bio_add_page(struct bio *bio, struct page *page, EXPORT_SYMBOL_GPL(__bio_add_page); /** - * bio_add_page - attempt to add page to bio + * bio_add_page - attempt to add page(s) to bio * @bio: destination bio - * @page: page to add - * @len: vec entry length - * @offset: vec entry offset + * @page: start page to add + * @len: vec entry length, may cross pages + * @offset: vec entry offset relative to @page, may cross pages * - * Attempt to add a page to the bio_vec maplist. This will only fail + * Attempt to add page(s) to the bio_vec maplist. This will only fail * if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio. */ int bio_add_page(struct bio *bio, struct page *page, From 0257c0ed5ea3de3e32cb322852c4c40bc09d1b97 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Wed, 24 Apr 2019 19:01:46 +0800 Subject: [PATCH 099/164] block: don't run get_page() on pages from non-bvec iov iter The refcount has been increased for pages retrieved from non-bvec iov iter via __bio_iov_iter_get_pages(), so don't need to do that again. Otherwise, IO pages are leaked easily. Cc: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni Fixes: 7321ecbfc7cf ("block: change how we get page references in bio_iov_iter_get_pages") Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index c81ed2dfee53..662d45752ec5 100644 --- a/block/bio.c +++ b/block/bio.c @@ -992,7 +992,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) if (iov_iter_bvec_no_ref(iter)) bio_set_flag(bio, BIO_NO_PAGE_REF); - else + else if (is_bvec) bio_get_pages(bio); return bio->bi_vcnt ? 0 : ret; From 1568ee7e3c6305d9fbb2414bfd4b56e52853d42d Mon Sep 17 00:00:00 2001 From: Guoju Fang Date: Thu, 25 Apr 2019 00:48:26 +0800 Subject: [PATCH 100/164] bcache: fix crashes stopping bcache device before read miss done The bio from upper layer is considered completed when bio_complete() returns. In most scenarios bio_complete() is called in search_free(), but when read miss happens, the bio_compete() is called when backing device reading completed, while the struct search is still in use until cache inserting finished. If someone stops the bcache device just then, the device may be closed and released, but after cache inserting finished the struct search will access a freed struct cached_dev. This patch add the reference of bcache device before bio_complete() when read miss happens, and put it after the search is not used. Signed-off-by: Guoju Fang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/request.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f101bfe8657a..f11123079fe0 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -706,14 +706,14 @@ static void search_free(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); - atomic_dec(&s->d->c->search_inflight); + atomic_dec(&s->iop.c->search_inflight); if (s->iop.bio) bio_put(s->iop.bio); bio_complete(s); closure_debug_destroy(cl); - mempool_free(s, &s->d->c->search); + mempool_free(s, &s->iop.c->search); } static inline struct search *search_alloc(struct bio *bio, @@ -756,13 +756,13 @@ static void cached_dev_bio_complete(struct closure *cl) struct search *s = container_of(cl, struct search, cl); struct cached_dev *dc = container_of(s->d, struct cached_dev, disk); - search_free(cl); cached_dev_put(dc); + search_free(cl); } /* Process reads */ -static void cached_dev_cache_miss_done(struct closure *cl) +static void cached_dev_read_error_done(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); @@ -800,7 +800,22 @@ static void cached_dev_read_error(struct closure *cl) closure_bio_submit(s->iop.c, bio, cl); } - continue_at(cl, cached_dev_cache_miss_done, NULL); + continue_at(cl, cached_dev_read_error_done, NULL); +} + +static void cached_dev_cache_miss_done(struct closure *cl) +{ + struct search *s = container_of(cl, struct search, cl); + struct bcache_device *d = s->d; + + if (s->iop.replace_collision) + bch_mark_cache_miss_collision(s->iop.c, s->d); + + if (s->iop.bio) + bio_free_pages(s->iop.bio); + + cached_dev_bio_complete(cl); + closure_put(&d->cl); } static void cached_dev_read_done(struct closure *cl) @@ -833,6 +848,7 @@ static void cached_dev_read_done(struct closure *cl) if (verify(dc) && s->recoverable && !s->read_dirty_data) bch_data_verify(dc, s->orig_bio); + closure_get(&dc->disk.cl); bio_complete(s); if (s->iop.bio && From 4e0c04ec3a304490a83d5c0355e64176acc9b4ba Mon Sep 17 00:00:00 2001 From: Guoju Fang Date: Thu, 25 Apr 2019 00:48:27 +0800 Subject: [PATCH 101/164] bcache: fix inaccurate result of unused buckets To get the amount of unused buckets in sysfs_priority_stats, the code count the buckets which GC_SECTORS_USED is zero. It's correct and should not be overwritten by the count of buckets which prio is zero. Signed-off-by: Guoju Fang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/sysfs.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 17bae9c14ca0..6cd44d3cf906 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -996,8 +996,6 @@ SHOW(__bch_cache) !cached[n - 1]) --n; - unused = ca->sb.nbuckets - n; - while (cached < p + n && *cached == BTREE_PRIO) cached++, n--; From 78d4eb8ad9e1d413449d1b7a060f50b6efa81ebd Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Thu, 25 Apr 2019 00:48:28 +0800 Subject: [PATCH 102/164] bcache: avoid clang -Wunintialized warning clang has identified a code path in which it thinks a variable may be unused: drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] fifo_pop(&ca->free_inc, bucket); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^~ drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here allocator_wait(ca, bch_allocator_push(ca, bucket)); ^~~~~~ drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait' if (cond) \ ^~~~ drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true fifo_pop(&ca->free_inc, bucket); ^ drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop' #define fifo_pop(fifo, i) fifo_pop_front(fifo, (i)) ^ drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front' if (_r) { \ ^ drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning long bucket; ^ This cannot happen in practice because we only enter the loop if there is at least one element in the list. Slightly rearranging the code makes this clearer to both the reader and the compiler, which avoids the warning. Signed-off-by: Arnd Bergmann Reviewed-by: Nathan Chancellor Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c index 5002838ea476..f8986effcb50 100644 --- a/drivers/md/bcache/alloc.c +++ b/drivers/md/bcache/alloc.c @@ -327,10 +327,11 @@ static int bch_allocator_thread(void *arg) * possibly issue discards to them, then we add the bucket to * the free list: */ - while (!fifo_empty(&ca->free_inc)) { + while (1) { long bucket; - fifo_pop(&ca->free_inc, bucket); + if (!fifo_pop(&ca->free_inc, bucket)) + break; if (ca->discard) { mutex_unlock(&ca->set->bucket_lock); From 792732d9852c0e4505aceff4631ea2168fd02480 Mon Sep 17 00:00:00 2001 From: Geliang Tang Date: Thu, 25 Apr 2019 00:48:29 +0800 Subject: [PATCH 103/164] bcache: use kmemdup_nul for CACHED_LABEL buffer This patch uses kmemdup_nul to create a NUL-terminated string from dc->sb.label. This is better than open coding it. With this, we can move env[2] initialization into env[] array to make code more elegant. Signed-off-by: Geliang Tang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index a697a3a923cd..6e618cb6126c 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -906,21 +906,18 @@ static int cached_dev_status_update(void *arg) void bch_cached_dev_run(struct cached_dev *dc) { struct bcache_device *d = &dc->disk; - char buf[SB_LABEL_SIZE + 1]; + char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL); char *env[] = { "DRIVER=bcache", kasprintf(GFP_KERNEL, "CACHED_UUID=%pU", dc->sb.uuid), - NULL, + kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf ? : ""), NULL, }; - memcpy(buf, dc->sb.label, SB_LABEL_SIZE); - buf[SB_LABEL_SIZE] = '\0'; - env[2] = kasprintf(GFP_KERNEL, "CACHED_LABEL=%s", buf); - if (atomic_xchg(&dc->running, 1)) { kfree(env[1]); kfree(env[2]); + kfree(buf); return; } @@ -944,6 +941,7 @@ void bch_cached_dev_run(struct cached_dev *dc) kobject_uevent_env(&disk_to_dev(d->disk)->kobj, KOBJ_CHANGE, env); kfree(env[1]); kfree(env[2]); + kfree(buf); if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") || sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache")) From 3a3947271cd6ce05c5b30c1250fa99de57410500 Mon Sep 17 00:00:00 2001 From: George Spelvin Date: Thu, 25 Apr 2019 00:48:30 +0800 Subject: [PATCH 104/164] bcache: Clean up bch_get_congested() There are a few nits in this function. They could in theory all be separate patches, but that's probably taking small commits too far. 1) I added a brief comment saying what it does. 2) I like to declare pointer parameters "const" where possible for documentation reasons. 3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming weight of a 32-bit random number (giving a random integer with mean 16 and variance 8). Passing by reference in a 64-bit variable is silly; just use hweight32(). 4) Its helper function fract_exp_two is unnecessarily tangled. Gcc can optimize the multiply by (1 << x) to a shift, but it can be written in a much more straightforward way at the cost of one more bit of internal precision. Some analysis reveals that this bit is always available. This shrinks the object code for fract_exp_two(x, 6) from 23 bytes: 0000000000000000 : 0: 89 f9 mov %edi,%ecx 2: c1 e9 06 shr $0x6,%ecx 5: b8 01 00 00 00 mov $0x1,%eax a: d3 e0 shl %cl,%eax c: 83 e7 3f and $0x3f,%edi f: d3 e7 shl %cl,%edi 11: c1 ef 06 shr $0x6,%edi 14: 01 f8 add %edi,%eax 16: c3 retq To 19: 0000000000000017 : 17: 89 f8 mov %edi,%eax 19: 83 e0 3f and $0x3f,%eax 1c: 83 c0 40 add $0x40,%eax 1f: 89 f9 mov %edi,%ecx 21: c1 e9 06 shr $0x6,%ecx 24: d3 e0 shl %cl,%eax 26: c1 e8 06 shr $0x6,%eax 29: c3 retq (Verified with 0 <= frac_bits <= 8, 0 <= x < 16< Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/request.c | 15 ++++++++------- drivers/md/bcache/request.h | 2 +- drivers/md/bcache/util.h | 26 +++++++++++++++++++------- 3 files changed, 28 insertions(+), 15 deletions(-) diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index f11123079fe0..41adcd1546f1 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -329,12 +329,13 @@ void bch_data_insert(struct closure *cl) bch_data_insert_start(cl); } -/* Congested? */ - -unsigned int bch_get_congested(struct cache_set *c) +/* + * Congested? Return 0 (not congested) or the limit (in sectors) + * beyond which we should bypass the cache due to congestion. + */ +unsigned int bch_get_congested(const struct cache_set *c) { int i; - long rand; if (!c->congested_read_threshold_us && !c->congested_write_threshold_us) @@ -353,8 +354,7 @@ unsigned int bch_get_congested(struct cache_set *c) if (i > 0) i = fract_exp_two(i, 6); - rand = get_random_int(); - i -= bitmap_weight(&rand, BITS_PER_LONG); + i -= hweight32(get_random_u32()); return i > 0 ? i : 1; } @@ -376,7 +376,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) { struct cache_set *c = dc->disk.c; unsigned int mode = cache_mode(dc); - unsigned int sectors, congested = bch_get_congested(c); + unsigned int sectors, congested; struct task_struct *task = current; struct io *i; @@ -412,6 +412,7 @@ static bool check_should_bypass(struct cached_dev *dc, struct bio *bio) goto rescale; } + congested = bch_get_congested(c); if (!congested && !dc->sequential_cutoff) goto rescale; diff --git a/drivers/md/bcache/request.h b/drivers/md/bcache/request.h index 721bf336ed1a..c64dbd7a91aa 100644 --- a/drivers/md/bcache/request.h +++ b/drivers/md/bcache/request.h @@ -33,7 +33,7 @@ struct data_insert_op { BKEY_PADDED(replace_key); }; -unsigned int bch_get_congested(struct cache_set *c); +unsigned int bch_get_congested(const struct cache_set *c); void bch_data_insert(struct closure *cl); void bch_cached_dev_request_init(struct cached_dev *dc); diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index 00aab6abcfe4..1fbced94e4cc 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -560,17 +560,29 @@ static inline uint64_t bch_crc64_update(uint64_t crc, return crc; } -/* Does linear interpolation between powers of two */ +/* + * A stepwise-linear pseudo-exponential. This returns 1 << (x >> + * frac_bits), with the less-significant bits filled in by linear + * interpolation. + * + * This can also be interpreted as a floating-point number format, + * where the low frac_bits are the mantissa (with implicit leading + * 1 bit), and the more significant bits are the exponent. + * The return value is 1.mantissa * 2^exponent. + * + * The way this is used, fract_bits is 6 and the largest possible + * input is CONGESTED_MAX-1 = 1023 (exponent 16, mantissa 0x1.fc), + * so the maximum output is 0x1fc00. + */ static inline unsigned int fract_exp_two(unsigned int x, unsigned int fract_bits) { - unsigned int fract = x & ~(~0 << fract_bits); + unsigned int mantissa = 1 << fract_bits; /* Implicit bit */ - x >>= fract_bits; - x = 1 << x; - x += (x * fract) >> fract_bits; - - return x; + mantissa += x & (mantissa - 1); + x >>= fract_bits; /* The exponent */ + /* Largest intermediate value 0x7f0000 */ + return mantissa << x >> fract_bits; } void bch_bio_map(struct bio *bio, void *base); From a4b732a248d12cbdb46999daf0bf288c011335eb Mon Sep 17 00:00:00 2001 From: Liang Chen Date: Thu, 25 Apr 2019 00:48:31 +0800 Subject: [PATCH 105/164] bcache: fix a race between cache register and cacheset unregister There is a race between cache device register and cache set unregister. For an already registered cache device, register_bcache will call bch_is_open to iterate through all cachesets and check every cache there. The race occurs if cache_set_free executes at the same time and clears the caches right before ca is dereferenced in bch_is_open_cache. To close the race, let's make sure the clean up work is protected by the bch_register_lock as well. This issue can be reproduced as follows, while true; do echo /dev/XXX> /sys/fs/bcache/register ; done& while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done & and results in the following oops, [ +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998 [ +0.000457] #PF error: [normal kernel read fault] [ +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0 [ +0.000388] Oops: 0000 [#1] SMP PTI [ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6 [ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 [ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache] [ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d [ +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202 [ +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000 [ +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001 [ +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a [ +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0 [ +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620 [ +0.000384] FS: 00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000 [ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0 [ +0.000837] Call Trace: [ +0.000682] ? _cond_resched+0x10/0x20 [ +0.000691] ? __kmalloc+0x131/0x1b0 [ +0.000710] kernfs_fop_write+0xfa/0x170 [ +0.000733] __vfs_write+0x2e/0x190 [ +0.000688] ? inode_security+0x10/0x30 [ +0.000698] ? selinux_file_permission+0xd2/0x120 [ +0.000752] ? security_file_permission+0x2b/0x100 [ +0.000753] vfs_write+0xa8/0x1a0 [ +0.000676] ksys_write+0x4d/0xb0 [ +0.000699] do_syscall_64+0x3a/0xf0 [ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Liang Chen Cc: stable@vger.kernel.org Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 6e618cb6126c..53c5e3e0ac22 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1514,6 +1514,7 @@ static void cache_set_free(struct closure *cl) bch_btree_cache_free(c); bch_journal_free(c); + mutex_lock(&bch_register_lock); for_each_cache(ca, c, i) if (ca) { ca->set = NULL; @@ -1532,7 +1533,6 @@ static void cache_set_free(struct closure *cl) mempool_exit(&c->search); kfree(c->devices); - mutex_lock(&bch_register_lock); list_del(&c->list); mutex_unlock(&bch_register_lock); From 14215ee01f6377c81c25c2cecda729e8811d2826 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:32 +0800 Subject: [PATCH 106/164] bcache: move definition of 'int ret' out of macro read_bucket() 'int ret' is defined as a local variable inside macro read_bucket(). Since this macro is called multiple times, and following patches will use a 'int ret' variable in bch_journal_read(), this patch moves definition of 'int ret' from macro read_bucket() to range of function bch_journal_read(). Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index b2fd412715b1..6e18057d1d82 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) { #define read_bucket(b) \ ({ \ - int ret = journal_read_bucket(ca, list, b); \ + ret = journal_read_bucket(ca, list, b); \ __set_bit(b, bitmap); \ if (ret < 0) \ return ret; \ @@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list) struct cache *ca; unsigned int iter; + int ret = 0; for_each_cache(ca, c, iter) { struct journal_device *ja = &ca->journal; @@ -267,7 +268,7 @@ bsearch: struct journal_replay, list)->j.seq; - return 0; + return ret; #undef read_bucket } From 1bee2addc0c8470c8aaa65ef0599eeae96dd88bc Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:33 +0800 Subject: [PATCH 107/164] bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() In journal_reclaim() ja->cur_idx of each cache will be update to reclaim available journal buckets. Variable 'int n' is used to count how many cache is successfully reclaimed, then n is set to c->journal.key by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache() loop will write the jset data onto each cache. The problem is, if all jouranl buckets on each cache is full, the following code in journal_reclaim(), 529 for_each_cache(ca, c, iter) { 530 struct journal_device *ja = &ca->journal; 531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; 532 533 /* No space available on this device */ 534 if (next == ja->discard_idx) 535 continue; 536 537 ja->cur_idx = next; 538 k->ptr[n++] = MAKE_PTR(0, 539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]), 540 ca->sb.nr_this_dev); 541 } 542 543 bkey_init(k); 544 SET_KEY_PTRS(k, n); If there is no available bucket to reclaim, the if() condition at line 534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS() will set KEY_PTRS field of c->journal.key to 0. Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in journal_write_unlocked() the journal data is written in following loop, 649 for (i = 0; i < KEY_PTRS(k); i++) { 650-671 submit journal data to cache device 672 } If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data won't be written to cache device here. If system crahed or rebooted before bkeys of the lost journal entries written into btree nodes, data corruption will be reported during bcache reload after rebooting the system. Indeed there is only one cache in a cache set, there is no need to set KEY_PTRS field in journal_reclaim() at all. But in order to keep the for_each_cache() logic consistent for now, this patch fixes the above problem by not setting 0 KEY_PTRS of journal key, if there is no bucket available to reclaim. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 6e18057d1d82..5180bed911ef 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c) ca->sb.nr_this_dev); } - bkey_init(k); - SET_KEY_PTRS(k, n); - - if (n) + if (n) { + bkey_init(k); + SET_KEY_PTRS(k, n); c->journal.blocks_free = c->sb.bucket_size >> c->block_bits; + } out: if (!journal_full(&c->journal)) __closure_wake_up(&c->journal.wait); @@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl) ca->journal.seq[ca->journal.cur_idx] = w->data->seq; } + /* If KEY_PTRS(k) == 0, this jset gets lost in air */ + BUG_ON(i == 0); + atomic_dec_bug(&fifo_back(&c->journal.pin)); bch_journal_next(&c->journal); journal_reclaim(c); From ce3e4cfb59cb382f8e5ce359238aa580d4ae7778 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:34 +0800 Subject: [PATCH 108/164] bcache: add failure check to run_cache_set() for journal replay Currently run_cache_set() has no return value, if there is failure in bch_journal_replay(), the caller of run_cache_set() has no idea about such failure and just continue to execute following code after run_cache_set(). The internal failure is triggered inside bch_journal_replay() and being handled in async way. This behavior is inefficient, while failure handling inside bch_journal_replay(), cache register code is still running to start the cache set. Registering and unregistering code running as same time may introduce some rare race condition, and make the code to be more hard to be understood. This patch adds return value to run_cache_set(), and returns -EIO if bch_journal_rreplay() fails. Then caller of run_cache_set() may detect such failure and stop registering code flow immedidately inside register_cache_set(). If journal replay fails, run_cache_set() can report error immediately to register_cache_set(). This patch makes the failure handling for bch_journal_replay() be in synchronized way, easier to understand and debug, and avoid poetential race condition for register-and-unregister in same time. Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 53c5e3e0ac22..8c7fdada0acf 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1773,7 +1773,7 @@ err: return NULL; } -static void run_cache_set(struct cache_set *c) +static int run_cache_set(struct cache_set *c) { const char *err = "cannot allocate memory"; struct cached_dev *dc, *t; @@ -1867,7 +1867,9 @@ static void run_cache_set(struct cache_set *c) if (j->version < BCACHE_JSET_VERSION_UUID) __uuid_write(c); - bch_journal_replay(c, &journal); + err = "bcache: replay journal failed"; + if (bch_journal_replay(c, &journal)) + goto err; } else { pr_notice("invalidating existing data"); @@ -1935,11 +1937,13 @@ static void run_cache_set(struct cache_set *c) flash_devs_run(c); set_bit(CACHE_SET_RUNNING, &c->flags); - return; + return 0; err: closure_sync(&cl); /* XXX: test this, it's broken */ bch_cache_set_error(c, "%s", err); + + return -EIO; } static bool can_attach_cache(struct cache *ca, struct cache_set *c) @@ -2003,8 +2007,11 @@ found: ca->set->cache[ca->sb.nr_this_dev] = ca; c->cache_by_alloc[c->caches_loaded++] = ca; - if (c->caches_loaded == c->sb.nr_in_set) - run_cache_set(c); + if (c->caches_loaded == c->sb.nr_in_set) { + err = "failed to run cache set"; + if (run_cache_set(c) < 0) + goto err; + } return NULL; err: From 2d17456eb1cc78803b999fdd503c2dbd42a7d3da Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:35 +0800 Subject: [PATCH 109/164] bcache: add comments for kobj release callback routine Bcache has several routines to release resources in implicit way, they are called when the associated kobj released. This patch adds code comments to notice when and which release callback will be called, - When dc->disk.kobj released: void bch_cached_dev_release(struct kobject *kobj) - When d->kobj released: void bch_flash_dev_release(struct kobject *kobj) - When c->kobj released: void bch_cache_set_release(struct kobject *kobj) - When ca->kobj released void bch_cache_release(struct kobject *kobj) Signed-off-by: Coly Li Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 8c7fdada0acf..f8d80adcafec 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1172,6 +1172,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, return 0; } +/* when dc->disk.kobj released */ void bch_cached_dev_release(struct kobject *kobj) { struct cached_dev *dc = container_of(kobj, struct cached_dev, @@ -1324,6 +1325,7 @@ err: /* Flash only volumes */ +/* When d->kobj released */ void bch_flash_dev_release(struct kobject *kobj) { struct bcache_device *d = container_of(kobj, struct bcache_device, @@ -1494,6 +1496,7 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) return true; } +/* When c->kobj released */ void bch_cache_set_release(struct kobject *kobj) { struct cache_set *c = container_of(kobj, struct cache_set, kobj); @@ -2021,6 +2024,7 @@ err: /* Cache device */ +/* When ca->kobj released */ void bch_cache_release(struct kobject *kobj) { struct cache *ca = container_of(kobj, struct cache, kobj); From 68d10e6979a3b59e3cd2e90bfcafed79c4cf180a Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:36 +0800 Subject: [PATCH 110/164] bcache: return error immediately in bch_journal_replay() When failure happens inside bch_journal_replay(), calling cache_set_err_on() and handling the failure in async way is not a good idea. Because after bch_journal_replay() returns, registering code will continue to execute following steps, and unregistering code triggered by cache_set_err_on() is running in same time. First it is unnecessary to handle failure and unregister cache set in an async way, second there might be potential race condition to run register and unregister code for same cache set. So in this patch, if failure happens in bch_journal_replay(), we don't call cache_set_err_on(), and just print out the same error message to kernel message buffer, then return -EIO immediately caller. Then caller can detect such failure and handle it in synchrnozied way. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 5180bed911ef..828ab474696a 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -331,9 +331,12 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) list_for_each_entry(i, list, list) { BUG_ON(i->pin && atomic_read(i->pin) != 1); - cache_set_err_on(n != i->j.seq, s, -"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)", - n, i->j.seq - 1, start, end); + if (n != i->j.seq) { + pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)", + n, i->j.seq - 1, start, end); + ret = -EIO; + goto err; + } for (k = i->j.start; k < bset_bkey_last(&i->j); From 88c12d42d2bb6e05deb3cfd24d12f6fe80544575 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:37 +0800 Subject: [PATCH 111/164] bcache: add error check for calling register_bdev() This patch adds return value to register_bdev(). Then if failure happens inside register_bdev(), its caller register_bcache() may detect and handle the failure more properly. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index f8d80adcafec..fde334939545 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1279,7 +1279,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size) /* Cached device - bcache superblock */ -static void register_bdev(struct cache_sb *sb, struct page *sb_page, +static int register_bdev(struct cache_sb *sb, struct page *sb_page, struct block_device *bdev, struct cached_dev *dc) { @@ -1317,10 +1317,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page, BDEV_STATE(&dc->sb) == BDEV_STATE_STALE) bch_cached_dev_run(dc); - return; + return 0; err: pr_notice("error %s: %s", dc->backing_dev_name, err); bcache_device_stop(&dc->disk); + return -EIO; } /* Flash only volumes */ @@ -2271,7 +2272,7 @@ static bool bch_is_open(struct block_device *bdev) static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, const char *buffer, size_t size) { - ssize_t ret = size; + ssize_t ret = -EINVAL; const char *err = "cannot allocate memory"; char *path = NULL; struct cache_sb *sb = NULL; @@ -2305,7 +2306,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (!IS_ERR(bdev)) bdput(bdev); if (attr == &ksysfs_register_quiet) - goto out; + goto quiet_out; } goto err; } @@ -2326,8 +2327,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, goto err_close; mutex_lock(&bch_register_lock); - register_bdev(sb, sb_page, bdev, dc); + ret = register_bdev(sb, sb_page, bdev, dc); mutex_unlock(&bch_register_lock); + if (ret < 0) + goto err; } else { struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL); @@ -2337,6 +2340,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (register_cache(sb, sb_page, bdev, ca) != 0) goto err; } +quiet_out: + ret = size; out: if (sb_page) put_page(sb_page); @@ -2349,7 +2354,6 @@ err_close: blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); err: pr_info("error %s: %s", path, err); - ret = -EINVAL; goto out; } From bb6d355c2aff42d4075a8e7428dd72cb009d6143 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:38 +0800 Subject: [PATCH 112/164] bcache: Add comments for blkdev_put() in registration code path Add comments to explain why in register_bcache() blkdev_put() won't be called in two location. Add comments to explain why blkdev_put() must be called in register_cache() when cache_alloc() failed. Signed-off-by: Coly Li Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fde334939545..fa856b2ca7af 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2189,6 +2189,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page, ret = cache_alloc(ca); if (ret != 0) { + /* + * If we failed here, it means ca->kobj is not initialized yet, + * kobject_put() won't be called and there is no chance to + * call blkdev_put() to bdev in bch_cache_release(). So we + * explicitly call blkdev_put() here. + */ blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); if (ret == -ENOMEM) err = "cache_alloc(): -ENOMEM"; @@ -2329,6 +2335,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, mutex_lock(&bch_register_lock); ret = register_bdev(sb, sb_page, bdev, dc); mutex_unlock(&bch_register_lock); + /* blkdev_put() will be called in cached_dev_free() */ if (ret < 0) goto err; } else { @@ -2337,6 +2344,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (!ca) goto err_close; + /* blkdev_put() will be called in bch_cache_release() */ if (register_cache(sb, sb_page, bdev, ca) != 0) goto err; } From 63d63b51d70fb5155754dcf0baa2c1700bcafcb0 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:39 +0800 Subject: [PATCH 113/164] bcache: add comments for closure_fn to be called in closure_queue() Add code comments to explain which call back function might be called for the closure_queue(). This is an effort to make code to be more understandable for readers. Signed-off-by: Coly Li Reviewed-by: Chaitanya Kulkarni Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fa856b2ca7af..0363ab534c8e 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -662,6 +662,11 @@ static const struct block_device_operations bcache_ops = { void bcache_device_stop(struct bcache_device *d) { if (!test_and_set_bit(BCACHE_DEV_CLOSING, &d->flags)) + /* + * closure_fn set to + * - cached device: cached_dev_flush() + * - flash dev: flash_dev_flush() + */ closure_queue(&d->cl); } @@ -1675,6 +1680,7 @@ static void __cache_set_unregister(struct closure *cl) void bch_cache_set_stop(struct cache_set *c) { if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags)) + /* closure_fn set to __cache_set_unregister() */ closure_queue(&c->caching); } From eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60 Mon Sep 17 00:00:00 2001 From: Coly Li Date: Thu, 25 Apr 2019 00:48:40 +0800 Subject: [PATCH 114/164] bcache: improve bcache_reboot() This patch tries to release mutex bch_register_lock early, to give chance to stop cache set and bcache device early. This patch also expends time out of stopping all bcache device from 2 seconds to 10 seconds, because stopping writeback rate update worker may delay for 5 seconds, 2 seconds is not enough. After this patch applied, stopping bcache devices during system reboot or shutdown is very hard to be observed any more. Signed-off-by: Coly Li Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 0363ab534c8e..3f34b96ebbc3 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -2397,10 +2397,19 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) list_for_each_entry_safe(dc, tdc, &uncached_devices, list) bcache_device_stop(&dc->disk); + mutex_unlock(&bch_register_lock); + + /* + * Give an early chance for other kthreads and + * kworkers to stop themselves + */ + schedule(); + /* What's a condition variable? */ while (1) { - long timeout = start + 2 * HZ - jiffies; + long timeout = start + 10 * HZ - jiffies; + mutex_lock(&bch_register_lock); stopped = list_empty(&bch_cache_sets) && list_empty(&uncached_devices); @@ -2412,7 +2421,6 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) mutex_unlock(&bch_register_lock); schedule_timeout(timeout); - mutex_lock(&bch_register_lock); } finish_wait(&unregister_wait, &wait); From 631207314d88e9091be02fbdd1fdadb1ae2ed79a Mon Sep 17 00:00:00 2001 From: Tang Junhui Date: Thu, 25 Apr 2019 00:48:41 +0800 Subject: [PATCH 115/164] bcache: fix failure in journal relplay journal replay failed with messages: Sep 10 19:10:43 ceph kernel: bcache: error on bb379a64-e44e-4812-b91d-a5599871a3b1: bcache: journal entries 2057493-2057567 missing! (replaying 2057493-2076601), disabling caching The reason is in journal_reclaim(), when discard is enabled, we send discard command and reclaim those journal buckets whose seq is old than the last_seq_now, but before we write a journal with last_seq_now, the machine is restarted, so the journal with the last_seq_now is not written to the journal bucket, and the last_seq_wrote in the newest journal is old than last_seq_now which we expect to be, so when we doing replay, journals from last_seq_wrote to last_seq_now are missing. It's hard to write a journal immediately after journal_reclaim(), and it harmless if those missed journal are caused by discarding since those journals are already wrote to btree node. So, if miss seqs are started from the beginning journal, we treat it as normal, and only print a message to show the miss journal, and point out it maybe caused by discarding. Patch v2 add a judgement condition to ignore the missed journal only when discard enabled as Coly suggested. (Coly Li: rebase the patch with other changes in bch_journal_replay()) Signed-off-by: Tang Junhui Tested-by: Dennis Schridde Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index 828ab474696a..f9afb164b887 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -318,6 +318,18 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list) } } +bool is_discard_enabled(struct cache_set *s) +{ + struct cache *ca; + unsigned int i; + + for_each_cache(ca, s, i) + if (ca->discard) + return true; + + return false; +} + int bch_journal_replay(struct cache_set *s, struct list_head *list) { int ret = 0, keys = 0, entries = 0; @@ -332,10 +344,15 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list) BUG_ON(i->pin && atomic_read(i->pin) != 1); if (n != i->j.seq) { - pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)", - n, i->j.seq - 1, start, end); - ret = -EIO; - goto err; + if (n == start && is_discard_enabled(s)) + pr_info("bcache: journal entries %llu-%llu may be discarded! (replaying %llu-%llu)", + n, i->j.seq - 1, start, end); + else { + pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)", + n, i->j.seq - 1, start, end); + ret = -EIO; + goto err; + } } for (k = i->j.start; From f16277ca20acf2c213fcd4b645f4c1cffcadf533 Mon Sep 17 00:00:00 2001 From: Shenghui Wang Date: Thu, 25 Apr 2019 00:48:42 +0800 Subject: [PATCH 116/164] bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce Elements of keylist should be accessed before the list is freed. Move bch_keylist_free() calling after the while loop to avoid wrong content accessed. Signed-off-by: Shenghui Wang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/btree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 64def336f053..b139858b0802 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -1476,11 +1476,11 @@ static int btree_gc_coalesce(struct btree *b, struct btree_op *op, out_nocoalesce: closure_sync(&cl); - bch_keylist_free(&keylist); while ((k = bch_keylist_pop(&keylist))) if (!bkey_cmp(k, &ZERO_KEY)) atomic_dec(&b->c->prio_blocked); + bch_keylist_free(&keylist); for (i = 0; i < nodes; i++) if (!IS_ERR_OR_NULL(new_nodes[i])) { From 95f18c9d1310730d075499a75aaf13bcd60405a7 Mon Sep 17 00:00:00 2001 From: Shenghui Wang Date: Thu, 25 Apr 2019 00:48:43 +0800 Subject: [PATCH 117/164] bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used to collect journal_replay(s) and filled by bch_journal_read(). If all goes well, bch_journal_replay() will release the list of jounal_replay(s) at the end of the branch. If something goes wrong, code flow will jump to the label "err:" and leave the list unreleased. This patch will release the list of journal_replay(s) in the case of error detected. v1 -> v2: * Move the release code to the location after label 'err:' to simply the change. Signed-off-by: Shenghui Wang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 3f34b96ebbc3..0ffe9acee9d8 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1790,6 +1790,8 @@ static int run_cache_set(struct cache_set *c) struct cache *ca; struct closure cl; unsigned int i; + LIST_HEAD(journal); + struct journal_replay *l; closure_init_stack(&cl); @@ -1949,6 +1951,12 @@ static int run_cache_set(struct cache_set *c) set_bit(CACHE_SET_RUNNING, &c->flags); return 0; err: + while (!list_empty(&journal)) { + l = list_first_entry(&journal, struct journal_replay, list); + list_del(&l->list); + kfree(l); + } + closure_sync(&cl); /* XXX: test this, it's broken */ bch_cache_set_error(c, "%s", err); From 8dc2ed3f3e5ba245828ad89968f6818be8996e9d Mon Sep 17 00:00:00 2001 From: Max Gurtovoy Date: Mon, 8 Apr 2019 18:39:58 +0300 Subject: [PATCH 118/164] nvmet-rdma: remove p2p_client initialization from fast-path Initialize it during command allocation. Cc: Logan Gunthorpe Cc: Stephen Bates Signed-off-by: Max Gurtovoy Reviewed-by: Logan Gunthorpe Signed-off-by: Christoph Hellwig --- drivers/nvme/target/rdma.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index ef893addf341..b7275218dfa5 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -373,6 +373,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev, if (ib_dma_mapping_error(ndev->device, r->send_sge.addr)) goto out_free_rsp; + r->req.p2p_client = &ndev->device->dev; r->send_sge.length = sizeof(*r->req.rsp); r->send_sge.lkey = ndev->pd->local_dma_lkey; @@ -763,8 +764,6 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue, cmd->send_sge.addr, cmd->send_sge.length, DMA_TO_DEVICE); - cmd->req.p2p_client = &queue->dev->device->dev; - if (!nvmet_req_init(&cmd->req, &queue->nvme_cq, &queue->nvme_sq, &nvmet_rdma_ops)) return; From fc6c9730725d5cc57c851d0e261a5682bba913a7 Mon Sep 17 00:00:00 2001 From: Max Gurtovoy Date: Mon, 8 Apr 2019 18:39:59 +0300 Subject: [PATCH 119/164] nvmet: rename nvme_completion instances from rsp to cqe Use NVMe namings for improving code readability. Signed-off-by: Max Gurtovoy Reviewed-by : Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 22 +++++++++++----------- drivers/nvme/target/fabrics-cmd.c | 16 ++++++++-------- drivers/nvme/target/fc.c | 2 +- drivers/nvme/target/loop.c | 6 +++--- drivers/nvme/target/nvmet.h | 4 ++-- drivers/nvme/target/rdma.c | 18 +++++++++--------- drivers/nvme/target/tcp.c | 8 ++++---- 7 files changed, 38 insertions(+), 38 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 4d8dd29479c0..1c1776c3e316 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -647,7 +647,7 @@ static void nvmet_update_sq_head(struct nvmet_req *req) } while (cmpxchg(&req->sq->sqhd, old_sqhd, new_sqhd) != old_sqhd); } - req->rsp->sq_head = cpu_to_le16(req->sq->sqhd & 0x0000FFFF); + req->cqe->sq_head = cpu_to_le16(req->sq->sqhd & 0x0000FFFF); } static void nvmet_set_error(struct nvmet_req *req, u16 status) @@ -656,7 +656,7 @@ static void nvmet_set_error(struct nvmet_req *req, u16 status) struct nvme_error_slot *new_error_slot; unsigned long flags; - req->rsp->status = cpu_to_le16(status << 1); + req->cqe->status = cpu_to_le16(status << 1); if (!ctrl || req->error_loc == NVMET_NO_ERROR_LOC) return; @@ -676,15 +676,15 @@ static void nvmet_set_error(struct nvmet_req *req, u16 status) spin_unlock_irqrestore(&ctrl->error_lock, flags); /* set the more bit for this request */ - req->rsp->status |= cpu_to_le16(1 << 14); + req->cqe->status |= cpu_to_le16(1 << 14); } static void __nvmet_req_complete(struct nvmet_req *req, u16 status) { if (!req->sq->sqhd_disabled) nvmet_update_sq_head(req); - req->rsp->sq_id = cpu_to_le16(req->sq->qid); - req->rsp->command_id = req->cmd->common.command_id; + req->cqe->sq_id = cpu_to_le16(req->sq->qid); + req->cqe->command_id = req->cmd->common.command_id; if (unlikely(status)) nvmet_set_error(req, status); @@ -841,8 +841,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, req->sg = NULL; req->sg_cnt = 0; req->transfer_len = 0; - req->rsp->status = 0; - req->rsp->sq_head = 0; + req->cqe->status = 0; + req->cqe->sq_head = 0; req->ns = NULL; req->error_loc = NVMET_NO_ERROR_LOC; req->error_slba = 0; @@ -1069,7 +1069,7 @@ u16 nvmet_ctrl_find_get(const char *subsysnqn, const char *hostnqn, u16 cntlid, if (!subsys) { pr_warn("connect request for invalid subsystem %s!\n", subsysnqn); - req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(subsysnqn); + req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(subsysnqn); return NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; } @@ -1090,7 +1090,7 @@ u16 nvmet_ctrl_find_get(const char *subsysnqn, const char *hostnqn, u16 cntlid, pr_warn("could not find controller %d for subsys %s / host %s\n", cntlid, subsysnqn, hostnqn); - req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); + req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; out: @@ -1188,7 +1188,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, if (!subsys) { pr_warn("connect request for invalid subsystem %s!\n", subsysnqn); - req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(subsysnqn); + req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(subsysnqn); goto out; } @@ -1197,7 +1197,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, if (!nvmet_host_allowed(subsys, hostnqn)) { pr_info("connect by host %s for subsystem %s not allowed\n", hostnqn, subsysnqn); - req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(hostnqn); + req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(hostnqn); up_read(&nvmet_config_sem); status = NVME_SC_CONNECT_INVALID_HOST | NVME_SC_DNR; goto out_put_subsystem; diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index 3a76ebc3d155..3b9f79aba98f 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -72,7 +72,7 @@ static void nvmet_execute_prop_get(struct nvmet_req *req) offsetof(struct nvmf_property_get_command, attrib); } - req->rsp->result.u64 = cpu_to_le64(val); + req->cqe->result.u64 = cpu_to_le64(val); nvmet_req_complete(req, status); } @@ -124,7 +124,7 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req) if (c->cattr & NVME_CONNECT_DISABLE_SQFLOW) { req->sq->sqhd_disabled = true; - req->rsp->sq_head = cpu_to_le16(0xffff); + req->cqe->sq_head = cpu_to_le16(0xffff); } if (ctrl->ops->install_queue) { @@ -158,7 +158,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) goto out; /* zero out initial completion result, assign values as needed */ - req->rsp->result.u32 = 0; + req->cqe->result.u32 = 0; if (c->recfmt != 0) { pr_warn("invalid connect version (%d).\n", @@ -172,7 +172,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) pr_warn("connect attempt for invalid controller ID %#x\n", d->cntlid); status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; - req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); + req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); goto out; } @@ -195,7 +195,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) pr_info("creating controller %d for subsystem %s for NQN %s.\n", ctrl->cntlid, ctrl->subsys->subsysnqn, ctrl->hostnqn); - req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid); + req->cqe->result.u16 = cpu_to_le16(ctrl->cntlid); out: kfree(d); @@ -222,7 +222,7 @@ static void nvmet_execute_io_connect(struct nvmet_req *req) goto out; /* zero out initial completion result, assign values as needed */ - req->rsp->result.u32 = 0; + req->cqe->result.u32 = 0; if (c->recfmt != 0) { pr_warn("invalid connect version (%d).\n", @@ -240,14 +240,14 @@ static void nvmet_execute_io_connect(struct nvmet_req *req) if (unlikely(qid > ctrl->subsys->max_qid)) { pr_warn("invalid queue id (%d)\n", qid); status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; - req->rsp->result.u32 = IPO_IATTR_CONNECT_SQE(qid); + req->cqe->result.u32 = IPO_IATTR_CONNECT_SQE(qid); goto out_ctrl_put; } status = nvmet_install_queue(ctrl, req); if (status) { /* pass back cntlid that had the issue of installing queue */ - req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid); + req->cqe->result.u16 = cpu_to_le16(ctrl->cntlid); goto out_ctrl_put; } diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 9369a11fe7a9..508661af0f50 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -2184,7 +2184,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport, } fod->req.cmd = &fod->cmdiubuf.sqe; - fod->req.rsp = &fod->rspiubuf.cqe; + fod->req.cqe = &fod->rspiubuf.cqe; fod->req.port = tgtport->pe->port; /* clear any response payload */ diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index b9f623ab01f3..a3ae491fa20e 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -18,7 +18,7 @@ struct nvme_loop_iod { struct nvme_request nvme_req; struct nvme_command cmd; - struct nvme_completion rsp; + struct nvme_completion cqe; struct nvmet_req req; struct nvme_loop_queue *queue; struct work_struct work; @@ -94,7 +94,7 @@ static void nvme_loop_queue_response(struct nvmet_req *req) { struct nvme_loop_queue *queue = container_of(req->sq, struct nvme_loop_queue, nvme_sq); - struct nvme_completion *cqe = req->rsp; + struct nvme_completion *cqe = req->cqe; /* * AEN requests are special as they don't time out and can @@ -207,7 +207,7 @@ static int nvme_loop_init_iod(struct nvme_loop_ctrl *ctrl, struct nvme_loop_iod *iod, unsigned int queue_idx) { iod->req.cmd = &iod->cmd; - iod->req.rsp = &iod->rsp; + iod->req.cqe = &iod->cqe; iod->queue = &ctrl->queues[queue_idx]; INIT_WORK(&iod->work, nvme_loop_execute_work); return 0; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 1653d19b187f..c25d88fc9dec 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -284,7 +284,7 @@ struct nvmet_fabrics_ops { struct nvmet_req { struct nvme_command *cmd; - struct nvme_completion *rsp; + struct nvme_completion *cqe; struct nvmet_sq *sq; struct nvmet_cq *cq; struct nvmet_ns *ns; @@ -322,7 +322,7 @@ extern struct workqueue_struct *buffered_io_wq; static inline void nvmet_set_result(struct nvmet_req *req, u32 result) { - req->rsp->result.u32 = cpu_to_le32(result); + req->cqe->result.u32 = cpu_to_le32(result); } /* diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index b7275218dfa5..36d906a7f70d 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -160,7 +160,7 @@ static inline bool nvmet_rdma_need_data_out(struct nvmet_rdma_rsp *rsp) { return !nvme_is_write(rsp->req.cmd) && rsp->req.transfer_len && - !rsp->req.rsp->status && + !rsp->req.cqe->status && !(rsp->flags & NVMET_RDMA_REQ_INLINE_DATA); } @@ -364,17 +364,17 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev, struct nvmet_rdma_rsp *r) { /* NVMe CQE / RDMA SEND */ - r->req.rsp = kmalloc(sizeof(*r->req.rsp), GFP_KERNEL); - if (!r->req.rsp) + r->req.cqe = kmalloc(sizeof(*r->req.cqe), GFP_KERNEL); + if (!r->req.cqe) goto out; - r->send_sge.addr = ib_dma_map_single(ndev->device, r->req.rsp, - sizeof(*r->req.rsp), DMA_TO_DEVICE); + r->send_sge.addr = ib_dma_map_single(ndev->device, r->req.cqe, + sizeof(*r->req.cqe), DMA_TO_DEVICE); if (ib_dma_mapping_error(ndev->device, r->send_sge.addr)) goto out_free_rsp; r->req.p2p_client = &ndev->device->dev; - r->send_sge.length = sizeof(*r->req.rsp); + r->send_sge.length = sizeof(*r->req.cqe); r->send_sge.lkey = ndev->pd->local_dma_lkey; r->send_cqe.done = nvmet_rdma_send_done; @@ -389,7 +389,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev, return 0; out_free_rsp: - kfree(r->req.rsp); + kfree(r->req.cqe); out: return -ENOMEM; } @@ -398,8 +398,8 @@ static void nvmet_rdma_free_rsp(struct nvmet_rdma_device *ndev, struct nvmet_rdma_rsp *r) { ib_dma_unmap_single(ndev->device, r->send_sge.addr, - sizeof(*r->req.rsp), DMA_TO_DEVICE); - kfree(r->req.rsp); + sizeof(*r->req.cqe), DMA_TO_DEVICE); + kfree(r->req.cqe); } static int diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 0a941abf56ec..17cf137dc88c 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -161,14 +161,14 @@ static inline bool nvmet_tcp_has_data_in(struct nvmet_tcp_cmd *cmd) static inline bool nvmet_tcp_need_data_in(struct nvmet_tcp_cmd *cmd) { - return nvmet_tcp_has_data_in(cmd) && !cmd->req.rsp->status; + return nvmet_tcp_has_data_in(cmd) && !cmd->req.cqe->status; } static inline bool nvmet_tcp_need_data_out(struct nvmet_tcp_cmd *cmd) { return !nvme_is_write(cmd->req.cmd) && cmd->req.transfer_len > 0 && - !cmd->req.rsp->status; + !cmd->req.cqe->status; } static inline bool nvmet_tcp_has_inline_data(struct nvmet_tcp_cmd *cmd) @@ -378,7 +378,7 @@ static void nvmet_setup_c2h_data_pdu(struct nvmet_tcp_cmd *cmd) pdu->hdr.plen = cpu_to_le32(pdu->hdr.hlen + hdgst + cmd->req.transfer_len + ddgst); - pdu->command_id = cmd->req.rsp->command_id; + pdu->command_id = cmd->req.cqe->command_id; pdu->data_length = cpu_to_le32(cmd->req.transfer_len); pdu->data_offset = cpu_to_le32(cmd->wbytes_done); @@ -1224,7 +1224,7 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue, sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO); if (!c->rsp_pdu) goto out_free_cmd; - c->req.rsp = &c->rsp_pdu->cqe; + c->req.cqe = &c->rsp_pdu->cqe; c->data_pdu = page_frag_alloc(&queue->pf_cache, sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO); From 6b7e631b927ca1266b2695307ab71ed7764af75e Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Sun, 7 Apr 2019 15:28:06 +0900 Subject: [PATCH 120/164] nvmet: return a specified error it subsys_alloc fails nvmet_subsys_alloc() returns its pointer or NULL if it fails. We can see three different steps in this function: 1. memory allocation 2. argument check 3. memory allocation for string But now the callers of this function do not seem to handle case 2 by returning -ENOMEM only even if it fails with an invalid parameter. This patch specifies error codes so that caller can pass it to its own caller. Signed-off-by: Minwoo Im Reviewed-by: Chaitanya Kulkarni . Signed-off-by: Christoph Hellwig --- drivers/nvme/target/configfs.c | 4 ++-- drivers/nvme/target/core.c | 6 +++--- drivers/nvme/target/discovery.c | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index adb79545cdd7..08dd5af357f7 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -898,8 +898,8 @@ static struct config_group *nvmet_subsys_make(struct config_group *group, } subsys = nvmet_subsys_alloc(name, NVME_NQN_NVME); - if (!subsys) - return ERR_PTR(-ENOMEM); + if (IS_ERR(subsys)) + return ERR_CAST(subsys); config_group_init_type_name(&subsys->group, name, &nvmet_subsys_type); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 1c1776c3e316..24e0a36392d9 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -1367,7 +1367,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) - return NULL; + return ERR_PTR(-ENOMEM); subsys->ver = NVME_VS(1, 3, 0); /* NVMe 1.3.0 */ /* generate a random serial number as our controllers are ephemeral: */ @@ -1383,14 +1383,14 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, default: pr_err("%s: Unknown Subsystem type - %d\n", __func__, type); kfree(subsys); - return NULL; + return ERR_PTR(-EINVAL); } subsys->type = type; subsys->subsysnqn = kstrndup(subsysnqn, NVMF_NQN_SIZE, GFP_KERNEL); if (!subsys->subsysnqn) { kfree(subsys); - return NULL; + return ERR_PTR(-ENOMEM); } kref_init(&subsys->ref); diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index 33ed95e72d6b..e8e09266bfa5 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -372,8 +372,8 @@ int __init nvmet_init_discovery(void) { nvmet_disc_subsys = nvmet_subsys_alloc(NVME_DISC_SUBSYS_NAME, NVME_NQN_DISC); - if (!nvmet_disc_subsys) - return -ENOMEM; + if (IS_ERR(nvmet_disc_subsys)) + return PTR_ERR(nvmet_disc_subsys); return 0; } From a5dffbb66d250a7ef07e27a2e75b8d9d7af2ab41 Mon Sep 17 00:00:00 2001 From: "Enrico Weigelt, metux IT consult" Date: Wed, 24 Apr 2019 12:34:39 +0200 Subject: [PATCH 121/164] nvmet: include Build breaks: drivers/nvme/target/core.c: In function 'nvmet_req_alloc_sgl': drivers/nvme/target/core.c:939:12: error: implicit declaration of \ function 'sgl_alloc'; did you mean 'bio_alloc'? \ [-Werror=implicit-function-declaration] req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt); ^~~~~~~~~ bio_alloc drivers/nvme/target/core.c:939:10: warning: assignment makes pointer \ from integer without a cast [-Wint-conversion] req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt); ^ drivers/nvme/target/core.c: In function 'nvmet_req_free_sgl': drivers/nvme/target/core.c:952:3: error: implicit declaration of \ function 'sgl_free'; did you mean 'ida_free'? [-Werror=implicit-function-declaration] sgl_free(req->sg); ^~~~~~~~ ida_free Cause: 1. missing include to 2. SGL_ALLOC needs to be enabled Therefore adding the missing include, as well as Kconfig dependency. Signed-off-by: Enrico Weigelt, metux IT consult Reviewed-by: Sagi Grimberg Reviewed-by: Minwoo Im Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/Kconfig | 1 + drivers/nvme/target/core.c | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index d94f25cde019..3ef0a4e5eed6 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -3,6 +3,7 @@ config NVME_TARGET tristate "NVMe Target support" depends on BLOCK depends on CONFIGFS_FS + select SGL_ALLOC help This enabled target side support for the NVMe protocol, that is it allows the Linux kernel to implement NVMe subsystems and diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 24e0a36392d9..7734a6acff85 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "nvmet.h" From 525ec495e021068aa8635a0e18ff60695f5b1f4f Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 24 Apr 2019 11:43:23 -0700 Subject: [PATCH 122/164] nvmet-file: clamp-down file namespace lba_shift When the backing file is a tempfile for example, the inode i_blkbits can be 1M in size which causes problems for hosts to support as the disk block size. Instead, expose the minimum between i_blkbits and 12 (4K sector size). Signed-off-by: Sagi Grimberg Reviewed-by:- Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/io-cmd-file.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c index bc6ebb51b0bf..05453f5d1448 100644 --- a/drivers/nvme/target/io-cmd-file.c +++ b/drivers/nvme/target/io-cmd-file.c @@ -49,7 +49,12 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns) goto err; ns->size = stat.size; - ns->blksize_shift = file_inode(ns->file)->i_blkbits; + /* + * i_blkbits can be greater than the universally accepted upper bound, + * so make sure we export a sane namespace lba_shift. + */ + ns->blksize_shift = min_t(u8, + file_inode(ns->file)->i_blkbits, 12); ns->bvec_cache = kmem_cache_create("nvmet-bvec", NVMET_MAX_MPOOL_BVEC * sizeof(struct bio_vec), From 569b3d3db1aac8586a16df1745c9e5a99ff47253 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 24 Apr 2019 11:53:16 -0700 Subject: [PATCH 123/164] nvmet-tcp: don't fail maxr2t greater than 1 The host may support it, but nothing prevents us from sending a single r2t at a time like we do anyways. Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/target/tcp.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 17cf137dc88c..69b83fa0c76c 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -774,12 +774,6 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue) return -EPROTO; } - if (icreq->maxr2t != 0) { - pr_err("queue %d: unsupported maxr2t %d\n", queue->idx, - le32_to_cpu(icreq->maxr2t) + 1); - return -EPROTO; - } - queue->hdr_digest = !!(icreq->digest & NVME_TCP_HDR_DIGEST_ENABLE); queue->data_digest = !!(icreq->digest & NVME_TCP_DATA_DIGEST_ENABLE); if (queue->hdr_digest || queue->data_digest) { From 7a42589654ae79e1177f0d74306a02d6cef7bddf Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 24 Apr 2019 11:53:17 -0700 Subject: [PATCH 124/164] nvme-tcp: fix a NULL deref when an admin connect times out If we timeout the admin startup sequence we might not yet have an I/O tagset allocated which causes the teardown sequence to crash. Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags if the tagset wasn't allocated. Fixes: 39d57757467b ("nvme-tcp: fix timeout handler") Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/tcp.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 68c49dd67210..aae5374d2b93 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1710,7 +1710,9 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, { blk_mq_quiesce_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); - blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl); + if (ctrl->admin_tagset) + blk_mq_tagset_busy_iter(ctrl->admin_tagset, + nvme_cancel_request, ctrl); blk_mq_unquiesce_queue(ctrl->admin_q); nvme_tcp_destroy_admin_queue(ctrl, remove); } @@ -1722,7 +1724,9 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, return; nvme_stop_queues(ctrl); nvme_tcp_stop_io_queues(ctrl); - blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl); + if (ctrl->tagset) + blk_mq_tagset_busy_iter(ctrl->tagset, + nvme_cancel_request, ctrl); if (remove) nvme_start_queues(ctrl); nvme_tcp_destroy_io_queues(ctrl, remove); From 1007709d7d06fab09bf2d007657575958676282b Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 24 Apr 2019 11:53:18 -0700 Subject: [PATCH 125/164] nvme-rdma: fix a NULL deref when an admin connect times out If we timeout the admin startup sequence we might not yet have an I/O tagset allocated which causes the teardown sequence to crash. Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags if the tagset wasn't allocated. Fixes: 4c174e636674 ("nvme-rdma: fix timeout handler") Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/rdma.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 11a5ecae78c8..e1824c2e0a1c 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -914,8 +914,9 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, { blk_mq_quiesce_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); - blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, - &ctrl->ctrl); + if (ctrl->ctrl.admin_tagset) + blk_mq_tagset_busy_iter(ctrl->ctrl.admin_tagset, + nvme_cancel_request, &ctrl->ctrl); blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); nvme_rdma_destroy_admin_queue(ctrl, remove); } @@ -926,8 +927,9 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl, if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); nvme_rdma_stop_io_queues(ctrl); - blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, - &ctrl->ctrl); + if (ctrl->ctrl.tagset) + blk_mq_tagset_busy_iter(ctrl->ctrl.tagset, + nvme_cancel_request, &ctrl->ctrl); if (remove) nvme_start_queues(&ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, remove); From efb973b19b88642bb7e08b8ce8e03b0bbd2a7e2a Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 24 Apr 2019 11:53:19 -0700 Subject: [PATCH 126/164] nvme-tcp: rename function to have nvme_tcp prefix usually nvme_ prefix is for core functions. While we're cleaning up, remove redundant empty lines Signed-off-by: Sagi Grimberg Reviewed-by: Minwoo Im Signed-off-by: Christoph Hellwig --- drivers/nvme/host/tcp.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index aae5374d2b93..2405bb9c63cc 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -473,7 +473,6 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue, } return 0; - } static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue, @@ -634,7 +633,6 @@ static inline void nvme_tcp_end_request(struct request *rq, u16 status) nvme_end_request(rq, cpu_to_le16(status << 1), res); } - static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, unsigned int *offset, size_t *len) { @@ -1535,7 +1533,7 @@ out_free_queue: return ret; } -static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl) +static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl) { int i, ret; @@ -1565,7 +1563,7 @@ static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl) return nr_io_queues; } -static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl) +static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl) { unsigned int nr_io_queues; int ret; @@ -1582,7 +1580,7 @@ static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl) dev_info(ctrl->device, "creating %d I/O queues.\n", nr_io_queues); - return nvme_tcp_alloc_io_queues(ctrl); + return __nvme_tcp_alloc_io_queues(ctrl); } static void nvme_tcp_destroy_io_queues(struct nvme_ctrl *ctrl, bool remove) @@ -1599,7 +1597,7 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new) { int ret; - ret = nvme_alloc_io_queues(ctrl); + ret = nvme_tcp_alloc_io_queues(ctrl); if (ret) return ret; From 663d6fee66b555f6a080104751be0b54e0bca78a Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Mon, 15 Apr 2019 09:51:46 +0800 Subject: [PATCH 127/164] nvme-loop: kill timeout handler Firstly it doesn't make sense to handle timeout for loop: 1) for admin queue, the request is always completed in code path of queuing IO. 2) for normal IO request, the timeout on these IOs have been handled by underlying queue already. Secondly nvme-loop's timeout handler is simply broken, and easy to cause issue: 1) no any sync/protection between timeout and normal completion, and now it is driver's responsibility to deal with that; 2) bad reset implementation, blk_mq_update_nr_hw_queues() is called after all NSs's queue is stopped(quiesced), and easy to trigger deadlock. So kill the timeout handler. Signed-off-by: Ming Lei Reviewd-by: Keith Busch Reviewed-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/target/loop.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index a3ae491fa20e..9e211ad6bdd3 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -129,20 +129,6 @@ static void nvme_loop_execute_work(struct work_struct *work) nvmet_req_execute(&iod->req); } -static enum blk_eh_timer_return -nvme_loop_timeout(struct request *rq, bool reserved) -{ - struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(rq); - - /* queue error recovery */ - nvme_reset_ctrl(&iod->queue->ctrl->ctrl); - - /* fail with DNR on admin cmd timeout */ - nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR; - - return BLK_EH_DONE; -} - static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -253,7 +239,6 @@ static const struct blk_mq_ops nvme_loop_mq_ops = { .complete = nvme_loop_complete_rq, .init_request = nvme_loop_init_request, .init_hctx = nvme_loop_init_hctx, - .timeout = nvme_loop_timeout, }; static const struct blk_mq_ops nvme_loop_admin_mq_ops = { @@ -261,7 +246,6 @@ static const struct blk_mq_ops nvme_loop_admin_mq_ops = { .complete = nvme_loop_complete_rq, .init_request = nvme_loop_init_request, .init_hctx = nvme_loop_init_admin_hctx, - .timeout = nvme_loop_timeout, }; static void nvme_loop_destroy_admin_queue(struct nvme_loop_ctrl *ctrl) From 82bebbde02e24ad7b641eca25e632f32579ed52f Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Wed, 10 Apr 2019 23:48:59 +0900 Subject: [PATCH 128/164] nvme-rdma: fix typo in struct comment struct nvme_rdma_cm_rej has two different attributes: recfmt and sts. And sts will have value what this comment wanted to show. Signed-off-by: Minwoo Im Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- include/linux/nvme-rdma.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/nvme-rdma.h b/include/linux/nvme-rdma.h index 3aa97b98dc89..3ec8e50efa16 100644 --- a/include/linux/nvme-rdma.h +++ b/include/linux/nvme-rdma.h @@ -77,7 +77,7 @@ struct nvme_rdma_cm_rep { * struct nvme_rdma_cm_rej - rdma connect reject * * @recfmt: format of the RDMA Private Data - * @fsts: error status for the associated connect request + * @sts: error status for the associated connect request */ struct nvme_rdma_cm_rej { __le16 recfmt; From 01fa017484ad98fccdeaab32db0077c574b6bd6f Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Mon, 11 Mar 2019 15:02:25 -0700 Subject: [PATCH 129/164] nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE If our target exposed a namespace with a block size that is greater than PAGE_SIZE, set 0 capacity on the namespace as we do not support it. This issue encountered when the nvmet namespace was backed by a tempfile. Signed-off-by: Sagi Grimberg Reviewed-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 248ff3b48041..3dd043aa6d1f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1591,6 +1591,10 @@ static void nvme_update_disk_info(struct gendisk *disk, sector_t capacity = le64_to_cpu(id->nsze) << (ns->lba_shift - 9); unsigned short bs = 1 << ns->lba_shift; + if (ns->lba_shift > PAGE_SHIFT) { + /* unsupported block size, set capacity to 0 later */ + bs = (1 << 9); + } blk_mq_freeze_queue(disk->queue); blk_integrity_unregister(disk); @@ -1601,7 +1605,8 @@ static void nvme_update_disk_info(struct gendisk *disk, if (ns->ms && !ns->ext && (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)) nvme_init_integrity(disk, ns->ms, ns->pi_type); - if (ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) + if ((ns->ms && !nvme_ns_has_pi(ns) && !blk_get_integrity(disk)) || + ns->lba_shift > PAGE_SHIFT) capacity = 0; set_capacity(disk, capacity); From cc6be13159316e8bdcd8bbb5209315256e151337 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 08:12:11 +0200 Subject: [PATCH 130/164] mtip32xx: remove trim support The trim support in mtip32xx has been "temporarily" disabled for 6 years, which is 3/4 of the time the driver even exists in the tree. Remove it as it obviously is dead code now. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- drivers/block/mtip32xx/mtip32xx.c | 89 ------------------------------- drivers/block/mtip32xx/mtip32xx.h | 17 ------ 2 files changed, 106 deletions(-) diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 83302ecdc8db..f0105d118056 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -1192,14 +1192,6 @@ static int mtip_get_identify(struct mtip_port *port, void __user *user_buffer) else clear_bit(MTIP_DDF_SEC_LOCK_BIT, &port->dd->dd_flag); -#ifdef MTIP_TRIM /* Disabling TRIM support temporarily */ - /* Demux ID.DRAT & ID.RZAT to determine trim support */ - if (port->identify[69] & (1 << 14) && port->identify[69] & (1 << 5)) - port->dd->trim_supp = true; - else -#endif - port->dd->trim_supp = false; - /* Set the identify buffer as valid. */ port->identify_valid = 1; @@ -1386,77 +1378,6 @@ static int mtip_get_smart_attr(struct mtip_port *port, unsigned int id, return rv; } -/* - * Trim unused sectors - * - * @dd pointer to driver_data structure - * @lba starting lba - * @len # of 512b sectors to trim - */ -static blk_status_t mtip_send_trim(struct driver_data *dd, unsigned int lba, - unsigned int len) -{ - u64 tlba, tlen, sect_left; - struct mtip_trim_entry *buf; - dma_addr_t dma_addr; - struct host_to_dev_fis fis; - blk_status_t ret = BLK_STS_OK; - int i; - - if (!len || dd->trim_supp == false) - return BLK_STS_IOERR; - - /* Trim request too big */ - WARN_ON(len > (MTIP_MAX_TRIM_ENTRY_LEN * MTIP_MAX_TRIM_ENTRIES)); - - /* Trim request not aligned on 4k boundary */ - WARN_ON(len % 8 != 0); - - /* Warn if vu_trim structure is too big */ - WARN_ON(sizeof(struct mtip_trim) > ATA_SECT_SIZE); - - /* Allocate a DMA buffer for the trim structure */ - buf = dma_alloc_coherent(&dd->pdev->dev, ATA_SECT_SIZE, &dma_addr, - GFP_KERNEL); - if (!buf) - return BLK_STS_RESOURCE; - memset(buf, 0, ATA_SECT_SIZE); - - for (i = 0, sect_left = len, tlba = lba; - i < MTIP_MAX_TRIM_ENTRIES && sect_left; - i++) { - tlen = (sect_left >= MTIP_MAX_TRIM_ENTRY_LEN ? - MTIP_MAX_TRIM_ENTRY_LEN : - sect_left); - buf[i].lba = cpu_to_le32(tlba); - buf[i].range = cpu_to_le16(tlen); - tlba += tlen; - sect_left -= tlen; - } - WARN_ON(sect_left != 0); - - /* Build the fis */ - memset(&fis, 0, sizeof(struct host_to_dev_fis)); - fis.type = 0x27; - fis.opts = 1 << 7; - fis.command = 0xfb; - fis.features = 0x60; - fis.sect_count = 1; - fis.device = ATA_DEVICE_OBS; - - if (mtip_exec_internal_command(dd->port, - &fis, - 5, - dma_addr, - ATA_SECT_SIZE, - 0, - MTIP_TRIM_TIMEOUT_MS) < 0) - ret = BLK_STS_IOERR; - - dma_free_coherent(&dd->pdev->dev, ATA_SECT_SIZE, buf, dma_addr); - return ret; -} - /* * Get the drive capacity. * @@ -3590,8 +3511,6 @@ static blk_status_t mtip_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(rq); - if (req_op(rq) == REQ_OP_DISCARD) - return mtip_send_trim(dd, blk_rq_pos(rq), blk_rq_sectors(rq)); mtip_hw_submit_io(dd, rq, cmd, hctx); return BLK_STS_OK; } @@ -3769,14 +3688,6 @@ skip_create_disk: blk_queue_max_segment_size(dd->queue, 0x400000); blk_queue_io_min(dd->queue, 4096); - /* Signal trim support */ - if (dd->trim_supp == true) { - blk_queue_flag_set(QUEUE_FLAG_DISCARD, dd->queue); - dd->queue->limits.discard_granularity = 4096; - blk_queue_max_discard_sectors(dd->queue, - MTIP_MAX_TRIM_ENTRY_LEN * MTIP_MAX_TRIM_ENTRIES); - } - /* Set the capacity of the device in 512 byte sectors. */ if (!(mtip_hw_get_capacity(dd, &capacity))) { dev_warn(&dd->pdev->dev, diff --git a/drivers/block/mtip32xx/mtip32xx.h b/drivers/block/mtip32xx/mtip32xx.h index abce25f27f57..91c1cb5b1532 100644 --- a/drivers/block/mtip32xx/mtip32xx.h +++ b/drivers/block/mtip32xx/mtip32xx.h @@ -193,21 +193,6 @@ struct mtip_work { mtip_workq_sdbfx(w->port, group, w->completed); \ } -#define MTIP_TRIM_TIMEOUT_MS 240000 -#define MTIP_MAX_TRIM_ENTRIES 8 -#define MTIP_MAX_TRIM_ENTRY_LEN 0xfff8 - -struct mtip_trim_entry { - __le32 lba; /* starting lba of region */ - __le16 rsvd; /* unused */ - __le16 range; /* # of 512b blocks to trim */ -} __packed; - -struct mtip_trim { - /* Array of regions to trim */ - struct mtip_trim_entry entry[MTIP_MAX_TRIM_ENTRIES]; -} __packed; - /* Register Frame Information Structure (FIS), host to device. */ struct host_to_dev_fis { /* @@ -474,8 +459,6 @@ struct driver_data { struct dentry *dfs_node; - bool trim_supp; /* flag indicating trim support */ - bool sr; int numa_node; /* NUMA support */ From cdca22bcbc64fc83dadb8d927df400a8d86ddabb Mon Sep 17 00:00:00 2001 From: Coly Li Date: Tue, 30 Apr 2019 22:02:25 +0800 Subject: [PATCH 131/164] bcache: remove redundant LIST_HEAD(journal) from run_cache_set() Commit 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets to remove the original define of LIST_HEAD(journal), which makes the change no take effect. This patch removes redundant variable LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix working. Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set") Reported-by: Juha Aatrokoski Cc: Shenghui Wang Signed-off-by: Coly Li Signed-off-by: Jens Axboe --- drivers/md/bcache/super.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 0ffe9acee9d8..1b63ac876169 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1800,7 +1800,6 @@ static int run_cache_set(struct cache_set *c) set_gc_sectors(c); if (CACHE_SYNC(&c->sb)) { - LIST_HEAD(journal); struct bkey *k; struct jset *j; From f936b06ae53815a7633b30ffd8cf5661ac826b3a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 09:02:59 +0200 Subject: [PATCH 132/164] bcache: clean up do_btree_node_write a bit Use a variable containing the buffer address instead of the to be removed integer iterator from bio_for_each_segment_all. Suggested-by: Matthew Wilcox Reviewed-by: Hannes Reinecke Acked-by: Coly Li Reviewed-by: Matthew Wilcox Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- drivers/md/bcache/btree.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index b139858b0802..3a9f8ed437de 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -431,12 +431,13 @@ static void do_btree_node_write(struct btree *b) if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) { int j; struct bio_vec *bv; - void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); + void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, b->bio, j, iter_all) - memcpy(page_address(bv->bv_page), - base + j * PAGE_SIZE, PAGE_SIZE); + bio_for_each_segment_all(bv, b->bio, j, iter_all) { + memcpy(page_address(bv->bv_page), addr, PAGE_SIZE); + addr += PAGE_SIZE; + } bch_submit_bbio(b->bio, b->c, &k.key, 0); From 2b070cfe582b8e99fec6ada57d2e59e194aae202 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 09:03:00 +0200 Subject: [PATCH 133/164] block: remove the i argument to bio_for_each_segment_all We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox Reviewed-by: Johannes Thumshirn Acked-by: David Sterba Reviewed-by: Hannes Reinecke Acked-by: Coly Li Reviewed-by: Matthew Wilcox Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 29 ++++++++++------------------- block/bounce.c | 3 +-- drivers/md/bcache/btree.c | 3 +-- drivers/md/dm-crypt.c | 3 +-- drivers/md/raid1.c | 6 +++--- drivers/staging/erofs/data.c | 3 +-- drivers/staging/erofs/unzip_vle.c | 3 +-- fs/block_dev.c | 6 ++---- fs/btrfs/compression.c | 3 +-- fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/extent_io.c | 10 ++++------ fs/btrfs/inode.c | 8 ++++---- fs/btrfs/raid56.c | 3 +-- fs/crypto/bio.c | 3 +-- fs/direct-io.c | 3 +-- fs/ext4/page-io.c | 3 +-- fs/ext4/readpage.c | 3 +-- fs/f2fs/data.c | 9 +++------ fs/gfs2/lops.c | 3 +-- fs/gfs2/meta_io.c | 3 +-- fs/iomap.c | 6 ++---- fs/mpage.c | 3 +-- fs/xfs/xfs_aops.c | 3 +-- include/linux/bio.h | 5 ++--- 24 files changed, 47 insertions(+), 81 deletions(-) diff --git a/block/bio.c b/block/bio.c index 662d45752ec5..9ad0d00cdc9b 100644 --- a/block/bio.c +++ b/block/bio.c @@ -874,9 +874,8 @@ static void bio_get_pages(struct bio *bio) { struct bvec_iter_all iter_all; struct bio_vec *bvec; - int i; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) get_page(bvec->bv_page); } @@ -884,9 +883,8 @@ static void bio_release_pages(struct bio *bio) { struct bvec_iter_all iter_all; struct bio_vec *bvec; - int i; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) put_page(bvec->bv_page); } @@ -1166,11 +1164,10 @@ static struct bio_map_data *bio_alloc_map_data(struct iov_iter *data, */ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter) { - int i; struct bio_vec *bvec; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { ssize_t ret; ret = copy_page_from_iter(bvec->bv_page, @@ -1198,11 +1195,10 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter) */ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter) { - int i; struct bio_vec *bvec; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { ssize_t ret; ret = copy_page_to_iter(bvec->bv_page, @@ -1223,10 +1219,9 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter) void bio_free_pages(struct bio *bio) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) __free_page(bvec->bv_page); } EXPORT_SYMBOL(bio_free_pages); @@ -1464,7 +1459,7 @@ struct bio *bio_map_user_iov(struct request_queue *q, return bio; out_unmap: - bio_for_each_segment_all(bvec, bio, j, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { put_page(bvec->bv_page); } bio_put(bio); @@ -1474,13 +1469,12 @@ struct bio *bio_map_user_iov(struct request_queue *q, static void __bio_unmap_user(struct bio *bio) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; /* * make sure we dirty pages we wrote to */ - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { if (bio_data_dir(bio) == READ) set_page_dirty_lock(bvec->bv_page); @@ -1571,10 +1565,9 @@ static void bio_copy_kern_endio_read(struct bio *bio) { char *p = bio->bi_private; struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { memcpy(p, page_address(bvec->bv_page), bvec->bv_len); p += bvec->bv_len; } @@ -1682,10 +1675,9 @@ cleanup: void bio_set_pages_dirty(struct bio *bio) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { if (!PageCompound(bvec->bv_page)) set_page_dirty_lock(bvec->bv_page); } @@ -1734,10 +1726,9 @@ void bio_check_pages_dirty(struct bio *bio) { struct bio_vec *bvec; unsigned long flags; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page)) goto defer; } diff --git a/block/bounce.c b/block/bounce.c index 47eb7e936e22..f8ed677a1bf7 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -163,14 +163,13 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool) { struct bio *bio_orig = bio->bi_private; struct bio_vec *bvec, orig_vec; - int i; struct bvec_iter orig_iter = bio_orig->bi_iter; struct bvec_iter_all iter_all; /* * free up bounce indirect pages used */ - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { orig_vec = bio_iter_iovec(bio_orig, orig_iter); if (bvec->bv_page != orig_vec.bv_page) { dec_zone_page_state(bvec->bv_page, NR_BOUNCE); diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 3a9f8ed437de..773f5fdad25f 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -429,12 +429,11 @@ static void do_btree_node_write(struct btree *b) bset_sector_offset(&b->keys, i)); if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) { - int j; struct bio_vec *bv; void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1)); struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, b->bio, j, iter_all) { + bio_for_each_segment_all(bv, b->bio, iter_all) { memcpy(page_address(bv->bv_page), addr, PAGE_SIZE); addr += PAGE_SIZE; } diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index dd6565798778..cca68546945b 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1445,11 +1445,10 @@ out: static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone) { - unsigned int i; struct bio_vec *bv; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, clone, i, iter_all) { + bio_for_each_segment_all(bv, clone, iter_all) { BUG_ON(!bv->bv_page); mempool_free(bv->bv_page, &cc->page_pool); } diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index fdf451aac369..0c8a098d220e 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2110,7 +2110,7 @@ static void process_checks(struct r1bio *r1_bio) } r1_bio->read_disk = primary; for (i = 0; i < conf->raid_disks * 2; i++) { - int j; + int j = 0; struct bio *pbio = r1_bio->bios[primary]; struct bio *sbio = r1_bio->bios[i]; blk_status_t status = sbio->bi_status; @@ -2125,8 +2125,8 @@ static void process_checks(struct r1bio *r1_bio) /* Now we can 'fixup' the error value */ sbio->bi_status = 0; - bio_for_each_segment_all(bi, sbio, j, iter_all) - page_len[j] = bi->bv_len; + bio_for_each_segment_all(bi, sbio, iter_all) + page_len[j++] = bi->bv_len; if (!status) { for (j = vcnt; j-- ; ) { diff --git a/drivers/staging/erofs/data.c b/drivers/staging/erofs/data.c index 81af768e7248..9f04d7466c55 100644 --- a/drivers/staging/erofs/data.c +++ b/drivers/staging/erofs/data.c @@ -17,12 +17,11 @@ static inline void read_endio(struct bio *bio) { - int i; struct bio_vec *bvec; const blk_status_t err = bio->bi_status; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; /* page is already locked */ diff --git a/drivers/staging/erofs/unzip_vle.c b/drivers/staging/erofs/unzip_vle.c index 31eef8395774..59b9f37d5c00 100644 --- a/drivers/staging/erofs/unzip_vle.c +++ b/drivers/staging/erofs/unzip_vle.c @@ -844,14 +844,13 @@ static void z_erofs_vle_unzip_kickoff(void *ptr, int bios) static inline void z_erofs_vle_read_endio(struct bio *bio) { const blk_status_t err = bio->bi_status; - unsigned int i; struct bio_vec *bvec; #ifdef EROFS_FS_HAS_MANAGED_CACHE struct address_space *mc = NULL; #endif struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; bool cachemngd = false; diff --git a/fs/block_dev.c b/fs/block_dev.c index 24615c76c1d0..8abc6570d29f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -210,7 +210,6 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, struct bio bio; ssize_t ret; blk_qc_t qc; - int i; struct bvec_iter_all iter_all; if ((pos | iov_iter_alignment(iter)) & @@ -261,7 +260,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, } __set_current_state(TASK_RUNNING); - bio_for_each_segment_all(bvec, &bio, i, iter_all) { + bio_for_each_segment_all(bvec, &bio, iter_all) { if (should_dirty && !PageCompound(bvec->bv_page)) set_page_dirty_lock(bvec->bv_page); put_page(bvec->bv_page); @@ -339,9 +338,8 @@ static void blkdev_bio_end_io(struct bio *bio) if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { struct bvec_iter_all iter_all; struct bio_vec *bvec; - int i; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) put_page(bvec->bv_page); } bio_put(bio); diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 4f2a8ae0aa42..6313dc65209e 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -160,7 +160,6 @@ csum_failed: if (cb->errors) { bio_io_error(cb->orig_bio); } else { - int i; struct bio_vec *bvec; struct bvec_iter_all iter_all; @@ -169,7 +168,7 @@ csum_failed: * checked so the end_io handlers know about it */ ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, cb->orig_bio, i, iter_all) + bio_for_each_segment_all(bvec, cb->orig_bio, iter_all) SetPageChecked(bvec->bv_page); bio_endio(cb->orig_bio); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6fe9197f6ee4..c333e79408ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -832,11 +832,11 @@ static blk_status_t btree_csum_one_bio(struct bio *bio) { struct bio_vec *bvec; struct btrfs_root *root; - int i, ret = 0; + int ret = 0; struct bvec_iter_all iter_all; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { root = BTRFS_I(bvec->bv_page->mapping->host)->root; ret = csum_dirty_buffer(root->fs_info, bvec->bv_page); if (ret) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ca8b8e785cf3..c85505c36fa6 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2451,11 +2451,10 @@ static void end_bio_extent_writepage(struct bio *bio) struct bio_vec *bvec; u64 start; u64 end; - int i; struct bvec_iter_all iter_all; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; struct inode *inode = page->mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -2523,11 +2522,10 @@ static void end_bio_extent_readpage(struct bio *bio) u64 extent_len = 0; int mirror; int ret; - int i; struct bvec_iter_all iter_all; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; struct inode *inode = page->mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -3643,11 +3641,11 @@ static void end_bio_extent_buffer_writepage(struct bio *bio) { struct bio_vec *bvec; struct extent_buffer *eb; - int i, done; + int done; struct bvec_iter_all iter_all; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; eb = (struct extent_buffer *)page->private; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 82fdda8ff5ab..10a8d08d3d29 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7828,7 +7828,6 @@ static void btrfs_retry_endio_nocsum(struct bio *bio) struct inode *inode = done->inode; struct bio_vec *bvec; struct extent_io_tree *io_tree, *failure_tree; - int i; struct bvec_iter_all iter_all; if (bio->bi_status) @@ -7841,7 +7840,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio) done->uptodate = 1; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree, io_tree, done->start, bvec->bv_page, btrfs_ino(BTRFS_I(inode)), 0); @@ -7919,7 +7918,7 @@ static void btrfs_retry_endio(struct bio *bio) struct bio_vec *bvec; int uptodate; int ret; - int i; + int i = 0; struct bvec_iter_all iter_all; if (bio->bi_status) @@ -7934,7 +7933,7 @@ static void btrfs_retry_endio(struct bio *bio) failure_tree = &BTRFS_I(inode)->io_failure_tree; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page, bvec->bv_offset, done->start, bvec->bv_len); @@ -7946,6 +7945,7 @@ static void btrfs_retry_endio(struct bio *bio) bvec->bv_offset); else uptodate = 0; + i++; } done->uptodate = uptodate; diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 67a6f7d47402..f3d0576dd327 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1442,12 +1442,11 @@ static int fail_bio_stripe(struct btrfs_raid_bio *rbio, static void set_bio_pages_uptodate(struct bio *bio) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; ASSERT(!bio_flagged(bio, BIO_CLONED)); - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) SetPageUptodate(bvec->bv_page); } diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c index 5759bcd018cd..8f3a8bc15d98 100644 --- a/fs/crypto/bio.c +++ b/fs/crypto/bio.c @@ -29,10 +29,9 @@ static void __fscrypt_decrypt_bio(struct bio *bio, bool done) { struct bio_vec *bv; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, bio, i, iter_all) { + bio_for_each_segment_all(bv, bio, iter_all) { struct page *page = bv->bv_page; int ret = fscrypt_decrypt_page(page->mapping->host, page, PAGE_SIZE, 0, page->index); diff --git a/fs/direct-io.c b/fs/direct-io.c index 9bb015bc4a83..fbe885d68035 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -538,7 +538,6 @@ static struct bio *dio_await_one(struct dio *dio) static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) { struct bio_vec *bvec; - unsigned i; blk_status_t err = bio->bi_status; if (err) { @@ -553,7 +552,7 @@ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) } else { struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; if (dio->op == REQ_OP_READ && !PageCompound(page) && diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 3e9298e6a705..4690618a92e9 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -61,11 +61,10 @@ static void buffer_io_error(struct buffer_head *bh) static void ext4_finish_bio(struct bio *bio) { - int i; struct bio_vec *bvec; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; #ifdef CONFIG_FS_ENCRYPTION struct page *data_page = NULL; diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c index 3adadf461825..3629a74b7f94 100644 --- a/fs/ext4/readpage.c +++ b/fs/ext4/readpage.c @@ -71,7 +71,6 @@ static inline bool ext4_bio_encrypted(struct bio *bio) static void mpage_end_io(struct bio *bio) { struct bio_vec *bv; - int i; struct bvec_iter_all iter_all; if (ext4_bio_encrypted(bio)) { @@ -82,7 +81,7 @@ static void mpage_end_io(struct bio *bio) return; } } - bio_for_each_segment_all(bv, bio, i, iter_all) { + bio_for_each_segment_all(bv, bio, iter_all) { struct page *page = bv->bv_page; if (!bio->bi_status) { diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 9727944139f2..64040e998439 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -86,10 +86,9 @@ static void __read_end_io(struct bio *bio) { struct page *page; struct bio_vec *bv; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, bio, i, iter_all) { + bio_for_each_segment_all(bv, bio, iter_all) { page = bv->bv_page; /* PG_error was set if any post_read step failed */ @@ -164,7 +163,6 @@ static void f2fs_write_end_io(struct bio *bio) { struct f2fs_sb_info *sbi = bio->bi_private; struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; if (time_to_inject(sbi, FAULT_WRITE_IO)) { @@ -172,7 +170,7 @@ static void f2fs_write_end_io(struct bio *bio) bio->bi_status = BLK_STS_IOERR; } - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; enum count_type type = WB_DATA_TYPE(page); @@ -349,7 +347,6 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode, { struct bio_vec *bvec; struct page *target; - int i; struct bvec_iter_all iter_all; if (!io->bio) @@ -358,7 +355,7 @@ static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode, if (!inode && !page && !ino) return true; - bio_for_each_segment_all(bvec, io->bio, i, iter_all) { + bio_for_each_segment_all(bvec, io->bio, iter_all) { if (bvec->bv_page->mapping) target = bvec->bv_page; diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index 8722c60b11fe..6f09b5e3dd6e 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -207,7 +207,6 @@ static void gfs2_end_log_write(struct bio *bio) struct gfs2_sbd *sdp = bio->bi_private; struct bio_vec *bvec; struct page *page; - int i; struct bvec_iter_all iter_all; if (bio->bi_status) { @@ -216,7 +215,7 @@ static void gfs2_end_log_write(struct bio *bio) wake_up(&sdp->sd_logd_waitq); } - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { page = bvec->bv_page; if (page_has_buffers(page)) gfs2_end_log_write_bh(sdp, bvec, bio->bi_status); diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c index 3201342404a7..ff86e1d4f8ff 100644 --- a/fs/gfs2/meta_io.c +++ b/fs/gfs2/meta_io.c @@ -189,10 +189,9 @@ struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno) static void gfs2_meta_read_endio(struct bio *bio) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) { + bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; struct buffer_head *bh = page_buffers(page); unsigned int len = bvec->bv_len; diff --git a/fs/iomap.c b/fs/iomap.c index abdd18e404f8..12a656271076 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -273,10 +273,9 @@ iomap_read_end_io(struct bio *bio) { int error = blk_status_to_errno(bio->bi_status); struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) iomap_read_page_end_io(bvec, error); bio_put(bio); } @@ -1592,9 +1591,8 @@ static void iomap_dio_bio_end_io(struct bio *bio) if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { struct bvec_iter_all iter_all; struct bio_vec *bvec; - int i; - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) put_page(bvec->bv_page); } bio_put(bio); diff --git a/fs/mpage.c b/fs/mpage.c index 3f19da75178b..436a85260394 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -47,10 +47,9 @@ static void mpage_end_io(struct bio *bio) { struct bio_vec *bv; - int i; struct bvec_iter_all iter_all; - bio_for_each_segment_all(bv, bio, i, iter_all) { + bio_for_each_segment_all(bv, bio, iter_all) { struct page *page = bv->bv_page; page_endio(page, bio_op(bio), blk_status_to_errno(bio->bi_status)); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 3619e9e8d359..47941bbaac02 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -98,7 +98,6 @@ xfs_destroy_ioend( for (bio = &ioend->io_inline_bio; bio; bio = next) { struct bio_vec *bvec; - int i; struct bvec_iter_all iter_all; /* @@ -111,7 +110,7 @@ xfs_destroy_ioend( next = bio->bi_private; /* walk each page on bio, ending page IO on them */ - bio_for_each_segment_all(bvec, bio, i, iter_all) + bio_for_each_segment_all(bvec, bio, iter_all) xfs_finish_page_writeback(inode, bvec, error); bio_put(bio); } diff --git a/include/linux/bio.h b/include/linux/bio.h index 9577ad8f6e28..186b2723c61b 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -134,9 +134,8 @@ static inline bool bio_next_segment(const struct bio *bio, * drivers should _never_ use the all version - the bio may have been split * before it got to the driver and the driver won't own all of it */ -#define bio_for_each_segment_all(bvl, bio, i, iter) \ - for (i = 0, bvl = bvec_init_iter_all(&iter); \ - bio_next_segment((bio), &iter); i++) +#define bio_for_each_segment_all(bvl, bio, iter) \ + for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); ) static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, unsigned bytes) From 4713839dfe8269d27d83a33d1e39f9c2970eb31a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 09:04:33 +0200 Subject: [PATCH 134/164] block: remove the __bio_add_pc_page export The same page optimization is a rather odd corner case, which is not used outside bio.c and which really should not be used outside of bio.c either - we have better highlevel helpers like the rq/bio mapping helpers. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 3 +-- include/linux/bio.h | 3 --- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/block/bio.c b/block/bio.c index 9ad0d00cdc9b..e717b303e1fb 100644 --- a/block/bio.c +++ b/block/bio.c @@ -709,7 +709,7 @@ static bool can_add_page_to_seg(struct request_queue *q, * * This should only be used by passthrough bios. */ -int __bio_add_pc_page(struct request_queue *q, struct bio *bio, +static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset, bool put_same_page) { @@ -776,7 +776,6 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio, bio_set_flag(bio, BIO_SEG_VALID); return len; } -EXPORT_SYMBOL(__bio_add_pc_page); int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page, unsigned int len, unsigned int offset) diff --git a/include/linux/bio.h b/include/linux/bio.h index 186b2723c61b..077cecdf9437 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -435,9 +435,6 @@ void bio_chain(struct bio *, struct bio *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, unsigned int, unsigned int); -extern int __bio_add_pc_page(struct request_queue *, struct bio *, - struct page *, unsigned int, unsigned int, - bool); bool __bio_try_merge_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off, bool same_page); void __bio_add_page(struct bio *bio, struct page *page, From 6601e44efd20efddc183c85131216200e90c5728 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 09:04:34 +0200 Subject: [PATCH 135/164] block: remove bogus comments in __bio_add_pc_page We are never called with file system pages by defintions for the passthrough interface, and we also never undo any addition later these days. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/block/bio.c b/block/bio.c index e717b303e1fb..de26dc18bceb 100644 --- a/block/bio.c +++ b/block/bio.c @@ -724,11 +724,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q)) return 0; - /* - * For filesystems with a blocksize smaller than the pagesize - * we will often be called with the same page as last time and - * a consecutive offset. Optimize this special case. - */ if (bio->bi_vcnt > 0) { bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; @@ -760,10 +755,6 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, if (bio->bi_phys_segments >= queue_max_segments(q)) return 0; - /* - * setup the new entry, we might clear it again later if we - * cannot add the page - */ bvec = &bio->bi_io_vec[bio->bi_vcnt]; bvec->bv_page = page; bvec->bv_len = len; From dcdca753c152efe8d86ec7a15423307807a516a7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 25 Apr 2019 09:04:35 +0200 Subject: [PATCH 136/164] block: clean up __bio_add_pc_page a bit Share the bi_size update by moving the done label up, and duplicate the bv_len update in the two callers to get rid of the bvec_merge label. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/block/bio.c b/block/bio.c index de26dc18bceb..029afb121a48 100644 --- a/block/bio.c +++ b/block/bio.c @@ -731,9 +731,7 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, offset == bvec->bv_offset + bvec->bv_len) { if (put_same_page) put_page(page); - bvec_merge: bvec->bv_len += len; - bio->bi_iter.bi_size += len; goto done; } @@ -745,8 +743,10 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, return 0; if (page_is_mergeable(bvec, page, len, offset, false) && - can_add_page_to_seg(q, bvec, page, len, offset)) - goto bvec_merge; + can_add_page_to_seg(q, bvec, page, len, offset)) { + bvec->bv_len += len; + goto done; + } } if (bio_full(bio)) @@ -760,9 +760,8 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, bvec->bv_len = len; bvec->bv_offset = offset; bio->bi_vcnt++; - bio->bi_iter.bi_size += len; - done: + bio->bi_iter.bi_size += len; bio->bi_phys_segments = bio->bi_vcnt; bio_set_flag(bio, BIO_SEG_VALID); return len; From 8c16567d867ed3185a67d8560e051090486d3ff1 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 14:42:39 -0400 Subject: [PATCH 137/164] block: switch all files cleared marked as GPLv2 to SPDX tags All these files have some form of the usual GPLv2 boilerplate. Switch them to use SPDX tags instead. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/badblocks.c | 10 +--------- block/bio-integrity.c | 16 +--------------- block/bio.c | 15 +-------------- block/blk-flush.c | 3 +-- block/blk-integrity.c | 16 +--------------- block/blk-mq-debugfs.c | 13 +------------ block/blk-mq-pci.c | 10 +--------- block/blk-mq-rdma.c | 10 +--------- block/blk-mq-virtio.c | 10 +--------- block/bsg.c | 9 +-------- block/kyber-iosched.c | 13 +------------ block/opal_proto.h | 10 +--------- block/partitions/acorn.c | 7 +------ block/scsi_ioctl.c | 16 +--------------- block/sed-opal.c | 10 +--------- block/t10-pi.c | 19 +------------------ include/linux/bio.h | 15 +-------------- include/linux/bvec.h | 15 +-------------- include/linux/sed-opal.h | 10 +--------- 19 files changed, 19 insertions(+), 208 deletions(-) diff --git a/block/badblocks.c b/block/badblocks.c index 91f7bcf979d3..2e5f5697db35 100644 --- a/block/badblocks.c +++ b/block/badblocks.c @@ -1,18 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Bad block management * * - Heavily based on MD badblocks code from Neil Brown * * Copyright (c) 2015, Intel Corporation. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/block/bio-integrity.c b/block/bio-integrity.c index 1b633a3526d4..42536674020a 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -1,23 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * bio-integrity.c - bio data integrity extensions * * Copyright (C) 2007, 2008, 2009 Oracle Corporation * Written by: Martin K. Petersen - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License version - * 2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; see the file COPYING. If not, write to - * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, - * USA. - * */ #include diff --git a/block/bio.c b/block/bio.c index 029afb121a48..683cbb40f051 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1,19 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2001 Jens Axboe - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public Licens - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- - * */ #include #include diff --git a/block/blk-flush.c b/block/blk-flush.c index d95f94892015..aedd9320e605 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -1,11 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Functions to sequence PREFLUSH and FUA writes. * * Copyright (C) 2011 Max Planck Institute for Gravitational Physics * Copyright (C) 2011 Tejun Heo * - * This file is released under the GPLv2. - * * REQ_{PREFLUSH|FUA} requests are decomposed to sequences consisted of three * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request * properties and hardware capability. diff --git a/block/blk-integrity.c b/block/blk-integrity.c index d1ab089e0919..7f302f7b9d84 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -1,23 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * blk-integrity.c - Block layer data integrity extensions * * Copyright (C) 2007, 2008 Oracle Corporation * Written by: Martin K. Petersen - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License version - * 2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; see the file COPYING. If not, write to - * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, - * USA. - * */ #include diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index ec1d18cb643c..6aea0ebc3a73 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -1,17 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2017 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . */ #include diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c index 1dce18553984..ad4545a2a98b 100644 --- a/block/blk-mq-pci.c +++ b/block/blk-mq-pci.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2016 Christoph Hellwig. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include #include diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c index 45030a81a1ed..cc921e6ba709 100644 --- a/block/blk-mq-rdma.c +++ b/block/blk-mq-rdma.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2017 Sagi Grimberg. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include #include diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c index 370827163835..75a52c18a8f6 100644 --- a/block/blk-mq-virtio.c +++ b/block/blk-mq-virtio.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2016 Christoph Hellwig. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include #include diff --git a/block/bsg.c b/block/bsg.c index f306853c6b08..833c44b3d458 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -1,13 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * bsg.c - block layer implementation of the sg v4 interface - * - * Copyright (C) 2004 Jens Axboe SUSE Labs - * Copyright (C) 2004 Peter M. Jones - * - * This file is subject to the terms and conditions of the GNU General Public - * License version 2. See the file "COPYING" in the main directory of this - * archive for more details. - * */ #include #include diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index ec6a04e01bc1..c3b05119cebd 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -1,20 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * The Kyber I/O scheduler. Controls latency by throttling queue depths using * scalable techniques. * * Copyright (C) 2017 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program. If not, see . */ #include diff --git a/block/opal_proto.h b/block/opal_proto.h index b6e352cfe982..d9a05ad02eb5 100644 --- a/block/opal_proto.h +++ b/block/opal_proto.h @@ -1,18 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright © 2016 Intel Corporation * * Authors: * Rafael Antognolli * Scott Bauer - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/block/partitions/acorn.c b/block/partitions/acorn.c index fbeb697374d5..7587700fad4a 100644 --- a/block/partitions/acorn.c +++ b/block/partitions/acorn.c @@ -1,12 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0 /* - * linux/fs/partitions/acorn.c - * * Copyright (c) 1996-2000 Russell King. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * * Scan ADFS partitions on hard disk drives. Unfortunately, there * isn't a standard for partitioning drives on Acorn machines, so * every single manufacturer of SCSI and IDE cards created their own diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 533f4aee8567..f5e0ad65e86a 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -1,20 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2001 Jens Axboe - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public Licens - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- - * */ #include #include diff --git a/block/sed-opal.c b/block/sed-opal.c index b1aa0cc25803..a46e8d13e16d 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -1,18 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright © 2016 Intel Corporation * * Authors: * Scott Bauer * Rafael Antognolli - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #define pr_fmt(fmt) KBUILD_MODNAME ":OPAL: " fmt diff --git a/block/t10-pi.c b/block/t10-pi.c index 62aed77d0bb9..0c0094609dd6 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -1,24 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0 /* * t10_pi.c - Functions for generating and verifying T10 Protection * Information. - * - * Copyright (C) 2007, 2008, 2014 Oracle Corporation - * Written by: Martin K. Petersen - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License version - * 2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; see the file COPYING. If not, write to - * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, - * USA. - * */ #include diff --git a/include/linux/bio.h b/include/linux/bio.h index 077cecdf9437..ea73df36529a 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -1,19 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright (C) 2001 Jens Axboe - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public Licens - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- */ #ifndef __LINUX_BIO_H #define __LINUX_BIO_H diff --git a/include/linux/bvec.h b/include/linux/bvec.h index a4811410e4fc..545a480528e0 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -1,21 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * bvec iterator * * Copyright (C) 2001 Ming Lei - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public Licens - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- */ #ifndef __LINUX_BVEC_ITER_H #define __LINUX_BVEC_ITER_H diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h index 04b124fca51e..3e76b6d7d97f 100644 --- a/include/linux/sed-opal.h +++ b/include/linux/sed-opal.h @@ -1,18 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * Copyright © 2016 Intel Corporation * * Authors: * Rafael Antognolli * Scott Bauer - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #ifndef LINUX_OPAL_H From a497ee34a45d58e9b978d0fa5c4b25d4813eb350 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 14:42:40 -0400 Subject: [PATCH 138/164] block: switch all files cleared marked as GPLv2 or later to SPDX tags All these files have some form of the usual GPLv2 or later boilerplate. Switch them to use SPDX tags instead. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bfq-cgroup.c | 11 +---------- block/bfq-iosched.c | 11 +---------- block/bfq-iosched.h | 11 +---------- block/bfq-wf2q.c | 11 +---------- block/bsg-lib.c | 16 +--------------- block/partitions/efi.c | 16 +--------------- block/partitions/efi.h | 16 +--------------- block/partitions/ldm.c | 16 +--------------- block/partitions/ldm.h | 16 +--------------- include/linux/bsg-lib.h | 16 +--------------- 10 files changed, 10 insertions(+), 130 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 793c027ca60e..b3796a40a61a 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -1,15 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /* * cgroups support for the BFQ I/O scheduler. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License as - * published by the Free Software Foundation; either version 2 of the - * License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index b85a4ab8b9db..f8d430f88d25 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /* * Budget Fair Queueing (BFQ) I/O scheduler. * @@ -12,16 +13,6 @@ * * Copyright (C) 2017 Paolo Valente * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License as - * published by the Free Software Foundation; either version 2 of the - * License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * * BFQ is a proportional-share I/O scheduler, with some extra * low-latency capabilities. BFQ also supports full hierarchical * scheduling through cgroups. Next paragraphs provide an introduction diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index eba7cd449ab4..c2faa77824f8 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -1,16 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ /* * Header file for the BFQ I/O scheduler: data structures and * prototypes of interface functions among BFQ components. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License as - * published by the Free Software Foundation; either version 2 of the - * License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef _BFQ_H #define _BFQ_H diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 48d899cfbe03..c9ba225081ce 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -1,19 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /* * Hierarchical Budget Worst-case Fair Weighted Fair Queueing * (B-WF2Q+): hierarchical scheduling algorithm by which the BFQ I/O * scheduler schedules generic entities. The latter can represent * either single bfq queues (associated with processes) or groups of * bfq queues (associated with cgroups). - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License as - * published by the Free Software Foundation; either version 2 of the - * License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include "bfq-iosched.h" diff --git a/block/bsg-lib.c b/block/bsg-lib.c index 005e2b75d775..b898a1cdf872 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -1,24 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /* * BSG helper library * * Copyright (C) 2008 James Smart, Emulex Corporation * Copyright (C) 2011 Red Hat, Inc. All rights reserved. * Copyright (C) 2011 Mike Christie - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - * */ #include #include diff --git a/block/partitions/efi.c b/block/partitions/efi.c index 39f70d968754..db2fef7dfc47 100644 --- a/block/partitions/efi.c +++ b/block/partitions/efi.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /************************************************************ * EFI GUID Partition Table handling * @@ -7,21 +8,6 @@ * efi.[ch] by Matt Domsch * Copyright 2000,2001,2002,2004 Dell Inc. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - * - * * TODO: * * Changelog: diff --git a/block/partitions/efi.h b/block/partitions/efi.h index abd0b19288a6..3e8576157575 100644 --- a/block/partitions/efi.h +++ b/block/partitions/efi.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ /************************************************************ * EFI GUID Partition Table * Per Intel EFI Specification v1.02 @@ -5,21 +6,6 @@ * * By Matt Domsch Fri Sep 22 22:15:56 CDT 2000 * Copyright 2000,2001 Dell Inc. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - * ************************************************************/ #ifndef FS_PART_EFI_H_INCLUDED diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c index 16766f267559..6db573f33219 100644 --- a/block/partitions/ldm.c +++ b/block/partitions/ldm.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /** * ldm - Support for Windows Logical Disk Manager (Dynamic Disks) * @@ -6,21 +7,6 @@ * Copyright (C) 2001,2002 Jakob Kemi * * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads - * - * This program is free software; you can redistribute it and/or modify it under - * the terms of the GNU General Public License as published by the Free Software - * Foundation; either version 2 of the License, or (at your option) any later - * version. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS - * FOR A PARTICULAR PURPOSE. See the GNU General Public License for more - * details. - * - * You should have received a copy of the GNU General Public License along with - * this program (in the main directory of the source in the file COPYING); if - * not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, - * Boston, MA 02111-1307 USA */ #include diff --git a/block/partitions/ldm.h b/block/partitions/ldm.h index f4c6055df956..1ca63e97bccc 100644 --- a/block/partitions/ldm.h +++ b/block/partitions/ldm.h @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /** * ldm - Part of the Linux-NTFS project. * @@ -6,21 +7,6 @@ * Copyright (C) 2001,2002 Jakob Kemi * * Documentation is available at http://www.linux-ntfs.org/doku.php?id=downloads - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the Free - * Software Foundation; either version 2 of the License, or (at your option) - * any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program (in the main directory of the Linux-NTFS source - * in the file COPYING); if not, write to the Free Software Foundation, - * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #ifndef _FS_PT_LDM_H_ diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h index 7f14517a559b..960988d42f77 100644 --- a/include/linux/bsg-lib.h +++ b/include/linux/bsg-lib.h @@ -1,24 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ /* * BSG helper library * * Copyright (C) 2008 James Smart, Emulex Corporation * Copyright (C) 2011 Red Hat, Inc. All rights reserved. * Copyright (C) 2011 Mike Christie - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - * */ #ifndef _BLK_BSG_ #define _BLK_BSG_ From 9fcd030baa36f75721ddc70f7d932983243fff25 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 14:42:41 -0400 Subject: [PATCH 139/164] sed-opal.h: remove redundant licence boilerplate The file already has the correct SPDX header. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- include/uapi/linux/sed-opal.h | 9 --------- 1 file changed, 9 deletions(-) diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h index e092e124dd16..33e53b80cd1f 100644 --- a/include/uapi/linux/sed-opal.h +++ b/include/uapi/linux/sed-opal.h @@ -5,15 +5,6 @@ * Authors: * Rafael Antognolli * Scott Bauer - * - * This program is free software; you can redistribute it and/or modify it - * under the terms and conditions of the GNU General Public License, - * version 2, as published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #ifndef _UAPI_SED_OPAL_H From 6353599813158c02227d8086acb20e07f3306452 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 14:42:42 -0400 Subject: [PATCH 140/164] block: add a SPDX tag to blk-mq-rdma.h This file has no copyright notice, but was added as part of a commit adding another file using the default kernel GPLv2 license. Add a matching SPDX tag. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- include/linux/blk-mq-rdma.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h index 7b6ecf9ac4c3..5cc5f0f36218 100644 --- a/include/linux/blk-mq-rdma.h +++ b/include/linux/blk-mq-rdma.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_BLK_MQ_RDMA_H #define _LINUX_BLK_MQ_RDMA_H From 3dcf60bcb603f56361abb364a4cd2f69677453f0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 14:42:43 -0400 Subject: [PATCH 141/164] block: add SPDX tags to block layer files missing licensing information Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/blk-cgroup.c | 1 + block/blk-core.c | 1 + block/blk-exec.c | 1 + block/blk-iolatency.c | 1 + block/blk-mq-cpumap.c | 1 + block/blk-mq-sched.c | 1 + block/blk-mq-sysfs.c | 1 + block/blk-mq-tag.c | 1 + block/blk-mq.c | 1 + block/blk-rq-qos.c | 2 ++ block/blk-rq-qos.h | 1 + block/blk-settings.c | 1 + block/blk-stat.c | 1 + block/blk-timeout.c | 1 + block/blk-wbt.c | 1 + block/blk-zoned.c | 1 + block/elevator.c | 1 + block/genhd.c | 1 + block/ioctl.c | 1 + block/ioprio.c | 1 + block/mq-deadline.c | 1 + block/partitions/aix.h | 1 + block/partitions/amiga.h | 1 + block/partitions/ibm.h | 1 + block/partitions/karma.h | 1 + block/partitions/msdos.h | 1 + block/partitions/osf.h | 1 + block/partitions/sgi.h | 1 + block/partitions/sun.h | 1 + block/partitions/sysv68.h | 1 + block/partitions/ultrix.h | 1 + 31 files changed, 32 insertions(+) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 617a2b3f7582..b97b479e4f64 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Common Block IO controller cgroup interface * diff --git a/block/blk-core.c b/block/blk-core.c index a55389ba8779..b044829135c9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds * Copyright (C) 1994, Karl Keyte: Added support for disk statistics diff --git a/block/blk-exec.c b/block/blk-exec.c index a34b7d918742..1db44ca0f4a6 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Functions related to setting various queue properties from drivers */ diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 507212d75ee2..d22e61bced86 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Block rq-qos base io controller * diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 03a534820271..48bebf00a5f3 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * CPU <-> hardware queue mapping helpers * diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index aa6bc5c02643..f6e3b10b52eb 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * blk-mq scheduling framework * diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3f9c3f4ac44c..61efc2a29e58 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 #include #include #include diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a4931fc7be8a..7513c8eaabee 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Tag allocation using scalable bitmaps. Uses active queue tracking to support * fairer distribution of tags between multiple submitters when a shared tag map diff --git a/block/blk-mq.c b/block/blk-mq.c index fc60ed7e940e..4f15adfbab29 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Block multiqueue core code * diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index d169d7188fa6..3f55b56f24bc 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -1,3 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 + #include "blk-rq-qos.h" /* diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 564851889550..2300e038b9fa 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ #ifndef RQ_QOS_H #define RQ_QOS_H diff --git a/block/blk-settings.c b/block/blk-settings.c index 6375afaedcec..ec150f88db09 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Functions related to setting various queue properties from drivers */ diff --git a/block/blk-stat.c b/block/blk-stat.c index 696a04176e4d..940f15d600f8 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Block stat tracking code * diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 124c26128bf6..8aa68fae96ad 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Functions related to generic timeout handling of requests. */ diff --git a/block/blk-wbt.c b/block/blk-wbt.c index fd166fbb0f65..313f45a37e9d 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * buffered writeback throttling. loosely based on CoDel. We can't drop * packets for IO scheduling, so the logic is something like this: diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 2d98803faec2..ae7e91bd0618 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Zoned block device handling * diff --git a/block/elevator.c b/block/elevator.c index 2e5399d9f40f..ec55d5fc0b3e 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * Block device elevator/IO-scheduler. * diff --git a/block/genhd.c b/block/genhd.c index 83f5c33d1e80..ad6826628e79 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * gendisk handling */ diff --git a/block/ioctl.c b/block/ioctl.c index 4825c78a6baa..15a0eb80ada9 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 #include #include #include diff --git a/block/ioprio.c b/block/ioprio.c index f9821080c92c..2e0559f157c8 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * fs/ioprio.c * diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 14288f864e94..1876f5712bfd 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler, * for the blk-mq scheduling framework diff --git a/block/partitions/aix.h b/block/partitions/aix.h index e0c66a987523..b4449f0b9f2b 100644 --- a/block/partitions/aix.h +++ b/block/partitions/aix.h @@ -1 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ extern int aix_partition(struct parsed_partitions *state); diff --git a/block/partitions/amiga.h b/block/partitions/amiga.h index d094585cadaa..7e63f4d9d969 100644 --- a/block/partitions/amiga.h +++ b/block/partitions/amiga.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/amiga.h */ diff --git a/block/partitions/ibm.h b/block/partitions/ibm.h index 08fb0804a812..8bf13febb2b6 100644 --- a/block/partitions/ibm.h +++ b/block/partitions/ibm.h @@ -1 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ int ibm_partition(struct parsed_partitions *); diff --git a/block/partitions/karma.h b/block/partitions/karma.h index c764b2e9df21..48e074d417fb 100644 --- a/block/partitions/karma.h +++ b/block/partitions/karma.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/karma.h */ diff --git a/block/partitions/msdos.h b/block/partitions/msdos.h index 38c781c490b3..fcacfc486092 100644 --- a/block/partitions/msdos.h +++ b/block/partitions/msdos.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/msdos.h */ diff --git a/block/partitions/osf.h b/block/partitions/osf.h index 20ed2315ec16..4d8088e7ea8c 100644 --- a/block/partitions/osf.h +++ b/block/partitions/osf.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/osf.h */ diff --git a/block/partitions/sgi.h b/block/partitions/sgi.h index b9553ebdd5a9..a5b77c3987cf 100644 --- a/block/partitions/sgi.h +++ b/block/partitions/sgi.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/sgi.h */ diff --git a/block/partitions/sun.h b/block/partitions/sun.h index 2424baa8319f..ae1b9eed3fd7 100644 --- a/block/partitions/sun.h +++ b/block/partitions/sun.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/sun.h */ diff --git a/block/partitions/sysv68.h b/block/partitions/sysv68.h index bf2f5ffa97ac..4fb6b8ec78ae 100644 --- a/block/partitions/sysv68.h +++ b/block/partitions/sysv68.h @@ -1 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ extern int sysv68_partition(struct parsed_partitions *state); diff --git a/block/partitions/ultrix.h b/block/partitions/ultrix.h index a3cc00b2bded..9f676cead222 100644 --- a/block/partitions/ultrix.h +++ b/block/partitions/ultrix.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * fs/partitions/ultrix.h */ From 12adb7a013e318de553ccee4a006a718667972b3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 13:56:16 -0400 Subject: [PATCH 142/164] block: remove the unused blk_queue_dma_pad function Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/blk-settings.c | 16 ---------------- include/linux/blkdev.h | 1 - 2 files changed, 17 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index ec150f88db09..3facc41476be 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -663,22 +663,6 @@ void disk_stack_limits(struct gendisk *disk, struct block_device *bdev, } EXPORT_SYMBOL(disk_stack_limits); -/** - * blk_queue_dma_pad - set pad mask - * @q: the request queue for the device - * @mask: pad mask - * - * Set dma pad mask. - * - * Appending pad buffer to a request modifies the last entry of a - * scatter list such that it includes the pad buffer. - **/ -void blk_queue_dma_pad(struct request_queue *q, unsigned int mask) -{ - q->dma_pad_mask = mask; -} -EXPORT_SYMBOL(blk_queue_dma_pad); - /** * blk_queue_update_dma_pad - update pad mask * @q: the request queue for the device diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 99aa98f60b9e..bd3e3f09bfa0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1069,7 +1069,6 @@ extern int bdev_stack_limits(struct queue_limits *t, struct block_device *bdev, extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev, sector_t offset); extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b); -extern void blk_queue_dma_pad(struct request_queue *, unsigned int); extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int); extern int blk_queue_dma_drain(struct request_queue *q, dma_drain_needed_fn *dma_drain_needed, From 2d5abb9a1e8e92b25e781f0c3537a5b3b4b2f033 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Wed, 1 May 2019 06:34:09 -0600 Subject: [PATCH 143/164] bcache: make is_discard_enabled() static It's not used outside this file. Fixes: 631207314d88 ("bcache: fix failure in journal relplay") Signed-off-by: Jens Axboe --- drivers/md/bcache/journal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c index f9afb164b887..12dae9348147 100644 --- a/drivers/md/bcache/journal.c +++ b/drivers/md/bcache/journal.c @@ -318,7 +318,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list) } } -bool is_discard_enabled(struct cache_set *s) +static bool is_discard_enabled(struct cache_set *s) { struct cache *ca; unsigned int i; From f34e25898a608380a60135288019c4cb6013bec8 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Mon, 29 Apr 2019 16:25:48 -0700 Subject: [PATCH 144/164] nvme-tcp: fix possible null deref on a timed out io queue connect If I/O queue connect times out, we might have freed the queue socket already, so check for that on the error path in nvme_tcp_start_queue. Signed-off-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/tcp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 2405bb9c63cc..2b107a1d152b 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1423,7 +1423,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx) if (!ret) { set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags); } else { - __nvme_tcp_stop_queue(&ctrl->queues[idx]); + if (test_bit(NVME_TCP_Q_ALLOCATED, &ctrl->queues[idx].flags)) + __nvme_tcp_stop_queue(&ctrl->queues[idx]); dev_err(nctrl->device, "failed to connect queue: %d ret=%d\n", idx, ret); } From 525aa5a705d86e193726ee465d1a975265fabf19 Mon Sep 17 00:00:00 2001 From: Hannes Reinecke Date: Tue, 30 Apr 2019 18:57:09 +0200 Subject: [PATCH 145/164] nvme-multipath: split bios with the ns_head bio_set before submitting If the bio is moved to a different queue via blk_steal_bios() and the original queue is destroyed in nvme_remove_ns() we'll be ending with a crash in bio_endio() as the mempool for the split bio bvecs had already been destroyed. So split the bio using the original queue (which will remain during the lifetime of the bio) before sending it down to the underlying device. Signed-off-by: Hannes Reinecke Reviewed-by: Ming Lei Signed-off-by: Christoph Hellwig --- drivers/nvme/host/multipath.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f0716f6ce41f..e6ddc83223df 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -232,6 +232,14 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q, blk_qc_t ret = BLK_QC_T_NONE; int srcu_idx; + /* + * The namespace might be going away and the bio might + * be moved to a different queue via blk_steal_bios(), + * so we need to use the bio_split pool from the original + * queue to allocate the bvecs from. + */ + blk_queue_split(q, &bio); + srcu_idx = srcu_read_lock(&head->srcu); ns = nvme_find_path(head); if (likely(ns)) { From 592b6e7b0226b198c12439065f725be00c92d559 Mon Sep 17 00:00:00 2001 From: Hannes Reinecke Date: Sun, 28 Apr 2019 20:24:42 -0700 Subject: [PATCH 146/164] nvme-multipath: don't print ANA group state by default Signed-off-by: Hannes Reinecke Signed-off-by: Chaitanya Kulkarni Reviewed-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/host/multipath.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index e6ddc83223df..5c9429d41120 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -429,7 +429,7 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, unsigned *nr_change_groups = data; struct nvme_ns *ns; - dev_info(ctrl->device, "ANA group %d: %s.\n", + dev_dbg(ctrl->device, "ANA group %d: %s.\n", le32_to_cpu(desc->grpid), nvme_ana_state_names[desc->state]); From 049bf37262c61c99f45438910711b55054b24838 Mon Sep 17 00:00:00 2001 From: Klaus Birkelund Jensen Date: Tue, 30 Apr 2019 18:53:29 +0200 Subject: [PATCH 147/164] nvme-pci: fix psdt field for single segment sgls The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: Klaus Birkelund Jensen Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c1eecde6b853..efc1da56521c 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -830,6 +830,7 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev, return BLK_STS_RESOURCE; iod->dma_len = bv->bv_len; + cmnd->flags = NVME_CMD_SGL_METABUF; cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma); cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len); cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4; From 9dc1a38ef1925d23c2933c5867df816386d92ff8 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Tue, 30 Apr 2019 09:33:40 -0600 Subject: [PATCH 148/164] nvme-pci: shutdown on timeout during deletion We do not restart a controller in a deleting state for timeout errors. When in this state, unblock potential request dispatchers with failed completions by shutting down the controller on timeout detection. Reported-by: Yufen Yu Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index efc1da56521c..3df0f2b29427 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1277,6 +1277,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) struct nvme_dev *dev = nvmeq->dev; struct request *abort_req; struct nvme_command cmd; + bool shutdown = false; u32 csts = readl(dev->bar + NVME_REG_CSTS); /* If PCI error recovery process is happening, we cannot reset or @@ -1313,12 +1314,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) * shutdown, so we return BLK_EH_DONE. */ switch (dev->ctrl.state) { + case NVME_CTRL_DELETING: + shutdown = true; case NVME_CTRL_CONNECTING: case NVME_CTRL_RESETTING: dev_warn_ratelimited(dev->ctrl.device, "I/O %d QID %d timeout, disable controller\n", req->tag, nvmeq->qid); - nvme_dev_disable(dev, false); + nvme_dev_disable(dev, shutdown); nvme_req(req)->flags |= NVME_REQ_CANCELLED; return BLK_EH_DONE; default: From c8e9e9b7646ebe1c5066ddc420d7630876277eb4 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Tue, 30 Apr 2019 09:33:41 -0600 Subject: [PATCH 149/164] nvme-pci: unquiesce admin queue on shutdown Just like IO queues, the admin queue also will not be restarted after a controller shutdown. Unquiesce this queue so that we do not block request dispatch on a permanently disabled controller. Reported-by: Yufen Yu Signed-off-by: Keith Busch Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3df0f2b29427..ac10d3ad1e75 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2437,8 +2437,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) * must flush all entered requests to their failed completion to avoid * deadlocking blk-mq hot-cpu notifier. */ - if (shutdown) + if (shutdown) { nvme_start_queues(&dev->ctrl); + if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) + blk_mq_unquiesce_queue(dev->ctrl.admin_q); + } mutex_unlock(&dev->shutdown_lock); } From 665648673ef5384c7194ea6df4b55f2da98646cf Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Fri, 12 Apr 2019 00:52:39 +0900 Subject: [PATCH 150/164] nvme-pci: remove an unneeded variable initialization Variable "n" will be assigned once kstrtoint() succeeds, otherwise it will not be referred because kstrtoint() will return an error which means go out from this function. Signed-off-by: Minwoo Im Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ac10d3ad1e75..e2ff92de41a7 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -146,7 +146,7 @@ static int io_queue_depth_set(const char *val, const struct kernel_param *kp) static int queue_count_set(const char *val, const struct kernel_param *kp) { - int n = 0, ret; + int n, ret; ret = kstrtoint(val, 10, &n); if (ret) From a97234e1ff1ec9d5a41c6adff5632c61639dee6a Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Fri, 12 Apr 2019 00:18:32 +0900 Subject: [PATCH 151/164] nvme-pci: check more command sizes All the NVMe command has 64bytes fixed size so that it has been assured with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h also need to be checked here. Signed-off-by: Minwoo Im Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e2ff92de41a7..9c1a8fd68b3a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -231,19 +231,26 @@ struct nvme_iod { */ static inline void _nvme_check_size(void) { + BUILD_BUG_ON(sizeof(struct nvme_common_command) != 64); BUILD_BUG_ON(sizeof(struct nvme_rw_command) != 64); + BUILD_BUG_ON(sizeof(struct nvme_identify) != 64); BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64); BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64); BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64); BUILD_BUG_ON(sizeof(struct nvme_features) != 64); + BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64); BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64); BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64); BUILD_BUG_ON(sizeof(struct nvme_command) != 64); BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE); BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64); BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); + BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64); } static unsigned int max_io_queues(void) From a2faf94e57c5237a9cad31f63eeaf2412ed0e951 Mon Sep 17 00:00:00 2001 From: Minwoo Im Date: Fri, 12 Apr 2019 00:18:31 +0900 Subject: [PATCH 152/164] nvme-fabrics: check more command sizes struct common_command provides a common structure for NVMe-oF command format. It also needs to be checked for unintended size growth. Signed-off-by: Minwoo Im Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index d4cb826f58ff..592d1e61ef7e 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -1188,6 +1188,7 @@ static void __exit nvmf_exit(void) class_destroy(nvmf_class); nvmf_host_put(nvmf_default_host); + BUILD_BUG_ON(sizeof(struct nvmf_common_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_connect_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_get_command) != 64); BUILD_BUG_ON(sizeof(struct nvmf_property_set_command) != 64); From 811015409fd4af80bbecb8e46b3aa24c8986fb74 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 11:36:52 -0400 Subject: [PATCH 153/164] nvme: move command size checks to the core Most command aren't PCIe specific, so move the size checking for them to core.c Signed-off-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/core.c | 27 +++++++++++++++++++++++++++ drivers/nvme/host/pci.c | 31 +++---------------------------- 2 files changed, 30 insertions(+), 28 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3dd043aa6d1f..e970c5adee28 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3879,10 +3879,37 @@ void nvme_start_queues(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_start_queues); +/* + * Check we didn't inadvertently grow the command structure sizes: + */ +static inline void _nvme_check_size(void) +{ + BUILD_BUG_ON(sizeof(struct nvme_common_command) != 64); + BUILD_BUG_ON(sizeof(struct nvme_rw_command) != 64); + BUILD_BUG_ON(sizeof(struct nvme_identify) != 64); + BUILD_BUG_ON(sizeof(struct nvme_features) != 64); + BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64); + BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64); + BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64); + BUILD_BUG_ON(sizeof(struct nvme_command) != 64); + BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE); + BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE); + BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64); + BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); + BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); + BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64); +} + + int __init nvme_core_init(void) { int result = -ENOMEM; + _nvme_check_size(); + nvme_wq = alloc_workqueue("nvme-wq", WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0); if (!nvme_wq) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 9c1a8fd68b3a..3e4fb891a95a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -226,33 +226,6 @@ struct nvme_iod { struct scatterlist *sg; }; -/* - * Check we didin't inadvertently grow the command struct - */ -static inline void _nvme_check_size(void) -{ - BUILD_BUG_ON(sizeof(struct nvme_common_command) != 64); - BUILD_BUG_ON(sizeof(struct nvme_rw_command) != 64); - BUILD_BUG_ON(sizeof(struct nvme_identify) != 64); - BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64); - BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64); - BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64); - BUILD_BUG_ON(sizeof(struct nvme_features) != 64); - BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64); - BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64); - BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64); - BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64); - BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64); - BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64); - BUILD_BUG_ON(sizeof(struct nvme_command) != 64); - BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE); - BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE); - BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64); - BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); - BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64); - BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64); -} - static unsigned int max_io_queues(void) { return num_possible_cpus() + write_queues + poll_queues; @@ -2988,6 +2961,9 @@ static struct pci_driver nvme_driver = { static int __init nvme_init(void) { + BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64); + BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64); + BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64); BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2); return pci_register_driver(&nvme_driver); } @@ -2996,7 +2972,6 @@ static void __exit nvme_exit(void) { pci_unregister_driver(&nvme_driver); flush_workqueue(nvme_wq); - _nvme_check_size(); } MODULE_AUTHOR("Matthew Wilcox "); From 893a74b7a76e6e9c5c7199e6aae946f090622fa2 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 30 Apr 2019 11:37:43 -0400 Subject: [PATCH 154/164] nvme: mark nvme_core_init and nvme_core_exit static Signed-off-by: Christoph Hellwig Reviewed-by: Bart Van Assche Reviewed-by: Keith Busch Reviewed-by: Chaitanya Kulkarni --- drivers/nvme/host/core.c | 4 ++-- drivers/nvme/host/nvme.h | 3 --- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index e970c5adee28..cd16d98d1f1a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3904,7 +3904,7 @@ static inline void _nvme_check_size(void) } -int __init nvme_core_init(void) +static int __init nvme_core_init(void) { int result = -ENOMEM; @@ -3956,7 +3956,7 @@ out: return result; } -void __exit nvme_core_exit(void) +static void __exit nvme_core_exit(void) { ida_destroy(&nvme_subsystems_ida); class_destroy(nvme_subsys_class); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 527d64545023..5ee75b5ff83f 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -577,7 +577,4 @@ static inline struct nvme_ns *nvme_get_ns_from_dev(struct device *dev) return dev_to_disk(dev)->private_data; } -int __init nvme_core_init(void); -void __exit nvme_core_exit(void); - #endif /* _NVME_H */ From 6f53e73b9ec5b3cd097077c5ffcb76df708ce3f8 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Mon, 29 Apr 2019 16:28:19 -0700 Subject: [PATCH 155/164] nvmet: protect discovery change log event list iteration When we iterate on the discovery subsystem controllers we need to protect against concurrent mutations to it. Signed-off-by: Sagi Grimberg Reviewed-by: Minwoo Im Signed-off-by: Christoph Hellwig --- drivers/nvme/target/discovery.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index e8e09266bfa5..5baf269f3f8a 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -30,14 +30,17 @@ void nvmet_port_disc_changed(struct nvmet_port *port, { struct nvmet_ctrl *ctrl; + lockdep_assert_held(&nvmet_config_sem); nvmet_genctr++; + mutex_lock(&nvmet_disc_subsys->lock); list_for_each_entry(ctrl, &nvmet_disc_subsys->ctrls, subsys_entry) { if (subsys && !nvmet_host_allowed(subsys, ctrl->hostnqn)) continue; __nvmet_disc_changed(port, ctrl); } + mutex_unlock(&nvmet_disc_subsys->lock); } static void __nvmet_subsys_disc_changed(struct nvmet_port *port, @@ -46,12 +49,14 @@ static void __nvmet_subsys_disc_changed(struct nvmet_port *port, { struct nvmet_ctrl *ctrl; + mutex_lock(&nvmet_disc_subsys->lock); list_for_each_entry(ctrl, &nvmet_disc_subsys->ctrls, subsys_entry) { if (host && strcmp(nvmet_host_name(host), ctrl->hostnqn)) continue; __nvmet_disc_changed(port, ctrl); } + mutex_unlock(&nvmet_disc_subsys->lock); } void nvmet_subsys_disc_changed(struct nvmet_subsys *subsys, From 273938bf7ae92112e646f9a46b39aa74b64be4e8 Mon Sep 17 00:00:00 2001 From: Raul E Rangel Date: Thu, 2 May 2019 13:48:11 -0600 Subject: [PATCH 156/164] block: fix function name in comment The comment was out of date. Reviewed-by: Bart Van Assche Signed-off-by: Raul E Rangel Signed-off-by: Jens Axboe --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4f15adfbab29..c9bf9b92d2db 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2063,7 +2063,7 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, list_del_init(&page->lru); /* * Remove kmemleak object previously allocated in - * blk_mq_init_rq_map(). + * blk_mq_alloc_rqs(). */ kmemleak_free(page_address(page)); __free_pages(page, page->private); From e87eb301bee183d82bb3d04bd71b6660889a2588 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:23 +0800 Subject: [PATCH 157/164] blk-mq: grab .q_usage_counter when queuing request from plug code path Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang Cc: James Smart Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-mq-sched.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index f6e3b10b52eb..74c6bb871f7e 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -414,6 +414,14 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, struct list_head *list, bool run_queue_async) { struct elevator_queue *e; + struct request_queue *q = hctx->queue; + + /* + * blk_mq_sched_insert_requests() is called from flush plug + * context only, and hold one usage counter to prevent queue + * from being released. + */ + percpu_ref_get(&q->q_usage_counter); e = hctx->queue->elevator; if (e && e->type->ops.insert_requests) @@ -427,12 +435,14 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, if (!hctx->dispatch_busy && !e && !run_queue_async) { blk_mq_try_issue_list_directly(hctx, list); if (list_empty(list)) - return; + goto out; } blk_mq_insert_requests(hctx, ctx, list); } blk_mq_run_hw_queue(hctx, run_queue_async); + out: + percpu_ref_put(&q->q_usage_counter); } static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set, From fbc2a15e3433058582e5635aabe48a3011a644a8 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:24 +0800 Subject: [PATCH 158/164] blk-mq: move cancel of requeue_work into blk_mq_release With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-core.c | 1 - block/blk-mq.c | 2 ++ 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index b044829135c9..2af1040b2fa6 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -238,7 +238,6 @@ void blk_sync_queue(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; - cancel_delayed_work_sync(&q->requeue_work); queue_for_each_hw_ctx(q, hctx, i) cancel_delayed_work_sync(&hctx->run_work); } diff --git a/block/blk-mq.c b/block/blk-mq.c index c9bf9b92d2db..741cf8d55e9c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2635,6 +2635,8 @@ void blk_mq_release(struct request_queue *q) struct blk_mq_hw_ctx *hctx; unsigned int i; + cancel_delayed_work_sync(&q->requeue_work); + /* hctx kobj stays in hctx */ queue_for_each_hw_ctx(q, hctx, i) { if (!hctx) From c7e2d94b3d1634988a95ac4d77a72dc7487ece06 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:25 +0800 Subject: [PATCH 159/164] blk-mq: free hw queue's resource in hctx's release handler Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d909b2, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2f315 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reported-by: James Smart Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-core.c | 2 +- block/blk-mq-sysfs.c | 6 ++++++ block/blk-mq.c | 8 ++------ block/blk-mq.h | 2 +- 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2af1040b2fa6..81d209568a26 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -375,7 +375,7 @@ void blk_cleanup_queue(struct request_queue *q) blk_exit_queue(q); if (queue_is_mq(q)) - blk_mq_free_queue(q); + blk_mq_exit_queue(q); percpu_ref_exit(&q->q_usage_counter); diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 61efc2a29e58..7593c4c78975 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -11,6 +11,7 @@ #include #include +#include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" @@ -34,6 +35,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj) { struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + + if (hctx->flags & BLK_MQ_F_BLOCKING) + cleanup_srcu_struct(hctx->srcu); + blk_free_flush_queue(hctx->fq); + sbitmap_free(&hctx->ctx_map); free_cpumask_var(hctx->cpumask); kfree(hctx->ctxs); kfree(hctx); diff --git a/block/blk-mq.c b/block/blk-mq.c index 741cf8d55e9c..1fdb8de92a10 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2268,12 +2268,7 @@ static void blk_mq_exit_hctx(struct request_queue *q, if (set->ops->exit_hctx) set->ops->exit_hctx(hctx, hctx_idx); - if (hctx->flags & BLK_MQ_F_BLOCKING) - cleanup_srcu_struct(hctx->srcu); - blk_mq_remove_cpuhp(hctx); - blk_free_flush_queue(hctx->fq); - sbitmap_free(&hctx->ctx_map); } static void blk_mq_exit_hw_queues(struct request_queue *q, @@ -2908,7 +2903,8 @@ err_exit: } EXPORT_SYMBOL(blk_mq_init_allocated_queue); -void blk_mq_free_queue(struct request_queue *q) +/* tags can _not_ be used after returning from blk_mq_exit_queue */ +void blk_mq_exit_queue(struct request_queue *q) { struct blk_mq_tag_set *set = q->tag_set; diff --git a/block/blk-mq.h b/block/blk-mq.h index 423ea88ab6fb..633a5a77ee8b 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -37,7 +37,7 @@ struct blk_mq_ctx { struct kobject kobj; } ____cacheline_aligned_in_smp; -void blk_mq_free_queue(struct request_queue *q); +void blk_mq_exit_queue(struct request_queue *q); int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr); void blk_mq_wake_waiters(struct request_queue *q); bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool); From 7c6c5b7c9186e3fb5b10afb8e5f710ae661144c6 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:26 +0800 Subject: [PATCH 160/164] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Split blk_mq_alloc_and_init_hctx into two parts, and one is blk_mq_alloc_hctx() for allocating all hctx resources, another is blk_mq_init_hctx() for initializing hctx, which serves as counter-part of blk_mq_exit_hctx(). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org Cc: Martin K . Petersen Cc: Christoph Hellwig Cc: James E . J . Bottomley Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-mq.c | 139 ++++++++++++++++++++++++++----------------------- 1 file changed, 75 insertions(+), 64 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 1fdb8de92a10..17e63d80b6d6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2285,15 +2285,65 @@ static void blk_mq_exit_hw_queues(struct request_queue *q, } } +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set) +{ + int hw_ctx_size = sizeof(struct blk_mq_hw_ctx); + + BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu), + __alignof__(struct blk_mq_hw_ctx)) != + sizeof(struct blk_mq_hw_ctx)); + + if (tag_set->flags & BLK_MQ_F_BLOCKING) + hw_ctx_size += sizeof(struct srcu_struct); + + return hw_ctx_size; +} + static int blk_mq_init_hctx(struct request_queue *q, struct blk_mq_tag_set *set, struct blk_mq_hw_ctx *hctx, unsigned hctx_idx) { - int node; + hctx->queue_num = hctx_idx; - node = hctx->numa_node; + cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); + + hctx->tags = set->tags[hctx_idx]; + + if (set->ops->init_hctx && + set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) + goto unregister_cpu_notifier; + + if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, + hctx->numa_node)) + goto exit_hctx; + return 0; + + exit_hctx: + if (set->ops->exit_hctx) + set->ops->exit_hctx(hctx, hctx_idx); + unregister_cpu_notifier: + blk_mq_remove_cpuhp(hctx); + return -1; +} + +static struct blk_mq_hw_ctx * +blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, + int node) +{ + struct blk_mq_hw_ctx *hctx; + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; + + hctx = kzalloc_node(blk_mq_hw_ctx_size(set), gfp, node); + if (!hctx) + goto fail_alloc_hctx; + + if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node)) + goto free_hctx; + + atomic_set(&hctx->nr_active, 0); if (node == NUMA_NO_NODE) - node = hctx->numa_node = set->numa_node; + node = set->numa_node; + hctx->numa_node = node; INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn); spin_lock_init(&hctx->lock); @@ -2301,58 +2351,45 @@ static int blk_mq_init_hctx(struct request_queue *q, hctx->queue = q; hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; - cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); - - hctx->tags = set->tags[hctx_idx]; - /* * Allocate space for all possible cpus to avoid allocation at * runtime */ hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *), - GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); + gfp, node); if (!hctx->ctxs) - goto unregister_cpu_notifier; + goto free_cpumask; if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8), - GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node)) + gfp, node)) goto free_ctxs; - hctx->nr_ctx = 0; spin_lock_init(&hctx->dispatch_wait_lock); init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); INIT_LIST_HEAD(&hctx->dispatch_wait.entry); - if (set->ops->init_hctx && - set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) - goto free_bitmap; - hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size, - GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY); + gfp); if (!hctx->fq) - goto exit_hctx; - - if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node)) - goto free_fq; + goto free_bitmap; if (hctx->flags & BLK_MQ_F_BLOCKING) init_srcu_struct(hctx->srcu); + blk_mq_hctx_kobj_init(hctx); - return 0; + return hctx; - free_fq: - blk_free_flush_queue(hctx->fq); - exit_hctx: - if (set->ops->exit_hctx) - set->ops->exit_hctx(hctx, hctx_idx); free_bitmap: sbitmap_free(&hctx->ctx_map); free_ctxs: kfree(hctx->ctxs); - unregister_cpu_notifier: - blk_mq_remove_cpuhp(hctx); - return -1; + free_cpumask: + free_cpumask_var(hctx->cpumask); + free_hctx: + kfree(hctx); + fail_alloc_hctx: + return NULL; } static void blk_mq_init_cpu_queues(struct request_queue *q, @@ -2698,51 +2735,25 @@ struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set, } EXPORT_SYMBOL(blk_mq_init_sq_queue); -static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set) -{ - int hw_ctx_size = sizeof(struct blk_mq_hw_ctx); - - BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu), - __alignof__(struct blk_mq_hw_ctx)) != - sizeof(struct blk_mq_hw_ctx)); - - if (tag_set->flags & BLK_MQ_F_BLOCKING) - hw_ctx_size += sizeof(struct srcu_struct); - - return hw_ctx_size; -} - static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx( struct blk_mq_tag_set *set, struct request_queue *q, int hctx_idx, int node) { struct blk_mq_hw_ctx *hctx; - hctx = kzalloc_node(blk_mq_hw_ctx_size(set), - GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, - node); + hctx = blk_mq_alloc_hctx(q, set, node); if (!hctx) - return NULL; + goto fail; - if (!zalloc_cpumask_var_node(&hctx->cpumask, - GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, - node)) { - kfree(hctx); - return NULL; - } - - atomic_set(&hctx->nr_active, 0); - hctx->numa_node = node; - hctx->queue_num = hctx_idx; - - if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) { - free_cpumask_var(hctx->cpumask); - kfree(hctx); - return NULL; - } - blk_mq_hctx_kobj_init(hctx); + if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) + goto free_hctx; return hctx; + + free_hctx: + kobject_put(&hctx->kobj); + fail: + return NULL; } static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, From 2f8f1336a48bd5186de3476da0a3e2ec06d0533a Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:27 +0800 Subject: [PATCH 161/164] blk-mq: always free hctx after request queue is freed In normal queue cleanup path, hctx is released after request queue is freed, see blk_mq_release(). However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because of hw queues shrinking. This way is easy to cause use-after-free, because: one implicit rule is that it is safe to call almost all block layer APIs if the request queue is alive; and one hctx may be retrieved by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); finally use-after-free is triggered. Fixes this issue by always freeing hctx after releasing request queue. If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce a per-queue list to hold them, then try to resuse these hctxs if numa node is matched. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-mq.c | 46 ++++++++++++++++++++++++++++++------------ include/linux/blk-mq.h | 2 ++ include/linux/blkdev.h | 7 +++++++ 3 files changed, 42 insertions(+), 13 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 17e63d80b6d6..08a6248d8536 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2269,6 +2269,10 @@ static void blk_mq_exit_hctx(struct request_queue *q, set->ops->exit_hctx(hctx, hctx_idx); blk_mq_remove_cpuhp(hctx); + + spin_lock(&q->unused_hctx_lock); + list_add(&hctx->hctx_list, &q->unused_hctx_list); + spin_unlock(&q->unused_hctx_lock); } static void blk_mq_exit_hw_queues(struct request_queue *q, @@ -2351,6 +2355,8 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, hctx->queue = q; hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; + INIT_LIST_HEAD(&hctx->hctx_list); + /* * Allocate space for all possible cpus to avoid allocation at * runtime @@ -2664,15 +2670,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q) */ void blk_mq_release(struct request_queue *q) { - struct blk_mq_hw_ctx *hctx; - unsigned int i; + struct blk_mq_hw_ctx *hctx, *next; + int i; cancel_delayed_work_sync(&q->requeue_work); - /* hctx kobj stays in hctx */ - queue_for_each_hw_ctx(q, hctx, i) { - if (!hctx) - continue; + queue_for_each_hw_ctx(q, hctx, i) + WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list)); + + /* all hctx are in .unused_hctx_list now */ + list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) { + list_del_init(&hctx->hctx_list); kobject_put(&hctx->kobj); } @@ -2739,9 +2747,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx( struct blk_mq_tag_set *set, struct request_queue *q, int hctx_idx, int node) { - struct blk_mq_hw_ctx *hctx; + struct blk_mq_hw_ctx *hctx = NULL, *tmp; - hctx = blk_mq_alloc_hctx(q, set, node); + /* reuse dead hctx first */ + spin_lock(&q->unused_hctx_lock); + list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) { + if (tmp->numa_node == node) { + hctx = tmp; + break; + } + } + if (hctx) + list_del_init(&hctx->hctx_list); + spin_unlock(&q->unused_hctx_lock); + + if (!hctx) + hctx = blk_mq_alloc_hctx(q, set, node); if (!hctx) goto fail; @@ -2779,10 +2800,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, hctx = blk_mq_alloc_and_init_hctx(set, q, i, node); if (hctx) { - if (hctxs[i]) { + if (hctxs[i]) blk_mq_exit_hctx(q, set, hctxs[i], i); - kobject_put(&hctxs[i]->kobj); - } hctxs[i] = hctx; } else { if (hctxs[i]) @@ -2813,9 +2832,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, if (hctx->tags) blk_mq_free_map_and_requests(set, j); blk_mq_exit_hctx(q, set, hctx, j); - kobject_put(&hctx->kobj); hctxs[j] = NULL; - } } mutex_unlock(&q->sysfs_lock); @@ -2858,6 +2875,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, if (!q->queue_hw_ctx) goto err_sys_init; + INIT_LIST_HEAD(&q->unused_hctx_list); + spin_lock_init(&q->unused_hctx_lock); + blk_mq_realloc_hw_ctxs(set, q); if (!q->nr_hw_queues) goto err_hctxs; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index db29928de467..15d1aa53d96c 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -70,6 +70,8 @@ struct blk_mq_hw_ctx { struct dentry *sched_debugfs_dir; #endif + struct list_head hctx_list; + /* Must be the last member - see also blk_mq_hw_ctx_size(). */ struct srcu_struct srcu[0]; }; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index bd3e3f09bfa0..1aafeb923e7b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -535,6 +535,13 @@ struct request_queue { struct mutex sysfs_lock; + /* + * for reusing dead hctx instance in case of updating + * nr_hw_queues + */ + struct list_head unused_hctx_list; + spinlock_t unused_hctx_lock; + atomic_t mq_freeze_depth; #if defined(CONFIG_BLK_DEV_BSG) From 1b97871b501f1bac0fd39a073c4c8473ee457a55 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:28 +0800 Subject: [PATCH 162/164] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release hctx is always released after requeue is freed. With holding queue's kobject refcount, it is safe for driver to run queue, so one run queue might be scheduled after blk_sync_queue() is done. So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release() for avoiding run released queue. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-core.c | 8 -------- block/blk-mq-sysfs.c | 2 ++ 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 81d209568a26..6722b24a1182 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -233,14 +233,6 @@ void blk_sync_queue(struct request_queue *q) { del_timer_sync(&q->timeout); cancel_work_sync(&q->timeout_work); - - if (queue_is_mq(q)) { - struct blk_mq_hw_ctx *hctx; - int i; - - queue_for_each_hw_ctx(q, hctx, i) - cancel_delayed_work_sync(&hctx->run_work); - } } EXPORT_SYMBOL(blk_sync_queue); diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 7593c4c78975..2280d3cca965 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -36,6 +36,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj) struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + cancel_delayed_work_sync(&hctx->run_work); + if (hctx->flags & BLK_MQ_F_BLOCKING) cleanup_srcu_struct(hctx->srcu); blk_free_flush_queue(hctx->fq); From 662156641bc409a28fa313fca1a755105425d278 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Tue, 30 Apr 2019 09:52:29 +0800 Subject: [PATCH 163/164] block: don't drain in-progress dispatch in blk_cleanup_queue() Now freeing hw queue resource is moved to hctx's release handler, we don't need to worry about the race between blk_cleanup_queue and run queue any more. So don't drain in-progress dispatch in blk_cleanup_queue(). This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before freeing queue"). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei Signed-off-by: Jens Axboe --- block/blk-core.c | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6722b24a1182..419d600e6637 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -339,18 +339,6 @@ void blk_cleanup_queue(struct request_queue *q) blk_queue_flag_set(QUEUE_FLAG_DEAD, q); - /* - * make sure all in-progress dispatch are completed because - * blk_freeze_queue() can only complete all requests, and - * dispatch may still be in-progress since we dispatch requests - * from more than one contexts. - * - * We rely on driver to deal with the race in case that queue - * initialization isn't done. - */ - if (queue_is_mq(q) && blk_queue_init_done(q)) - blk_mq_quiesce_queue(q); - /* for synchronous bio-based driver finish in-flight integrity i/o */ blk_flush_integrity(); From b8753433fc611e23e31300e1d099001a08955c88 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 7 May 2019 08:53:35 +0200 Subject: [PATCH 164/164] block: fix mismerge in bvec_advance When Jens merged my commit to only allow contiguous page structs in a bio_vec with Ming's 5.1 fix to ensue the bvec length didn't overflow we failed to keep the removal of the expensive nth_page calls. This commits adds them back as intended. Fixes: 5c61ee2cd586 ("Merge tag 'v5.1-rc6' into for-5.2/block") Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- include/linux/bvec.h | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 545a480528e0..a032f01e928c 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -133,11 +133,6 @@ static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all) return &iter_all->bv; } -static inline struct page *bvec_nth_page(struct page *page, int idx) -{ - return idx == 0 ? page : nth_page(page, idx); -} - static inline void bvec_advance(const struct bio_vec *bvec, struct bvec_iter_all *iter_all) { @@ -147,8 +142,7 @@ static inline void bvec_advance(const struct bio_vec *bvec, bv->bv_page++; bv->bv_offset = 0; } else { - bv->bv_page = bvec_nth_page(bvec->bv_page, bvec->bv_offset / - PAGE_SIZE); + bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT); bv->bv_offset = bvec->bv_offset & ~PAGE_MASK; } bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,