1
0
Fork 0
alistair23-linux/include/linux/blk-cgroup.h

645 lines
18 KiB
C
Raw Normal View History

#ifndef _BLK_CGROUP_H
#define _BLK_CGROUP_H
/*
* Common Block IO controller cgroup interface
*
* Based on ideas and code from CFQ, CFS and BFQ:
* Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
*
* Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
* Paolo Valente <paolo.valente@unimore.it>
*
* Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
* Nauman Rafique <nauman@google.com>
*/
#include <linux/cgroup.h>
#include <linux/u64_stats_sync.h>
#include <linux/seq_file.h>
#include <linux/radix-tree.h>
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
#include <linux/blkdev.h>
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-19 15:42:57 -06:00
#include <linux/atomic.h>
/* Max limits for throttle policy */
#define THROTL_IOPS_MAX UINT_MAX
#ifdef CONFIG_BLK_CGROUP
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
enum blkg_rwstat_type {
BLKG_RWSTAT_READ,
BLKG_RWSTAT_WRITE,
BLKG_RWSTAT_SYNC,
BLKG_RWSTAT_ASYNC,
BLKG_RWSTAT_NR,
BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
};
struct blkcg_gq;
struct blkcg {
struct cgroup_subsys_state css;
spinlock_t lock;
struct radix_tree_root blkg_tree;
struct blkcg_gq *blkg_hint;
struct hlist_head blkg_list;
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
struct blkcg_policy_data *pd[BLKCG_MAX_POLS];
writeback: make backing_dev_info host cgroup-specific bdi_writebacks For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 15:13:37 -06:00
struct list_head all_blkcgs_node;
writeback: make backing_dev_info host cgroup-specific bdi_writebacks For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 15:13:37 -06:00
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
#endif
};
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
struct blkg_stat {
struct u64_stats_sync syncp;
uint64_t cnt;
};
struct blkg_rwstat {
struct u64_stats_sync syncp;
uint64_t cnt[BLKG_RWSTAT_NR];
};
/*
* A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a
* request_queue (q). This is used by blkcg policies which need to track
* information per blkcg - q pair.
*
* There can be multiple active blkcg policies and each blkg:policy pair is
* represented by a blkg_policy_data which is allocated and freed by each
* policy's pd_alloc/free_fn() methods. A policy can allocate private data
* area by allocating larger data structure which embeds blkg_policy_data
* at the beginning.
*/
struct blkg_policy_data {
/* the blkg and policy id this per-policy data belongs to */
struct blkcg_gq *blkg;
int plid;
};
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
/*
* Policies that need to keep per-blkcg data which is independent
* from any request_queue associated to it must specify its size
* with the cpd_size field of the blkcg_policy structure and
blkcg: fix blkcg_policy_data allocation bug e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") updated per-blkcg policy data to be dynamically allocated. When a policy is registered, its policy data aren't created. Instead, when the policy is activated on a queue, the policy data are allocated if there are blkg's (blkcg_gq's) which are attached to a given blkcg. This is buggy. Consider the following scenario. 1. A blkcg is created. No blkg's attached yet. 2. The policy is registered. No policy data is allocated. 3. The policy is activated on a queue. As the above blkcg doesn't have any blkg's, it won't allocate the matching blkcg_policy_data. 4. An IO is issued from the blkcg and blkg is created and the blkcg still doesn't have the matching policy data allocated. With cfq-iosched, this leads to an oops. It also doesn't free policy data on policy unregistration assuming that freeing of all policy data on blkcg destruction should take care of it; however, this also is incorrect. 1. A blkcg has policy data. 2. The policy gets unregistered but the policy data remains. 3. Another policy gets registered on the same slot. 4. Later, the new policy tries to allocate policy data on the previous blkcg but the slot is already occupied and gets skipped. The policy ends up operating on the policy data of the previous policy. There's no reason to manage blkcg_policy_data lazily. The reason we do lazy allocation of blkg's is that the number of all possible blkg's is the product of cgroups and block devices which can reach a surprising level. blkcg_policy_data is contrained by the number of cgroups and shouldn't be a problem. This patch makes blkcg_policy_data to be allocated for all existing blkcg's on policy registration and freed on unregistration and removes blkcg_policy_data handling from policy [de]activation paths. This makes that blkcg_policy_data are created and removed with the policy they belong to and fixes the above described problems. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data") Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-09 14:39:50 -06:00
* embed a blkcg_policy_data in it. cpd_init() is invoked to let
* each policy handle per-blkcg data.
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
*/
struct blkcg_policy_data {
/* the policy id this per-policy data belongs to */
int plid;
};
/* association between a blk cgroup and a request queue */
struct blkcg_gq {
/* Pointer to the associated request_queue */
struct request_queue *q;
struct list_head q_node;
struct hlist_node blkcg_node;
struct blkcg *blkcg;
/*
* Each blkg gets congested separately and the congestion state is
* propagated to the matching bdi_writeback_congested.
*/
struct bdi_writeback_congested *wb_congested;
/* all non-root blkcg_gq's are guaranteed to have access to parent */
struct blkcg_gq *parent;
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
/* request allocation list for this blkcg-q pair */
struct request_list rl;
/* reference count */
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-19 15:42:57 -06:00
atomic_t refcnt;
/* is this blkg online? protected by both blkcg and q locks */
bool online;
struct blkg_policy_data *pd[BLKCG_MAX_POLS];
struct rcu_head rcu_head;
};
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
typedef void (blkcg_pol_init_cpd_fn)(const struct blkcg *blkcg);
typedef struct blkg_policy_data *(blkcg_pol_alloc_pd_fn)(gfp_t gfp, int node);
typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
typedef void (blkcg_pol_free_pd_fn)(struct blkg_policy_data *pd);
typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
struct blkcg_policy {
int plid;
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
/* policy specific per-blkcg data size */
size_t cpd_size;
/* cgroup files for the policy */
struct cftype *cftypes;
/* operations */
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
blkcg_pol_init_cpd_fn *cpd_init_fn;
blkcg_pol_alloc_pd_fn *pd_alloc_fn;
blkcg_pol_init_pd_fn *pd_init_fn;
blkcg_pol_online_pd_fn *pd_online_fn;
blkcg_pol_offline_pd_fn *pd_offline_fn;
blkcg_pol_free_pd_fn *pd_free_fn;
blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn;
};
extern struct blkcg blkcg_root;
extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q);
int blkcg_init_queue(struct request_queue *q);
void blkcg_drain_queue(struct request_queue *q);
void blkcg_exit_queue(struct request_queue *q);
/* Blkio controller policy registration */
int blkcg_policy_register(struct blkcg_policy *pol);
void blkcg_policy_unregister(struct blkcg_policy *pol);
int blkcg_activate_policy(struct request_queue *q,
const struct blkcg_policy *pol);
void blkcg_deactivate_policy(struct request_queue *q,
const struct blkcg_policy *pol);
void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
u64 (*prfill)(struct seq_file *,
struct blkg_policy_data *, int),
const struct blkcg_policy *pol, int data,
bool show_total);
u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v);
u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
const struct blkg_rwstat *rwstat);
u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
int off);
u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
int off);
struct blkg_conf_ctx {
struct gendisk *disk;
struct blkcg_gq *blkg;
u64 v;
};
int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
const char *input, struct blkg_conf_ctx *ctx);
void blkg_conf_finish(struct blkg_conf_ctx *ctx);
static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
{
return css ? container_of(css, struct blkcg, css) : NULL;
}
static inline struct blkcg *task_blkcg(struct task_struct *tsk)
{
cgroup: clean up cgroup_subsys names and initialization cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>
2014-02-08 08:36:58 -07:00
return css_to_blkcg(task_css(tsk, blkio_cgrp_id));
}
static inline struct blkcg *bio_blkcg(struct bio *bio)
{
if (bio && bio->bi_css)
return css_to_blkcg(bio->bi_css);
return task_blkcg(current);
}
static inline struct cgroup_subsys_state *
task_get_blkcg_css(struct task_struct *task)
{
return task_get_css(task, blkio_cgrp_id);
}
/**
* blkcg_parent - get the parent of a blkcg
* @blkcg: blkcg of interest
*
* Return the parent blkcg of @blkcg. Can be called anytime.
*/
static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
{
return css_to_blkcg(blkcg->css.parent);
}
/**
* blkg_to_pdata - get policy private data
* @blkg: blkg of interest
* @pol: policy of interest
*
* Return pointer to private data associated with the @blkg-@pol pair.
*/
static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
struct blkcg_policy *pol)
{
return blkg ? blkg->pd[pol->plid] : NULL;
}
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
static inline struct blkcg_policy_data *blkcg_to_cpd(struct blkcg *blkcg,
struct blkcg_policy *pol)
{
return blkcg ? blkcg->pd[pol->plid] : NULL;
}
/**
* pdata_to_blkg - get blkg associated with policy private data
* @pd: policy private data of interest
*
* @pd is policy private data. Determine the blkg it's associated with.
*/
static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd)
{
return pd ? pd->blkg : NULL;
}
/**
* blkg_path - format cgroup path of blkg
* @blkg: blkg of interest
* @buf: target buffer
* @buflen: target buffer length
*
* Format the path of the cgroup of @blkg into @buf.
*/
static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
{
cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-12 07:29:50 -07:00
char *p;
cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-12 07:29:50 -07:00
p = cgroup_path(blkg->blkcg->css.cgroup, buf, buflen);
if (!p) {
strncpy(buf, "<unavailable>", buflen);
cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-12 07:29:50 -07:00
return -ENAMETOOLONG;
}
memmove(buf, p, buf + buflen - p);
return 0;
}
/**
* blkg_get - get a blkg reference
* @blkg: blkg to get
*
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-19 15:42:57 -06:00
* The caller should be holding an existing reference.
*/
static inline void blkg_get(struct blkcg_gq *blkg)
{
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-19 15:42:57 -06:00
WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
atomic_inc(&blkg->refcnt);
}
void __blkg_release_rcu(struct rcu_head *rcu);
/**
* blkg_put - put a blkg reference
* @blkg: blkg to put
*/
static inline void blkg_put(struct blkcg_gq *blkg)
{
blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, 2a4fd070ee85 ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-19 15:42:57 -06:00
WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
if (atomic_dec_and_test(&blkg->refcnt))
call_rcu(&blkg->rcu_head, __blkg_release_rcu);
}
struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
bool update_hint);
/**
* blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
* @d_blkg: loop cursor pointing to the current descendant
2013-08-08 18:11:25 -06:00
* @pos_css: used for iteration
* @p_blkg: target blkg to walk descendants of
*
* Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU
* read locked. If called under either blkcg or queue lock, the iteration
* is guaranteed to include all and only online blkgs. The caller may
2013-08-08 18:11:25 -06:00
* update @pos_css by calling css_rightmost_descendant() to skip subtree.
cgroup: make css_for_each_descendant() and friends include the origin css in the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-08 18:11:27 -06:00
* @p_blkg is included in the iteration and the first node to be visited.
*/
2013-08-08 18:11:25 -06:00
#define blkg_for_each_descendant_pre(d_blkg, pos_css, p_blkg) \
css_for_each_descendant_pre((pos_css), &(p_blkg)->blkcg->css) \
if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
(p_blkg)->q, false)))
/**
* blkg_for_each_descendant_post - post-order walk of a blkg's descendants
* @d_blkg: loop cursor pointing to the current descendant
2013-08-08 18:11:25 -06:00
* @pos_css: used for iteration
* @p_blkg: target blkg to walk descendants of
*
* Similar to blkg_for_each_descendant_pre() but performs post-order
cgroup: make css_for_each_descendant() and friends include the origin css in the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
2013-08-08 18:11:27 -06:00
* traversal instead. Synchronization rules are the same. @p_blkg is
* included in the iteration and the last node to be visited.
*/
2013-08-08 18:11:25 -06:00
#define blkg_for_each_descendant_post(d_blkg, pos_css, p_blkg) \
css_for_each_descendant_post((pos_css), &(p_blkg)->blkcg->css) \
if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
(p_blkg)->q, false)))
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
/**
* blk_get_rl - get request_list to use
* @q: request_queue of interest
* @bio: bio which will be attached to the allocated request (may be %NULL)
*
* The caller wants to allocate a request from @q to use for @bio. Find
* the request_list to use and obtain a reference on it. Should be called
* under queue_lock. This function is guaranteed to return non-%NULL
* request_list.
*/
static inline struct request_list *blk_get_rl(struct request_queue *q,
struct bio *bio)
{
struct blkcg *blkcg;
struct blkcg_gq *blkg;
rcu_read_lock();
blkcg = bio_blkcg(bio);
/* bypass blkg lookup and use @q->root_rl directly for root */
if (blkcg == &blkcg_root)
goto root_rl;
/*
* Try to use blkg->rl. blkg lookup may fail under memory pressure
* or if either the blkcg or queue is going away. Fall back to
* root_rl in such cases.
*/
blkg = blkg_lookup_create(blkcg, q);
if (unlikely(IS_ERR(blkg)))
goto root_rl;
blkg_get(blkg);
rcu_read_unlock();
return &blkg->rl;
root_rl:
rcu_read_unlock();
return &q->root_rl;
}
/**
* blk_put_rl - put request_list
* @rl: request_list to put
*
* Put the reference acquired by blk_get_rl(). Should be called under
* queue_lock.
*/
static inline void blk_put_rl(struct request_list *rl)
{
if (rl->blkg->blkcg != &blkcg_root)
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
blkg_put(rl->blkg);
}
/**
* blk_rq_set_rl - associate a request with a request_list
* @rq: request of interest
* @rl: target request_list
*
* Associate @rq with @rl so that accounting and freeing can know the
* request_list @rq came from.
*/
static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
{
rq->rl = rl;
}
/**
* blk_rq_rl - return the request_list a request came from
* @rq: request of interest
*
* Return the request_list @rq is allocated from.
*/
static inline struct request_list *blk_rq_rl(struct request *rq)
{
return rq->rl;
}
struct request_list *__blk_queue_next_rl(struct request_list *rl,
struct request_queue *q);
/**
* blk_queue_for_each_rl - iterate through all request_lists of a request_queue
*
* Should be used under queue_lock.
*/
#define blk_queue_for_each_rl(rl, q) \
for ((rl) = &(q)->root_rl; (rl); (rl) = __blk_queue_next_rl((rl), (q)))
block: Use u64_stats_init() to initialize seqcounts Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-12 20:42:14 -07:00
static inline void blkg_stat_init(struct blkg_stat *stat)
{
u64_stats_init(&stat->syncp);
}
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
/**
* blkg_stat_add - add a value to a blkg_stat
* @stat: target blkg_stat
* @val: value to add
*
* Add @val to @stat. The caller is responsible for synchronizing calls to
* this function.
*/
static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val)
{
u64_stats_update_begin(&stat->syncp);
stat->cnt += val;
u64_stats_update_end(&stat->syncp);
}
/**
* blkg_stat_read - read the current value of a blkg_stat
* @stat: blkg_stat to read
*
* Read the current value of @stat. This function can be called without
* synchroniztion and takes care of u64 atomicity.
*/
static inline uint64_t blkg_stat_read(struct blkg_stat *stat)
{
unsigned int start;
uint64_t v;
do {
start = u64_stats_fetch_begin_irq(&stat->syncp);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
v = stat->cnt;
} while (u64_stats_fetch_retry_irq(&stat->syncp, start));
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
return v;
}
/**
* blkg_stat_reset - reset a blkg_stat
* @stat: blkg_stat to reset
*/
static inline void blkg_stat_reset(struct blkg_stat *stat)
{
stat->cnt = 0;
}
/**
* blkg_stat_merge - merge a blkg_stat into another
* @to: the destination blkg_stat
* @from: the source
*
* Add @from's count to @to.
*/
static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
{
blkg_stat_add(to, blkg_stat_read(from));
}
block: Use u64_stats_init() to initialize seqcounts Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-12 20:42:14 -07:00
static inline void blkg_rwstat_init(struct blkg_rwstat *rwstat)
{
u64_stats_init(&rwstat->syncp);
}
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
/**
* blkg_rwstat_add - add a value to a blkg_rwstat
* @rwstat: target blkg_rwstat
* @rw: mask of REQ_{WRITE|SYNC}
* @val: value to add
*
* Add @val to @rwstat. The counters are chosen according to @rw. The
* caller is responsible for synchronizing calls to this function.
*/
static inline void blkg_rwstat_add(struct blkg_rwstat *rwstat,
int rw, uint64_t val)
{
u64_stats_update_begin(&rwstat->syncp);
if (rw & REQ_WRITE)
rwstat->cnt[BLKG_RWSTAT_WRITE] += val;
else
rwstat->cnt[BLKG_RWSTAT_READ] += val;
if (rw & REQ_SYNC)
rwstat->cnt[BLKG_RWSTAT_SYNC] += val;
else
rwstat->cnt[BLKG_RWSTAT_ASYNC] += val;
u64_stats_update_end(&rwstat->syncp);
}
/**
* blkg_rwstat_read - read the current values of a blkg_rwstat
* @rwstat: blkg_rwstat to read
*
* Read the current snapshot of @rwstat and return it as the return value.
* This function can be called without synchronization and takes care of
* u64 atomicity.
*/
static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
{
unsigned int start;
struct blkg_rwstat tmp;
do {
start = u64_stats_fetch_begin_irq(&rwstat->syncp);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
tmp = *rwstat;
} while (u64_stats_fetch_retry_irq(&rwstat->syncp, start));
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
return tmp;
}
/**
* blkg_rwstat_total - read the total count of a blkg_rwstat
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
* @rwstat: blkg_rwstat to read
*
* Return the total count of @rwstat regardless of the IO direction. This
* function can be called without synchronization and takes care of u64
* atomicity.
*/
static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 15:38:42 -06:00
{
struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
return tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE];
}
/**
* blkg_rwstat_reset - reset a blkg_rwstat
* @rwstat: blkg_rwstat to reset
*/
static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
{
memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
}
/**
* blkg_rwstat_merge - merge a blkg_rwstat into another
* @to: the destination blkg_rwstat
* @from: the source
*
* Add @from's counts to @to.
*/
static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
struct blkg_rwstat *from)
{
struct blkg_rwstat v = blkg_rwstat_read(from);
int i;
u64_stats_update_begin(&to->syncp);
for (i = 0; i < BLKG_RWSTAT_NR; i++)
to->cnt[i] += v.cnt[i];
u64_stats_update_end(&to->syncp);
}
#else /* CONFIG_BLK_CGROUP */
struct blkcg {
};
struct blkg_policy_data {
};
block, cgroup: implement policy-specific per-blkcg data The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 15:38:42 -06:00
struct blkcg_policy_data {
};
struct blkcg_gq {
};
struct blkcg_policy {
};
#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
static inline struct cgroup_subsys_state *
task_get_blkcg_css(struct task_struct *task)
{
return NULL;
}
#ifdef CONFIG_BLOCK
static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
static inline void blkcg_drain_queue(struct request_queue *q) { }
static inline void blkcg_exit_queue(struct request_queue *q) { }
static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { }
static inline int blkcg_activate_policy(struct request_queue *q,
const struct blkcg_policy *pol) { return 0; }
static inline void blkcg_deactivate_policy(struct request_queue *q,
const struct blkcg_policy *pol) { }
static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
struct blkcg_policy *pol) { return NULL; }
static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
static inline void blkg_get(struct blkcg_gq *blkg) { }
static inline void blkg_put(struct blkcg_gq *blkg) { }
blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-26 16:05:44 -06:00
static inline struct request_list *blk_get_rl(struct request_queue *q,
struct bio *bio) { return &q->root_rl; }
static inline void blk_put_rl(struct request_list *rl) { }
static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
#define blk_queue_for_each_rl(rl, q) \
for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
#endif /* CONFIG_BLOCK */
#endif /* CONFIG_BLK_CGROUP */
#endif /* _BLK_CGROUP_H */