redonkable/alistair23-linux

Author	SHA1	Message	Date
Tejun Heo	3dd06ffa9d	cgroup: rename cgroup_dummy_root and related names The dummy root will be repurposed to serve as the default unified hierarchy. Let's rename things in preparation. * s/cgroup_dummy_root/cgrp_dfl_root/ * s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore * s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	944196278d	cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroupfs_root->subsys_mask represents the controllers attached to the hierarchy. This patch moves the field to cgroup. Subsystem initialization and rebinding updates the top cgroup's subsys_mask. For !root cgroups, the subsys_mask bits are set from create_css() and cleared from kill_css(), which effectively means that all cgroups will have the same subsys_mask as the top cgroup. While this doesn't make any difference now, this will help implementation of the default unified hierarchy where !root cgroups may have subsets of the top_cgroup's subsys_mask. While at it, __kill_css() is split out of kill_css(). The former doesn't care about the subsys_mask while the latter becomes noop if the controller is already killed and clears the matching bit if not before proceeding to killing the css. This will be used later by the default unified hierarchy implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	5df3603229	cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding Currently, while rebinding, cgroup_dummy_root serves as the anchor point. In addition to the target root, rebind_subsystems() takes @added_mask and @removed_mask. The subsystems specified in the former are expected to be on the dummy root and then moved to the target root. The ones in the latter are moved from non-dummy root to dummy. Now that the dummy root is a fully functional one and we're planning to use it for the default unified hierarchy, this level of distinction between dummy and non-dummy roots is quite awkward. This patch updates rebind_subsystems() to take the target root and one subsystem mask and move the specified subsystmes to the target root which may or may not be the dummy root. IOW, unbinding now becomes moving the subsystems to the dummy root and binding to non-dummy root. This makes the dummy root mostly equivalent to other hierarchies in terms of the mechanism of moving subsystems around; however, we still retain all the semantical restrictions so that this patch doesn't introduce any visible behavior differences. Another noteworthy detail is that rebind_subsystems() guarantees that moving a subsystem to the dummy root never fails so that valid unmounting attempts always succeed. This unifies binding and unbinding of subsystems. The invocation points of ->bind() were inconsistent between the two and now moved after whole rebinding is complete. This doesn't break the current users and generally makes more sense. All rebind_subsystems() users are converted accordingly. Note that cgroup_remount() now makes two calls to rebind_subsystems() to bind and then unbind the requested subsystems. This will allow repurposing of the dummy hierarchy as the default unified hierarchy and shouldn't make any userland visible behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	985ed67014	cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup_dummy_root is used to host controllers which aren't attached to any other hierarchy. The root is minimally set up during kernfs bootstrap and didn't go through full hierarchy initialization. We're planning to use cgroup_dummy_root for the default unified hierarchy and thus want it to be fully functional. Replace the special initialization, which was collected into cgroup_init() by the previous patch, with an invocation of cgroup_setup_root(). This simplifies the init path and makes cgroup_dummy_root a full hierarchy with its own kernfs_root and all. As this puts the dummy hierarchy on the cgroup_roots list, rename for_each_active_root() to for_each_root() and update its users to skip the dummy root for now. This patch doesn't cause any userland visible behavior changes at this point. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Tejun Heo	172a2c0685	cgroup: reorganize cgroup bootstrapping * Fields of init_css_set and css_set_count are now set using initializer instead of programmatically from cgroup_init_early(). * init_cgroup_root() now also takes @opts and performs the optional part of initialization too. The leftover part of cgroup_root_from_opts() is collapsed into its only caller - cgroup_mount(). * Initialization of cgroup_root_count and linking of init_css_set are moved from cgroup_init_early() to to cgroup_init(). None of the early_init users depends on init_css_set being linked. * Subsystem initializations are moved after dummy hierarchy init and init_css_set linking. These changes reorganize the bootstrap logic so that the dummy hierarchy can share the usual hierarchy init path and be made more normal. These changes don't make noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Tejun Heo	5d77381fd8	cgroup: relocate setting of CGRP_DEAD In cgroup_destroy_locked(), move setting of CGRP_DEAD above invocations of kill_css(). This doesn't make any visible behavior difference now but will be used to inhibit manipulating controller enable states of a dying cgroup on the unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Tejun Heo	952aaa1254	cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup_transfer_tasks() can currently fail in the middle due to memory allocation failure. When that happens, the function just aborts and returns error code and there's no way to tell how many actually got migrated at the point of failure and or to revert the partial migration. Update it to use cgroup_migrate{_add_src\|prepare_dst\|migrate\|finish}() so that the function either succeeds or fails as a whole as long as ->can_attach() doesn't fail. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	0e1d768f1b	cgroup: drop task_lock() protection around task->cgroups For optimization, task_lock() is additionally used to protect task->cgroups. The optimization is pretty dubious as either css_set_rwsem is grabbed anyway or PF_EXITING already protects task->cgroups. It adds only overhead and confusion at this point. Let's drop task_[un]lock() and update comments accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	eaf797abc5	cgroup: update how a newly forked task gets associated with css_set When a new process is forked, cgroup_fork() associates it with the css_set of its parent but doesn't link it into it. After the new process is linked to tasklist, cgroup_post_fork() does the linking. This is problematic for cgroup_transfer_tasks() as there's no way to tell whether there are tasks which are pointing to a css_set but not linked yet. It is impossible to implement an operation which transfer all tasks of a cgroup to another and the current cgroup_transfer_tasks() can easily be tricked into leaving a newly forked process behind if it gets called between cgroup_fork() and cgroup_post_fork(). Let's make association with a css_set and linking atomic by moving it to cgroup_post_fork(). cgroup_fork() sets child->cgroups to init_css_set as a placeholder and cgroup_post_fork() is updated to perform both the association with the parent's cgroup and linking there. This means that a newly created task will point to init_css_set without holding a ref to it much like what it does on the exit path. Empty cg_list is used to indicate that the task isn't holding a ref to the associated css_set. This fixes an actual bug with cgroup_transfer_tasks(); however, I'm not marking it for -stable. The whole thing is broken in multiple other ways which require invasive updates to fix and I don't think it's worthwhile to bother with backporting this particular one. Fortunately, the only user is cpuset and these bugs don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	1958d2d53d	cgroup: split process / task migration into four steps Currently, process / task migration is a single operation which may fail depending on memory pressure or the involved controllers' ->can_attach() callbacks. One problem with this approach is migration of multiple targets. It's impossible to tell whether a given target will be successfully migrated beforehand and cgroup core can't keep track of enough states to roll back after intermediate failure. This is already an issue with cgroup_transfer_tasks(). Also, we're gonna need multiple target migration for unified hierarchy. This patch splits migration into four stages - cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(), cgroup_migrate() and cgroup_migrate_finish(), where cgroup_migrate_prepare_dst() performs all the operations which may fail due to allocation failure without actually migrating the target. The four separate stages mean that, disregarding ->can_attach() failures, the success or failure of multi target migration can be determined before performing any actual migration. If preparations of all targets succeed, the whole thing will succeed. If not, the whole operation can fail without any side-effect. Since the previous patch to use css_set->mg_tasks to keep track of migration targets, the only thing which may need memory allocation during migration is the target css_sets. cgroup_migrate_prepare() pins all source and target css_sets and link them up. Note that this can be performed without holding threadgroup_lock even if the target is a process. As long as cgroup_mutex is held, no new css_set can be put into play. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	ceb6a081f6	cgroup: separate out cset_group_from_root() from task_cgroup_from_root() This will be used by the planned migration path update. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:02 -05:00
Tejun Heo	b3dc094e93	cgroup: use css_set->mg_tasks to track target tasks during migration Currently, while migrating tasks from one cgroup to another, cgroup_attach_task() builds a flex array of all target tasks; unfortunately, this has a couple issues. * Flex array has size limit. On 64bit, struct task_and_cgroup is 24bytes making the flex element limit around 87k. It is a high number but not impossible to hit. This means that the current cgroup implementation can't migrate a process with more than 87k threads. * Process migration involves memory allocation whose size is dependent on the number of threads the process has. This means that cgroup core can't guarantee success or failure of multi-process migrations as memory allocation failure can happen in the middle. This is in part because cgroup can't grab threadgroup locks of multiple processes at the same time, so when there are multiple processes to migrate, it is imposible to tell how many tasks are to be migrated beforehand. Note that this already affects cgroup_transfer_tasks(). cgroup currently cannot guarantee atomic success or failure of the operation. It may fail in the middle and after such failure cgroup doesn't have enough information to roll back properly. It just aborts with some tasks migrated and others not. To resolve the situation, this patch updates the migration path to use task->cg_list to track target tasks. The previous patch already added css_set->mg_tasks and updated iterations in non-migration paths to include them during task migration. This patch updates migration path to actually make use of it. Instead of putting onto a flex_array, each target task is moved from its css_set->tasks list to css_set->mg_tasks and the migration path keeps trace of all the source css_sets and the associated cgroups. Once all source css_sets are determined, the destination css_set for each is determined, linked to the matching source css_set and put on a separate list. To iterate the target tasks, migration path just needs to iterat through either the source or target css_sets, depending on whether migration has been committed or not, and the tasks on their ->mg_tasks lists. cgroup_taskset is updated to contain the list_heads for source and target css_sets and the iteration cursor. cgroup_taskset_*() are accordingly updated to walk through css_sets and their ->mg_tasks. This resolves the above listed issues with moderate additional complexity. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:01 -05:00
Tejun Heo	c75611282c	cgroup: add css_set->mg_tasks Currently, while migrating tasks from one cgroup to another, cgroup_attach_task() builds a flex array of all target tasks; unfortunately, this has a couple issues. * Flex array has size limit. On 64bit, struct task_and_cgroup is 24bytes making the flex element limit around 87k. It is a high number but not impossible to hit. This means that the current cgroup implementation can't migrate a process with more than 87k threads. * Process migration involves memory allocation whose size is dependent on the number of threads the process has. This means that cgroup core can't guarantee success or failure of multi-process migrations as memory allocation failure can happen in the middle. This is in part because cgroup can't grab threadgroup locks of multiple processes at the same time, so when there are multiple processes to migrate, it is imposible to tell how many tasks are to be migrated beforehand. Note that this already affects cgroup_transfer_tasks(). cgroup currently cannot guarantee atomic success or failure of the operation. It may fail in the middle and after such failure cgroup doesn't have enough information to roll back properly. It just aborts with some tasks migrated and others not. To resolve the situation, we're going to use task->cg_list during migration too. Instead of building a separate array, target tasks will be linked into a dedicated migration list_head on the owning css_set. Tasks on the migration list are treated the same as tasks on the usual tasks list; however, being on a separate list allows cgroup migration code path to keep track of the target tasks by simply keeping the list of css_sets with tasks being migrated, making unpredictable dynamic allocation unnecessary. In prepartion of such migration path update, this patch introduces css_set->mg_tasks list and updates css_set task iterations so that they walk both css_set->tasks and ->mg_tasks. Note that ->mg_tasks isn't used yet. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:01 -05:00
Tejun Heo	f153ad11bc	Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15 Pull in for-3.14-fixes to receive `532de3fc72` ("cgroup: update cgroup_enable_task_cg_lists() to grab siglock") which conflicts with `afeb0f9fd4` ("cgroup: relocate cgroup_enable_task_cg_lists()") and the following cg_lists updates. This is likely to cause further conflicts down the line too, so let's merge it early. As cgroup_enable_task_cg_lists() is relocated in for-3.15, this merge causes conflict in the original position. It's resolved by applying siglock changes to the updated version in the new location. Conflicts: kernel/cgroup.c Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-25 09:56:49 -05:00
Tejun Heo	532de3fc72	cgroup: update cgroup_enable_task_cg_lists() to grab siglock Currently, there's nothing preventing cgroup_enable_task_cg_lists() from missing set PF_EXITING and race against cgroup_exit(). Depending on the timing, cgroup_exit() may finish with the task still linked on css_set leading to list corruption. Fix it by grabbing siglock in cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be visible. This whole on-demand cg_list optimization is extremely fragile and has ample possibility to lead to bugs which can cause things like once-a-year oops during boot. I'm wondering whether the better approach would be just adding "cgroup_disable=all" handling which disables the whole cgroup rather than tempting fate with this on-demand craziness. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2014-02-18 18:23:18 -05:00
Li Zefan	dc5736ed7a	cgroup: add a validation check to cgroup_add_cftyps() Fengguang reported this bug: BUG: unable to handle kernel NULL pointer dereference at 0000003c IP: [<cc90b4ad>] cgroup_cfts_commit+0x27/0x1c1 ... Call Trace: [<cc9d1129>] ? kmem_cache_alloc_trace+0x33f/0x3b7 [<cc90c6fc>] cgroup_add_cftypes+0x8f/0xca [<cd78b646>] cgroup_init+0x6a/0x26a [<cd764d7d>] start_kernel+0x4d7/0x57a [<cd7642ef>] i386_start_kernel+0x92/0x96 This happens in a corner case. If CGROUP_SCHED=y but CFS_BANDWIDTH=n && FAIR_GROUP_SCHED=n && RT_GROUP_SCHED=n, we have: cpu_files[] = { { } /* terminate */ } When we pass cpu_files to cgroup_apply_cftypes(), as cpu_files[0].ss is NULL, we'll access NULL pointer. The bug was introduced by commit `de00ffa56e` ("cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-18 18:20:09 -05:00
Li Zefan	6534fd6c15	cgroup: fix memory leak in cgroup_mount() We should free the memory allocated in parse_cgroupfs_options() before calling this function again. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-14 10:52:40 -05:00
Li Zefan	bad3466034	cgroup: fix locking in cgroupstats_build() css_set_lock has been converted to css_set_rwsem, and rwsem can't nest inside rcu_read_lock. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-14 10:52:39 -05:00
Fengguang Wu	430af8ad9d	cgroup: fix coccinelle warnings kernel/cgroup.c:2256:1-3: WARNING: PTR_RET can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: coccinelle/api/ptr_ret.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-13 16:42:45 -05:00
Tejun Heo	8541fecc04	cgroup: unexport functions With module support gone, a lot of functions no longer need to be exported. Unexport them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	9db8de3722	cgroup: cosmetic updates to cgroup_attach_task() cgroup_attach_task() is planned to go through restructuring. Let's tidy it up a bit in preparation. * Update cgroup_attach_task() to receive the target task argument in @leader instead of @tsk. * Rename @tsk to @task. * Rename @retval to @ret. This is purely cosmetic. v2: get_nr_threads() was using uninitialized @task instead of @leader. Fixed. Reported by Dan Carpenter. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	bc668c7519	cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size() The two functions don't have any users left. Remove them along with cgroup_taskset->cur_cgrp. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	cb0f1fe9ba	cgroup: move css_set_rwsem locking outside of cgroup_task_migrate() Instead of repeatedly locking and unlocking css_set_rwsem inside cgroup_task_migrate(), update cgroup_attach_task() to grab it outside of the loop and update cgroup_task_migrate() to use put_css_set_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:41 -05:00
Tejun Heo	89c5509b0d	cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit() put_css_set() is performed in two steps - it first tries to put without grabbing css_set_rwsem if such put wouldn't make the count zero. If that fails, it puts after write-locking css_set_rwsem. This patch separates out the second phase into put_css_set_locked() which should be called with css_set_rwsem locked. Also, put_css_set_taskexit() is droped and put_css_set() is made to take @taskexit. There are only a handful users of these functions. No point in providing different variants. put_css_locked() will be used by later changes. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	889ed9ceaa	cgroup: remove css_scan_tasks() css_scan_tasks() doesn't have any user left. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	96d365e0b8	cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem Currently there are two ways to walk tasks of a cgroup - css_task_iter_start/next/end() and css_scan_tasks(). The latter builds on the former but allows blocking while iterating. Unfortunately, the way css_scan_tasks() is implemented is rather nasty, it uses a priority heap of pointers to extract some number of tasks in task creation order and loops over them invoking the callback and repeats that until it reaches the end. It requires either preallocated heap or may fail under memory pressure, while unlikely to be problematic, the complexity is O(N^2), and in general just nasty. We're gonna convert all css_scan_users() to css_task_iter_start/next/end() and remove css_scan_users(). As css_scan_tasks() users may block, let's convert css_set_lock to a rwsem so that tasks can block during css_task_iter_*() is in progress. While this does increase the chance of possible deadlock scenarios, given the current usage, the probability is relatively low, and even if that happens, the right thing to do is updating the iteration in the similar way to css iterators so that it can handle blocking. Most conversions are trivial; however, task_cgroup_path() now expects to be called with css_set_rwsem locked instead of locking itself. This is because the function is called with RCU read lock held and rwsem locking should nest outside RCU read lock. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	e406d1cfff	cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks() Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the first task in the cgroup and then tranfers it. This achieves the same result without using css_scan_tasks() which is scheduled to be removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	07bc356ed2	cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count() cgroup_task_count() read-locks css_set_lock and walks all tasks to count them and then returns the result. The only thing all the users want is determining whether the cgroup is empty or not. This patch implements cgroup_has_tasks() which tests whether cgroup->cset_links is empty, replaces all cgroup_task_count() usages and unexports it. Note that the test isn't synchronized. This is the same as before. The test has always been racy. This will help planned css_set locking update. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	afeb0f9fd4	cgroup: relocate cgroup_enable_task_cg_lists() Move it above so that prototype isn't necessary. Let's also move the definition of use_task_css_set_links next to it. This is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	56fde9e01d	cgroup: enable task_cg_lists on the first cgroup mount Tasks are not linked on their css_sets until cgroup task iteration is actually used. This is to avoid incurring overhead on the fork and exit paths for systems which have cgroup compiled in but don't use it. This lazy binding also affects the task migration path. It has to be careful so that it doesn't link tasks to css_sets when task_cg_lists linking is not enabled yet. Unfortunately, this conditional linking in the migration path interferes with planned migration updates. This patch moves the lazy binding a bit earlier, to the first cgroup mount. It's a clear indication that cgroup is being used on the system and task_cg_lists linking is highly likely to be enabled soon anyway through "tasks" and "cgroup.procs" files. This allows cgroup_task_migrate() to always link @tsk->cg_list. Note that it may still race with cgroup_post_fork() but who wins that race is inconsequential. While at it, make use_task_css_set_links a bool, add sanity checks in cgroup_enable_task_cg_lists() and css_task_iter_start(), and update the former so that it's guaranteed and assumes to run only once. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	3558557305	cgroup: drop CGRP_ROOT_SUBSYS_BOUND Before kernfs conversion, due to the way super_block lookup works, cgroup roots were created and made visible before being fully initialized. This in turn required a special flag to mark that the root hasn't been fully initialized so that the destruction path can tell fully bound ones from half initialized. That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the kernfs conversion as the lookup and creation of new root are atomic w.r.t. cgroup_mutex. This patch removes the flag and passes the requests subsystem mask to cgroup_setup_root() so that it can set the respective mask bits as subsystems are bound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	d3ba07c3aa	cgroup: disallow xattr, release_agent and name if sane_behavior Disallow more mount options if sane_behavior. Note that xattr used to generate warning. While at it, simplify option check in cgroup_mount() and update sane_behavior comment in cgroup.h. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	1a11533fbd	Revert "cgroup: use an ordered workqueue for cgroup destruction" This reverts commit `ab3f5faa62`. Explanation from Hugh: It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's `8a2b753844` ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer." Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com>	2014-02-12 19:08:28 -05:00
Tejun Heo	776f02fa4e	cgroup: remove cgroupfs_root->refcnt Currently, cgroupfs_root and its ->top_cgroup are separated reference counted and the latter's is ignored. There's no reason to do this separately. This patch removes cgroupfs_root->refcnt and destroys cgroupfs_root when the top_cgroup is released. * cgroup_put() updated to ignore cgroup_is_dead() test for top cgroups. cgroup_free_fn() updated to handle root destruction when releasing a top cgroup. * As root destruction is now bounced through cgroup destruction, it is asynchronous. Update cgroup_mount() so that it waits for pending release which is currently implemented using msleep(). Converting this to proper wait_queue isn't hard but likely unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	3c9c825b8b	cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t root->number_of_cgroups is currently an integer protected with cgroup_mutex. Except for sanity checks and proc reporting, the only place it's used is to check whether the root has any child during remount; however, this is a bit flawed as the counter is not decremented when the cgroup is unlinked but when it's released, meaning that there could be an extended period where all cgroups are removed but remount is still not allowed because some internal objects are lingering. While not perfect either, it'd be better to use emptiness test on root->top_cgroup.children. This patch updates cgroup_remount() to test top_cgroup's children instead, which makes number_of_cgroups only actual usage statistics printing in proc implemented in proc_cgroupstats_show(). Let's shorten its name and make it an atomic_t so that we don't have to worry about its synchronization. It's purely auxiliary at this point. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	e61734c55c	cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	6f30558f37	cgroup: make cgroup hold onto its kernfs_node cgroup currently releases its kernfs_node when it gets removed. While not buggy, this makes cgroup->kn access rules complicated than necessary and leads to things like get/put protection around kernfs_remove() in cgroup_destroy_locked(). In addition, we want to use kernfs_name/path() and friends but also want to be able to determine a cgroup's name between removal and release. This patch makes cgroup hold onto its kernfs_node until freed so that cgroup->kn is always accessible. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	21a2d3430b	cgroup: simplify dynamic cftype addition and removal Dynamic cftype addition and removal using cgroup_add/rm_cftypes() respectively has been quite hairy due to vfs i_mutex. As i_mutex nests outside cgroup_mutex, cgroup_mutex has to be released and regrabbed on each iteration through the hierarchy complicating the process. Now that i_mutex is no longer in play, it can be simplified. * Just holding cgroup_tree_mutex is enough. No need to meddle with cgroup_mutex. * No reason to play the unlock - relock - check serial_nr dancing. Everything can be atomically while holding cgroup_tree_mutex. * cgroup_cfts_prepare() is replaced with direct locking of cgroup_tree_mutex. * cgroup_cfts_commit() no longer fiddles with locking. It just applies the cftypes change to the existing cgroups in the hierarchy. Renamed to cgroup_cfts_apply(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:49 -05:00
Tejun Heo	0adb070426	cgroup: remove cftype_set cftype_set was added primarily to allow registering the same cftype array more than once for different subsystems. Nobody uses or needs such thing and it's already broken because each cftype has ->ss pointer which is initialized during registration. Let's add list_head ->node to cftype and use the first cftype entry in the array to link them instead of allocating separate cftype_set. While at it, trigger WARN if cft seems previously initialized during registration. This simplifies cftype handling a bit. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	80b1358699	cgroup: relocate cgroup_rm_cftypes() cftype handling is about to be revamped. Relocate cgroup_rm_cftypes() above cgroup_add_cftypes() in preparation. This is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	86bf4b6875	cgroup: warn if "xattr" is specified with "sane_behavior" Mount option "xattr" is no longer necessary as it's enabled by default on kernfs. Warn if "xattr" is specified with "sane_behavior" so that the option can be removed in the future. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	2bd59d48eb	cgroup: convert to kernfs cgroup filesystem code was derived from the original sysfs implementation which was heavily intertwined with vfs objects and locking with the goal of re-using the existing vfs infrastructure. That experiment turned out rather disastrous and sysfs switched, a long time ago, to distributed filesystem model where a separate representation is maintained which is queried by vfs. Unfortunately, cgroup stuck with the failed experiment all these years and accumulated even more problems over time. Locking and object lifetime management being entangled with vfs is probably the most egregious. vfs is never designed to be misused like this and cgroup ends up jumping through various convoluted dancing to make things work. Even then, operations across multiple cgroups can't be done safely as it'll deadlock with rename locking. Recently, kernfs is separated out from sysfs so that it can be used by users other than sysfs. This patch converts cgroup to use kernfs, which will bring the following benefits. * Separation from vfs internals. Locking and object lifetime management is contained in cgroup proper making things a lot simpler. This removes significant amount of locking convolutions, hairy object lifetime rules and the restriction on multi-cgroup operations. * Can drop a lot of code to implement filesystem interface as most are provided by kernfs. * Proper "severing" semantics, which allows controllers to not worry about lingering file accesses after offline. While the preceding patches did as much as possible to make the transition less painful, large part of the conversion has to be one discrete step making this patch rather large. The rest of the commit message lists notable changes in different areas. Overall ------- * vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn, cgroupfs_root->sb w/ ->kf_root. * All dentry accessors are removed. Helpers to map from kernfs constructs are added. * All vfs plumbing around dentry, inode and bdi removed. * cgroup_mount() now directly looks for matching root and then proceeds to create a new one if not found. Synchronization and object lifetime ----------------------------------- * vfs inode locking removed. Among other things, this removes the need for the convolution in cgroup_cfts_commit(). Future patches will further simplify it. * vfs refcnting replaced with cgroup internal ones. cgroup->refcnt, cgroupfs_root->refcnt added. cgroup_put_root() now directly puts root->refcnt and when it reaches zero proceeds to destroy it thus merging cgroup_put_root() and the former cgroup_kill_sb(). Simliarly, cgroup_put() now directly schedules cgroup_free_rcu() when refcnt reaches zero. * Unlike before, kernfs objects don't hold onto cgroup objects. When cgroup destroys a kernfs node, all existing operations are drained and the association is broken immediately. The same for cgroupfs_roots and mounts. * All operations which come through kernfs guarantee that the associated cgroup is and stays valid for the duration of operation; however, there are two paths which need to find out the associated cgroup from dentry without going through kernfs - css_tryget_from_dir() and cgroupstats_build(). For these two, kernfs_node->priv is RCU managed so that they can dereference it under RCU read lock. File and directory handling --------------------------- * File and directory operations converted to kernfs_ops and kernfs_syscall_ops. * xattrs is implicitly supported by kernfs. No need to worry about it from cgroup. This means that "xattr" mount option is no longer necessary. A future patch will add a deprecated warning message when sane_behavior. * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a private copy of one of the kernfs_ops to set its atomic_write_len. cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated to handle it. * cftype->lockdep_key added so that kernfs lockdep annotation can be per cftype. * Inidividual file entries and open states are now managed by kernfs. No need to worry about them from cgroup. cfent, cgroup_open_file and their friends are removed. * kernfs_nodes are created deactivated and kernfs_activate() invocations added to places where creation of new nodes are committed. * cgroup_rmdir() uses kernfs_[un]break_active_protection() for self-removal. v2: - Li pointed out in an earlier patch that specifying "name=" during mount without subsystem specification should succeed if there's an existing hierarchy with a matching name although it should fail with -EINVAL if a new hierarchy should be created. Prior to the conversion, this used by handled by deferring failure from NULL return from cgroup_root_from_opts(), which was necessary because root was being created before checking for existing ones. Note that cgroup_root_from_opts() returned an ERR_PTR() value for error conditions which require immediate mount failure. As we now have separate search and creation steps, deferring failure from cgroup_root_from_opts() is no longer necessary. cgroup_root_from_opts() is updated to always return ERR_PTR() value on failure. - The logic to match existing roots is updated so that a mount attempt with a matching name but different subsys_mask are rejected. This was handled by a separate matching loop under the comment "Check for name clashes with existing mounts" but got lost during conversion. Merge the check into the main search loop. - Add __rcu __force casting in RCU_INIT_POINTER() in cgroup_destroy_locked() to avoid the sparse address space warning reported by kbuild test bot. Maybe we want an explicit interface to use kn->priv as RCU protected pointer? v3: Make CONFIG_CGROUPS select CONFIG_KERNFS. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: kbuild test robot fengguang.wu@intel.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	f2e85d574e	cgroup: relocate functions in preparation of kernfs conversion Relocate cgroup_init/exit_root_id(), cgroup_free_root(), cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs conversion. These are pure relocations to make kernfs conversion easier to follow. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	59f5296b51	cgroup: misc preps for kernfs conversion * Un-inline seq_css(). After kernfs conversion, the function will need to dereference internal data structures. * Add cgroup_get/put_root() and replace direct super_block->s_active manipulatinos with them. These will be converted to kernfs_root refcnting. * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with them. These will be converted to kernfs refcnting. * Update current_css_set_cg_links_read() to use cgroup_name() instead of reaching into the dentry name. The end result is the same. These changes don't make functional differences but will make transition to kernfs easier. v2: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	b166492406	cgroup: introduce cgroup_ino() mm/memory-failure.c::hwpoison_filter_task() has been reaching into cgroup to extract the associated ino to be used as a filtering criterion. This is an implementation detail which shouldn't be depended upon from outside cgroup proper and is about to change with the scheduled kernfs conversion. This patch introduces a proper interface to determine the associated ino, cgroup_ino(), and updates hwpoison_filter_task() to use it instead of reaching directly into cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	2da440a26c	cgroup: introduce cgroup_init/exit_cftypes() Factor out cft->ss initialization into cgroup_init_cftypes() from cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes() through cgroup_exit_cftypes(). This doesn't make any meaningful difference now but the two new functions will be expanded during kernfs transition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	5f46990787	cgroup: update the meaning of cftype->max_write_len cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	de00ffa56e	cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes() Currently, cgroup_subsys->base_cftypes registration is different from dynamic cftypes registartion. Instead of going through cgroup_add_cftypes(), cgroup_init_subsys() invokes cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset which doesn't involve dynamic allocation. While avoiding dynamic allocation is somewhat nice, having two separate paths for cftypes registration is nasty, especially as we're planning to add more operations during cftypes registration. This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset and registers base_cftypes using cgroup_add_cftypes(). This is done as a separate step in cgroup_init() instead of a part of cgroup_init_subsys(). This is because cgroup_init_subsys() can be called very early during boot when kmalloc() isn't available yet. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	8d7e6fb0a1	cgroup: update cgroup name handling Straightforward updates to cgroup name handling in preparation of kernfs conversion. * cgroup_alloc_name() is updated to take const char * isntead of dentry * for name source. * cgroup name formatting is separated out into cgroup_file_name(). While at it, buffer length protection is added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	d427dfeb12	cgroup: factor out cgroup_setup_root() from cgroup_mount() Factor out new root initialization into cgroup_setup_root() from cgroup_mount(). This makes it easier to follow and will ease kernfs conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00

1 2 3 4 5 ...

606 commits