redonkable/alistair23-linux

Author	SHA1	Message	Date
Nikolay Borisov	118c701e20	btrfs: remove __BTRFS_LEAF_DATA_SIZE __BTRFS_LAF_DATA_SIZE is used only by BTRFS_LEAF_DATA_SIZE. Make the latter subsume the former. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:01 +02:00
Nikolay Borisov	3d9ec8c49a	btrfs: rename btrfs_leaf_data to BTRFS_LEAF_DATA_OFFSET Commit `5f39d397df` ("Btrfs: Create extent_buffer interface for large blocksizes") refactored btrfs_leaf_data function to take extent_buffer rather than struct btrfs_leaf. However, as it turns out the parameter being passed is never used. Furthermore this function no longer returns the leaf data but rather the offset to it. So rename the function to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_* helpers and turn it into a macro. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ removed () from the macro ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Anand Jain	e1ddce71d6	btrfs: reduce arguments for decompress_bio ops struct compressed_bio pointer can be used instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Anand Jain	8140dc30a4	btrfs: btrfs_decompress_bio() could accept compressed_bio instead Instead of sending each argument of struct compressed_bio, send the compressed_bio itself. Also by having struct compressed_bio in btrfs_decompress_bio() it would help tracing. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Nikolay Borisov	d2006e6d28	btrfs: Refactor update_space_info Following the factoring out of the creation code udpate_space_info can only be called for already-existing space_info structs. As such it cannot fail. Remove superfluous error handling and make the function return void. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Nikolay Borisov	2be12ef79f	btrfs: Separate space_info create/update Currently the struct space_info creation code is intermixed in the udpate_space_info function. There are well-defined points at which the we actually want to create brand-new space_info structs (e.g. during mount of the filesystem as well as sometimes when adding/initialising new chunks). In such cases update_space_info is called with 0 as the bytes parameter. All of this makes for spaghetti code. Fix it by factoring out the creation code in a separate create_space_info structure. This also allows to simplify the internals. Also remove BUG_ON from do_alloc_chunk since the callers handle errors. Furthermore it will make the update_space_info function not fail, allowing us to remove error handling in callers. This will come in a follow up patch. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Liu Bo	555ba411aa	Btrfs: let btrfs_print_leaf print more about block group This adds chunk_objectid and flags, with flags we can recognize whether the block group is about data or metadata. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Liu Bo	28785f70ef	Btrfs: skip commit transaction if we don't have enough pinned bytes We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space. This changes the code to the above logic. Fixes: `b150a4f10d` ("Btrfs: use a percpu to keep track of possibly pinned bytes") Tested-by: Nikolay Borisov <nborisov@suse.com> Reported-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
David Sterba	4e2814ef04	btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctx We don't need to take the mutex and zero out wr_cur_bio, as this is called after the scrub finished. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
David Sterba	e241ddeb9c	btrfs: scrub: inline helper scrub_free_wr_ctx The helper scrub_free_wr_ctx is used only once and fits into scrub_free_ctx as it continues sctx shutdown, no need to keep it separate. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
David Sterba	8fcdac3f20	btrfs: scrub: inline helper scrub_setup_wr_ctx The helper scrub_setup_wr_ctx is used only once and fits into scrub_setup_ctx as it continues intialization, no need to keep it separate. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Jeff Mahoney	c1c4919b11	btrfs: remove root usage from can_overcommit can_overcommit using the root to determine the allocation profile is the only use of a root in the call graph below reserve_metadata_bytes. It turns out that we only need to know whether the allocation is for the chunk root or not -- and we can pass that around as a bool instead. This allows us to pull root usage out of the reservation path all the way up to reserve_metadata_bytes itself, which uses it only to compare against fs_info->chunk_root to set the bool. In turn, this eliminates a bunch of races where we use a particular root too early in the mount process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Jeff Mahoney	1b86826d12	btrfs: cleanup root usage by btrfs_get_alloc_profile There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
David Sterba	e03733da5a	btrfs: fix bool type in btrfs_page_exists_in_range We use only a simple bool indicator, int is not a problem here. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
David Sterba	c9fed2bb61	btrfs: remove unused member list from btrfs_end_io_wq The end io work queue items have been tracked by the work queues since "Btrfs: Add async worker threads for pre and post IO checksumming" (`8b71284292`) (2008). Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
David Sterba	ee4ea69852	btrfs: remove unused members dir_path from recorded_ref The two members do not seem to be used since the initial commit. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
David Sterba	b297c9f68f	btrfs: remove unused member list from async_submit_bio The list used to track checksums in the early version (2.6.29), but I was able not pinpoint the commit that stopped using it. Everything apparently works without it for a long time. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
David Sterba	106204f191	btrfs: remove unused member err from reada_extent Seems to be unused since the initial commit, we ignore readahead errors anyway, the full read will handle that if necessary. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Sahil Kang	0bef71093d	btrfs: Remove unnecessary branching in free-space-tree.c Both btrfs_create_free_space_tree and btrfs_clear_free_space_tree contain: if (ret) return ret; return 0; The if statement is only false when ret equals zero, and since we return zero in such cases, we can safely remove the branching. Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	e477094f0d	Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Arnd Bergmann	3c91ee6964	Btrfs: work around maybe-uninitialized warning A rewrite of btrfs_submit_direct_hook appears to have introduced a warning: fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook': fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized] Where the 'bio' variable was previously initialized unconditionally, it is now set in the "while (submit_len > 0)" loop that would never execute if submit_len is zero. Assuming this cannot happen in practice, we can avoid the warning by simply replacing the while{} loop with a do{}while() loop so the compiler knows that it will always be entered at least once. Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to simplify DIO submit". Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	3892ac9086	Btrfs: unify naming of btrfs_io_bio All dio endio functions are using io_bio for struct btrfs_io_bio, this makes btrfs_submit_direct to follow this convention. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	11b5616516	Btrfs: check-integrity use bvec_iter Some check-integrity code depends on bio->bi_vcnt, this changes it to use bio segments because some bios passing here may not have a reliable bi_vcnt. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	629ebf4fad	Btrfs: record error if one block has failed to retry In the nocsum case of dio read endio, it returns immediately if an error gets returned when repairing, which leaves the rest blocks unrepaired. The behavior is different from how buffered read endio works in the same case. This changes it to record error only and go on repairing the rest blocks. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	17347cec15	Btrfs: change how we iterate bios in endio Since dio submit has used bio_clone_fast, the submitted bio may not have a reliable bi_vcnt, for the bio vector iterations in checksum related functions, bio->bi_iter is not modified yet and it's safe to use bio_for_each_segment, while for those bio vector iterations in dio read's endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning bios and use the helper __bio_for_each_segment with the saved bvec_iter to access each bvec. Also for dio reads which don't get split, we also need to save a copy of bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access each bvec in dio read's endio. Note that it doesn't affect other calls of btrfs_bio_clone() because they don't need to use this iterator. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Liu Bo	725130bac5	Btrfs: use bio_clone_bioset_partial to simplify DIO submit Currently when mapping bio to limit bio to a single stripe length, we split bio by adding page to bio one by one, but later we don't modify the vector of bio at all, thus we can use bio_clone_fast to use the original bio vector directly. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Liu Bo	2f8e914042	Btrfs: new helper btrfs_bio_clone_partial This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned bio that only owns a part of the original bio's data. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Liu Bo	015c1bd9f1	Btrfs: use bio_clone_fast to clone our bio For raid1 and raid10, we clone the original bio to the bios which are then sent to different disks. Right now we use bio_clone_bioset to create a clone bio with iterating bi_io_vec to initialize it. This changes it to use bio_clone_fast() which creates a clone bio but only copies the bi_io_vec pointer instead of iterating bi_io_vec. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Josef Bacik	7870d0822b	Btrfs: don't pass the inode through clean_io_failure Instead pass around the failure tree and the io tree. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Josef Bacik	6ec656bc0f	btrfs: remove inode argument from repair_io_failure Once we remove the btree_inode we won't have an inode to pass anymore, just pass the fs_info directly and the inum since we use that to print out the repair message. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Josef Bacik	c6100a4b4e	Btrfs: replace tree->mapping with tree->private_data For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Sargun Dhillon	2723480a0f	btrfs: Add quota_override knob into sysfs This patch adds the read-write attribute quota_override into sysfs. Any process which has CAP_SYS_RESOURCE can set this flag to on, and once it is set to true, processes with CAP_SYS_RESOURCE can exceed the quota. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com> [ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Sargun Dhillon	f29efe2921	btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE This patch introduces the quota override flag to btrfs_fs_info, and a change to quota limit checking code to temporarily allow for quota to be overridden for processes with CAP_SYS_RESOURCE. It's useful for administrative programs, such as log rotation, that may need to temporarily use more disk space in order to free up a greater amount of overall disk space without yielding more disk space to the rest of userland. Eventually, we may want to add the idea of an operator-specific quota, operator reserved space, or something else to allow for administrative override, but this is perhaps the simplest solution. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com> [ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Nikolay Borisov	a5ed45f822	btrfs: Convert fs_info->free_chunk_space to atomic64_t The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Anand Jain	401b41e5a8	btrfs: add framework to handle device flush error as a volume This adds comments to the flush error handling part of the code, and hopes to maintain the same logic with a framework which can be used to handle the errors at the volume level. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Daichou	6b349dfe80	Btrfs: remove obsolete FIXMEs in qgroup ioctls These FIXMEs were already addressed in 2013. All functions check for qgroup existence: * btrfs_add_qgroup_relation * btrfs_ioctl_qgroup_create * btrfs_limit_qgroup * btrfs_del_qgroup_relation Signed-off-by: Daichou <tommy0705c@gmail.com> [ enhance and reformat changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Dan Carpenter	97d038562a	Btrfs: remove an unused variable "item" is never used. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:57 +02:00
Fabian Frederick	977ec79271	btrfs: kmap() can't fail Remove NULL test on kmap() as it will always return a valid pointer. Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:57 +02:00
Linus Torvalds	54ed0f71f0	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a bug on sparc where we may dereference freed stack memory" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: Work around deallocated stack frame reference gcc bug on sparc.	2017-06-15 17:54:51 +09:00
Linus Torvalds	66cea28a94	Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Some fixes that Dave Sterba collected. We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting leak caused by u32 overflow Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io btrfs: tree-log.c: Wrong printk information about namelen btrfs: fix race with relocation recovery and fs_root setup btrfs: fix memory leak in update_space_info failure path btrfs: use correct types for page indices in btrfs_page_exists_in_range btrfs: fix incorrect error return ret being passed to mapping_set_error btrfs: Make flush bios explicitely sync btrfs: fiemap: Cache and merge fiemap extent before submit it to user	2017-06-10 11:06:05 -07:00
Omar Sandoval	70e7af244f	Btrfs: fix delalloc accounting leak caused by u32 overflow btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication, which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2). For a nodesize of 16kB, this overflow happens at 16k items. Usually, num_items is a small constant passed to btrfs_start_transaction(), but we also use btrfs_calc_trans_metadata_size() for metadata reservations for extent items in btrfs_delalloc_{reserve,release}_metadata(). In drop_outstanding_extents(), num_items is calculated as inode->reserved_extents - inode->outstanding_extents. The difference between these two counters is usually small, but if many delalloc extents are reserved and then the outstanding extents are merged in btrfs_merge_extent_hook(), the difference can become large enough to overflow in btrfs_calc_trans_metadata_size(). The overflow manifests itself as a leak of a multiple of 4 GB in delalloc_block_rsv and the metadata bytes_may_use counter. This in turn can cause early ENOSPC errors. Additionally, these WARN_ONs in extent-tree.c will be hit when unmounting: WARN_ON(fs_info->delalloc_block_rsv.size > 0); WARN_ON(fs_info->delalloc_block_rsv.reserved > 0); WARN_ON(space_info->bytes_pinned > 0 \|\| space_info->bytes_reserved > 0 \|\| space_info->bytes_may_use > 0); Fix it by casting nodesize to a u64 so that btrfs_calc_trans_metadata_size() does a full 64-bit multiplication. While we're here, do the same in btrfs_calc_trunc_metadata_size(); this can't overflow with any existing uses, but it's better to be safe here than have another hard-to-debug problem later on. Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-06-09 12:48:36 -07:00
Liu Bo	452e62b71f	Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Before this, we use 'filled' mode here, ie. if all range has been filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG bits in extent_state until releasing this inode's pages, and that prevents extent_data from being freed. This clears the bit if any was found within the ordered extent. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-06-09 12:48:29 -07:00
Su Yue	286b92f43c	btrfs: tree-log.c: Wrong printk information about namelen In verify_dir_item, it wants to printk name_len of dir_item but printk data_len acutally. Fix it by calling btrfs_dir_name_len instead of btrfs_dir_data_len. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-06-09 12:48:07 -07:00
David Miller	d41519a69b	crypto: Work around deallocated stack frame reference gcc bug on sparc. On sparc, if we have an alloca() like situation, as is the case with SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack memory. The result can be that the value is clobbered if a trap or interrupt arrives at just the right instruction. It only occurs if the function ends returning a value from that alloca() area and that value can be placed into the return value register using a single instruction. For example, in lib/libcrc32c.c:crc32c() we end up with a return sequence like: return %i7+8 lduw [%o5+16], %o0 ! MEM[(u32 *)__shash_desc.1_10 + 16B], %o5 holds the base of the on-stack area allocated for the shash descriptor. But the return released the stack frame and the register window. So if an intererupt arrives between 'return' and 'lduw', then the value read at %o5+16 can be corrupted. Add a data compiler barrier to work around this problem. This is exactly what the gcc fix will end up doing as well, and it absolutely should not change the code generated for other cpus (unless gcc on them has the same bug :-) With crucial insight from Eric Sandeen. Cc: <stable@vger.kernel.org> Reported-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2017-06-08 17:36:03 +08:00
Jeff Mahoney	a9b3311ef3	btrfs: fix race with relocation recovery and fs_root setup If we have to recover relocation during mount, we'll ultimately have to evict the orphan inode. That goes through the reservation dance, where priority_reclaim_metadata_space and flush_space expect fs_info->fs_root to be valid. That's the next thing to be set up during mount, so we crash, almost always in flush_space trying to join the transaction but priority_reclaim_metadata_space is possible as well. This call path has been problematic in the past WRT whether ->fs_root is valid yet. Commit `957780eb27` (Btrfs: introduce ticketed enospc infrastructure) added new users that are called in the direct path instead of the async path that had already been worked around. The thing is that we don't actually need the fs_root, specifically, for anything. We either use it to determine whether the root is the chunk_root for use in choosing an allocation profile or as a root to pass btrfs_join_transaction before immediately committing it. Anything that isn't the chunk root works in the former case and any root works in the latter. A simple fix is to use a root we know will always be there: the extent_root. Cc: <stable@vger.kernel.org> # v4.8+ Fixes: `957780eb27` (Btrfs: introduce ticketed enospc infrastructure) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-01 16:56:55 +02:00
Jeff Mahoney	896533a7da	btrfs: fix memory leak in update_space_info failure path If we fail to add the space_info kobject, we'll leak the memory for the percpu counter. Fixes: `6ab0a2029c` (btrfs: publish allocation data in sysfs) Cc: <stable@vger.kernel.org> # v3.14+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-01 16:56:31 +02:00
David Sterba	cc2b702c52	btrfs: use correct types for page indices in btrfs_page_exists_in_range Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> CC: stable@vger.kernel.org # 3.16+ Fixes: `fc4adbff82` ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-01 16:56:17 +02:00
Colin Ian King	bff5baf8aa	btrfs: fix incorrect error return ret being passed to mapping_set_error The setting of return code ret should be based on the error code passed into function end_extent_writepage and not on ret. Thanks to Liu Bo for spotting this mistake in the original fix I submitted. Detected by CoverityScan, CID#1414312 ("Logically dead code") Fixes: `5dca6eea91` ("Btrfs: mark mapping with error flag to report errors to userspace") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-05-16 15:42:10 +02:00
Jan Kara	8d91012528	btrfs: Make flush bios explicitely sync Commit `b685d3d65a` "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_{FUA\|PREFLUSH\|...} definitions. generic_make_request_checks() however strips REQ_FUA and REQ_PREFLUSH flags from a bio when the storage doesn't report volatile write cache and thus write effectively becomes asynchronous which can lead to performance regressions Fix the problem by making sure all bios which are synchronous are properly marked with REQ_SYNC. CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Fixes: `b685d3d65a` Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-05-16 15:42:01 +02:00
Qu Wenruo	4751832da9	btrfs: fiemap: Cache and merge fiemap extent before submit it to user [BUG] Cycle mount btrfs can cause fiemap to return different result. Like: # mount /dev/vdb5 /mnt/btrfs # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 25088..25215 128 0x1 # umount /mnt/btrfs # mount /dev/vdb5 /mnt/btrfs # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..31]: 25088..25119 32 0x0 1: [32..63]: 25120..25151 32 0x0 2: [64..95]: 25152..25183 32 0x0 3: [96..127]: 25184..25215 32 0x1 But after above fiemap, we get correct merged result if we call fiemap again. # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/test/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 25088..25215 128 0x1 [REASON] Btrfs will try to merge extent map when inserting new extent map. btrfs_fiemap(start=0 len=(u64)-1) \|- extent_fiemap(start=0 len=(u64)-1) \|- get_extent_skip_holes(start=0 len=64k) \| \|- btrfs_get_extent_fiemap(start=0 len=64k) \| \|- btrfs_get_extent(start=0 len=64k) \| \| Found on-disk (ino, EXTENT_DATA, 0) \| \|- add_extent_mapping() \| \|- Return (em->start=0, len=16k) \| \|- fiemap_fill_next_extent(logic=0 phys=X len=16k) \| \|- get_extent_skip_holes(start=0 len=64k) \| \|- btrfs_get_extent_fiemap(start=0 len=64k) \| \|- btrfs_get_extent(start=16k len=48k) \| \| Found on-disk (ino, EXTENT_DATA, 16k) \| \|- add_extent_mapping() \| \| \|- try_merge_map() \| \| Merge with previous em start=0 len=16k \| \| resulting em start=0 len=32k \| \|- Return (em->start=0, len=32K) << Merged result \|- Stripe off the unrelated range (0~16K) of return em \|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K) ^^^ Causing split fiemap extent. And since in add_extent_mapping(), em is already merged, in next fiemap() call, we will get merged result. [FIX] Here we introduce a new structure, fiemap_cache, which records previous fiemap extent. And will always try to merge current fiemap_cache result before calling fiemap_fill_next_extent(). Only when we failed to merge current fiemap extent with cached one, we will call fiemap_fill_next_extent() to submit cached one. So by this method, we can merge all fiemap extents. It can also be done in fs/ioctl.c, however the problem is if fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap extent. So I choose to merge it in btrfs. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-05-16 15:41:53 +02:00
Linus Torvalds	1176032cb1	Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has fixes and cleanups Dave Sterba collected for the merge window. The biggest functional fixes are between btrfs raid5/6 and scrub, and raid5/6 and device replacement. Some of our pending qgroup fixes are included as well while I bash on the rest in testing. We also have the usual set of cleanups, including one that makes __btrfs_map_block() much more maintainable, and conversions from atomic_t to refcount_t" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits) btrfs: fix the gfp_mask for the reada_zones radix tree Btrfs: fix reported number of inode blocks Btrfs: send, fix file hole not being preserved due to inline extent Btrfs: fix extent map leak during fallocate error path Btrfs: fix incorrect space accounting after failure to insert inline extent Btrfs: fix invalid attempt to free reserved space on failure to cow range btrfs: Handle delalloc error correctly to avoid ordered extent hang btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error btrfs: check if the device is flush capable btrfs: delete unused member nobarriers btrfs: scrub: Fix RAID56 recovery race condition btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: Use ktime_get_real_ts for root ctime Btrfs: handle only applicable errors returned by btrfs_get_extent btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option btrfs: use q which is already obtained from bdev_get_queue Btrfs: switch to div64_u64 if with a u64 divisor Btrfs: update scrub_parity to use u64 stripe_len Btrfs: enable repair during read for raid56 profile btrfs: use clear_page where appropriate ...	2017-05-10 08:33:17 -07:00
Michal Hocko	19809c2da2	mm, vmalloc: use __GFP_HIGHMEM implicitly __vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\\|return[[:space:]]*__vmalloc" \| wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Chris Mason	9bcaaea741	btrfs: fix the gfp_mask for the reada_zones radix tree Commits `cc8385b59e` and `7ef70b4d99` added preallocation for the reada radix trees and also switched them over to GFP_KERNEL for the default gfp mask. Since we're doing radix tree insertions under spinlocks, we need to make sure the mask doesn't allow sleeping. This fix keeps the radix preallocation but switches back to the original gfp_mask. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-05-04 16:56:11 -07:00
Linus Torvalds	694752922b	Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ was initially a fork of CFQ, but subsequently changed to implement fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant to be used on desktop type single drives, providing good fairness. From Paolo. - Add Kyber IO scheduler. This is a full multiqueue aware scheduler, using a scalable token based algorithm that throttles IO based on live completion IO stats, similary to blk-wbt. From Omar. - A series from Jan, moving users to separately allocated backing devices. This continues the work of separating backing device life times, solving various problems with hot removal. - A series of updates for lightnvm, mostly from Javier. Includes a 'pblk' target that exposes an open channel SSD as a physical block device. - A series of fixes and improvements for nbd from Josef. - A series from Omar, removing queue sharing between devices on mostly legacy drivers. This helps us clean up other bits, if we know that a queue only has a single device backing. This has been overdue for more than a decade. - Fixes for the blk-stats, and improvements to unify the stats and user windows. This both improves blk-wbt, and enables other users to register a need to receive IO stats for a device. From Omar. - blk-throttle improvements from Shaohua. This provides a scalable framework for implementing scalable priotization - particularly for blk-mq, but applicable to any type of block device. The interface is marked experimental for now. - Bucketized IO stats for IO polling from Stephen Bates. This improves efficiency of polled workloads in the presence of mixed block size IO. - A few fixes for opal, from Scott. - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics. From a variety of folks, mostly Sagi and James Smart. - A series from Bart, improving our exposed info and capabilities from the blk-mq debugfs support. - A series from Christoph, cleaning up how handle WRITE_ZEROES. - A series from Christoph, cleaning up the block layer handling of how we track errors in a request. On top of being a nice cleanup, it also shrinks the size of struct request a bit. - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was never used by platforms, and the latter has outlived it's usefulness. - Various little bug fixes and cleanups from a wide variety of folks. * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits) block: hide badblocks attribute by default blk-mq: unify hctx delay_work and run_work block: add kblock_mod_delayed_work_on() blk-mq: unify hctx delayed_run_work and run_work nbd: fix use after free on module unload MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler blk-mq-sched: alloate reserved tags out of normal pool mtip32xx: use runtime tag to initialize command header scsi: Implement blk_mq_ops.show_rq() blk-mq: Add blk_mq_ops.show_rq() blk-mq: Show operation, cmd_flags and rq_flags names blk-mq: Make blk_flags_show() callers append a newline character blk-mq: Move the "state" debugfs attribute one level down blk-mq: Unregister debugfs attributes earlier blk-mq: Only unregister hctxs for which registration succeeded blk-mq-debugfs: Rename functions for registering and unregistering the mq directory blk-mq: Let blk_mq_debugfs_register() look up the queue name blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset ide-pm: always pass 0 error to __blk_end_request_all ..	2017-05-01 10:39:57 -07:00
Linus Torvalds	28b2013587	Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "We have one more fix for btrfs. This gets rid of a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: qgroup: move noisy underflow warning to debugging build	2017-04-28 10:13:17 -07:00
Chris Mason	bce19f9d23	Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12	2017-04-27 14:13:09 -07:00
Filipe Manana	a7e3b975a0	Btrfs: fix reported number of inode blocks Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-04-26 16:27:26 +01:00
Filipe Manana	e1cbfd7bf6	Btrfs: send, fix file hole not being preserved due to inline extent Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit `e1699d2d7b` ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-04-26 16:27:25 +01:00
Filipe Manana	be2d253cc9	Btrfs: fix extent map leak during fallocate error path If the call to btrfs_qgroup_reserve_data() failed, we were leaking an extent map structure. The failure can happen either due to an -ENOMEM condition or, when quotas are enabled, due to -EDQUOT for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>	2017-04-26 16:27:24 +01:00
Filipe Manana	1c81ba237b	Btrfs: fix incorrect space accounting after failure to insert inline extent When using compression, if we fail to insert an inline extent we incorrectly end up attempting to free the reserved data space twice, once through extent_clear_unlock_delalloc(), because we pass it the flag EXTENT_DO_ACCOUNTING, and once through a direct call to btrfs_free_reserved_data_space_noquota(). This results in a trace like the following: [ 834.576240] ------------[ cut here ]------------ [ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [ 834.596103] Call Trace: [ 834.596103] dump_stack+0x67/0x90 [ 834.596103] __warn+0xc2/0xdd [ 834.596103] warn_slowpath_null+0x1d/0x1f [ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs] [ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs] [ 834.596103] async_cow_start+0x32/0x4d [btrfs] [ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs] [ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs] [ 834.596103] process_one_work+0x273/0x4e4 [ 834.596103] worker_thread+0x1eb/0x2ca [ 834.596103] ? rescuer_thread+0x2b6/0x2b6 [ 834.596103] kthread+0x100/0x108 [ 834.596103] ? __list_del_entry+0x22/0x22 [ 834.596103] ret_from_fork+0x2e/0x40 [ 834.611656] ---[ end trace 719902fe6bdef08f ]--- So fix this by not calling directly btrfs_free_reserved_data_space_noquota() if an error happened. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-04-26 16:27:23 +01:00
Filipe Manana	a315e68f6e	Btrfs: fix invalid attempt to free reserved space on failure to cow range When attempting to COW a file range (we are starting writeback and doing COW), if we manage to reserve an extent for the range we will write into but fail after reserving it and before creating the respective ordered extent, we end up in an error path where we attempt to decrement the data space's bytes_may_use counter after we already did it while reserving the extent, leading to a warning/trace like the following: [ 847.621524] ------------[ cut here ]------------ [ 847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg [ 847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 847.648601] Call Trace: [ 847.648601] dump_stack+0x67/0x90 [ 847.648601] __warn+0xc2/0xdd [ 847.648601] warn_slowpath_null+0x1d/0x1f [ 847.648601] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.648601] btrfs_clear_bit_hook+0x140/0x258 [btrfs] [ 847.648601] clear_state_bit+0x87/0x128 [btrfs] [ 847.648601] __clear_extent_bit+0x222/0x2b7 [btrfs] [ 847.648601] clear_extent_bit+0x17/0x19 [btrfs] [ 847.648601] extent_clear_unlock_delalloc+0x3b/0x6b [btrfs] [ 847.648601] cow_file_range.isra.39+0x387/0x39a [btrfs] [ 847.648601] run_delalloc_nocow+0x4d7/0x70e [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] run_delalloc_range+0xa7/0x2b5 [btrfs] [ 847.648601] writepage_delalloc.isra.31+0xb9/0x15c [btrfs] [ 847.648601] __extent_writepage+0x249/0x2e8 [btrfs] [ 847.648601] extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] ? mark_lock+0x24/0x201 [ 847.648601] extent_writepages+0x4b/0x5c [btrfs] [ 847.648601] ? btrfs_writepage_start_hook+0xed/0xed [btrfs] [ 847.648601] btrfs_writepages+0x28/0x2a [btrfs] [ 847.648601] do_writepages+0x23/0x2c [ 847.648601] __filemap_fdatawrite_range+0x5a/0x61 [ 847.648601] filemap_fdatawrite_range+0x13/0x15 [ 847.648601] btrfs_fdatawrite_range+0x20/0x46 [btrfs] [ 847.648601] start_ordered_ops+0x19/0x23 [btrfs] [ 847.648601] btrfs_sync_file+0x136/0x42c [btrfs] [ 847.648601] vfs_fsync_range+0x8c/0x9e [ 847.648601] vfs_fsync+0x1c/0x1e [ 847.648601] do_fsync+0x31/0x4a [ 847.648601] SyS_fsync+0x10/0x14 [ 847.648601] entry_SYSCALL_64_fastpath+0x18/0xad [ 847.648601] RIP: 0033:0x7f5b05200800 [ 847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800 [ 847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003 [ 847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f [ 847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046 [ 847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740 [ 847.648601] ? trace_hardirqs_off_caller+0x3f/0xaa [ 847.685787] ---[ end trace 2a4a3e15382508e8 ]--- So fix this by not attempting to decrement the data space info's bytes_may_use counter if we already reserved the extent and an error happened before creating the ordered extent. We are already correctly freeing the reserved extent if an error happens, so there's no additional measure needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-04-26 16:27:22 +01:00
Qu Wenruo	524272607e	btrfs: Handle delalloc error correctly to avoid ordered extent hang [BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] \|<------------------ delalloc range --------------------------->\| \| OE 1 \| OE 2 \| ... \| OE n \| \|<>\| \|<---------- cleanup range --------->\| \|\| \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: \|<------------------ delalloc range --------------------------->\| \| OE 1 \| OE 2 \| ... \| OE n \| \|<>\|<-------- ----------->\|<------ old error handler --------->\| \|\| \|\| \|\| \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-04-26 16:27:21 +01:00
Qu Wenruo	4dbd80fb91	btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error [BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) \|<------------------ delalloc range --------------------------->\| \| OE 1 \| OE 2 \| ... \| OE n \| \|<----------- cleanup range --------------->\| \|<----------- ----------->\| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. \|<------------------ delalloc range --------------------------->\| \| OE 1 \| OE 2 \| ... \| OE n \| \|<----------- ----------->\|<---------- cleanup range --------->\| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-04-26 16:27:21 +01:00
Jan Kara	9e11ceee23	btrfs: Convert to separately allocated bdi Allocate struct backing_dev_info separately instead of embedding it inside superblock. This unifies handling of bdi among users. CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: David Sterba <dsterba@suse.com> CC: linux-btrfs@vger.kernel.org Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-04-20 12:09:55 -06:00
David Sterba	338bd52f3c	btrfs: qgroup: move noisy underflow warning to debugging build The WARN_ON and warning from report_reserved_underflow can become very noisy and is visible unconditionally although this is namely for debugging. The patch "btrfs: Add WARN_ON for qgroup reserved underflow" (`18dc22c19b`) went to 4.11-rc1 and the plan was to get the fix as well, but this hasn't happened. CC: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-19 12:40:49 +02:00
Anand Jain	c2a9c7ab47	btrfs: check if the device is flush capable The block layer call chain from submit_bio will check if the write cache is enabled for the given queue before submitting the flush. This will add a code to fail fast if its not. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog to reflect current code stat, blkdev_issue_flush is not used yet ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 16:13:27 +02:00
Anand Jain	13e88e1560	btrfs: delete unused member nobarriers The last consumer of nobarriers is removed by the commit [1] and sync won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write through it just return success without actually transpiring such a request to the block device/lun. [1] commit `b25de9d6da` block: remove BIO_EOPNOTSUPP And, as the device/lun write cache state may change dynamically saving such as state won't help either. So deleting the member nobarriers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 16:12:07 +02:00
Qu Wenruo	28d70e237d	btrfs: scrub: Fix RAID56 recovery race condition When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16 (assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted \| Correct \| Correct \| \| Scrub dev3 (D1) \| Scrub dev2 (D2) \| Scrub dev1(P) \| ------------------------------------------------------------------------ Read out D1 \|Read out D2 \|Read full stripe \| Check csum \|Check csum \|Check parity \| Csum mismatch \|Csum match, continue \|Parity mismatch \| handle_errored_block \| \|handle_errored_block \| Read out full stripe \| \| Read out full stripe\| D1 csum error(err++) \| \| D1 csum error(err++)\| Recover D1 \| \| Recover D1 \| So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protection, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo Baroncelli <kreijack@libero.it> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:27 +02:00
Qu Wenruo	0966a7b130	btrfs: scrub: Introduce full stripe lock for RAID56 Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protection, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which store a rb_tree of mutexes for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:27 +02:00
Deepa Dinamani	fa7aede2ab	btrfs: Use ktime_get_real_ts for root ctime btrfs_root_item maintains the ctime for root updates. This is not part of vfs_inode. Since current_time() uses struct inode* as an argument as Linus suggested, this cannot be used to update root times unless, we modify the signature to use inode. Since btrfs uses nanosecond time granularity, it can also use ktime_get_real_ts directly to obtain timestamp for the root. It is necessary to use the timespec time api here because the same btrfs_set_stack_timespec_*() apis are used for vfs inode times as well. These can be transitioned to using timespec64 when btrfs internally changes to use timespec64 as well. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:27 +02:00
Dan Carpenter	9986277e0e	Btrfs: handle only applicable errors returned by btrfs_get_extent btrfs_get_extent() never returns NULL pointers, so this code introduces a static checker warning. The btrfs_get_extent() is a bit complex, but trust me that it doesn't return NULLs and also if it did we would trigger the BUG_ON(!em) before the last return statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [ updated subject ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:27 +02:00
Qu Wenruo	82bafb38c2	btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option [BUG] The easist way to reproduce the bug is: ------ # mkfs.btrfs -f $dev -n 16K # mount $dev $mnt -o inode_cache # btrfs quota enable $mnt # btrfs quota rescan -w $mnt # btrfs qgroup show $mnt qgroupid rfer excl -------- ---- ---- 0/5 32.00KiB 32.00KiB ^^ Twice the correct value ------ And fstests/btrfs qgroup test group can easily detect them with inode_cache mount option. Although some of them are false alerts since old test cases are using fixed golden output. While new test cases will use "btrfs check" to detect qgroup mismatch. [CAUSE] Inode_cache mount option will make commit_fs_roots() to call btrfs_save_ino_cache() to update fs/subvol trees, and generate new delayed refs. However we call btrfs_qgroup_prepare_account_extents() too early, before commit_fs_roots(). This makes the "old_roots" for newly generated extents are always NULL. For freeing extent case, this makes both new_roots and old_roots to be empty, while correct old_roots should not be empty. This causing qgroup numbers not decreased correctly. [FIX] Modify the timing of calling btrfs_qgroup_prepare_account_extents() to just before btrfs_qgroup_account_extents(), and add needed delayed_refs handler. So qgroup can handle inode_map mount options correctly. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Anand Jain	e884f4f06e	btrfs: use q which is already obtained from bdev_get_queue We have already assigned q from bdev_get_queue() so use it. And rearrange the code for better view. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Liu Bo	42c61ab676	Btrfs: switch to div64_u64 if with a u64 divisor This is fixing code pieces where we use div_u64 when passing a u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Liu Bo	972d721939	Btrfs: update scrub_parity to use u64 stripe_len Commit `3d8da67817` ("Btrfs: fix divide error upon chunk's stripe_len") changed stripe_len in struct map_lookup to u64, but didn't update stripe_len in struct scrub_parity. This updates the type and switches to div64_u64_rem to match u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Liu Bo	c725328c55	Btrfs: enable repair during read for raid56 profile Now that scrub can fix data errors with the help of parity for raid56 profile, repair during read is able to as well. Although the mirror num in raid56 scenario has different meanings, i.e. 0 or 1: read data directly > 1: do recover with parity, it could be fit into how we repair bad block during read. The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the device and position on it. Cc: David Sterba <dsterba@suse.cz> Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
David Sterba	619a974292	btrfs: use clear_page where appropriate There's a helper to clear whole page, with a arch-specific optimized code. The replaced cases do not seem to be in performace critical code, but we still might get some percent gain. Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Qu Wenruo	e501bfe323	btrfs: Prevent scrub recheck from racing with dev replace scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses bbio without protection of bio_counter. This can lead to use-after-free if racing with dev replace cancel. Fix it by increasing bio_counter before calling btrfs_map_sblock() and decreasing the bio_counter when corresponding recover is finished. Cc: Liu Bo <bo.li.liu@oracle.com> Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Qu Wenruo	ae6529c35b	btrfs: Wait for in-flight bios before freeing target device for raid56 When raid56 dev-replace is cancelled by running scrub, we will free target device without waiting for in-flight bios, causing the following NULL pointer deference or general protection failure. BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0 IP: generic_make_request_checks+0x4d/0x610 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] task: ffff88002875b4c0 task.stack: ffffc90001334000 RIP: 0010:generic_make_request_checks+0x4d/0x610 Call Trace: ? generic_make_request+0xc7/0x360 generic_make_request+0x24/0x360 ? generic_make_request+0xc7/0x360 submit_bio+0x64/0x120 ? page_in_rbio+0x4d/0x80 [btrfs] ? rbio_orig_end_io+0x80/0x80 [btrfs] finish_rmw+0x3f4/0x540 [btrfs] validate_rbio_for_rmw+0x36/0x40 [btrfs] raid_rmw_end_io+0x7a/0x90 [btrfs] bio_endio+0x56/0x60 end_workqueue_fn+0x3c/0x40 [btrfs] btrfs_scrubparity_helper+0xef/0x620 [btrfs] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8 In btrfs_dev_replace_finishing(), we will call btrfs_rm_dev_replace_blocked() to wait bios before destroying the target device when scrub is finished normally. However when dev-replace is aborted, either due to error or cancelled by scrub, we didn't wait for bios, this can lead to use-after-free if there are bios holding the target device. Furthermore, for raid56 scrub, at least 2 places are calling btrfs_map_sblock() without protection of bio_counter, leading to the problem. This patch fixes the problem: 1) Wait for bio_counter before freeing target device when canceling replace 2) When calling btrfs_map_sblock() for raid56, use bio_counter to protect the call. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Qu Wenruo	9a33944bdf	btrfs: scrub: Don't append on-disk pages for raid56 scrub In the following situation, scrub will calculate wrong parity to overwrite the correct one: RAID5 full stripe: Before \| Dev 1 \| Dev 2 \| Dev 3 \| \| Data stripe 1 \| Data stripe 2 \| Parity Stripe \| --------------------------------------------------- 0 \| 0x0000 (Bad) \| 0xcdcd \| 0x0000 \| --------------------------------------------------- 4K \| 0xcdcd \| 0xcdcd \| 0x0000 \| ... \| 0xcdcd \| 0xcdcd \| 0x0000 \| --------------------------------------------------- 64K After scrubbing dev3 only: \| Dev 1 \| Dev 2 \| Dev 3 \| \| Data stripe 1 \| Data stripe 2 \| Parity Stripe \| --------------------------------------------------- 0 \| 0xcdcd (Good) \| 0xcdcd \| 0xcdcd (Bad) \| --------------------------------------------------- 4K \| 0xcdcd \| 0xcdcd \| 0x0000 \| ... \| 0xcdcd \| 0xcdcd \| 0x0000 \| --------------------------------------------------- 64K The reason is that after raid56 read rebuild rbio->stripe_pages are all correctly recovered (0xcd for data stripes). However when we check and repair parity in scrub_parity_check_and_repair(), we will append pages in sparity->spages list to rbio->bio_pages[], which contains old on-disk data. And when we submit parity data to disk, we calculate parity using rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback to rbio->stripe_pages[]. The patch fix it by not appending pages from sparity->spages. So finish_parity_scrub() will use rbio->stripe_pages[] which is correct. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Qu Wenruo	d51ea5dd22	btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint Newly introduced qgroup reserved space trace points are normally nested into several common qgroup operations. While some other trace points are not well placed to co-operate with them, causing confusing output. This patch re-arrange trace_btrfs_qgroup_release_data() and trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered before reserved space ones. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Qu Wenruo	3159fe7bae	btrfs: qgroup: Add trace point for qgroup reserved space Introduce the following trace points: qgroup_update_reserve qgroup_meta_reserve These trace points are handy to trace qgroup reserve space related problems. Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup structure to trace points, so that structure needs to be exported. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
David Sterba	825ad4c964	btrfs: drop redundant parameters from btrfs_map_sblock All callers pass 0 for mirror_num and 1 for need_raid_map. Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
David Sterba	bcc8e07f9e	btrfs: sink GFP flags parameter to tree_mod_log_insert_root All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
David Sterba	176ef8f5e6	btrfs: sink GFP flags parameter to tree_mod_log_insert_move All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Liu Bo	abad60c601	Btrfs: fix wrong failed mirror_num of read-repair on raid56 In raid56 scenario, after trying parity recovery, we didn't set mirror_num for btrfs_bio with failed mirror_num, hence end_bio_extent_readpage() will report a random mirror_num in dmesg log. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
Liu Bo	1bcd7aa17f	Btrfs: set scrub page's io_error if failing to submit io Scrub repairs data by the unit called scrub_block, which may contain several pages. Scrub always tries to look up a good copy of a whole block, but if there's no such copy, it tries to do repair page by page. If we don't set page's io_error when checking this bad copy, in the last step, we may skip this page when repairing bad copy from good copy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:26 +02:00
David Sterba	171938e528	btrfs: track exclusive filesystem operation in flags There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues	48a89bc4f2	btrfs: qgroups: Retry after commit on getting EDQUOT We are facing the same problem with EDQUOT which was experienced with ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but here is a quick fix, which may be too big a hammer. Quotas are reserved during the start of an operation, incrementing qg->reserved. However, it is written to disk in a commit_transaction which could take as long as commit_interval. In the meantime there could be deletions which are not accounted for because deletions are accounted for only while committed (free_refroot). So, when we get a EDQUOT flush the data to disk and try again. This fixes fstests btrfs/139. Here is a sample script which shows this issue. DEVICE=/dev/vdb MOUNTPOINT=/mnt TESTVOL=$MOUNTPOINT/tmp QUOTA=5 PROG=btrfs DD_BS="4k" DD_COUNT="256" RUN_TIMES=5000 mkfs.btrfs -f $DEVICE mount -o commit=240 $DEVICE $MOUNTPOINT $PROG subvolume create $TESTVOL $PROG quota enable $TESTVOL $PROG qgroup limit ${QUOTA}G $TESTVOL typeset -i DD_RUN_GOOD typeset -i QUOTA function _check_cmd() { if [[ ${?} > 0 ]]; then echo -n "$(date) E: Running previous command" echo ${} echo "Without sync" $PROG qgroup show -pcreFf ${TESTVOL} echo "With sync" $PROG qgroup show -pcreFf --sync ${TESTVOL} exit 1 fi } while true; do DD_RUN_GOOD=$RUN_TIMES while (( ${DD_RUN_GOOD} != 0 )); do dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT} _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}" DD_RUN_GOOD=(${DD_RUN_GOOD}-1) done $PROG qgroup show -pcref $TESTVOL echo "----------- Cleanup ---------- " rm $TESTVOL/quotatest done Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Edmund Nadolski	de47c9d3ff	btrfs: replace hardcoded value with SEQ_LAST macro Define the SEQ_LAST macro to replace (u64)-1 in places where said value triggers a special-case ref search behavior. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Edmund Nadolski	f58d88b336	btrfs: provide enumeration for __merge_refs mode argument Replace hardcoded numeric values for __merge_refs 'mode' argument with descriptive constants. Signed-off-by: Edmund Nadolski <enadolski@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	f486135eba	btrfs: remove unused qgroup members from btrfs_trans_handle The members have been effectively unused since "Btrfs: rework qgroup accounting" (`fcebe4562d`), there's no substitute for assert_qgroups_uptodate so it's removed as well. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	994a5d2bc7	btrfs: remove local blocksize variable in reada_find_extent The name is misleading and the local variable serves no purpose. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	5721b8ad26	btrfs: remove redundant parameter from reada_start_machine_dev We can read fs_info from dev. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	0ceaf28213	btrfs: remove redundant parameter from reada_find_zone We can read fs_info from dev. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	d48d71aa99	btrfs: remove redundant parameter from btree_readahead_hook We can read fs_info from eb. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	7ef70b4d99	btrfs: preallocate radix tree node for global readahead tree We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the global radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, because readahead is optional and not on any critical writeout path. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
David Sterba	cc8385b59e	btrfs: preallocate radix tree node for readahead We can preallocate the node so insertion does not have to do that under the lock. The GFP flags for the per-device radix tree are initialized to GFP_NOFS & ~__GFP_DIRECT_RECLAIM but we can use GFP_KERNEL, same as an allocation above anyway, but also because readahead is optional and not on any critical writeout path. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues	4d339d0106	btrfs: No need to check !(flags & MS_RDONLY) twice Code cleanup. The code block is for !(*flags & MS_RDONLY). We don't need to check it again. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00

1 2 3 4 5 ...

6221 commits