Commit graph

1554 commits

Author SHA1 Message Date
Linus Torvalds d3bc0e67f8 for-4.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAls4zz0ACgkQxWXV+ddt
 WDs0ZhAAplEAcN1BP986BS7GpjrG20vQtP9AnHlnSEJJJmsnykpspBylOcLRkKjF
 LKBfBPCKqIo7kn5ebKT1Kk7zJPkOOEfmxGW7hffVN/oa/oMtmgJbbHDUgl2TDgdu
 rky1O+Bj+S37s5rhiXAJ4oU9ekdpWIlN30GczfynjiPqGigKh/cINsEEhQIIAiJG
 PRDQfSIJeh67x1AP0KE8sJAYSsaeFxT+kHrT/NPs1NFDSzrQSa/QWPFVjGVVuI/Y
 w84Mo0EqdRV7tap7D3QyWyYea6zdP00PG8TyLl0Kba+LckFbzpNN5hP3SUxleBzL
 0ZBJi7/tOqnrMV3YaGm40dLfgD4B+CFt8zDyg2JvWUxxEzfQfYif7KIT2IV8fSqS
 QrVw2NrzQC7EZ4Zu98wCN7dyyOE8yhqbq805YdG3Nj+zT6DqRu01TBo4Yr/Ek8ux
 +ITAtQVbaOZmTIt/qh/Oxc5jRsurAno1FP3XRH+1hfSlS7xc3LfI1CUbX3jAKzXN
 edxdM4/h+d4nekvROnKBH4EheS6+ZVfgzYlYUW9c2rjcJ1RHhDElbh14+IoM6LKJ
 nJ+Cp+744F6W5jaG3oWElJrdhlY31mWUjiZaj2CHl16EcH3MToytrxKMX+OWo95W
 gChnKicrtpO6+9nbED3Tdhp7SkbDysun6jvEpSgdlm8+2H5Kwrw=
 =KWL7
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have a few regression fixes for qgroup rescan status tracking and
  the vm_fault_t conversion that mixed up the error values"

* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix mount failure when qgroup rescan is in progress
  Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
  btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
2018-07-01 12:38:16 -07:00
Chris Mason 717beb96d9 Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions.  It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix.  Something like this:

    ret2 = btrfs_delalloc_reserve_space() // returns zero on success

    lock_page(page)
    if (page->mapping != inode->i_mapping)
	goto out_unlock;

...

out_unlock:
    if (!ret2) {
	    ...
	    return VM_FAULT_LOCKED;
    }

This ends up triggering this WARNING in btrfs_destroy_inode()
    WARN_ON(BTRFS_I(inode)->block_rsv.size);

xfstests generic/095 was able to reliably reproduce the errors.

Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.

Fixes: a528a24150 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28 11:30:50 +02:00
Linus Torvalds 84bfed40fc for-4.18-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
 WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
 7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
 4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
 GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
 48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
 hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
 PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
 NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
 dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
 e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
 tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
 =fCdm
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two regression fixes and an incorrect error value propagation fix from
  'rename exchange'"

* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix return value on rename exchange failure
  btrfs: fix invalid-free in btrfs_extent_same
  Btrfs: fix physical offset reported by fiemap for inline extents
2018-06-26 08:41:54 -07:00
Filipe Manana c5b4a50b74 Btrfs: fix return value on rename exchange failure
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-22 12:59:08 +02:00
Linus Torvalds 7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Arnd Bergmann 15eefe2a99 Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64
Pull the timespec64 conversion from Deepa Dinamani:
 "The series aims to switch vfs timestamps to use
  struct timespec64. Currently vfs uses struct timespec,
  which is not y2038 safe.

  The flag patch applies cleanly. I've not seen the timestamps
  update logic change often. The series applies cleanly on 4.17-rc6
  and linux-next tip (top commit: next-20180517).

  I'm not sure how to merge this kind of a series with a flag patch.
  We are targeting 4.18 for this.
  Let me know if you have other suggestions.

  The series involves the following:
  1. Add vfs helper functions for supporting struct timepec64 timestamps.
  2. Cast prints of vfs timestamps to avoid warnings after the switch.
  3. Simplify code using vfs timestamps so that the actual
     replacement becomes easy.
  4. Convert vfs timestamps to use struct timespec64 using a script.
     This is a flag day patch.

  I've tried to keep the conversions with the script simple, to
  aid in the reviews. I've kept all the internal filesystem data
  structures and function signatures the same.

  Next steps:
  1. Convert APIs that can handle timespec64, instead of converting
     timestamps at the boundaries.
  2. Update internal data structures to avoid timestamp conversions."

I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:54:00 +02:00
Souptick Joarder a528a24150 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Reference commit 1c8f422059 ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduced inline function in 4.17-rc6.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07 17:27:45 +02:00
Deepa Dinamani 95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Omar Sandoval ad7e1a740d Btrfs: clean up error handling in btrfs_truncate()
btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 21:27:32 +02:00
Nikolay Borisov c5794e5178 btrfs: Factor out write portion of btrfs_get_blocks_direct
Now that the read side is extracted into its own function, do the same
to the write side. This leaves btrfs_get_blocks_direct_write with the
sole purpose of handling common locking required. Also flip the
condition in btrfs_get_blocks_direct_write so that the write case
comes first and we check for if (Create) rather than if (!create). This
is purely subjective but I believe makes reading a bit more "linear".
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:44 +02:00
Nikolay Borisov 1c8d0175df btrfs: Factor out read portion of btrfs_get_blocks_direct
Currently this function handles both the READ and WRITE dio cases. This
is facilitated by a bunch of 'if' statements, a goto short-circuit
statement and a very perverse aliasing of "!created"(READ) case
by setting lockstart = lockend and checking for lockstart < lockend for
detecting the write. Let's simplify this mess by extracting the
READ-only code into a separate __btrfs_get_block_direct_read function.
This is only the first step, the next one will be to factor out the
write side as well. The end goal will be to have the common locking/
unlocking code in btrfs_get_blocks_direct and then it will call either
the read|write subvariants. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:43 +02:00
Su Yue 090a127afa btrfs: return error value if create_io_em failed in cow_file_range
In cow_file_range(), create_io_em() may fail, but its return value is
not recorded.  Then return value may be 0 even it failed which is a
wrong behavior.

Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.

Fixes: 6f9994dbab ("Btrfs: create a helper to create em for IO")
CC: stable@vger.kernel.org # 4.11+
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:51:08 +02:00
Gu JinXiang c4c129db5d btrfs: drop unused parameter qgroup_reserved
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien 78d4295b1e btrfs: lift some btrfs_cross_ref_exist checks in nocow path
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:

[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/nocow/testfile

IOPS result:   unpatched     patched

1 fio round:     46670        46958
snapshot
2 fio round:     51826        54498
3 fio round:     59767        61289

After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi d19577912d btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Omar Sandoval 399b0bbf5f Btrfs: reserve space for O_TMPFILE orphan item deletion
btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50b ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:00 +02:00
Omar Sandoval a575ceeb13 Btrfs: get rid of unused orphan infrastructure
Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:57 +02:00
Omar Sandoval 27919067f1 Btrfs: fix ENOSPC caused by orphan items reservations
Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:54 +02:00
Omar Sandoval 4b9d7b59bf Btrfs: refactor btrfs_evict_inode() reserve refill dance
The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:52 +02:00
Omar Sandoval c08db7d8d2 Btrfs: don't return ino to ino cache if inode item removal fails
In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb05094 ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:51 +02:00
Omar Sandoval 05a5bd7c4d Btrfs: delete dead code in btrfs_orphan_commit_root()
btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e3 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:49 +02:00
Omar Sandoval 7b40b695b4 Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:48 +02:00
Omar Sandoval f7e9e8fc79 Btrfs: stop creating orphan items for truncate
Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:46 +02:00
Omar Sandoval 0552210997 Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b ("btrfs_truncate")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:45 +02:00
Omar Sandoval fd86a3a315 Btrfs: fix error handling in btrfs_truncate_inode_items()
btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:44 +02:00
Omar Sandoval d1342aadbd Btrfs: update stale comments referencing vmtruncate()
Commit a41ad394a0 ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.

Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:43 +02:00
David Sterba 7b6a221e5b btrfs: rename btrfs_update_iflags to reflect which flags it touches
The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:20 +02:00
Nikolay Borisov 20a6800402 btrfs: Unexport and rename btrfs_invalidate_inodes
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:10 +02:00
David Sterba 093258e6eb btrfs: replace waitqueue_actvie with cond_wake_up
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:09 +02:00
Al Viro 7a1b1e7028 btrfs: take the last remnants of ->d_fsdata use out
[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]

The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete().  Which,
incidentally, is all btrfs_dentry_delete() does.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:37 +02:00
Nikolay Borisov 7c8a0d363a btrfs: Add assert in __btrfs_del_delalloc_inode
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:32 +02:00
Nikolay Borisov 3a2f8c07e1 btrfs: Unexport btrfs_alloc_delalloc_work
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:29 +02:00
Nikolay Borisov 076da91cd9 btrfs: Remove delayed_iput member from btrfs_delalloc_work
When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:29 +02:00
Nikolay Borisov 4fbb514785 btrfs: Remove delay_iput parameter from __start_delalloc_inodes
It's always set to 0 so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 76f32e240e btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes
It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 82b3e53b8d btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:28 +02:00
Nikolay Borisov 40012f96b6 btrfs: Remove btrfs_wait_and_free_delalloc_work
This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov 8ae225a8a4 btrfs: Remove tree argument from extent_writepages
It can be directly referenced from the passed address_space so do that.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:20 +02:00
Nikolay Borisov 81f1d39035 btrfs: Use list_empty instead of list_empty_careful
list_empty_careful usually is a signal of something tricky going on. Its
usage in btrfs is actually not needed since both lists it's used on are
local to a function and cannot be modified concurrently. So switch to
plain list_empty. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Nikolay Borisov 2a3ff0adc9 btrfs: Remove redundant tree argument from extent_readpages
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Nikolay Borisov 477a30ba5f btrfs: Sink extent_tree arguments in try_release_extent_mapping
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:19 +02:00
Misono Tomohiro a79a464d56 btrfs: Allow rmdir(2) to delete an empty subvolume
Change the behavior of rmdir(2) and allow it to delete an empty
subvolume by using btrfs_delete_subvolume() which is used by
btrfs_ioctl_snap_destroy().

This is a change in behaviour and has been requested by users. Deleting
the subvolume by ioctl requires root permissions while the rmdir way
does works with standard tools and syscalls for all users that can
access the subvolume.

The main usecase is to allow 'rm -rf /path/with/subvols' to simply work.
We were not able to find any nasty usability surprises, the intention is
to do the destructive rm. Without allowing rmdir, this would have to be
followed by the ioctl subvolume deletion, which is more of an annoyance.

Implementation details:

The required lock for @dir and inode of @dentry is already acquired in
vfs layer.

We need some check before deleting a subvolume. Permission check is done
in vfs layer, emptiness check is in btrfs_rmdir() and additional check
(i.e. neither the subvolume is a default subvolume nor send is in progress)
is in btrfs_delete_subvolume().

Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after
btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs
layer later.

Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
Misono Tomohiro f60a2364a4 btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy()
Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:

1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes

and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.

Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.

As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
Misono Tomohiro ec42f16734 btrfs: Move may_destroy_subvol() from ioctl.c to inode.c
This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:18 +02:00
David Sterba f46b24c945 btrfs: use fs_info for btrfs_handle_em_exist tracepoint
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:17 +02:00
Nikolay Borisov b25f0d0012 btrfs: Use while loop instead of labels in __endio_write_update_ordered
Currently __endio_write_update_ordered uses labels to implement
what is essentially a simple while loop. This makes the code more
cumbersome to follow than it actually has to be. No functional
changes. No xfstest regressions were found during testing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:07:15 +02:00
Linus Torvalds d7b66b4ab0 for-4.17-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt
 WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR
 LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb
 7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm
 RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x
 +BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ
 24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8
 KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM
 1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe
 1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG
 Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh
 NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE=
 =iYyx
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A one-liner that prevents leaking an internal error value 1 out of the
  ftruncate syscall.

  This has been observed in practice. The steps to reproduce make a
  common pattern (open/write/fync/ftruncate) but also need the
  application to not check only for negative values and happens only for
  compressed inlined files.

  The conditions are narrow but as this could break userspace I think
  it's better to merge it now and not wait for the merge window"

* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix error handling in btrfs_truncate()
2018-05-24 11:47:43 -07:00
Omar Sandoval d50147381a Btrfs: fix error handling in btrfs_truncate()
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).

To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:

int main(void) {
        char buf[256] = { 0 };
        int ret;
        int fd;

        fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
        if (fd == -1) {
                perror("open");
                return EXIT_FAILURE;
        }

        if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                perror("write");
                close(fd);
                return EXIT_FAILURE;
        }

        if (fsync(fd) == -1) {
                perror("fsync");
                close(fd);
                return EXIT_FAILURE;
        }

        ret = ftruncate(fd, 128);
        if (ret) {
                printf("ftruncate() returned %d\n", ret);
                close(fd);
                return EXIT_FAILURE;
        }

        close(fd);
        return EXIT_SUCCESS;
}

Fixes: ddfae63cc8 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-24 11:56:57 +02:00
Linus Torvalds 5997aab0a1 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes all over the place"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: fix io_destroy(2) vs. lookup_ioctx() race
  ext2: fix a block leak
  nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed
  cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed
  unfuck sysfs_mount()
  kernfs: deal with kernfs_fill_super() failures
  cramfs: Fix IS_ENABLED typo
  befs_lookup(): use d_splice_alias()
  affs_lookup: switch to d_splice_alias()
  affs_lookup(): close a race with affs_remove_link()
  fix breakage caused by d_find_alias() semantics change
  fs: don't scan the inode cache before SB_BORN is set
  do d_instantiate/unlock_new_inode combinations safely
  iov_iter: fix memory leak in pipe_get_pages_alloc()
  iov_iter: fix return type of __pipe_get_pages()
2018-05-21 11:54:57 -07:00