1
0
Fork 0
Commit Graph

59753 Commits (2a7bf671186eb5d1a47ef192aefbdf788f5b38fe)

Author SHA1 Message Date
Linus Torvalds 0f75ef6a9c Keyrings ACL
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRyyVvu3V2unywtrAQL3xQ//eifjlELkRAPm2EReWwwahdM+9QL/0bAy
 e8eAzP9EaphQGUhpIzM9Y7Cx+a8XW2xACljY8hEFGyxXhDMoLa35oSoJOeay6vQt
 QcgWnDYsET8Z7HOsFCP3ZQqlbbqfsB6CbIKtZoEkZ8ib7eXpYcy1qTydu7wqrl4A
 AaJalAhlUKKUx9hkGGJTh2xvgmxgSJkxx3cNEWJQ2uGgY/ustBpqqT4iwFDsgA/q
 fcYTQFfNQBsC8/SmvQgxJSc+reUdQdp0z1vd8qjpSdFFcTq1qOtK0qDdz1Bbyl24
 hAxvNM1KKav83C8aF7oHhEwLrkD+XiYKixdEiCJJp+A2i+vy2v8JnfgtFTpTgLNK
 5xu2VmaiWmee9SLCiDIBKE4Ghtkr8DQ/5cKFCwthT8GXgQUtdsdwAaT3bWdCNfRm
 DqgU/AyyXhoHXrUM25tPeF3hZuDn2yy6b1TbKA9GCpu5TtznZIHju40Px/XMIpQH
 8d6s/pg+u/SnkhjYWaTvTcvsQ2FB/vZY/UzAVyosnoMBkVfL4UtAHGbb8FBVj1nf
 Dv5VjSjl4vFjgOr3jygEAeD2cJ7L6jyKbtC/jo4dnOmPrSRShIjvfSU04L3z7FZS
 XFjMmGb2Jj8a7vAGFmsJdwmIXZ1uoTwX56DbpNL88eCgZWFPGKU7TisdIWAmJj8U
 N9wholjHJgw=
 =E3bF
 -----END PGP SIGNATURE-----

Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring ACL support from David Howells:
 "This changes the permissions model used by keys and keyrings to be
  based on an internal ACL by the following means:

   - Replace the permissions mask internally with an ACL that contains a
     list of ACEs, each with a specific subject with a permissions mask.
     Potted default ACLs are available for new keys and keyrings.

     ACE subjects can be macroised to indicate the UID and GID specified
     on the key (which remain). Future commits will be able to add
     additional subject types, such as specific UIDs or domain
     tags/namespaces.

     Also split a number of permissions to give finer control. Examples
     include splitting the revocation permit from the change-attributes
     permit, thereby allowing someone to be granted permission to revoke
     a key without allowing them to change the owner; also the ability
     to join a keyring is split from the ability to link to it, thereby
     stopping a process accessing a keyring by joining it and thus
     acquiring use of possessor permits.

   - Provide a keyctl to allow the granting or denial of one or more
     permits to a specific subject. Direct access to the ACL is not
     granted, and the ACL cannot be viewed"

* tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Provide KEYCTL_GRANT_PERMISSION
  keys: Replace uid/gid/perm permissions checking with an ACL
2019-07-08 19:56:57 -07:00
Linus Torvalds c84ca912b0 Keyrings namespacing
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRU89Pu3V2unywtrAQIdBBAAmMBsrfv+LUN4Vru/D6KdUO4zdYGcNK6m
 S56bcNfP6oIDEj6HrNNnzKkWIZpdZ61Odv1zle96+v4WZ/6rnLCTpcsdaFNTzaoO
 YT2jk7jplss0ImrMv1DSoykGqO3f0ThMIpGCxHKZADGSu0HMbjSEh+zLPV4BaMtT
 BVuF7P3eZtDRLdDtMtYcgvf5UlbdoBEY8w1FUjReQx8hKGxVopGmCo5vAeiY8W9S
 ybFSZhPS5ka33ynVrLJH2dqDo5A8pDhY8I4bdlcxmNtRhnPCYZnuvTqeAzyUKKdI
 YN9zJeDu1yHs9mi8dp45NPJiKy6xLzWmUwqH8AvR8MWEkrwzqbzNZCEHZ41j74hO
 YZWI0JXi72cboszFvOwqJERvITKxrQQyVQLPRQE2vVbG0bIZPl8i7oslFVhitsl+
 evWqHb4lXY91rI9cC6JIXR1OiUjp68zXPv7DAnxv08O+PGcioU1IeOvPivx8QSx4
 5aUeCkYIIAti/GISzv7xvcYh8mfO76kBjZSB35fX+R9DkeQpxsHmmpWe+UCykzWn
 EwhHQn86+VeBFP6RAXp8CgNCLbrwkEhjzXQl/70s1eYbwvK81VcpDAQ6+cjpf4Hb
 QUmrUJ9iE0wCNl7oqvJZoJvWVGlArvPmzpkTJk3N070X2R0T7x1WCsMlPDMJGhQ2
 fVHvA3QdgWs=
 =Push
 -----END PGP SIGNATURE-----

Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Linus Torvalds 3431a940bb Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 AVX512 status update from Ingo Molnar:
 "This adds a new ABI that the main scheduler probably doesn't want to
  deal with but HPC job schedulers might want to use: the
  AVX512_elapsed_ms field in the new /proc/<pid>/arch_status task status
  file, which allows the user-space job scheduler to cluster such tasks,
  to avoid turbo frequency drops"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation/filesystems/proc.txt: Add arch_status file
  x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
  proc: Add /proc/<pid>/arch_status
2019-07-08 17:28:57 -07:00
Linus Torvalds dad1c12ed8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Remove the unused per rq load array and all its infrastructure, by
   Dietmar Eggemann.

 - Add utilization clamping support by Patrick Bellasi. This is a
   refinement of the energy aware scheduling framework with support for
   boosting of interactive and capping of background workloads: to make
   sure critical GUI threads get maximum frequency ASAP, and to make
   sure background processing doesn't unnecessarily move to cpufreq
   governor to higher frequencies and less energy efficient CPU modes.

 - Add the bare minimum of tracepoints required for LISA EAS regression
   testing, by Qais Yousef - which allows automated testing of various
   power management features, including energy aware scheduling.

 - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
   kernel used to modify the scheduler's CPU affinity logic such as
   migrate_disable() - introduce the task->cpus_ptr value instead of
   taking the address of &task->cpus_allowed directly - by Sebastian
   Andrzej Siewior.

 - Misc optimizations, fixes, cleanups and small enhancements - see the
   Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched/uclamp: Add uclamp support to energy_compute()
  sched/uclamp: Add uclamp_util_with()
  sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
  sched/uclamp: Set default clamps for RT tasks
  sched/uclamp: Reset uclamp values on RESET_ON_FORK
  sched/uclamp: Extend sched_setattr() to support utilization clamping
  sched/core: Allow sched_setattr() to use the current policy
  sched/uclamp: Add system default clamps
  sched/uclamp: Enforce last task's UCLAMP_MAX
  sched/uclamp: Add bucket local max tracking
  sched/uclamp: Add CPU's clamp buckets refcounting
  sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
  sched/debug: Export the newly added tracepoints
  sched/debug: Add sched_overutilized tracepoint
  sched/debug: Add new tracepoint to track PELT at se level
  sched/debug: Add new tracepoints to track PELT at rq level
  sched/debug: Add a new sched_trace_*() helper functions
  sched/autogroup: Make autogroup_path() always available
  sched/wait: Deduplicate code with do-while
  sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
  ...
2019-07-08 16:39:53 -07:00
Linus Torvalds e192832869 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - rwsem scalability improvements, phase #2, by Waiman Long, which are
     rather impressive:

       "On a 2-socket 40-core 80-thread Skylake system with 40 reader
        and writer locking threads, the min/mean/max locking operations
        done in a 5-second testing window before the patchset were:

         40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
         40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

        After the patchset, they became:

         40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
         40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

     There's a lot of changes to the locking implementation that makes
     it similar to qrwlock, including owner handoff for more fair
     locking.

     Another microbenchmark shows how across the spectrum the
     improvements are:

       "With a locking microbenchmark running on 5.1 based kernel, the
        total locking rates (in kops/s) on a 2-socket Skylake system
        with equal numbers of readers and writers (mixed) before and
        after this patchset were:

        # of Threads   Before Patch      After Patch
        ------------   ------------      -----------
             2            2,618             4,193
             4            1,202             3,726
             8              802             3,622
            16              729             3,359
            32              319             2,826
            64              102             2,744"

     The changes are extensive and the patch-set has been through
     several iterations addressing various locking workloads. There
     might be more regressions, but unless they are pathological I
     believe we want to use this new implementation as the baseline
     going forward.

   - jump-label optimizations by Daniel Bristot de Oliveira: the primary
     motivation was to remove IPI disturbance of isolated RT-workload
     CPUs, which resulted in the implementation of batched jump-label
     updates. Beyond the improvement of the real-time characteristics
     kernel, in one test this patchset improved static key update
     overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
     as well.

   - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
     ~10 years of atomic64_t existence the various types used by the
     APIs only had to be self-consistent within each architecture -
     which means they became wildly inconsistent across architectures.
     Mark puts and end to this by reworking all the atomic64
     implementations to use 's64' as the base type for atomic64_t, and
     to ensure that this type is consistently used for parameters and
     return values in the API, avoiding further problems in this area.

   - A large set of small improvements to lockdep by Yuyang Du: type
     cleanups, output cleanups, function return type and othr cleanups
     all around the place.

   - A set of percpu ops cleanups and fixes by Peter Zijlstra.

   - Misc other changes - please see the Git log for more details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  locking/lockdep: increase size of counters for lockdep statistics
  locking/atomics: Use sed(1) instead of non-standard head(1) option
  locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
  x86/jump_label: Make tp_vec_nr static
  x86/percpu: Optimize raw_cpu_xchg()
  x86/percpu, sched/fair: Avoid local_clock()
  x86/percpu, x86/irq: Relax {set,get}_irq_regs()
  x86/percpu: Relax smp_processor_id()
  x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  locking/rwsem: Guard against making count negative
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Make rwsem->owner an atomic_long_t
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  ...
2019-07-08 16:12:03 -07:00
Richard Weinberger 8009ce956c ubifs: Don't leak orphans on memory during commit
If an orphan has child orphans (xattrs), and due
to a commit the parent orpahn cannot get free()'ed immediately,
put also all child orphans on the erase list.
Otherwise UBIFS will free() them only upon unmount and we
waste memory.

Fixes: 988bec4131 ("ubifs: orphan: Handle xattrs like files")
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 20:01:34 +02:00
Richard Weinberger ee1438ce5d ubifs: Check link count of inodes when killing orphans.
O_TMPFILE files can change their link count back to non-zero.
This corner case needs to get addressed in the orphans subsystem
too.

Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Reported-by: Lars Persson <lists@bofh.nu>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 20:01:33 +02:00
Michele Dionisio eeabb9866e ubifs: Add support for zstd compression.
zstd shows a good compression rate and is faster than lzo,
also on slow ARM cores.

Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Michele Dionisio <michele.dionisio@gmail.com>
[rw: rewrote commit message]
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:43:53 +02:00
Sascha Hauer 817aa09484 ubifs: support offline signed images
HMACs can only be generated on the system the UBIFS image is running on.
To support offline signed images we add a PKCS#7 signature to the UBIFS
image which can be created by mkfs.ubifs.

Both the master node and the superblock need to be authenticated, during
normal runtime both are protected with HMACs. For offline signature
support however only a single signature is desired. We add a signature
covering the superblock node directly behind it. To protect the master
node a hash of the master node is added to the superblock which is used
when the master node doesn't contain a HMAC.

Transition to a read/write filesystem is also supported. During
transition first the master node is rewritten with a HMAC (implicitly,
it is written anyway as the FS is marked dirty). Afterwards the
superblock is rewritten with a HMAC. Once after the image has been
mounted read/write it is HMAC only, the signature is no longer required
or even present on the filesystem.

In an offline signed image the master node is authenticated by the
superblock. In a transition to r/w we have to make sure that the master
node is rewritten before the superblock node. In this case the master
node gets a HMAC and its authenticity no longer depends on the
superblock node. There are some cases in which the current code first
writes the superblock node though, so with this patch writing of the
superblock node is delayed until the master node is written.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:43:52 +02:00
Liu Song 8ba0a2ab84 ubifs: remove unnecessary check in ubifs_log_start_commit
In ubifs_log_start_commit, the value of c->lhead_offs is zero or set
to zero by code bellow.

	/* Switch to the next log LEB */
	if (c->lhead_offs) {
		c->lhead_lnum = ubifs_next_log_lnum(c, c->lhead_lnum);
		ubifs_assert(c->lhead_lnum != c->ltail_lnum);
		c->lhead_offs = 0;
	}

The value of 'len' can not exceed 'max_len' which assigned value by
code bellow.

	max_len = UBIFS_CS_NODE_SZ + c->jhead_cnt * UBIFS_REF_NODE_SZ;

The value of c->lhead_offs changed by code bellow and cannot exceed
'max_len'.

	c->lhead_offs += len;
	if (c->lhead_offs == c->leb_size) {
		c->lhead_lnum = ubifs_next_log_lnum(c, c->lhead_lnum);
		c->lhead_offs = 0;
	}

Usually, the size of PEB is between 64KB and 256KB. So the value of
c->lhead_offs is far less than c->leb_size. The check
'if (c->lhead_offs == c->leb_size)' could never to be true.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:43:51 +02:00
Liu Song 7d8c811bf9 ubifs: Fix typo of output in get_cs_sqnum
"Not a CS node" makes more sense than "Node a CS node".

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:43:43 +02:00
Liu Song d5cf9473a3 ubifs: Simplify redundant code
cbuf's size can be simply assigned.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:43:38 +02:00
Richard Weinberger bacfa94b08 ubifs: Correctly use tnc_next() in search_dh_cookie()
Commit c877154d30 fixed an uninitialized variable and optimized
the function to not call tnc_next() in the first iteration of the
loop. While this seemed perfectly legit and wise, it turned out to
be illegal.
If the lookup function does not find an exact match it will rewind
the cursor by 1.
The rewinded cursor will not match the name hash we are looking for
and this results in a spurious -ENOENT.
So we need to move to the next entry in case of an non-exact match,
but not if the match was exact.

While we are here, update the documentation to avoid further confusion.

Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: c877154d30 ("ubifs: Fix uninitialized variable in search_dh_cookie()")
Fixes: 781f675e2d ("ubifs: Fix unlink code wrt. double hash lookups")
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08 19:13:41 +02:00
Ingo Molnar 552a031ba1 Linux 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0idTweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZesIAJDKicw2Voyx8K8m
 3pXSK+71RuO/d3Y9M51mdfTMKRP4PHR9/4wVZ9wHPwC4dV6wxgsmIYCF69a1Wety
 LD1MpDCP1DK5wVfPNKVX2xmj7ua6iutPtSsJHzdzM2TlscgsrFKjmUccqJ5JLwL5
 c34nqwXWnzzRyI5Ga9cQSlwzAXq0vDHXyML3AnCosSsLX0lKFrHlK1zttdOPNkfj
 dXRN62g3q+9kVQozzhDXb8atZZ7IkBk8Q0lujpNXW83Ci1VjaVNv3SB8GZTXIlLj
 U15VdyuwfJDfpBgFBN6/unzVaAB6FFrEKy0jT1aeTyKarMKDKgOnJjn10aKjDNno
 /bXsKKc=
 =TVqV
 -----END PGP SIGNATURE-----

Merge tag 'v5.2' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-08 18:04:41 +02:00
Luis Henriques d31d07b97a ceph: fix end offset in truncate_inode_pages_range call
Commit e450f4d1a5 ("ceph: pass inclusive lend parameter to
filemap_write_and_wait_range()") fixed the end offset parameter used to
call filemap_write_and_wait_range and invalidate_inode_pages2_range.
Unfortunately it missed truncate_inode_pages_range, introducing a
regression that is easily detected by xfstest generic/130.

The problem is that when doing direct IO it is possible that an extra page
is truncated from the page cache when the end offset is page aligned.
This can cause data loss if that page hasn't been sync'ed to the OSDs.

While there, change code to use PAGE_ALIGN macro instead.

Cc: stable@vger.kernel.org
Fixes: e450f4d1a5 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:45 +02:00
Luis Henriques 52dd0f1b3f ceph: use generic_delete_inode() for ->drop_inode
ceph_drop_inode() implementation is not any different from the generic
function, thus there's no point in keeping it around.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:45 +02:00
Yan, Zheng 87bc5b895d ceph: use ceph_evict_inode to cleanup inode's resource
remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freeing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40102
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:45 +02:00
Luis Henriques 0f7cf80ae9 ceph: initialize superblock s_time_gran to 1
Having granularity set to 1us results in having inode timestamps with a
accurancy different from the fuse client (i.e. atime, ctime and mtime will
always end with '000').  This patch normalizes this behaviour and sets the
granularity to 1.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:45 +02:00
Ilya Dryomov 94e8577188 libceph: rename r_unsafe_item to r_private_item
This list item remained from when we had safe and unsafe replies
(commit vs ack).  It has since become a private list item for use by
clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton 26350535c2 ceph: don't NULL terminate virtual xattrs
The convention with xattrs is to not store the termination with string
data, given that it returns the length. This is how setfattr/getfattr
operate.

Most of ceph's virtual xattr routines use snprintf to plop the string
directly into the destination buffer, but snprintf always NULL
terminates the string. This means that if we send the kernel a buffer
that is the exact length needed to hold the string, it'll end up
truncated.

Add a ceph_fmt_xattr helper function to format the string into an
on-stack buffer that should always be large enough to hold the whole
thing and then memcpy the result into the destination buffer. If it does
turn out that the formatted string won't fit in the on-stack buffer,
then return -E2BIG and do a WARN_ONCE().

Change over most of the virtual xattr routines to use the new helper. A
couple of the xattrs are sourced from strings however, and it's
difficult to know how long they'll be. Just have those memcpy the result
in place after verifying the length.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton 3b421018f4 ceph: return -ERANGE if virtual xattr value didn't fit in buffer
The getxattr manpage states that we should return ERANGE if the
destination buffer size is too small to hold the value.
ceph_vxattrcb_layout does this internally, but we should be doing
this for all vxattrs.

Fix the only caller of getxattr_cb to check the returned size
against the buffer length and return -ERANGE if it doesn't fit.
Drop the same check in ceph_vxattrcb_layout and just rely on the
caller to handle it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton f1d1b51dea ceph: make getxattr_cb return ssize_t
The getxattr_cb functions return size_t, which is unsigned and then
cast that value to int and then ssize_t before returning it. While all
of this works, it relies on implicit casting rules for signed/unsigned
conversions.

Change getxattr_cb to return ssize_t to better conform with what the
caller actually wants. Also, remove some suspicious casts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Yan, Zheng 49ada6e8dc ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP
Client uses this flag to tell mds if there is more cap snap need to
flush. It's mainly for the case that client needs to re-send cap/snap
flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding
inodes are all released before mds failover.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Yan, Zheng d6cee9dbd8 ceph: kick flushing and flush snaps before sending normal cap message
Otherwise client may send cap flush messages in wrong order.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Yan, Zheng 054f8d41af ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton 5c30835690 ceph: increment change_attribute on local changes
We don't set SB_I_VERSION on ceph since we need to manage it ourselves,
so we must increment it whenever we update the file times.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:44 +02:00
Jeff Layton 176c77c9c9 ceph: handle change_attr in cap messages
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton a35ead314e ceph: add change_attr field to ceph_inode_info
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton 58981784a6 ceph: allow querying of STATX_BTIME in ceph_getattr
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton ec62b894df ceph: handle btime in cap messages
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton 245ce991cc ceph: add btime field to ceph_inode_info
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Jeff Layton f3848af1bf ceph: have MDS map decoding use entity_addr_t decoder
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:43 +02:00
Yan, Zheng 428138c989 ceph: remove request from waiting list before unregister
Link: https://tracker.ceph.com/issues/40339
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 6f0f597b5d ceph: don't blindly unregister session that is in opening state
handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 2ef5df1abe ceph: fix infinite loop in get_quota_realm()
get_quota_realm() enters infinite loop if quota inode has no caps.
This can happen after client gets evicted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng ac6713ccb5 ceph: add selinux support
When creating new file/directory, use security_dentry_init_security() to
prepare selinux context for the new inode, then send openc/mkdir request
to MDS, together with selinux xattr.

security_dentry_init_security() only supports single security module and
only selinux has dentry_init_security hook. So only selinux is supported
for now. We can add support for other security modules once kernel has a
generic version of dentry_init_security()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 5c31e92dff ceph: rename struct ceph_acls_info to ceph_acl_sec_ctx
Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx().
And move their definitions to different files. This is preparation
for security label support.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 057297812d ceph: fix debug print format in __set_xattr()
name is not '\0' terminated.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Hariprasad Kelam 03af439ad9 ceph: fix warning PTR_ERR_OR_ZERO can be used
change1: fix below warning  reported by coccicheck

/fs/ceph/export.c:371:33-39: WARNING: PTR_ERR_OR_ZERO can be used

change2: typecasted PTR_ERR_OR_ZERO to long as dout expecting long

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng d6e4781972 ceph: hold i_ceph_lock when removing caps for freeing inode
ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 8f2a98ef3c ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 41883ba8ee ceph: use READ_ONCE to access d_parent in RCU critical section
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng feab6ac25d ceph: fix dir_lease_is_valid()
It should call __ceph_dentry_dir_lease_touch() under dentry->d_lock.
Besides, ceph_dentry(dentry) can be NULL when called by LOOKUP_RCU
d_revalidate()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Yan, Zheng 543212b3a4 ceph: close race between d_name_cmp() and update_dentry_lease()
d_name_cmp() and update_dentry_lease() lock and unlock dentry->d_lock
respectively. Dentry may get renamed between them. The fix is moving
the dentry name compare into update_dentry_lease().

This patch introduce two version of update_dentry_lease(). One version
is for the case that parent inode is locked. It does not need to check
parent/target inode and dentry name. Another version is for the case
that parent inode is not locked. It checks parent/target inode and
dentry name after locking dentry->d_lock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
Andrea Parri 749607731e ceph: fix improper use of smp_mb__before_atomic()
This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic64_set() primitive.

Replace the barrier with an smp_mb().

Fixes: fdd4e15838 ("ceph: rework dcache readdir")
Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:42 +02:00
David Disseldorp 718807289d ceph: fix "ceph.dir.rctime" vxattr value
The vxattr value incorrectly places a "09" prefix to the nanoseconds
field, instead of providing it as a zero-pad width specifier after '%'.

Fixes: 3489b42a72 ("ceph: fix three bugs, two in ceph_vxattrcb_file_layout()")
Link: https://tracker.ceph.com/issues/39943
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:41 +02:00
David Disseldorp d0f191d20c ceph: remove unused vxattr length helpers
ceph_listxattr() now calculates the length of vxattrs dynamically, so
these helpers, which incorrectly ignore vxattr.exists_cb(), can be
removed.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:41 +02:00
David Disseldorp 2b2abcac8c ceph: fix listxattr vxattr buffer length calculation
ceph_listxattr() incorrectly returns a length based on the static
ceph_vxattrs_name_size() value, which only takes into account whether
vxattrs are hidden, ignoring vxattr.exists_cb().

When filling the xattr buffer ceph_listxattr() checks VXATTR_FLAG_HIDDEN
and vxattr.exists_cb(). If both are false, we return an incorrect
(oversize) length.

Fix this behaviour by always calculating the vxattrs length at runtime,
taking both vxattr.hidden and vxattr.exists_cb() into account.

This bug is only exposed with the new "ceph.snap.btime" vxattr, as all
other vxattrs with a non-null exists_cb also carry VXATTR_FLAG_HIDDEN.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:41 +02:00
David Disseldorp 100cc610a5 ceph: add ceph.snap.btime vxattr
The ceph.snap.btime virtual xattr provides the snapshot creation (birth)
time in $secs.$nsecs format.

Link: https://tracker.ceph.com/issues/38838
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:41 +02:00
David Disseldorp 193e7b3762 ceph: carry snapshot creation time with inodes
MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:40 +02:00
David Disseldorp e1b8143914 ceph: clean up ceph.dir.pin vxattr name sizeof()
.name_size should use the same string as .name.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:40 +02:00
Dan Carpenter 13c41737b9 ceph: silence a checker warning in mdsc_show()
The problem is that if ceph_mdsc_build_path() fails then we set "path"
to NULL and the "pathlen" variable is uninitialized.  Then we call
ceph_mdsc_free_path(path, pathlen) to clean up.  Since "path" is NULL,
the function is a no-op but Smatch and UBSan still complain that
"pathlen" is uninitialized.

This patch doesn't change run time, it just silence the warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08 14:01:40 +02:00
Greg Kroah-Hartman c33d442328 debugfs: make error message a bit more verbose
When a file/directory is already present in debugfs, and it is attempted
to be created again, be more specific about what file/directory is being
created and where it is trying to be created to give a bit more help to
developers to figure out the problem.

Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20190706154256.GA2683@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-08 10:44:57 +02:00
Ronnie Sahlberg f5f111c231 cifs: refactor and clean up arguments in the reparse point parsing
Will be helpful as we improve handling of special file types.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:44 -05:00
Steve French ff2a09e919 SMB3: query inode number on open via create context
We can cut the number of roundtrips on open (may also
help some rename cases as well) by returning the inode
number in the SMB2 open request itself instead of
querying it afterwards via a query FILE_INTERNAL_INFO.
This should significantly improve the performance of
posix open.

Add SMB2_CREATE_QUERY_ON_DISK_ID create context request
on open calls so that when server supports this we
can save a roundtrip for QUERY_INFO on every open.

Follow on patch will add the response processing for
SMB2_CREATE_QUERY_ON_DISK_ID context and optimize
smb2_open_file to avoid the extra network roundtrip
on every posix open. This patch adds the context on
SMB2/SMB3 open requests.

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:44 -05:00
Steve French 96d3cca124 smb3: Send netname context during negotiate protocol
See MS-SMB2 2.2.3.1.4

Allows hostname to be used by load balancers

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French 9fe5ff1c5d smb3: do not send compression info by default
Since in theory a server could respond with compressed read
responses even if not requested on read request (assuming that
a compression negcontext is sent in negotiate protocol) - do
not send compression information during negotiate protocol
unless the user asks for compression explicitly (compression
is experimental), and add a mount warning that compression
is experimental.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-07-07 22:37:43 -05:00
Steve French 412094a8fb smb3: add new mount option to retrieve mode from special ACE
There is a special ACE used by some servers to allow the mode
bits to be stored.  This can be especially helpful in scenarios
in which the client is trusted, and access checking on the
client vs the POSIX mode bits is sufficient.

Add mount option to allow enabling this behavior.
Follow on patch will add support for chmod and queryinfo
(stat) by retrieving the POSIX mode bits from the special
ACE, SID: S-1-5-88-3

See e.g.
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/hh509017(v=ws.10)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-07-07 22:37:43 -05:00
Steve French d5ecebc490 smb3: Allow query of symlinks stored as reparse points
The 'NFS' style symlinks (see MS-FSCC 2.1.2.4) were not
being queried properly in query_symlink. Fix this.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-07-07 22:37:43 -05:00
Ronnie Sahlberg f2caf901c1 cifs: Fix a race condition with cifs_echo_request
There is a race condition with how we send (or supress and don't send)
smb echos that will cause the client to incorrectly think the
server is unresponsive and thus needs to be reconnected.

Summary of the race condition:
 1) Daisy chaining scheduling creates a gap.
 2) If traffic comes unfortunate shortly after
    the last echo, the planned echo is suppressed.
 3) Due to the gap, the next echo transmission is delayed
    until after the timeout, which is set hard to twice
    the echo interval.

This is fixed by changing the timeouts from 2 to three times the echo interval.

Detailed description of the bug: https://lutz.donnerhacke.de/eng/Blog/Groundhog-Day-with-SMB-remount

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Ronnie Sahlberg 3e2725796c cifs: always add credits back for unsolicited PDUs
not just if CONFIG_CIFS_DEBUG2 is enabled.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Hariprasad Kelam 0aa3a24be0 fs: cifs: cifsssmb: Change return type of convert_ace_to_cifs_ace
Change return from int to void of  convert_ace_to_cifs_ace as it never
fails.

fixes below issue reported by coccicheck
fs/cifs/cifssmb.c:3606:7-9: Unneeded variable: "rc". Return "0" on line
3620

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French e7348e35a3 add some missing definitions
query on disk id structure definition was missing

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Colin Ian King 63d614a608 cifs: fix typo in debug message with struct field ia_valid
Field ia_valid is being debugged with the field name iavalid, fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Aurelien Aptel 3190b59a05 smb3: minor cleanup of compound_send_recv
Trivial cleanup. Will make future multichannel code smaller
as well.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French e7a1a2df4d CIFS: Fix module dependency
KEYS is required not that CONFIG_CIFS_ACL is always on
and the ifdef for it removed.

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French 73cf8085dc cifs: simplify code by removing CONFIG_CIFS_ACL ifdef
SMB3 ACL support is needed for many use cases now and should not be
ifdeffed out, even for SMB1 (CIFS).  Remove the CONFIG_CIFS_ACL
ifdef so ACL support is always built into cifs.ko

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French 6552d6a026 cifs: Fix check for matching with existing mount
If we mount the same share twice, we check the flags to see if the
second mount matches the earlier mount, but we left some flags out.

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Paulo Alcantara (SUSE) 29fbeb7a90 cifs: Properly handle auto disabling of serverino option
Fix mount options comparison when serverino option is turned off later
in cifs_autodisable_serverino() and thus avoiding mismatch of new cifs
mounts.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <paulo@paulo.ac>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilove@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French dc179268cd smb3: if max_credits is specified then display it in /proc/mounts
If "max_credits" is overridden from its default by specifying
it on the smb3 mount then display it in /proc/mounts

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:43 -05:00
Steve French 43cdae88de Fix match_server check to allow for auto dialect negotiate
When using multidialect negotiate (default or specifying vers=3.0 which
allows any smb3 dialect), fix how we check for an existing server session.
Before this fix if you mounted a second time to the same server (e.g. a
different share on the same server) we would only reuse the existing smb
session if a single dialect were requested (e.g. specifying vers=2.1 or vers=3.0
or vers=3.1.1 on the mount command). If a default mount (e.g. not
specifying vers=) is done then would always create a new socket connection
and SMB3 (or SMB3.1.1) session each time we connect to a different share
on the same server rather than reusing the existing one.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-07-07 22:37:42 -05:00
Aurelien Aptel 5fc3681fa5 cifs: add missing GCM module dependency
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:42 -05:00
Steve French 2b2f754807 SMB3.1.1: Add GCM crypto to the encrypt and decrypt functions
SMB3.1.1 GCM performs much better than the older CCM default:
more than twice as fast in the write patch (copy to the Samba
server on localhost for example) and 80% faster on the read
patch (copy from the server).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-07-07 22:37:42 -05:00
Steve French 9ac63ec776 SMB3: Add SMB3.1.1 GCM to negotiated crypto algorigthms
GCM is faster. Request it during negotiate protocol.
Followon patch will add callouts to GCM crypto

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-07-07 22:37:42 -05:00
Kefeng Wang 06f2fca7ff fs: cifs: Drop unlikely before IS_ERR(_OR_NULL)
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag,
so no need to do that again from its callers. Drop it.

Cc: linux-cifs@vger.kernel.org
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
2019-07-07 22:37:42 -05:00
YueHaibing d81f09748d cifs: Use kmemdup in SMB2_ioctl_init()
Use kmemdup rather than duplicating its implementation

This was reported by coccinelle.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-07-07 22:37:42 -05:00
Darrick J. Wong 211bbf3c38 xfs: don't update lastino for FSBULKSTAT_SINGLE
The kernel test robot found a regression of xfs/054 in the conversion of
bulkstat to use the new iwalk infrastructure -- if a caller set *lastip
= 128 and invoked FSBULKSTAT_SINGLE, the bstat info would be for inode
128, but *lastip would be increased by the kernel to 129.

FSBULKSTAT_SINGLE never incremented lastip before, so it's incorrect to
make such an update to the internal lastino value now.

Fixes: 2810bd6840 ("xfs: convert bulkstat to new iwalk infrastructure")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-06 21:05:10 -07:00
Benjamin Coddington 9f7761cf04 NFS: Cleanup if nfs_match_client is interrupted
Don't bail out before cleaning up a new allocation if the wait for
searching for a matching nfs client is interrupted.  Memory leaks.

Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com
Fixes: 950a578c61 ("NFS: make nfs_match_client killable")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:53 -04:00
Darrick J. Wong 9026b3a973 nfs: disable client side deduplication
The NFS protocol doesn't support deduplication, so turn it off again.

Fixes: ce96e888fe ("Fix nfs4.2 return -EINVAL when do dedupe operation")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:53 -04:00
Dave Wysochanski 1a7441b282 NFSv4: Add lease_time and lease_expired to 'nfs4:' line of mountstats
On the NFS client there is no low-impact way to determine the nfs4
lease time or whether the lease is expired, so add these to mountstats
with times displayed in seconds.

If the lease is not expired, display lease_expired=0. Otherwise,
display lease_expired=seconds_since_expired, similar to 'age:' line
in mountstats.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:52 -04:00
Trond Myklebust 2b17d725f9 NFS: Clean up writeback code
Now that the VM promises never to recurse back into the filesystem
layer on writeback, remove all the GFP_NOFS references etc from
the generic writeback code.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:52 -04:00
Trond Myklebust c98ebe2937 Merge branch 'multipath_tcp' 2019-07-06 14:54:52 -04:00
Trond Myklebust 28ade856c0 Merge branch 'containers' 2019-07-06 14:54:52 -04:00
NeilBrown 5a0c257f8e NFS: send state management on a single connection.
With NFSv4.1, different network connections need to be explicitly
bound to a session.  During session startup, this is not possible
so only a single connection must be used for session startup.

So add a task flag to disable the default round-robin choice of
connections (when nconnect > 1) and force the use of a single
connection.
Then use that flag on all requests for session management - for
consistence, include NFSv4.0 management (SETCLIENTID) and session
destruction

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust 53c3263071 NFS: Allow multiple connections to a NFSv2 or NFSv3 server
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust fd87c8b73a NFS: Display the "nconnect" mount option if it is set.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust bb71e4a5d7 pNFS: Allow multiple connections to the DS
If the user specifies -onconnect=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the pNFS data server
as well. The connections will all go to the same IP address.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust 6619079d05 NFSv4: Allow multiple connections to NFSv4.x (x>0) servers
If the user specifies the -onconn=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the server. The
connections will all go to the same IP address.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust 28cc5cd8c6 NFS: Add a mount option to specify number of TCP connections to use
Allow the user to specify that the client should use multiple connections
to the server. For the moment, this functionality will be limited to
TCP and to NFSv4.x (x>0).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06 14:54:50 -04:00
Trond Myklebust bf11fbdb20 NFS: Add sysfs support for per-container identifier
In order to identify containers to the NFS client, we add a per-net
sysfs attribute that udev can fill with the appropriate identifier.
The identifier could be a unique hostname, but in most cases it
will probably be a persisted uuid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:49 -04:00
Trond Myklebust 1c341b7775 NFS: Add deferred cache invalidation for close-to-open consistency violations
If the client detects that close-to-open cache consistency has been
violated, and that the file or directory has been changed on the
server, then do a cache invalidation when we're done working with
the file.
The reason we don't do an immediate cache invalidation is that we
want to avoid performance problems due to false positives. Also,
note that we cannot guarantee cache consistency in this situation
even if we do invalidate the cache.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:49 -04:00
Trond Myklebust 10b7a70cbb NFS: Cleanup - add nfs_clients_exit to mirror nfs_clients_init
Add a helper to clean up the struct nfs_net when it is being destroyed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:49 -04:00
Trond Myklebust 996bc4f405 NFS: Create a root NFS directory in /sys/fs/nfs
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:49 -04:00
Trond Myklebust 44942b4e45 NFSv4: Handle the special Linux file open access mode
According to the open() manpage, Linux reserves the access mode 3
to mean "check for read and write permission on the file and return
a file descriptor that can't be used for reading or writing."

Currently, the NFSv4 code will ask the server to open the file,
and will use an incorrect share access mode of 0. Since it has
an incorrect share access mode, the client later forgets to send
a corresponding close, meaning it can leak stateids on the server.

Fixes: ce4ef7c0a8 ("NFS: Split out NFS v4 file operations")
Cc: stable@vger.kernel.org # 3.6+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:25 -04:00
Trond Myklebust 1bf85d8c98 NFSv4: Handle open for execute correctly
When mapping the NFSv4 context to an open mode and access mode,
we need to treat the FMODE_EXEC flag differently. For the open
mode, FMODE_EXEC means we need read share access. For the access
mode checking, we need to verify that the user actually has
execute access.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06 14:54:25 -04:00
Linus Torvalds ceacbc0e14 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixlet from Al Viro:
 "Fix bogus default y in Kconfig (VALIDATE_FS_PARSER)

  That thing should not be turned on by default, especially since it's
  not quiet in case it finds no problems. Geert has sent the obvious fix
  quite a few times, but it fell through the cracks"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: VALIDATE_FS_PARSER should default to n
2019-07-06 09:53:08 -07:00
Linus Torvalds a8f46b5afe Two more quick bugfixes for nfsd, fixing a regression causing mount
failures on high-memory machines and fixing the DRC over RDMA.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl0fiP4VHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG++dwP/RkuKV6sjmopr5/SNK334UDGbpAk
 h+VWAJSrnrLDqk2Ezwz4cO8XrPxMEiQQdKtiyIM51oshq9EN0u0gdOS7ycdz9mAm
 qm1WgekRH3vNiNYU+im2r7CXXrBUYeggi1clbOoJnEwsKxV0qFG74OxO8fB5gNMP
 Jeq46nofwSAjxLQBwTkHJhs0cV7rmJhq6mVWJ4lzD6JTzMH7FqO0zoJJHfyP5xb+
 SXSheqWK4eTKhvPJR0JwyiUBUXYzNZyDoNlyRCSVylfxha3cTxF0GeG1Pm2uS+sm
 V8QOnqud+4cF5Qa2zAZ2T/w7dgsfouAODQOi/gzDwE0EM+FojbOnk0CJwL7wuzB3
 flkmWOMER0RV92z7gWqh5JDQFoHeVkldZfFTrYfVWIcPV+pGLiRayzt+dlSbpaYj
 09jMlBLHXxHwCqPT5u8GFKveMNluYyIoKc3s38eojX2u3eg9HNsXoCfmbC2RGaZ4
 iT2E5isl7donclTHDKEU7RkWnaboSQoB+oodMWH8TN7p9FfgpsaObWTebrsbvvBN
 DMrdc+nJ+x78krI9XKpSQOmPpJH9siaIFn6nVq5oLNjytaH+UrrArvkLMKhuSW6L
 8u5fU3SvL0Eriuz7EtgZwosy+VPvFpXgQVMdJ+z/cm32mOeFgcz4BNEMaQuFJVaN
 fbY6fM46Xngo0A3x
 =f2gw
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two more quick bugfixes for nfsd: fixing a regression causing mount
  failures on high-memory machines and fixing the DRC over RDMA"

* tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix overflow causing non-working mounts on 1 TB machines
  svcrdma: Ignore source port when computing DRC hash
2019-07-05 19:00:37 -07:00
Pankaj Gupta b21fec4140 xfs: disable map_sync for async flush
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Pankaj Gupta e46bfc3f03 ext4: disable map_sync for async flush
Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-05 15:19:10 -07:00
Darrick J. Wong 036f463fe1 xfs: online scrub needn't bother zeroing its temporary buffer
The xattr scrubber functions use the temporary memory buffer either for
storing bitmaps or for testing if attribute value extraction works.  The
bitmap code always zeroes what it needs and the value extraction sets
the buffer contents, so it's not necessary to waste CPU time zeroing on
allocation.

Note that while we never read the contents that the attr value
extraction function sets, we do need to call it to check the remote
attribute header and CRCs to check for corruption.

A flame graph analysis showed that we were spending 7% of a xfs_scrub
run (the whole program, not just the attr scrubber itself) allocating
and zeroing 64k segments needlessly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:56 -07:00
Darrick J. Wong 6d6ccedd76 xfs: only allocate memory for scrubbing attributes when we need it
In examining a flame graph of time spent running xfs_scrub on various
filesystems, I noticed that we spent nearly 7% of the total runtime on
allocating a zeroed 65k buffer for every SCRUB_TYPE_XATTR invocation.
We do this even if none of the attribute values were anywhere near 64k
in size, even if there were no attribute blocks to check space on, and
even if it just turns out there are no attributes at all.

Therefore, rearrange the xattr buffer setup code to support reallocating
with a bigger buffer and redistribute the callers of that function so
that we only allocate memory just prior to needing it, and only allocate
as much as we need.  If we can't get memory with the ILOCK held we'll
bail out with EDEADLOCK which will allocate the maximum memory.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:56 -07:00
Darrick J. Wong 0081675933 xfs: refactor attr scrub memory allocation function
Move the code that allocates memory buffers for the extended attribute
scrub code into a separate function so we can reduce memory allocations
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:55 -07:00
Darrick J. Wong 3addd24880 xfs: refactor extended attribute buffer pointer functions
Replace the open-coded attribute buffer pointer calculations with helper
functions to make it more obvious what we're doing with our freeform
memory allocation w.r.t. either storing xattr values or computing btree
block free space.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:55 -07:00
Darrick J. Wong 2c3b83d7ca xfs: attribute scrub should use seen_enough to pass error values
When we're iterating all the attributes using the built-in xattr
iterator, we can use the seen_enough variable to pass error codes back
to the main scrub function instead of flattening them into 0/1.  This
will be used in a more exciting fashion in upcoming patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-05 10:29:54 -07:00
Colin Ian King e02d48eaae btrfs: fix memory leak of path on error return path
Currently if the allocation of roots or tmp_ulist fails the error handling
does not free up the allocation of path causing a memory leak. Fix this and
other similar leaks by moving the call of btrfs_free_path from label out
to label out_free_ulist.

Kudos to David Sterba for spotting the issue in my original fix and suggesting
the correct way to fix the leak and Anand Jain for spotting a double free
issue.

Addresses-Coverity: ("Resource leak")
Fixes: 5911c8fe05 ("btrfs: fiemap: preallocate ulists for btrfs_check_shared")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-05 18:47:57 +02:00
Geert Uytterhoeven 75f2d86b20 fs: VALIDATE_FS_PARSER should default to n
CONFIG_VALIDATE_FS_PARSER is a debugging tool to check that the parser
tables are vaguely sane.  It was set to default to 'Y' for the moment to
catch errors in upcoming fs conversion development.

Make sure it is not enabled by default in the final release of v5.1.

Fixes: 31d921c7fb ("vfs: Add configuration parser helpers")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-05 11:22:11 -04:00
Linus Torvalds a5fff14a0c Merge branch 'akpm' (patches from Andrew)
Merge more fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  swap_readpage(): avoid blk_wake_io_task() if !synchronous
  devres: allow const resource arguments
  mm/vmscan.c: prevent useless kswapd loops
  fs/userfaultfd.c: disable irqs for fault_pending and event locks
  mm/page_alloc.c: fix regression with deferred struct page init
2019-07-05 11:39:56 +09:00
Linus Torvalds cde357c392 dax fix v5.2-rc8
- Fix xarray entry association for mixed mappings
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdHpMDAAoJEB7SkWpmfYgCJ0wP/2fgMgoh1YBv4DU4gsNs328G
 pyTX3i6d+KoEmfvlcoOzU3lNtjf2S81H3QIkfTJG75uE9/3jNKOh+tPunj3/wIOv
 iHbpBZVK5OpE2f9FFNM785cTt7hBqBtR38N/PdKGFPSMzN9vX794rmvS8Ri0bHd4
 9zmSvdnrkdu0U4BGmBRZdUOCUIdDrtQClKJBRtG5Ksb194zf7lt/jm4k9WFMqfYA
 89mR/KHQhmhDnjyBynQa0TRtShlf/DsxtWiPyLT9FzD1RZt9+tFVLANRQEmFFAp2
 eb+b+LT35AdEEwErv7RkCCGSGKA7KXy7+hyETsoPMBXG08Q77nogG+zb/j0wCwvK
 SsQpo1aqmIeJRBQOXbYeqhn6VR9YuVMFlgaSfOcn/noQDHUIeKis2pnOMeKX36MN
 xHTQm9hSgG+0Des5UoAd4eNH5fDmwuUJK4o4kVGG3pKPuTzyW1gLyo7ItewleE7c
 7rOhLMU55mqUizxsBfEOO6qiJED64iQ2K3mve5I22YetaYCfdrNnU8x5iS9FEcir
 CATYHilKevowwVZg2a7Iy5FquYsQghMhZe4iF5gzjF+zEpj0Wi8S0nVsCtX8Js3S
 p/f3TFmc2i4ZCFPbOAAdYOuagU6pFbuSFQyjtljo5sjD3JKBXitgEaAepFXMtVSA
 EXEQYBgj5CoQ8ClF/eiA
 =NNXl
 -----END PGP SIGNATURE-----

Merge tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax fix from Dan Williams:
 "A single dax fix that has been soaking awaiting other fixes under
  discussion to join it. As it is getting late in the cycle lets proceed
  with this fix and save follow-on changes for post-v5.3-rc1.

   - Fix xarray entry association for mixed mappings"

* tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix xarray entry association for mixed mappings
2019-07-05 11:32:11 +09:00
Linus Torvalds 2cd7cdc7e4 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull do_move_mount() fix from Al Viro:
 "Regression fix"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: move_mount: reject moving kernel internal mounts
2019-07-05 11:21:36 +09:00
Eric Biggers cbcfa130a9 fs/userfaultfd.c: disable irqs for fault_pending and event locks
When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.

This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
userfaultfd_ctx_read(), which in turn can be waiting for
userfaultfd_ctx::fault_pending_wqh.lock or
userfaultfd_ctx::event_wqh.lock.

But elsewhere the fault_pending_wqh and event_wqh locks are taken with
IRQs enabled.  Since the IRQ handler may take kioctx::ctx_lock, lockdep
reports that a deadlock is possible.

Fix it by always disabling IRQs when taking the fault_pending_wqh and
event_wqh locks.

Commit ae62c16e10 ("userfaultfd: disable irqs when taking the
waitqueue lock") didn't fix this because it only accounted for the
fd_wqh lock, not the other locks nested inside it.

Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
Fixes: bfe4037e72 ("aio: implement IOCB_CMD_POLL")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 11:12:07 +09:00
Al Viro 037f11b475 mnt_init(): call shmem_init() unconditionally
No point having two call sites (earlier in init_rootfs() from
mnt_init() in case we are going to use shmem-style rootfs,
later from do_basic_setup() unconditionally), along with the
logics in shmem_init() itself to make the second call a no-op...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
Al Viro 33488845f2 constify ksys_mount() string arguments
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
Al Viro fd3e007f6c don't bother with registering rootfs
init_mount_tree() can get to rootfs_fs_type directly and that simplifies
a lot of things.  We don't need to register it, we don't need to look
it up *and* we don't need to bother with preventing subsequent userland
mounts.  That's the way we should've done that from the very beginning.

There is a user-visible change, namely the disappearance of "rootfs"
from /proc/filesystems.  Note that it's been unmountable all along
and it didn't show up in /proc/mounts; however, it *is* a user-visible
change and theoretically some script might've been using its presence
in /proc/filesystems to tell 2.4.11+ from earlier kernels.

*IF* any complaints about behaviour change do show up, we could fake
it in /proc/filesystems.  I very much doubt we'll have to, though.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
Al Viro 14a253ce42 init_rootfs(): don't bother with init_ramfs_fs()
the only thing done by the latter is making ramfs visible
to mount(2); we don't need it there - rootfs is separate
and, in fact, made visible to mount(2) in the same init_rootfs().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
David Howells 7ab2fa7693 vfs: Convert openpromfs to use the new mount API
Convert the openpromfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
David Howells 4799974555 vfs: Convert efivarfs to use the new mount API
Convert the efivarfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

[AV: get rid of efivarfs_sb nonsense - it has never been used]

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Garrett <matthew.garrett@nebula.com>
cc: Jeremy Kerr <jk@ozlabs.org>
cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
cc: linux-efi@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:59 -04:00
David Howells 6bc62f2067 vfs: Convert configfs to use the new mount API
Convert the configfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
David Howells bc99a664e9 vfs: Convert binfmt_misc to use the new mount API
Convert the binfmt_misc filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
Al Viro c23a0bbab3 convenience helper: get_tree_single()
counterpart of mount_single(); switch fusectl to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
Al Viro 2ac295d4f0 convenience helper get_tree_nodev()
counterpart of mount_nodev().  Switch hugetlb and pseudo to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 22:01:58 -04:00
Al Viro e4e59906cf fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt()
make unhash_mnt() return the mountpoint to be dropped, let callers
deal with it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 18:58:38 -04:00
Al Viro adc9b5c091 __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore
... not since 1e9c75fb9c ("mnt: fix __detach_mounts infinite loop")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 18:58:37 -04:00
Al Viro 1cfb7072c1 nfs: dget_parent() never returns NULL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 18:58:37 -04:00
Al Viro 516162b92d ceph: don't open-code the check for dead lockref
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04 18:58:23 -04:00
Josef Bacik 28a32d2b1a btrfs: move the subvolume reservation stuff out of extent-tree.c
This is just two functions, put it in root-tree.c since it involves root
items.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:18 +02:00
Josef Bacik 867363429d btrfs: migrate the delalloc space stuff to it's own home
We have code for data and metadata reservations for delalloc.  There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file.  inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
Josef Bacik fb6dea2660 btrfs: migrate btrfs_trans_release_chunk_metadata
Move this into transaction.c with the rest of the transaction related
code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
Josef Bacik 6ef03debdb btrfs: migrate the delayed refs rsv code
These belong with the delayed refs related code, not in extent-tree.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
Goldwyn Rodrigues 9978059be8 btrfs: Evaluate io_tree in find_lock_delalloc_range()
Simplification.  No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04 17:26:17 +02:00
Andreas Gruenbacher bb4cb25dd3 gfs2: Remove unused gfs2_iomap_alloc argument
Remove the unused flags argument of gfs2_iomap_alloc.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-04 17:24:25 +02:00
Darrick J. Wong bf3cb39447 xfs: allow single bulkstat of special inodes
Create a new bulk ireq flag that enables userspace to ask us for a
special inode number instead of interpreting @ino as a literal inode
number.  This enables us to query the root inode easily.

The reason for adding the ability to query specifically the root
directory inode is that certain programs (xfsdump and xfsrestore) want
to confirm when they've been pointed to the root directory.  The
userspace code assumes the root directory is always the first result
from calling bulkstat with lastino == 0, but this isn't true if the
(initial btree roots + initial AGFL + inode alignment padding) is itself
long enough to be allocated to new inodes if all of those blocks should
happen to be free at the same time.  Rather than make userspace guess
at internal filesystem state, we provide a direct query.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-04 07:52:24 -07:00
Darrick J. Wong 13d59a2a61 xfs: specify AG in bulk req
Add a new xfs_bulk_ireq flag to constrain the iteration to a single AG.
If the passed-in startino value is zero then we start with the first
inode in the AG that the user passes in; otherwise, we iterate only
within the same AG as the passed-in inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-04 07:52:23 -07:00
Greg Kroah-Hartman 0979cf95d2 orangefs: fix build warning from debugfs cleanup patch
Stephen writes:
	After merging the driver-core tree, today's linux-next build (x86_64
	allmodconfig) produced this warning:

	fs/orangefs/orangefs-debugfs.c: In function 'orangefs_debugfs_init':
	fs/orangefs/orangefs-debugfs.c:193:1: warning: label 'out' defined but not used [-Wunused-label]
	 out:
	 ^~~
	fs/orangefs/orangefs-debugfs.c: In function 'orangefs_kernel_debug_init':
	fs/orangefs/orangefs-debugfs.c:204:17: warning: unused variable 'ret' [-Wunused-variable]
	  struct dentry *ret;
	                 ^~~
Fix this up and change the return type of the function to void as it can
not fail, which cleans up some more code and variables as well.

Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: f095adba36 ("orangefs: no need to check return value of debugfs_create functions")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-04 10:30:33 +02:00
Greg Kroah-Hartman d71cac5971 ubifs: fix build warning after debugfs cleanup patch
Stephen writes:
	After merging the driver-core tree, today's linux-next build (arm
	multi_v7_defconfig) produced this warning:

	fs/ubifs/debug.c: In function 'dbg_debugfs_init_fs':
	fs/ubifs/debug.c:2812:6: warning: unused variable 'err' [-Wunused-variable]
	  int err, n;
	      ^~~

So fix this up properly.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-04 08:33:35 +02:00
Darrick J. Wong fba9760a43 xfs: wire up the v5 inumbers ioctl
Wire up the v5 INUMBERS ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:28 -07:00
Darrick J. Wong 0448b6f488 xfs: wire up new v5 bulkstat ioctls
Wire up the new v5 BULKSTAT ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:27 -07:00
Darrick J. Wong 5f19c7fc68 xfs: introduce v5 inode group structure
Introduce a new "v5" inode group structure that fixes the alignment
and padding problems of the existing structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:27 -07:00
Darrick J. Wong 7035f9724f xfs: introduce new v5 bulkstat structure
Introduce a new version of the in-core bulkstat structure that supports
our new v5 format features.  This structure also fills the gaps in the
previous structure.  We leave wiring up the ioctls for the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:26 -07:00
Darrick J. Wong 8bfe9d1810 xfs: rename bulkstat functions
Rename the bulkstat functions to 'fsbulkstat' so that they match the
ioctl names.  We will be introducing a new set of bulkstat/inumbers
ioctls soon, and it will be important to keep the names straight.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:26 -07:00
Darrick J. Wong 6f71fb6838 xfs: remove various bulk request typedef usage
Remove xfs_bstat_t, xfs_fsop_bulkreq_t, xfs_inogrp_t, and similarly
named compat typedefs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 20:36:25 -07:00
J. Bruce Fields 791234448d nfsd: decode implementation id
Decode the implementation ID and display in nfsd/clients/#/info.  It may
be help identify the client.  It won't be used otherwise.

(When this went into the protocol, I thought the implementation ID would
be a slippery slope towards implementation-specific workarounds as with
the http user-agent.  But I guess I was wrong, the risk seems pretty low
now.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 20:54:03 -04:00
J. Bruce Fields 6f4859b8a7 nfsd: create xdr_netobj_dup helper
Move some repeated code to a common helper.  No change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:51 -04:00
J. Bruce Fields 89c905becc nfsd: allow forced expiration of NFSv4 clients
NFSv4 clients are automatically expired and all their locks removed if
they don't contact the server for a certain amount of time (the lease
period, 90 seconds by default).

There can still be situations where that's not enough, so allow
userspace to force expiry by writing "expire\n" to the new
nfsd/client/#/ctl file.

(The generic "ctl" name is because I expect we may want to allow other
operations on clients in the future.)

The write will not return until the client is expired and all of its
locks and other state removed.

The fault injection code also provides a way of expiring clients, but it
fails if there are any in-progress RPC's referencing the client.  Also,
its method of selecting a client to expire is a little more
primitive--it uses an IP address, which can't always uniquely specify an
NFSv4 client.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields a204f25e37 nfsd: create get_nfsdfs_clp helper
Factor our some common code.  No change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 0c4b62b042 nfsd4: show layout stateids
These are also minimal for now, I'm not sure what information would be
useful.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 16d36e0999 nfsd: show lock and deleg stateids
These entries are pretty minimal for now.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 78599c42ae nfsd4: add file to display list of client's opens
Add a nfsd/clients/#/opens file to list some information about all the
opens held by the given client, including open modes, device numbers,
inode numbers, and open owners.

Open owners are totally opaque but seem to sometimes have some useful
ascii strings included, so passing through printable ascii characters
and escaping the rest seems useful while still being machine-readable.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 169319f13c nfsd: add more information to client info file
Add ip address, full client-provided identifier, and minor version.
There's much more that could possibly be useful but this is a start.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields ea053e164c nfsd: escape high characters in binary data
I'm exposing some information about NFS clients in pseudofiles.  I
expect to eventually have simple tools to help read those pseudofiles.

But it's also helpful if the raw files are human-readable to the extent
possible.  It aids debugging and makes them usable on systems that don't
have the latest nfs-utils.

A minor challenge there is opaque client-generated protocol objects like
state owners and client identifiers.  Some clients generate those to
include handy information in plain ascii.  But they may also include
arbitrary byte sequences.

I think the simplest approach is to limit to isprint(c) && isascii(c)
and escape everything else.

That means you can just cat the file and get something that looks OK.
Also, I'm trying to keep these files legal YAML, which requires them to
UTF-8, and this is a simple way to guarantee that.

Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 3bade247fc nfsd: copy client's address including port number to cl_addr
rpc_copy_addr() copies only the IP address and misses any port numbers.
It seems potentially useful to keep the port number around too.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields 97ad4031e2 nfsd4: add a client info file
Add a new nfsd/clients/#/info file with some basic information about
each NFSv4 client.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields bf5ed3e3bb nfsd: make client/ directory names small ints
We want clientid's on the wire to be randomized for reasons explained in
ebd7c72c63 "nfsd: randomize SETCLIENTID reply to help distinguish
servers".  But I'd rather have mostly small integers for the clients/
directory.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:50 -04:00
J. Bruce Fields e8a79fb14f nfsd: add nfsd/clients directory
I plan to expose some information about nfsv4 clients here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:49 -04:00
J. Bruce Fields 59f8e91b75 nfsd4: use reference count to free client
Keep a second reference count which is what is really used to decide
when to free the client's memory.

Next I'm going to add an nfsd/clients/ directory with a subdirectory for
each NFSv4 client.  File objects under nfsd/clients/ will hold these
references.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:49 -04:00
J. Bruce Fields 14ed14cc7c nfsd: rename cl_refcount
Rename this to a more descriptive name: it counts the number of
in-progress rpc's referencing this client.

Next I'm going to add a second refcount with a slightly different use.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:49 -04:00
J. Bruce Fields 2c830dd720 nfsd: persist nfsd filesystem across mounts
Keep around one internal mount of the nfsd filesystem so that we can add
stuff to it when clients come and go, regardless of whether anyone has
it mounted.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:49 -04:00
J. Bruce Fields 689d7ba489 nfsd: fix cleanup of nfsd_reply_cache_init on failure
The failure to unregister the shrinker results will result in corruption
when the nfsd_net is freed.

Also clean up the drc_slab while we're here.

Reported-by: syzbot+83a43746cebef3508b49@syzkaller.appspotmail.com
Fixes: db17b61765c2 ("nfsd4: drc containerization")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
J. Bruce Fields 30498dcc12 nfsd4: remove outdated nfsd4_decode_time comment
Commit bf8d909705 "nfsd: Decode and send 64bit time values" fixed the
code without updating the comment.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
J. Bruce Fields bdba53687e nfsd: use 64-bit seconds fields in nfsd v4 code
After commit 95582b0083 "vfs: change inode times to use struct
timespec64" there are spots in the NFSv4 decoding where we decode the
protocol into a struct timeval and then convert that into a timeval64.

That's unnecesary in the NFSv4 case since the on-the-wire protocol also
uses 64-bit values.  So just fix up our code to use timeval64 everywhere.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
Geert Uytterhoeven e977cc8308 nfsd: Spelling s/EACCESS/EACCES/
The correct spelling is EACCES:

include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
YueHaibing 291adeb254 lockd: Make two symbols static
Fix sparse warnings:

fs/lockd/clntproc.c:57:6: warning: symbol 'nlmclnt_put_lockowner' was not declared. Should it be static?
fs/lockd/svclock.c:409:35: warning: symbol 'nlmsvc_lock_ops' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
Benjamin Coddington f85d93385e locks: Cleanup lm_compare_owner and lm_owner_key
After the update to use nlm_lockowners for the NLM server, there are no
more users of lm_compare_owner and lm_owner_key.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
Benjamin Coddington 646d73e91b lockd: Show pid of lockd for remote locks
Use the pid of lockd instead of the remote lock's svid for the fl_pid for
local POSIX locks.  This allows proper enumeration of which local process
owns which lock.  The svid is meaningless to local lock readers.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:09 -04:00
Benjamin Coddington 9adfac6d73 lockd: Remove lm_compare_owner and lm_owner_key
Now that the NLM server allocates an nlm_lockowner for fl_owner, there's
no need for special hashing or comparison.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
Benjamin Coddington 89e0edfbea lockd: Convert NLM service fl_owner to nlm_lockowner
Do as the NLM client: allocate and track a struct nlm_lockowner for use as
the fl_owner for locks created by the NLM sever.  This allows us to keep
the svid within this structure for matching locks, and will allow us to
track the pid of lockd in a future patch.  It should also allow easier
reference of the nlm_host in conflicting locks, and simplify lock hashing
and comparison.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[bfields@redhat.com: fix type of some error returns]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
Benjamin Coddington 9de3ec1d57 lockd: prepare nlm_lockowner for use by the server
The nlm_lockowner structure that the client uses to track locks is
generally useful to the server as well.  Very similar functions to handle
allocation and tracking of the nlm_lockowner will follow. Rename the client
functions for clarity.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
J. Bruce Fields 22a46eb440 nfsd: note inadequate stats locking
After 89a26b3d29 "nfsd: split DRC global spinlock into per-bucket
locks", there is no longer a single global spinlock to protect these
stats.

So, really we need to fix that.  For now, at least fix the comment.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
J. Bruce Fields 3ba75830ce nfsd4: drc containerization
The nfsd duplicate reply cache should not be shared between network
namespaces.

The most straightforward way to fix this is just to move every global in
the code to per-net-namespace memory, so that's what we do.

Still todo: sort out which members of nfsd_stats should be global and
which per-net-namespace.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
J. Bruce Fields b401170f6d nfsd: don't call nfsd_reply_cache_shutdown twice
The caller is cleaning up on ENOMEM, don't try to do it here too.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:52:08 -04:00
Paul Menzel 3b2d4dcf71 nfsd: Fix overflow causing non-working mounts on 1 TB machines
Since commit 10a68cdf10 (nfsd: fix performance-limiting session
calculation) (Linux 5.1-rc1 and 4.19.31), shares from NFS servers with
1 TB of memory cannot be mounted anymore. The mount just hangs on the
client.

The gist of commit 10a68cdf10 is the change below.

    -avail = clamp_t(int, avail, slotsize, avail/3);
    +avail = clamp_t(int, avail, slotsize, total_avail/3);

Here are the macros.

    #define min_t(type, x, y)       __careful_cmp((type)(x), (type)(y), <)
    #define clamp_t(type, val, lo, hi) min_t(type, max_t(type, val, lo), hi)

`total_avail` is 8,434,659,328 on the 1 TB machine. `clamp_t()` casts
the values to `int`, which for 32-bit integers can only hold values
−2,147,483,648 (−2^31) through 2,147,483,647 (2^31 − 1).

`avail` (in the function signature) is just 65536, so that no overflow
was happening. Before the commit the assignment would result in 21845,
and `num = 4`.

When using `total_avail`, it is causing the assignment to be
18446744072226137429 (printed as %lu), and `num` is then 4164608182.

My next guess is, that `nfsd_drc_mem_used` is then exceeded, and the
server thinks there is no memory available any more for this client.

Updating the arguments of `clamp_t()` and `min_t()` to `unsigned long`
fixes the issue.

Now, `avail = 65536` (before commit 10a68cdf10 `avail = 21845`), but
`num = 4` remains the same.

Fixes: c54f24e338 (nfsd: fix performance-limiting session calculation)
Cc: stable@vger.kernel.org
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-03 17:51:31 -04:00
Jaegeuk Kim 4969c06a0d f2fs: support swap file w/ DIO
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-03 08:52:54 -07:00
Hariprasad Kelam a7a9250e18 fs: xfs: xfs_log: Change return type from int to void
Change return types of below functions as they never fails
xfs_log_mount_cancel
xlog_recover_cancel
xlog_recover_cancel_intents

fix below issue reported by coccicheck
fs/xfs/xfs_log_recover.c:4886:7-12: Unneeded variable: "error". Return
"0" on line 4926

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-03 08:21:58 -07:00
Darrick J. Wong 3e5a428b26 xfs: poll waiting for quotacheck
Create a pwork destroy function that uses polling instead of
uninterruptible sleep to wait for work items to finish so that we can
touch the softlockup watchdog.  IOWs, gross hack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 08:21:58 -07:00
Greg Kroah-Hartman 1a829ff2a6 ceph: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleanup allows the return value of the functions to be made void,
as no logic should care if these files succeed or not.

Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612145538.GA18772@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:18 +02:00
Greg Kroah-Hartman 702d6a834b ubifs: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612152120.GA17450@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:17 +02:00
Greg Kroah-Hartman f095adba36 orangefs: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612152204.GA17511@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:17 +02:00
Greg Kroah-Hartman 15b6ff9516 nfsd: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612152603.GB18440@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:17 +02:00
Greg Kroah-Hartman d03ae4778b debugfs: provide pr_fmt() macro
Use a common "debugfs: " prefix for all pr_* calls in a single place.

Cc: Mark Brown <broonie@kernel.org>
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190703071653.2799-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:55:52 +02:00
Greg Kroah-Hartman 43e23b6c0b debugfs: log errors when something goes wrong
As it is not recommended that debugfs calls be checked, it was pointed
out that major errors should still be logged somewhere so that
developers and users have a chance to figure out what went wrong.  To
help with this, error logging has been added to the debugfs core so that
it is not needed to be present in every individual file that calls
debugfs.

Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190703071653.2799-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:55:51 +02:00
Darrick J. Wong 40786717c8 xfs: multithreaded iwalk implementation
Create a parallel iwalk implementation and switch quotacheck to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-03 07:33:26 -07:00
Fuqian Huang 90f15ac9fa ext2: Use kmemdup rather than duplicating its implementation
kmemdup is introduced to duplicate a region of memory in a neat way.
Rather than kmalloc/kzalloc + memset, which the programmer needs to
write the size twice (sometimes lead to mistakes), kmemdup improves
readability, leads to smaller code and also reduce the chances of mistakes.
Suggestion to use kmemdup rather than using kmalloc/kzalloc + memset.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Link: https://lore.kernel.org/r/20190703131727.25735-1-huangfq.daxian@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
2019-07-03 16:24:25 +02:00
Christoph Hellwig 35af80aef9 gfs2: don't use buffer_heads in gfs2_allocate_page_backing
Rewrite gfs2_allocate_page_backing to call gfs2_iomap_get_alloc and operate on
struct iomap directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 14:45:18 +02:00
Christoph Hellwig 7770c93a46 gfs2: use iomap_bmap instead of generic_block_bmap
No need to indirect through get_blocks and buffer_heads when we can just use
the iomap version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 14:45:18 +02:00
Christoph Hellwig 378b6cbfb8 gfs2: mark stuffed_readpage static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 14:45:18 +02:00
Christoph Hellwig 59c01c5046 gfs2: merge gfs2_writepage_common into gfs2_writepage
There is no need to keep these two functions separate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 14:45:18 +02:00
Christoph Hellwig eadd753580 gfs2: merge gfs2_writeback_aops and gfs2_ordered_aops
The only difference between the two is that gfs2_ordered_aops sets the
set_page_dirty method to __set_page_dirty_buffers, but given that
__set_page_dirty_buffers is the default, if no method is set, there is no need
to to do that.  Merge the two sets of operations into one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 14:45:09 +02:00
Linus Torvalds 6e692c3b72 SMB3 fix (for stable as well) for crash mishandling one of the Windows reparse point symlink tags
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl0b67IACgkQiiy9cAdy
 T1H/Ngv/XNc9l/OEHwyWZ1QnSCBKZyLyD5ZcKQRFFkfiktmQ8FtPUzf4qKlHxX1h
 ssefwBIbkW1+DG2sgvrL7OfqnPDnSezVoifvRmbh0nFX8anWhtChZMc0s+xiLtz2
 SbDBugNSkc8l9fvQz5A6VPJ3TcNA+VsSE2rr1HuimS9S4RAy1RsPhhWNyUh3GV5A
 SWuD7bsnxZ7/H2l+hx+s2O5RLDFoeniEIGFTsH9/f7Q19YGJtf6arnUlyUaZjkXK
 bPV2jZyalRUznK7RSFDLu49fS2zH8/m6MfBYyat31SZVtLFcQC/ijhKYTWr8wrKu
 +iQPlX+IDk4rfH/++7PXJJv1sKFLZNEs22dOi1YG0FgkRtMNA8HzmJqVFLcgoB2d
 QD7Ahj4dE0ghXv1dLMjfKdchNbkrWiygfpje54AkhU9SWUIS/EljDbQSq3e/wpAW
 i9HxCGCmmTPFzVKDVhyaBXHi6h5pzd7FfNNS4iJ2Lsy5PRLOHBMxaX1wknu/8vP0
 IIWuB9Hh
 =1zkr
 -----END PGP SIGNATURE-----

Merge tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "SMB3 fix (for stable as well) for crash mishandling one of the Windows
  reparse point symlink tags"

* tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix crash querying symlinks stored as reparse-points
2019-07-03 16:06:36 +08:00
Christoph Hellwig e0ec0a6ba6 gfs2: remove the unused gfs2_stuffed_write_end function
This function was overlooked when the write_begin and write_end address space
operations were removed as part of gfs2's iomap conversion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 08:56:01 +02:00
Christoph Hellwig f3915f83e8 gfs2: use page_offset in gfs2_page_mkwrite
Without casting page->index to a guaranteed 64-bit type, the value might be
treated as 32-bit on 32-bit platforms and thus get truncated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-03 08:53:01 +02:00
Jaegeuk Kim cad3836f9e f2fs: allocate blocks for pinned file
This patch allows fallocate to allocate physical blocks for pinned file.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:42 -07:00
Sahitya Tummala 56659ce838 f2fs: fix is_idle() check for discard type
The discard thread should issue upto dpolicy->max_requests at once
and wait for all those discard requests at once it reaches
dpolicy->max_requests. It should then sleep for dpolicy->min_interval
timeout before issuing the next batch of discard requests. But in the
current code of is_idle(), it checks for dcc_info->queued_discard and
aborts issuing the discard batch of max_requests. This
dcc_info->queued_discard will be true always once one discard command
is issued.

It is thus resulting into this type of discard request pattern -

- Issue discard request#1
- is_idle() returns false, discard thread waits for request#1 and then
  sleeps for min_interval 50ms.
- Issue discard request#2
- is_idle() returns false, discard thread waits for request#2 and then
  sleeps for min_interval 50ms.
- and so on for all other discard requests, assuming f2fs is idle w.r.t
  other conditions.

With this fix, the pattern will look like this -

- Issue discard request#1
- Issue discard request#2
  and so on upto max_requests of 8
- Issue discard request#8
- wait for min_interval 50ms.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:42 -07:00
Jaegeuk Kim db6ec53b7e f2fs: add a rw_sem to cover quota flag changes
Two paths to update quota and f2fs_lock_op:

1.
 - lock_op
 |  - quota_update
 `- unlock_op

2.
 - quota_update
 - lock_op
 `- unlock_op

But, we need to make a transaction on quota_update + lock_op in #2 case.
So, this patch introduces:
1. lock_op
2. down_write
3. check __need_flush
4. up_write
5. if there is dirty quota entries, flush them
6. otherwise, good to go

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:41 -07:00
Chao Yu c83414aedf f2fs: set SBI_NEED_FSCK for xattr corruption case
If xattr is corrupted, let's print kernel message and set SBI_NEED_FSCK
for further repair.

Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:41 -07:00
Chao Yu 10f966bbf5 f2fs: use generic EFSBADCRC/EFSCORRUPTED
f2fs uses EFAULT as error number to indicate filesystem is corrupted
all the time, but generic filesystems use EUCLEAN for such condition,
we need to change to follow others.

This patch adds two new macros as below to wrap more generic error
code macros, and spread them in code.

EFSBADCRC	EBADMSG		/* Bad CRC detected */
EFSCORRUPTED	EUCLEAN		/* Filesystem is corrupted */

Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:41 -07:00
Geert Uytterhoeven f91108b801 f2fs: Use DIV_ROUND_UP() instead of open-coding
Replace the open-coded divisions with round-up by calls to the
DIV_ROUND_UP() helper macro.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:41 -07:00
Chao Yu 2d821c1217 f2fs: print kernel message if filesystem is inconsistent
As Pavel reported, once we detect filesystem inconsistency in
f2fs_inplace_write_data(), it will be better to print kernel message as
we did in other places.

Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:41 -07:00
Joe Perches dcbb4c10e6 f2fs: introduce f2fs_<level> macros to wrap f2fs_printk()
- Add and use f2fs_<level> macros
- Convert f2fs_msg to f2fs_printk
- Remove level from f2fs_printk and embed the level in the format
- Coalesce formats and align multi-line arguments
- Remove unnecessary duplicate extern f2fs_msg f2fs.h

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:40 -07:00
Chao Yu 8740edc3e5 f2fs: avoid get_valid_blocks() for cleanup
No logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:40:40 -07:00
Qiuyang Sun 04f0b2eaa3 f2fs: ioctl for removing a range from F2FS
This ioctl shrinks a given length (aligned to sections) from end of the
main area. Any cursegs and valid blocks will be moved out before
invalidating the range.

This feature can be used for adjusting partition sizes online.

History of the patch:

Sahitya Tummala:
 - Add this ioctl for f2fs_compat_ioctl() as well.
 - Fix debugfs status to reflect the online resize changes.
 - Fix potential race between online resize path and allocate new data
   block path or gc path.

Others:
 - Rename some identifiers.
 - Add some error handling branches.
 - Clear sbi->next_victim_seg[BG_GC/FG_GC] in shrinking range.
 - Implement this interface as ext4's, and change the parameter from shrunk
bytes to new block count of F2FS.
 - During resizing, force to empty sit_journal and forbid adding new
   entries to it, in order to avoid invalid segno in journal after resize.
 - Reduce sbi->user_block_count before resize starts.
 - Commit the updated superblock first, and then update in-memory metadata
   only when the former succeeds.
 - Target block count must align to sections.
 - Write checkpoint before and after committing the new superblock, w/o
CP_FSCK_FLAG respectively, so that the FS can be fixed by fsck even if
resize fails after the new superblock is committed.
 - In free_segment_range(), reduce granularity of gc_mutex.
 - Add protection on curseg migration.
 - Add freeze_bdev() and thaw_bdev() for resize fs.
 - Remove CUR_MAIN_SECS and use MAIN_SECS directly for allocation.
 - Recover super_block and FS metadata when resize fails.
 - No need to clear CP_FSCK_FLAG in update_ckpt_flags().
 - Clean up the sb and fs metadata update functions for resize_fs.

Geert Uytterhoeven:
 - Use div_u64*() for 64-bit divisions

Arnd Bergmann:
 - Not all architectures support get_user() with a 64-bit argument:
    ERROR: "__get_user_bad" [fs/f2fs/f2fs.ko] undefined!
    Use copy_from_user() here, this will always work.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-02 15:39:24 -07:00
Gabriel Krisman Bertazi 96fcaf86c3 ext4: fix coverity warning on error path of filename setup
Fix the following coverity warning reported by Dan Carpenter:

fs/ext4/namei.c:1311 ext4_fname_setup_ci_filename()
	  warn: 'cf_name->len' unsigned <= 0

Fixes: 3ae72562ad ("ext4: optimize case-insensitive lookups")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
2019-07-02 17:56:12 -04:00
Kimberly Brown 78e9605d4f ext4: replace ktype default_attrs with default_groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace the default_attrs field in ext4_sb_ktype
and ext4_feat_ktype with default_groups. Use the ATTRIBUTE_GROUPS macro
to create ext4_groups and ext4_feat_groups.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-07-02 17:38:55 -04:00
Hariprasad Kelam 7451c54abc ecryptfs: Change return type of ecryptfs_process_flags
Change return type of ecryptfs_process_flags from int to void as it
never fails.

fixes below issue reported by coccicheck

s/ecryptfs/crypto.c:870:5-7: Unneeded variable: "rc". Return "0" on line
883

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
[tyhicks: Remove the return value line from the function documentation]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2019-07-02 19:28:02 +00:00
Christoph Hellwig 25b2995a35 mm: remove MEMORY_DEVICE_PUBLIC support
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:43 -03:00
Darrick J. Wong 677717fbd4 xfs: refactor INUMBERS to use iwalk functions
Now that we have generic functions to walk inode records, refactor the
INUMBERS implementation to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong 04b8fba2e1 xfs: refactor iwalk code to handle walking inobt records
Refactor xfs_iwalk_ag_start and xfs_iwalk_ag so that the bits that are
particular to bulkstat (trimming the start irec, starting inode
readahead, and skipping empty groups) can be controlled via flags in the
iwag structure.

This enables us to add a new function to walk all inobt records which
will be used for the new INUMBERS implementation in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong 2b5eb82601 xfs: refactor xfs_iwalk_grab_ichunk
In preparation for reusing the iwalk code for the inogrp walking code
(aka INUMBERS), move the initial inobt lookup and retrieval code out of
xfs_iwalk_grab_ichunk so that we call the masking code only when we need
to trim out the inodes that came before the cursor in the inobt record
(aka BULKSTAT).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong 688f7c3678 xfs: clean up long conditionals in xfs_iwalk_ichunk_ra
Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:06 -07:00
Darrick J. Wong 5e29f3b720 xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino
Now that the inode chunk grabbing function is a static function in the
iwalk code, change its behavior so that @agino is the inode where we
want to /start/ the iteration.  This reduces cognitive friction with the
callers and simplifes the code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong da1d9e5912 xfs: move bulkstat ichunk helpers to iwalk code
Now that we've reworked the bulkstat code to use iwalk, we can move the
old bulkstat ichunk helpers to xfs_iwalk.c.  No functional changes here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong 938c710d99 xfs: calculate inode walk prefetch more carefully
The existing inode walk prefetch is based on the old bulkstat code,
which simply allocated 4 pages worth of memory and prefetched that many
inobt records, regardless of however many inodes the caller requested.
65536 inodes is a lot to prefetch (~32M on x64, ~512M on arm64) so let's
scale things down a little more intelligently based on the number of
inodes requested, etc.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong 2810bd6840 xfs: convert bulkstat to new iwalk infrastructure
Create a new ibulk structure incore to help us deal with bulk inode stat
state tracking and then convert the bulkstat code to use the new iwalk
iterator.  This disentangles inode walking from bulk stat control for
simpler code and enables us to isolate the formatter functions to the
ioctl handling code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong f16fe3ecde xfs: bulkstat should copy lastip whenever userspace supplies one
When userspace passes in a @lastip pointer we should copy the results
back, even if the @ocount pointer is NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong ebd126a651 xfs: convert quotacheck to use the new iwalk functions
Convert quotacheck to use the new iwalk iterator to dig through the
inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong a211432c27 xfs: create simplified inode walk function
Create a new iterator function to simplify walking inodes in an XFS
filesystem.  This new iterator will replace the existing open-coded
walking that goes on in various places.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Darrick J. Wong 5bb46e3e18 xfs: create iterator error codes
Currently, xfs doesn't have generic error codes defined for "stop
iterating"; we just reuse the XFS_BTREE_QUERY_* return values.  This
looks a little weird if we're not actually iterating a btree index.
Before we start adding more iterators, we should create general
XFS_ITER_{CONTINUE,ABORT} return values and define the XFS_BTREE_QUERY_*
ones from that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-07-02 09:40:05 -07:00
Josef Bacik 67f9c2209e btrfs: migrate the global_block_rsv helpers to block-rsv.c
These helpers belong in block-rsv.c

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:55 +02:00
Josef Bacik 550fa228ee btrfs: migrate the block-rsv code to block-rsv.c
This moves everything out of extent-tree.c to block-rsv.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:54 +02:00
Josef Bacik 424a47805a btrfs: stop using block_rsv_release_bytes everywhere
block_rsv_release_bytes() is the internal to the block_rsv code, and
shouldn't be called directly by anything else.  Switch all users to the
exported helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:54 +02:00
Josef Bacik fcec36224f btrfs: cleanup the target logic in __btrfs_block_rsv_release
This works for all callers already, but if we wanted to use the helper
for the global_block_rsv it would end up trying to refill itself.  Fix
the logic to be able to be used no matter which block rsv is passed in
to this helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:54 +02:00
Josef Bacik fed14b323d btrfs: export __btrfs_block_rsv_release
The delalloc reserve stuff calls this directly because it cares about
the qgroup accounting stuff, so export it so we can move it around.  Fix
btrfs_block_rsv_release() to just be a static inline since it just calls
__btrfs_block_rsv_release() with NULL for the qgroup stuff.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:54 +02:00
Josef Bacik 0b50174ad5 btrfs: export btrfs_block_rsv_add_bytes
This is used in a few places, we need to make sure it's exported so we
can move it around.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:54 +02:00
Josef Bacik d12ffdd1aa btrfs: move btrfs_block_rsv definitions into it's own header
Prep work for separating out all of the block_rsv related code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Goldwyn Rodrigues 9b4851bc48 btrfs: Simplify update of space_info in __reserve_metadata_bytes()
We don't need an if-else-if chain where we can use a simple OR since
both conditions are performing the same action. The short-circuit for OR
will ensure that if the first condition is true, can_overcommit() is not
called.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Josef Bacik 83d731a5b2 btrfs: unexport can_overcommit
Now that we've moved all of the users to space-info.c, unexport it and
name it back to can_overcommit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Josef Bacik 0d9764f6d0 btrfs: move reserve_metadata_bytes and supporting code to space-info.c
This moves all of the metadata reservation code into space-info.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Josef Bacik 5da6afeb32 btrfs: move dump_space_info to space-info.c
We'll need this exported so we can use it in all the various was we need
to use it.  This is prep work to move reserve_metadata_bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Josef Bacik c2a67a76ec btrfs: export block_rsv_use_bytes
We are going to need this to move the metadata reservation stuff to
space_info.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:53 +02:00
Josef Bacik b338b013e1 btrfs: move btrfs_space_info_add_*_bytes to space-info.c
Now that we've moved all the pre-requisite stuff, move these two
functions.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:52 +02:00
Josef Bacik bb96c4e574 btrfs: move the space info update macro to space-info.h
Also rename it to btrfs_space_info_update_* so it's clear what we're
updating.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:52 +02:00
Josef Bacik 41783ef24d btrfs: move and export can_overcommit
This is the first piece of moving the space reservation code to
space-info.c

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:52 +02:00
Josef Bacik 280c290881 btrfs: move the space_info handling code to space-info.c
These are the basic init and lookup functions and some helper functions,
fairly straightforward before the bad stuff starts.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:52 +02:00
Josef Bacik d44b72aa12 btrfs: export space_info_add_*_bytes
Prep work for consolidating all of the space_info code into one file.
We need to export these so multiple files can use them.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:52 +02:00
Josef Bacik fc471cb0c8 btrfs: rename do_chunk_alloc to btrfs_chunk_alloc
Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
Josef Bacik 8719aaae8d btrfs: move space_info to space-info.h
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
David Sterba e749af443f btrfs: lift bio_set_dev from bio allocation helpers
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
David Sterba e1ea2beee2 btrfs: use raid_attr for minimum stripe count in btrfs_calc_avail_data_space
Minimum stripe count matches the minimum devices required for a given
profile. The open coded assignments match the raid_attr table.

What's changed here is the meaning for RAID5/6. Previously their
min_stripes would be 1, while newly it's devs_min. This however shold be
the same as before because it's not possible to create filesystem on
fewer devices than the raid_attr table allows.

There's no adjustment regarding the parity stripes (like
calc_data_stripes does), because we're interested in overall space that
would fit on the devices.

Missing devices make no difference for the whole calculation, we have
the size stored in the structures.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
David Sterba 4f080f5711 btrfs: use raid_attr to adjust minimal stripe size in btrfs_calc_avail_data_space
Special case for DUP can be replaced by lookup to the attribute table,
where the dev_stripes is the right coefficient.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
David Sterba f262fa8de6 btrfs: drop default value assignments in enums
A few more instances whre we don't need to specify the values as long as
they are the same that enum assigns automatically. All of the enums are
in-memory only and nothing relies on the exact values.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
David Sterba 2792237d0c btrfs: use common helpers for extent IO state insertion messages
Print the error messages using the helpers that also print the
filesystem identification.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
Josef Bacik 63611e738a btrfs: run delayed iput at unlink time
We have been seeing issues in production where a cleaner script will end
up unlinking a bunch of files that have pending iputs.  This means they
will get their final iput's run at btrfs-cleaner time and thus are not
throttled, which impacts the workload.

Since we are unlinking these files we can just drop the delayed iput at
unlink time.  We are already holding a reference to the inode so this
will not be the final iput and thus is completely safe to do at this
point.  Doing this means we are more likely to be doing the final iput
at unlink time, and thus will get the IO charged to the caller and get
throttled appropriately without affecting the main workload.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
Filipe Manana 179006688a Btrfs: add missing inode version, ctime and mtime updates when punching hole
If the range for which we are punching a hole covers only part of a page,
we end up updating the inode item but we skip the update of the inode's
iversion, mtime and ctime. Fix that by ensuring we update those properties
of the inode.

A patch for fstests test case generic/059 that tests this as been sent
along with this fix.

Fixes: 2aaa665581 ("Btrfs: add hole punching")
Fixes: e8c1c76e80 ("Btrfs: add missing inode update when punching hole")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
Filipe Manana 803f0f64d1 Btrfs: fix fsync not persisting dentry deletions due to inode evictions
In order to avoid searches on a log tree when unlinking an inode, we check
if the inode being unlinked was logged in the current transaction, as well
as the inode of its parent directory. When any of the inodes are logged,
we proceed to delete directory items and inode reference items from the
log, to ensure that if a subsequent fsync of only the inode being unlinked
or only of the parent directory when the other is not fsync'ed as well,
does not result in the entry still existing after a power failure.

That check however is not reliable when one of the inodes involved (the
one being unlinked or its parent directory's inode) is evicted, since the
logged_trans field is transient, that is, it is not stored on disk, so it
is lost when the inode is evicted and loaded into memory again (which is
set to zero on load). As a consequence the checks currently being done by
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
return true if the inode was evicted before, regardless of the inode
having been logged or not before (and in the current transaction), this
results in the dentry being unlinked still existing after a log replay
if after the unlink operation only one of the inodes involved is fsync'ed.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/foo
  $ xfs_io -c fsync /mnt/dir/foo

  # Keep an open file descriptor on our directory while we evict inodes.
  # We just want to evict the file's inode, the directory's inode must not
  # be evicted.
  $ ( cd /mnt/dir; while true; do :; done ) &
  $ pid=$!

  # Wait a bit to give time to background process to chdir to our test
  # directory.
  $ sleep 0.5

  # Trigger eviction of the file's inode.
  $ echo 2 > /proc/sys/vm/drop_caches

  # Unlink our file and fsync the parent directory. After a power failure
  # we don't expect to see the file anymore, since we fsync'ed the parent
  # directory.
  $ rm -f $SCRATCH_MNT/dir/foo
  $ xfs_io -c fsync /mnt/dir

  <power failure>

  $ mount /dev/sdb /mnt
  $ ls /mnt/dir
  foo
  $
   --> file still there, unlink not persisted despite explicit fsync on dir

Fix this by checking if the inode has the full_sync bit set in its runtime
flags as well, since that bit is set everytime an inode is loaded from
disk, or for other less common cases such as after a shrinking truncate
or failure to allocate extent maps for holes, and gets cleared after the
first fsync. Also consider the inode as possibly logged only if it was
last modified in the current transaction (besides having the full_fsync
flag set).

Fixes: 3a5f1d458a ("Btrfs: Optimize btree walking while logging inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
Nikolay Borisov 89b798ad1b btrfs: Use btrfs_get_io_geometry appropriately
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.

This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:50 +02:00
Nikolay Borisov 5f1411265e btrfs: Introduce btrfs_io_geometry infrastructure
Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:49 +02:00
David Sterba c9d713d5b5 btrfs: improve messages when updating feature flags
Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:

  setting 128 feature flag

updated:

  setting incompat feature flag for RAID56 (0x80)

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:49 +02:00
Arnd Bergmann 6c64460cdc btrfs: shut up bogus -Wmaybe-uninitialized warning
gcc sometimes can't determine whether a variable has been initialized
when both the initialization and the use are conditional:

fs/btrfs/props.c: In function 'inherit_props':
fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    btrfs_block_rsv_release(fs_info, trans->block_rsv,

This code is fine. Unfortunately, I cannot think of a good way to
rephrase it in a way that makes gcc understand this, so I add a bogus
initialization the way one should not.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ gcc 8 and 9 don't emit the warning ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:49 +02:00
Filipe Manana 9e967495e0 Btrfs: prevent send failures and crashes due to concurrent relocation
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:

  [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
  [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984

Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:

 [ 3478.466280] ------------[ cut here ]------------
 [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
 [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
 [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
 [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
 (...)
 [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
 [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
 [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
 [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
 [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
 [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 [ 3478.475840] FS:  00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
 [ 3478.476489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
 [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 3478.479003] Call Trace:
 [ 3478.479600]  ? do_raw_spin_unlock+0x49/0xc0
 [ 3478.480202]  tree_advance+0x173/0x1d0 [btrfs]
 [ 3478.480810]  btrfs_compare_trees+0x30c/0x690 [btrfs]
 [ 3478.481388]  ? process_extent+0x1280/0x1280 [btrfs]
 [ 3478.481954]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
 [ 3478.482510]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
 [ 3478.483062]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
 [ 3478.483581]  ? rq_clock_task+0x2e/0x60
 [ 3478.484086]  ? wake_up_new_task+0x1f3/0x370
 [ 3478.484582]  ? do_vfs_ioctl+0xa2/0x6f0
 [ 3478.485075]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
 [ 3478.485552]  do_vfs_ioctl+0xa2/0x6f0
 [ 3478.486016]  ? __fget+0x113/0x200
 [ 3478.486467]  ksys_ioctl+0x70/0x80
 [ 3478.486911]  __x64_sys_ioctl+0x16/0x20
 [ 3478.487337]  do_syscall_64+0x60/0x1b0
 [ 3478.487751]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
 (...)
 [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
 [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
 [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
 [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
 [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
 [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
 (...)
 [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---

Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.

To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:49 +02:00
David Sterba 71a9c4885e btrfs: document BTRFS_MAX_MIRRORS
The real meaning of that constant is not clear from the context due to
the target device inclusion.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:49 +02:00
David Sterba a07e8a468e btrfs: use mask for RAID56 profiles
We don't need to enumerate the profiles, use the mask for consistency.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:48 +02:00
David Sterba c7369b3fae btrfs: add mask for all RAID1 types
Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:48 +02:00