Commit graph

810304 commits

Author SHA1 Message Date
Linus Torvalds 1ac5cd4978 block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE)
This mostly reverts commit 849a370016 ("block: avoid ordered task
state change for polled IO").  It was wrongly claiming that the ordering
wasn't necessary.  The memory barrier _is_ necessary.

If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier.  Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.

Note that sometimes the memory barrier can be elsewhere.  For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).

But none of those cases were true here.

NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.

(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-02 10:46:03 -08:00
Eric Dumazet d63967e475 isdn: fix kernel-infoleak in capi_unlocked_ioctl
Since capi_ioctl() copies 64 bytes after calling
capi20_get_manufacturer() we need to ensure to not leak
information to user.

BUG: KMSAN: kernel-infoleak in _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
CPU: 0 PID: 11245 Comm: syz-executor633 Not tainted 4.20.0-rc7+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
 kmsan_internal_check_memory+0x9d4/0xb00 mm/kmsan/kmsan.c:704
 kmsan_copy_to_user+0xab/0xc0 mm/kmsan/kmsan_hooks.c:601
 _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
 capi_ioctl include/linux/uaccess.h:177 [inline]
 capi_unlocked_ioctl+0x1a0b/0x1bf0 drivers/isdn/capi/capi.c:939
 do_vfs_ioctl+0xebd/0x2bf0 fs/ioctl.c:46
 ksys_ioctl fs/ioctl.c:713 [inline]
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl+0x1da/0x270 fs/ioctl.c:718
 __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:718
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x440019
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffdd4659fb8 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440019
RDX: 0000000020000080 RSI: 00000000c0044306 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000213 R12: 00000000004018a0
R13: 0000000000401930 R14: 0000000000000000 R15: 0000000000000000

Local variable description: ----data.i@capi_unlocked_ioctl
Variable was created at:
 capi_ioctl drivers/isdn/capi/capi.c:747 [inline]
 capi_unlocked_ioctl+0x82/0x1bf0 drivers/isdn/capi/capi.c:939
 do_vfs_ioctl+0xebd/0x2bf0 fs/ioctl.c:46

Bytes 12-63 of 64 are uninitialized
Memory access of size 64 starts at ffff88807ac5fce8
Data copied to user address 0000000020000080

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-02 10:31:39 -08:00
Stefano Brivio 7adf324609 ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
In ip6_neigh_lookup(), we must not return errors coming from
neigh_create(): if creation of a neighbour entry fails, the lookup should
return NULL, in the same way as it's done in __neigh_lookup().

Otherwise, callers legitimately checking for a non-NULL return value of
the lookup function might dereference an invalid pointer.

For instance, on neighbour table overflow, ndisc_router_discovery()
crashes ndisc_update() by passing ERR_PTR(-ENOBUFS) as 'neigh' argument.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: f8a1b43b70 ("net/ipv6: Create a neigh_lookup for FIB entries")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-02 10:29:20 -08:00
Eric Dumazet 202700e307 net/hamradio/6pack: use mod_timer() to rearm timers
Using del_timer() + add_timer() is generally unsafe on SMP,
as noticed by syzbot. Use mod_timer() instead.

kernel BUG at kernel/time/timer.c:1136!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 1026 Comm: kworker/u4:4 Not tainted 4.20.0+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events_unbound flush_to_ldisc
RIP: 0010:add_timer kernel/time/timer.c:1136 [inline]
RIP: 0010:add_timer+0xa81/0x1470 kernel/time/timer.c:1134
Code: 4d 89 7d 40 48 c7 85 70 fe ff ff 00 00 00 00 c7 85 7c fe ff ff ff ff ff ff 48 89 85 90 fe ff ff e9 e6 f7 ff ff e8 cf 42 12 00 <0f> 0b e8 c8 42 12 00 0f 0b e8 c1 42 12 00 4c 89 bd 60 fe ff ff e9
RSP: 0018:ffff8880a7fdf5a8 EFLAGS: 00010293
RAX: ffff8880a7846340 RBX: dffffc0000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816f3ee1 RDI: ffff88808a514ff8
RBP: ffff8880a7fdf760 R08: 0000000000000007 R09: ffff8880a7846c58
R10: ffff8880a7846340 R11: 0000000000000000 R12: ffff88808a514ff8
R13: ffff88808a514ff8 R14: ffff88808a514dc0 R15: 0000000000000030
FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000061c500 CR3: 00000000994d9000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 decode_prio_command drivers/net/hamradio/6pack.c:903 [inline]
 sixpack_decode drivers/net/hamradio/6pack.c:971 [inline]
 sixpack_receive_buf drivers/net/hamradio/6pack.c:457 [inline]
 sixpack_receive_buf+0xf9c/0x1470 drivers/net/hamradio/6pack.c:434
 tty_ldisc_receive_buf+0x164/0x1c0 drivers/tty/tty_buffer.c:465
 tty_port_default_receive_buf+0x114/0x190 drivers/tty/tty_port.c:38
 receive_buf drivers/tty/tty_buffer.c:481 [inline]
 flush_to_ldisc+0x3b2/0x590 drivers/tty/tty_buffer.c:533
 process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153
 worker_thread+0x143/0x14a0 kernel/workqueue.c:2296
 kthread+0x357/0x430 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Andreas Koensgen <ajk@comnets.uni-bremen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-02 10:27:01 -08:00
Xue Chaojing 53fe3ed19d net-next/hinic:add shutdown callback
If there is no shutdown callback, our board will report pcie UNF errors
after restarting. This patch add shutdown callback for hinic.

Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-02 10:13:35 -08:00
Linus Torvalds d9a7fa67b4 Merge branch 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull seccomp updates from James Morris:

 - Add SECCOMP_RET_USER_NOTIF

 - seccomp fixes for sparse warnings and s390 build (Tycho)

* 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  seccomp, s390: fix build for syscall type change
  seccomp: fix poor type promotion
  samples: add an example of seccomp user trap
  seccomp: add a return code to trap to userspace
  seccomp: switch system call argument type to void *
  seccomp: hoist struct seccomp_data recalculation higher
2019-01-02 09:48:13 -08:00
Bartlomiej Zolnierkiewicz 399382f801 drm/nouveau: fix incorrect FB_BACKLIGHT usage in Kconfig
Making FB_BACKLIGHT tristate by commit b4a1ed0cd1 ("fbdev:
make FB_BACKLIGHT a tristate") caused unmet dependencies in
some configurations:

WARNING: unmet direct dependencies detected for FB_BACKLIGHT
  Depends on [m]: HAS_IOMEM [=y] && FB [=m]
  Selected by [y]:
  - DRM_NOUVEAU [=y] && HAS_IOMEM [=y] && DRM [=y] && PCI [=y] && MMU [=y] && DRM_NOUVEAU_BACKLIGHT [=y]
  Selected by [m]:
  - FB_NVIDIA [=m] && HAS_IOMEM [=y] && FB [=m] && PCI [=y] && FB_NVIDIA_BACKLIGHT [=y]

Fix it by making DRM_NOUVEAU select BACKLIGHT_CLASS_DEVICE and
BACKLIGHT_LCD_SUPPORT instead of FB_BACKLIGHT.

Fixes: b4a1ed0cd1 ("fbdev: make FB_BACKLIGHT a tristate")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Cc: Rob Clark <robdclark@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
2019-01-02 18:47:37 +01:00
Linus Torvalds f218a29c25 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull integrity updates from James Morris:
 "In Linux 4.19, a new LSM hook named security_kernel_load_data was
  upstreamed, allowing LSMs and IMA to prevent the kexec_load syscall.
  Different signature verification methods exist for verifying the
  kexec'ed kernel image. This adds additional support in IMA to prevent
  loading unsigned kernel images via the kexec_load syscall,
  independently of the IMA policy rules, based on the runtime "secure
  boot" flag. An initial IMA kselftest is included.

  In addition, this pull request defines a new, separate keyring named
  ".platform" for storing the preboot/firmware keys needed for verifying
  the kexec'ed kernel image's signature and includes the associated IMA
  kexec usage of the ".platform" keyring.

  (David Howell's and Josh Boyer's patches for reading the
  preboot/firmware keys, which were previously posted for a different
  use case scenario, are included here)"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  integrity: Remove references to module keyring
  ima: Use inode_is_open_for_write
  ima: Support platform keyring for kernel appraisal
  efi: Allow the "db" UEFI variable to be suppressed
  efi: Import certificates from UEFI Secure Boot
  efi: Add an EFI signature blob parser
  efi: Add EFI signature data types
  integrity: Load certs to the platform keyring
  integrity: Define a trusted platform keyring
  selftests/ima: kexec_load syscall test
  ima: don't measure/appraise files on efivarfs
  x86/ima: retry detecting secure boot mode
  docs: Extend trusted keys documentation for TPM 2.0
  x86/ima: define arch_get_ima_policy() for x86
  ima: add support for arch specific policies
  ima: refactor ima_init_policy()
  ima: prevent kexec_load syscall based on runtime secureboot flag
  x86/ima: define arch_ima_get_secureboot
  integrity: support new struct public_key_signature encoding field
2019-01-02 09:43:14 -08:00
Yangtao Li 260f71eff4 sunrpc: convert to DEFINE_SHOW_ATTRIBUTE
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:49 -05:00
Santosh kumar pradhan 10e037d1e0 sunrpc: Add xprt after nfs4_test_session_trunk()
Multipathing: In case of NFSv3, rpc_clnt_test_and_add_xprt() adds
the xprt to xprt switch (i.e. xps) if rpc_call_null_helper() returns
success. But in case of NFSv4.1, it needs to do EXCHANGEID to verify
the path along with check for session trunking.

Add the xprt in nfs4_test_session_trunk() only when
nfs4_detect_session_trunking() returns success. Also release refcount
hold by rpc_clnt_setup_test_and_add_xprt().

Signed-off-by: Santosh kumar pradhan <santoshkumar.pradhan@wdc.com>
Tested-by: Suresh Jayaraman <suresh.jayaraman@wdc.com>
Reported-by: Aditya Agnihotri <aditya.agnihotri@wdc.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
J. Bruce Fields cb24e35b4f sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS
It's OK to sleep here, we just don't want to recurse into the filesystem
as a writeout could be waiting on this.

Future work: the documentation for GFP_NOFS says "Please try to avoid
using this flag directly and instead use memalloc_nofs_{save,restore} to
mark the whole scope which cannot/shouldn't recurse into the FS layer
with a short explanation why. All allocation requests will inherit
GFP_NOFS implicitly."

But I'm not sure where to do this.  Should the workqueue be arranging
that for us in the case of workqueues created with WQ_MEM_RECLAIM?

Reported-by: Trond Myklebust <trondmy@hammer.space>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
J. Bruce Fields 81c88b18de sunrpc: handle ENOMEM in rpcb_getport_async
If we ignore the error we'll hit a null dereference a little later.

Reported-by: syzbot+4b98281f2401ab849f4b@syzkaller.appspotmail.com
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
NeilBrown c2c7d84fd1 NFS: remove unnecessary test for IS_ERR(cred)
As gte_current_cred() cannot return an error,
this test is not necessary.
It hasn't been necessary for years, but it wasn't so obvious
before.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
Chuck Lever 07e10308ee xprtrdma: Prevent leak of rpcrdma_rep objects
If a reply has been processed but the RPC is later retransmitted
anyway, the req->rl_reply field still contains the only pointer to
the old rpcrdma rep. When the next reply comes in, the reply handler
will stomp on the rl_reply field, leaking the old rep.

A trace event is added to capture such leaks.

This problem seems to be worsened by the restructuring of the RPC
Call path in v4.20. Fully addressing this issue will require at
least a re-architecture of the disconnect logic, which is not
appropriate during -rc.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
Olga Kornievskaia 9aeaf8cfcb NFSv4.2 fix async copy reboot recovery
Original commit (e4648aa4f9 "NFS recover from destination server
reboot for copies") used memcmp() and then it was changed to use
nfs4_stateid_match_other() but that function returns opposite of
memcmp. As the result, recovery can't find the copy leading
to copy hanging.

Fixes: 80f4236886 ("NFSv4: Split out NFS v4.2 copy completion functions")
Fixes: cb7a8384dc ("NFS: Split out the body of nfs4_reclaim_open_state")
Signed-of-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:19 -05:00
Chuck Lever f85adb1bf5 xprtrdma: Don't leak freed MRs
Defensive clean up. Don't set frwr->fr_mr until we know that the
scatterlist allocation has succeeded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever af65ed404c xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
Make a note of the function's dependency on an earlier ib_drain_qp.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 995d312a28 xprtrdma: Replace outdated comment for rpcrdma_ep_post
Since commit 7c8d9e7c88 ("xprtrdma: Move Receive posting to
Receive handler"), rpcrdma_ep_post is no longer responsible for
posting Receive buffers. Update the documenting comment to reflect
this change.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever e0f86bc4f9 xprtrdma: Update comments in frwr_op_send
Commit f287762308 ("xprtrdma: Chain Send to FastReg WRs") was
written before commit ce5b371782 ("xprtrdma: Replace all usage of
"frmr" with "frwr""), but was merged afterwards. Thus it still
refers to FRMR and MWs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever acf0a39f4f SUNRPC: Fix some kernel doc complaints
Clean up some warnings observed when building with "make W=1".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever dc5820bd21 SUNRPC: Simplify defining common RPC trace events
Clean up, no functional change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 5b2095d0ce NFS: Fix NFSv4 symbolic trace point output
These symbolic values were not being displayed in string form.
TRACE_DEFINE_ENUM was missing in many cases. It also turns out that
__print_symbolic wants an unsigned long in the first field...

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 53b2c1cb9b xprtrdma: Trace mapping, alloc, and dereg failures
These are rare, but can be helpful at tracking down DMAR and other
problems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 395069fc37 xprtrdma: Add trace points for calls to transport switch methods
Name them "trace_xprtrdma_op_*" so they can be easily enabled as a
group. No trace point is added where the generic layer already has
observability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever ba217ec64a xprtrdma: Relocate the xprtrdma_mr_map trace points
The mr_map trace points were capturing information about the previous
use of the MR rather than about the segment that was just mapped.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever aba1183179 xprtrdma: Clean up of xprtrdma chunk trace points
The chunk-related trace points capture nearly the same information
as the MR-related trace points.

Also, rename them so globbing can be used to enable or disable
these trace points more easily.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 9bef848f44 xprtrdma: Remove unused fields from rpcrdma_ia
Clean up. The last use of these fields was in commit 173b8f49b3
("xprtrdma: Demote "connect" log messages") .

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever ddbb347f0c xprtrdma: Cull dprintk() call sites
Clean up: Remove dprintk() call sites that report rare or impossible
errors. Leave a few that display high-value low noise status
information.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 92f4433e56 xprtrdma: Simplify locking that protects the rl_allreqs list
Clean up: There's little chance of contention between the use of
rb_lock and rb_reqslock, so merge the two. This avoids having to
take both in some (possibly future) cases.

Transport tear-down is already serialized, thus there is no need for
locking at all when destroying rpcrdma_reqs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 236b0943d1 xprtrdma: Expose transport header errors
For better observability of parsing errors, return the error code
generated in the decoders to the upper layer consumer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:18 -05:00
Chuck Lever 889ee07f7e xprtrdma: Remove request_module from backchannel
Since commit ffe1f0df58 ("rpcrdma: Merge svcrdma and xprtrdma
modules into one"), the forward and backchannel components are part
of the same kernel module. A separate request_module() call in the
backchannel code is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever 15303d9ecd xprtrdma: Recognize XDRBUF_SPARSE_PAGES
Commit 431f6eb357 ("SUNRPC: Add a label for RPC calls that require
allocation on receive") didn't update similar logic in rpc_rdma.c.
I don't think this is a bug, per-se; the commit just adds more
careful checking for broken upper layer behavior.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever 0dfbb5f05e NFS: Make "port=" mount option optional for RDMA mounts
Having to specify "proto=rdma,port=20049" is cumbersome.

RFC 8267 Section 6.3 requires NFSv4 clients to use "the alternative
well-known port number", which is 20049. Make the use of the well-
known port number automatic, just as it is for NFS/TCP and port
2049.

For NFSv2/3, Section 4.2 allows clients to simply choose 20049 as
the default or use rpcbind. I don't know of an NFS/RDMA server
implementation that registers it's NFS/RDMA service with rpcbind,
so automatically choosing 20049 seems like the better choice. The
other widely-deployed NFS/RDMA client, Solaris, also uses 20049
as the default port.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever 0a93fbcb16 xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
Place the associated RPC transaction's XID in the upper 32 bits of
each RDMA segment's rdma_offset field. There are two reasons to do
this:

- The R_key only has 8 bits that are different from registration to
  registration. The XID adds more uniqueness to each RDMA segment to
  reduce the likelihood of a software bug on the server reading from
  or writing into memory it's not supposed to.

- On-the-wire RDMA Read and Write requests do not otherwise carry
  any identifier that matches them up to an RPC. The XID in the
  upper 32 bits will act as an eye-catcher in network captures.

Suggested-by: Tom Talpey <ttalpey@microsoft.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever 5f62412be3 xprtrdma: Remove rpcrdma_memreg_ops
Clean up: Now that there is only FRWR, there is no need for a memory
registration switch. The indirect calls to the memreg operations can
be replaced with faster direct calls.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever ba69cd122e xprtrdma: Remove support for FMR memory registration
FMR is not supported on most recent RDMA devices. It is also less
secure than FRWR because an FMR memory registration can expose
adjacent bytes to remote reading or writing. As discussed during the
RDMA BoF at LPC 2018, it is time to remove support for FMR in the
NFS/RDMA client stack.

Note that NFS/RDMA server-side uses either local memory registration
or FRWR. FMR is not used.

There are a few Infiniband/RoCE devices in the kernel tree that do
not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
not support client-side NFS/RDMA after this patch. These are:

 - mthca
 - qib
 - hns (RoCE)

Users of these devices can use NFS/TCP on IPoIB instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever a78868497c xprtrdma: Reduce max_frwr_depth
Some devices advertise a large max_fast_reg_page_list_len
capability, but perform optimally when MRs are significantly smaller
than that depth -- probably when the MR itself is no larger than a
page.

By default, the RDMA R/W core API uses max_sge_rd as the maximum
page depth for MRs. For some devices, the value of max_sge_rd is
1, which is also not optimal. Thus, when max_sge_rd is larger than
1, use that value. Otherwise use the value of the
max_fast_reg_page_list_len attribute.

I've tested this with CX-3 Pro, FastLinq, and CX-5 devices. It
reproducibly improves the throughput of large I/Os by several
percent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:17 -05:00
Chuck Lever 6946f82380 xprtrdma: Fix ri_max_segs and the result of ro_maxpages
With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
fail with EMSGSIZE. This is because the calculated value of
ri_max_segs (the max number of MRs per RPC) exceeded
RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
walk off the end of the transport header.

Once that was addressed, the ro_maxpages result has to be corrected
to account for the number of MRs needed for Reply chunks, which is
2 MRs smaller than a normal Read or Write chunk.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever 0c0829bcf5 xprtrdma: Don't wake pending tasks until disconnect is done
Transport disconnect processing does a "wake pending tasks" at
various points.

Suppose an RPC Reply is being processed. The RPC task that Reply
goes with is waiting on the pending queue. If a disconnect wake-up
happens before reply processing is done, that reply, even if it is
good, is thrown away, and the RPC has to be sent again.

This window apparently does not exist for socket transports because
there is a lock held while a reply is being received which prevents
the wake-up call until after reply processing is done.

To resolve this, all RPC replies being processed on an RPC-over-RDMA
transport have to complete before pending tasks are awoken due to a
transport disconnect.

Callers that already hold the transport write lock may invoke
->ops->close directly. Others use a generic helper that schedules
a close when the write lock can be taken safely.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever 3d433ad812 xprtrdma: No qp_event disconnect
After thinking about this more, and auditing other kernel ULP imple-
mentations, I believe that a DISCONNECT cm_event will occur after a
fatal QP event. If that's the case, there's no need for an explicit
disconnect in the QP event handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever 6d2d0ee27c xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue
To address a connection-close ordering problem, we need the ability
to drain the RPC completions running on rpcrdma_receive_wq for just
one transport. Give each transport its own RPC completion workqueue,
and drain that workqueue when disconnecting the transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever 6ceea36890 xprtrdma: Refactor Receive accounting
Clean up: Divide the work cleanly:

- rpcrdma_wc_receive is responsible only for RDMA Receives
- rpcrdma_reply_handler is responsible only for RPC Replies
- the posted send and receive counts both belong in rpcrdma_ep

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever b674c4b4a1 xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails
The recovery case in frwr_op_unmap_sync needs to DMA unmap each MR.
frwr_release_mr does not DMA-unmap, but the recycle worker does.

Fixes: 61da886bf7 ("xprtrdma: Explicitly resetting MRs is ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Chuck Lever e2f34e2671 xprtrdma: Yet another double DMA-unmap
While chasing yet another set of DMAR fault reports, I noticed that
the frwr recycler conflates whether or not an MR has been DMA
unmapped with frwr->fr_state. Actually the two have only an indirect
relationship. It's in fact impossible to guess reliably whether the
MR has been DMA unmapped based on its fr_state field, especially as
the surrounding code and its assumptions have changed over time.

A better approach is to track the DMA mapping status explicitly so
that the recycler is less brittle to unexpected situations, and
attempts to DMA-unmap a second time are prevented.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.20
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02 12:05:16 -05:00
Moni Shoua 2f1927b090 IB/core: Add advise_mr to the list of known ops
We need to add advise_mr to the list of operation setters on the ib_device
or otherwise callers to ib_set_device_ops() for advise_mr operation will
not have their callback registered.

When the advise_mr series was merged with the device ops series the
SET_DEVICE_OPS() was missed.

Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-02 09:40:34 -07:00
Leon Romanovsky ccffa54548 Revert "IB/mlx5: Fix long EEH recover time with NVMe offloads"
Longer term testing shows this patch didn't play well with MR cache and
caused to call traces during remove_mkeys().

This reverts commit bb7e22a8ab.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-02 09:40:34 -07:00
Yishai Hadas 7422edce73 IB/mlx5: Allow XRC INI usage via verbs in DEVX context
From device point of view both XRC target and initiator are XRC transport
type.

Fix to use the expected UID as was handled for the XRC target case to
allow its usage via verbs in DEVX context.

Fixes: 5aa3771ded ("IB/mlx5: Allow XRC usage via verbs in DEVX context")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-02 09:40:34 -07:00
Guo Ren f50fd2d852 csky: Add perf support for C-SKY
This adds basic perf support for all C-SKY CPUs. Hardware events are
only supported by 807/810/860.

Signed-off-by: Guo Ren <ren_guo@c-sky.com>
2019-01-02 22:17:11 +08:00
Adrian Hunter b25756df5b perf session: Add comment for perf_session__register_idle_thread()
Add a comment to perf_session__register_idle_thread() to bring attention to
a pitfall with the idle task thread structure. The pitfall is that there
should really be a 'struct thread' for the idle task of each cpu, but there
is only one that can have pid == tid == 0.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20181221120620.9659-9-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-02 11:05:06 -03:00
Adrian Hunter 256d92bc93 perf thread-stack: Fix thread stack processing for the idle task
perf creates a single 'struct thread' to represent the idle task. That
is because threads are identified by PID and TID, and the idle task
always has PID == TID == 0.

However, there are actually separate idle tasks for each CPU. That
creates a problem for thread stack processing which assumes that each
thread has a single stack, not one stack per CPU.

Fix that by passing through the CPU number, and in the case of the idle
"thread", pick the thread stack from an array based on the CPU number.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20181221120620.9659-8-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-01-02 11:03:17 -03:00