1
0
Fork 0
Commit Graph

61691 Commits (359f642700f2ff05d9c94cd9216c97af7b8e9553)

Author SHA1 Message Date
Greg Edwards 359f642700 block: move bio_integrity_{intervals,bytes} into blkdev.h
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 15:49:41 -06:00
Jens Axboe eca53cb63f Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block
Pull NVMe updates from Christoph:

"Highlights:

 - massively improved tracepoints (Keith Busch)
 - support for larger inline data in the RDMA host and target
   (Steve Wise)
 - RDMA setup/teardown path fixes and refactor (Sagi Grimberg)
 - Command Supported and Effects log support for the NVMe target
   (Chaitanya Kulkarni)
 - buffered I/O support for the NVMe target (Chaitanya Kulkarni)

 plus the usual set of cleanups and small enhancements."

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet: don't use uuid_le type
  nvmet: check fileio lba range access boundaries
  nvmet: fix file discard return status
  nvme-rdma: centralize admin/io queue teardown sequence
  nvme-rdma: centralize controller setup sequence
  nvme-rdma: unquiesce queues when deleting the controller
  nvme-rdma: mark expected switch fall-through
  nvme: add disk name to trace events
  nvme: add controller name to trace events
  nvme: use hw qid in trace events
  nvme: cache struct nvme_ctrl reference to struct nvme_request
  nvmet-rdma: add an error flow for post_recv failures
  nvmet-rdma: add unlikely check in the fast path
  nvmet-rdma: support max(16KB, PAGE_SIZE) inline data
  nvme-rdma: support up to 4 segments of inline data
  nvmet: add buffered I/O support for file backed ns
  nvmet: add commands supported and effects log page
  nvme: move init of keep_alive work item to controller initialization
  nvme.h: resync with nvme-cli
2018-07-25 08:43:30 -06:00
Christoph Hellwig c55183c9aa block: unexport bio_clone_bioset
Now only used by the bounce code, so move it there and mark the function
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:26 -06:00
Christoph Hellwig 071f52fbce block: remove bio_clone_kmalloc
Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:22 -06:00
Revanth Rajashekar 40c6f9c28e nvme.h: resync with nvme-cli
Added some feature ids present in nvme-cli but not kernel.

Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:12 +02:00
Tejun Heo 636620b66d blkcg: Track DISCARD statistics and output them in cgroup io.stat
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat.  Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:23 -06:00
Michael Callahan bdca3c87fb block: Track DISCARD statistics and output them in stat and diskstat
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats.  These are tracked with the same four stats as reads
and writes:

Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests

This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.

tj: Refreshed on top of v4.17 and other previous updates.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:22 -06:00
Michael Callahan ddcf35d397 block: Add and use op_stat_group() for indexing disk_stat fields.
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function.  It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:20 -06:00
Michael Callahan dbae2c5513 block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:18 -06:00
Michael Callahan 59767fbd49 block: Add part_stat_read_accum to read across field entries.
Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries.  For example to sum up the number read and write
sectors completed.  In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:16 -06:00
Tejun Heo 3f289dcb4b block: make bdev_ops->rw_page() take a REQ_OP instead of bool
c11f0c0b5b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information.  This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:14 -06:00
Vladimir Zapolskiy 05814a1037 block: remove blkdev_entry_to_request() macro
Remove blkdev_entry_to_request() macro, which remained unused through
the observable history, also note that it repeats list_entry_rq() macro
verbatim.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-13 08:12:26 -06:00
Josef Bacik d706751215 block: introduce blk-iolatency io controller
Current IO controllers for the block layer are less than ideal for our
use case.  The io.max controller is great at hard limiting, but it is
not work conserving.  This patch introduces io.latency.  You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets.  This makes use of
the rq-qos infrastructure and works much like the wbt stuff.  There are
a few differences from wbt

 - It's bio based, so the latency covers the whole block layer in addition to
   the actual io.
 - We will throttle all IO types that comes in here if we need to.
 - We use the mean latency over the 100ms window.  This is because writes can
   be particularly fast, which could give us a false sense of the impact of
   other workloads on our protected workload.
 - By default there's no throttling, we set the queue_depth to INT_MAX so that
   we can have as many outstanding bio's as we're allowed to.  Only at
   throttle time do we pay attention to the actual queue depth.
 - We backcharge cgroups for root cg issued IO and induce artificial
   delays in order to deal with cases like metadata only or swap heavy
   workloads.

In testing this has worked out relatively well.  Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).

Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group.  We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.

Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all.  With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik a79050434b blk-rq-qos: refactor out common elements of blk-wbt
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo 2cf855837b memcontrol: schedule throttling if we are congested
Memory allocations can induce swapping via kswapd or direct reclaim.  If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling.  So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik d09d8df3a2 blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies.  The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources.  Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo 0d3bd88d54 swap,blkcg: issue swap io with the appropriate context
For backcharging we need to know who the page belongs to when swapping
it out.  We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik 0d1e0c7cd5 blk: introduce REQ_SWAP
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups.  Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik 903d23f0a3 blk-cgroup: allow controllers to output their own stats
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.

Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik c7c98fd376 block: introduce bio_issue_as_root_blkg
Instead of forcing all file systems to get the right context on their
bio's, simply check for REQ_META to see if we need to issue as the root
blkg.  We don't want to force all bio's to have the root blkg associated
with them if REQ_META is set, as some controllers (blk-iolatency) need
to know who the originating cgroup is so it can backcharge them for the
work they are doing.  This helper will make sure that the controllers do
the proper thing wrt the IO priority and backcharging.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Josef Bacik 08e18eab0c block: add bi_blkg to the bio for cgroups
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well.  Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.

Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio.  This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.

So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei 6e76871730 blk-mq: dequeue request one by one from sw queue if hctx is busy
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei 97889f9ac2 blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().

This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:

1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.

2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.

In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.

Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Ming Lei 5815839b3c blk-mq: introduce new lock for protecting hctx->dispatch_wait
Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.

IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().

This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.

Fixes: eb619fdb2d ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche 6a5ac98465 block: Make struct request_queue smaller for CONFIG_BLK_DEV_ZONED=n
Exclude zoned block device members from struct request_queue for
CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
the code that uses these struct request_queue members if
CONFIG_BLK_DEV_ZONED != n.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche 7c8542b798 block: Inline blk_queue_nr_zones()
Since the implementation of blk_queue_nr_zones() is trivial and since
it only has a single caller, inline this function.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Bart Van Assche 6b1d83d274 block: Remove bdev_nr_zones()
Remove this function since it has no callers. This function was
introduced in commit 6cc77e9cb0 ("block: introduce zoned block
devices zone write locking").

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjorling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:52 -06:00
Linus Torvalds 6f27a64092 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - Prevent an out-of-bounds access in mtrr_write()

 - Break a circular dependency in the new hyperv IPI acceleration code

 - Address the build breakage related to inline functions by enforcing
   gnu_inline and explicitly bringing native_save_fl() out of line,
   which also adds a set of _ARM_ARG macros which provide 32/64bit
   safety.

 - Initialize the shadow CR4 per cpu variable before using it.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mtrr: Don't copy out-of-bounds data in mtrr_write
  x86/hyper-v: Fix the circular dependency in IPI enlightenment
  x86/paravirt: Make native_save_fl() extern inline
  x86/asm: Add _ASM_ARG* constants for argument registers to <asm/asm.h>
  compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations
  x86/mm/32: Initialize the CR4 shadow before __flush_tlb_all()
2018-07-08 13:26:55 -07:00
Linus Torvalds 6fb2489d7f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:

 - The hopefully final fix for the reported race problems in
   kthread_parkme(). The previous attempt still left a hole and was
   partially wrong.

 - Plug a race in the remote tick mechanism which triggers a warning
   about updates not being done correctly. That's a false positive if
   the race condition is hit as the remote CPU is idle. Plug it by
   checking the condition again when holding run queue lock.

 - Fix a bug in the utilization estimation of a run queue which causes
   the estimation to be 0 when a run queue is throttled.

 - Advance the global expiration of the period timer when the timer is
   restarted after a idle period. Otherwise the expiry time is stale and
   the timer fires prematurely.

 - Cure the drift between the bandwidth timer and the runqueue
   accounting, which leads to bogus throttling of runqueues

 - Place the call to cpufreq_update_util() correctly so the function
   will observe the correct number of running RT tasks and not a stale
   one.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread, sched/core: Fix kthread_parkme() (again...)
  sched/util_est: Fix util_est_dequeue() for throttled cfs_rq
  sched/fair: Advance global expiration when period timer is restarted
  sched/fair: Fix bandwidth timer clock drift condition
  sched/rt: Fix call to cpufreq_update_util()
  sched/nohz: Skip remote tick on idle task entirely
2018-07-08 12:41:23 -07:00
Linus Torvalds 97f4e14229 While cleaning out my INBOX, I found a few patches that were lost
in the noise. These are minor bug fixes and clean ups. Those include:
 
  - Avoiding a string overflow
  - Code that didn't match the comment (but should)
  - A small code optimization (use of a conditional)
  - Quieting printf warnings
  - Nuking unused code
  - Fixing function graph interrupt annotation
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWz7ARhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmMqAQDTS7uvFLRR603WXyOazX6Y7FeiYFWp
 MUUZjnbG9u0bawEAulW53AM0OL3EAAaZKtPi8VtsT+uktR1GIynXrp+yoww=
 =yQDv
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes and cleanups from Steven Rostedt:
 "While cleaning out my INBOX, I found a few patches that were lost in
  the noise. These are minor bug fixes and clean ups. Those include:

   - avoid a string overflow

   - code that didn't match the comment (but should)

   - a small code optimization (use of a conditional)

   - quiet printf warnings

   - nuke unused code

   - fix function graph interrupt annotation"

* tag 'trace-v4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix missing return symbol in function_graph output
  ftrace: Nuke clear_ftrace_function
  tracing: Use __printf markup to silence compiler
  tracing: Optimize trace_buffer_iter() logic
  tracing: Make create_filter() code match the comments
  tracing: Avoid string overflow
2018-07-05 19:29:07 -07:00
Yisheng Xie 5ccba64a56 ftrace: Nuke clear_ftrace_function
clear_ftrace_function is not used outside of ftrace.c and is not help to
use a function, so nuke it per Steve's suggestion.

Link: http://lkml.kernel.org/r/1517537689-34947-1-git-send-email-xieyisheng1@huawei.com

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-03 18:33:19 -04:00
Nick Desaulniers d03db2bc26 compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations
Functions marked extern inline do not emit an externally visible
function when the gnu89 C standard is used. Some KBUILD Makefiles
overwrite KBUILD_CFLAGS. This is an issue for GCC 5.1+ users as without
an explicit C standard specified, the default is gnu11. Since c99, the
semantics of extern inline have changed such that an externally visible
function is always emitted. This can lead to multiple definition errors
of extern inline functions at link time of compilation units whose build
files have removed an explicit C standard compiler flag for users of GCC
5.1+ or Clang.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@redhat.com
Cc: akataria@vmware.com
Cc: akpm@linux-foundation.org
Cc: andrea.parri@amarulasolutions.com
Cc: ard.biesheuvel@linaro.org
Cc: aryabinin@virtuozzo.com
Cc: astrachan@google.com
Cc: boris.ostrovsky@oracle.com
Cc: brijesh.singh@amd.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: geert@linux-m68k.org
Cc: ghackmann@google.com
Cc: gregkh@linuxfoundation.org
Cc: jan.kiszka@siemens.com
Cc: jarkko.sakkinen@linux.intel.com
Cc: jpoimboe@redhat.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: kstewart@linuxfoundation.org
Cc: linux-efi@vger.kernel.org
Cc: linux-kbuild@vger.kernel.org
Cc: manojgupta@google.com
Cc: mawilcox@microsoft.com
Cc: michal.lkml@markovi.net
Cc: mjg59@google.com
Cc: mka@chromium.org
Cc: pombredanne@nexb.com
Cc: rientjes@google.com
Cc: rostedt@goodmis.org
Cc: sedat.dilek@gmail.com
Cc: thomas.lendacky@amd.com
Cc: tstellar@redhat.com
Cc: tweek@google.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: yamada.masahiro@socionext.com
Link: http://lkml.kernel.org/r/20180621162324.36656-2-ndesaulniers@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03 10:56:27 +02:00
Peter Zijlstra 1cef1150ef kthread, sched/core: Fix kthread_parkme() (again...)
Gaurav reports that commit:

  85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")

isn't working for him. Because of the following race:

> controller Thread                               CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
>                                                 smpboot_thread_fn
>                                                 set Task interruptible
>
>
> wake_up_process
>  if (!(p->state & state))
>                 goto out;
>
>                                                 Kthread_parkme
>                                                 SET TASK_PARKED
>                                                 schedule
>                                                 raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
>                                                 context_switch
>
>                                                 finish_lock_switch
>
>
>
>                                                 Case TASK_PARKED
>                                                 kthread_park_complete
>
>
> SET Running

Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
handling is buggered because the TASK_DEAD thing is done with
preemption disabled, the current code can still complete early on
preemption :/

So basically revert that earlier fix and go with a variant of the
alternative mentioned in the commit. Promote TASK_PARKED to special
state to avoid the store-store issue on task->state leading to the
WARN in kthread_unpark() -> __kthread_bind().

But in addition, add wait_task_inactive() to kthread_park() to ensure
the task really is PARKED when we return from kthread_park(). This
avoids the whole kthread still gets migrated nonsense -- although it
would be really good to get this done differently.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03 09:17:30 +02:00
Linus Torvalds 4e33d7d479 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.

 2) Need to bump memory lock rlimit for test_sockmap bpf test, from
    Yonghong Song.

 3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.

 4) Fix uninitialized read in nf_log, from Jann Horn.

 5) Fix raw command length parsing in mlx5, from Alex Vesker.

 6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
    Varadhan.

 7) Fix regressions in FIB rule matching during create, from Jason A.
    Donenfeld and Roopa Prabhu.

 8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.

 9) More bpfilter build fixes/adjustments from Masahiro Yamada.

10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
    Dangaard Brouer.

11) fib_tests.sh file permissions were broken, from Shuah Khan.

12) Make sure BH/preemption is disabled in data path of mac80211, from
    Denis Kenzior.

13) Don't ignore nla_parse_nested() return values in nl80211, from
    Johannes berg.

14) Properly account sock objects ot kmemcg, from Shakeel Butt.

15) Adjustments to setting bpf program permissions to read-only, from
    Daniel Borkmann.

16) TCP Fast Open key endianness was broken, it always took on the host
    endiannness. Whoops. Explicitly make it little endian. From Yuching
    Cheng.

17) Fix prefix route setting for link local addresses in ipv6, from
    David Ahern.

18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.

19) Various bpf sockmap fixes, from John Fastabend.

20) Use after free for GRO with ESP, from Sabrina Dubroca.

21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
    Eric Biggers.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
  qede: Adverstise software timestamp caps when PHC is not available.
  qed: Fix use of incorrect size in memcpy call.
  qed: Fix setting of incorrect eswitch mode.
  qed: Limit msix vectors in kdump kernel to the minimum required count.
  ipvlan: call dev_change_flags when ipvlan mode is reset
  ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
  net: fix use-after-free in GRO with ESP
  tcp: prevent bogus FRTO undos with non-SACK flows
  bpf: sockhash, add release routine
  bpf: sockhash fix omitted bucket lock in sock_close
  bpf: sockmap, fix smap_list_map_remove when psock is in many maps
  bpf: sockmap, fix crash when ipv6 sock is added
  net: fib_rules: bring back rule_exists to match rule during add
  hv_netvsc: split sub-channel setup into async and sync
  net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
  atm: zatm: Fix potential Spectre v1
  s390/qeth: consistently re-enable device features
  s390/qeth: don't clobber buffer on async TX completion
  s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
  s390/qeth: fix race when setting MAC address
  ...
2018-07-02 11:18:28 -07:00
Sabrina Dubroca 603d4cf8fe net: fix use-after-free in GRO with ESP
Since the addition of GRO for ESP, gro_receive can consume the skb and
return -EINPROGRESS. In that case, the lower layer GRO handler cannot
touch the skb anymore.

Commit 5f114163f2 ("net: Add a skb_gro_flush_final helper.") converted
some of the gro_receive handlers that can lead to ESP's gro_receive so
that they wouldn't access the skb when -EINPROGRESS is returned, but
missed other spots, mainly in tunneling protocols.

This patch finishes the conversion to using skb_gro_flush_final(), and
adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
GUE.

Fixes: 5f114163f2 ("net: Add a skb_gro_flush_final helper.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-02 20:34:04 +09:00
Linus Torvalds d7563ca5bf Staging/IIO fixes for 4.18-rc3
Here are a few small staging and IIO driver fixes for 4.18-rc3.
 
 Nothing major or big, all just fixes for reported problems since
 4.18-rc1.  All of these have been in linux-next this week with no
 reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWziQ7w8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynwZACgjpcPVTEiA0W02s4B/GhmVt1NMeUAnjgeLDzY
 yRz6SX18lSmjHqCUXAk1
 =/2QW
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO fixes from Greg KH:
 "Here are a few small staging and IIO driver fixes for 4.18-rc3.

  Nothing major or big, all just fixes for reported problems since
  4.18-rc1. All of these have been in linux-next this week with no
  reported problems"

* tag 'staging-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: android: ion: Return an ERR_PTR in ion_map_kernel
  staging: comedi: quatech_daqp_cs: fix no-op loop daqp_ao_insn_write()
  iio: imu: inv_mpu6050: Fix probe() failure on older ACPI based machines
  iio: buffer: fix the function signature to match implementation
  iio: mma8452: Fix ignoring MMA8452_INT_DRDY
  iio: tsl2x7x/tsl2772: avoid potential division by zero
  iio: pressure: bmp280: fix relative humidity unit
2018-07-01 12:20:20 -07:00
Linus Torvalds c2aee376cf USB fixes for 4.18-rc3
Here is a number of USB gadget and other driver fixes for 4.18-rc3.
 
 There's a bunch of them here, most of them being gadget driver and xhci
 host controller fixes for reported issues (as normal), but there are
 also some new device ids, and some fixes for the typec code.
 
 There is an acpi core patch in here that was acked by the acpi
 maintainer as it is needed for the typec fixes in order to properly
 solve a problem in that driver.
 
 All of these have been in linux-next this week with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWziS+Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn8QQCfU8PnLo+uPoV+DU7Nm1486mHTP4YAoLU3Q0nE
 oRmnzA3TnxytluWgbC7M
 =H8MG
 -----END PGP SIGNATURE-----

Merge tag 'usb-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here is a number of USB gadget and other driver fixes for 4.18-rc3.

  There's a bunch of them here, most of them being gadget driver and
  xhci host controller fixes for reported issues (as normal), but there
  are also some new device ids, and some fixes for the typec code.

  There is an acpi core patch in here that was acked by the acpi
  maintainer as it is needed for the typec fixes in order to properly
  solve a problem in that driver.

  All of these have been in linux-next this week with no reported
  issues"

* tag 'usb-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (33 commits)
  usb: chipidea: host: fix disconnection detect issue
  usb: typec: tcpm: fix logbuffer index is wrong if _tcpm_log is re-entered
  typec: tcpm: Fix a msecs vs jiffies bug
  NFC: pn533: Fix wrong GFP flag usage
  usb: cdc_acm: Add quirk for Uniden UBC125 scanner
  staging/typec: fix tcpci_rt1711h build errors
  usb: typec: ucsi: Fix for incorrect status data issue
  usb: typec: ucsi: acpi: Workaround for cache mode issue
  acpi: Add helper for deactivating memory region
  usb: xhci: increase CRS timeout value
  usb: xhci: tegra: fix runtime PM error handling
  usb: xhci: remove the code build warning
  xhci: Fix kernel oops in trace_xhci_free_virt_device
  xhci: Fix perceived dead host due to runtime suspend race with event handler
  dwc2: gadget: Fix ISOC IN DDMA PID bitfield value calculation
  usb: gadget: dwc2: fix memory leak in gadget_init()
  usb: gadget: composite: fix delayed_status race condition when set_interface
  usb: dwc2: fix isoc split in transfer with no data
  usb: dwc2: alloc dma aligned buffer for isoc split in
  usb: dwc2: fix the incorrect bitmaps for the ports of multi_tt hub
  ...
2018-07-01 11:50:16 -07:00
David S. Miller 271b955e52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-07-01

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) A bpf_fib_lookup() helper fix to change the API before freeze to
   return an encoding of the FIB lookup result and return the nexthop
   device index in the params struct (instead of device index as return
   code that we had before), from David.

2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
   reject progs when set_memory_*() fails since it could still be RO.
   Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
   an issue, and a memory leak in s390 JIT found during review, from
   Daniel.

3) Multiple fixes for sockmap/hash to address most of the syzkaller
   triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
   a missing sock_map_release() routine to hook up to callbacks, and a
   fix for an omitted bucket lock in sock_close(), from John.

4) Two bpftool fixes to remove duplicated error message on program load,
   and another one to close the libbpf object after program load. One
   additional fix for nfp driver's BPF offload to avoid stopping offload
   completely if replace of program failed, from Jakub.

5) Couple of BPF selftest fixes that bail out in some of the test
   scripts if the user does not have the right privileges, from Jeffrin.

6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
   where we need to set the flag that some of the test cases are expected
   to fail, from Kleber.

7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
   since it has no relation to it and lirc2 users often have configs
   without cgroups enabled and thus would not be able to use it, from Sean.

8) Fix a selftest failure in sockmap by removing a useless setrlimit()
   call that would set a too low limit where at the same time we are
   already including bpf_rlimit.h that does the job, from Yonghong.

9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-01 09:27:44 +09:00
Linus Torvalds 22d3e0c36e Kbuild fixes for v4.18
- introduce __diag_* macros and suppress -Wattribute-alias warnings from GCC 8
 
 - fix stack protector test script for x86_64
 
 - fix line number handling in Kconfig
 
 - document that '#' starts a comment in Kconfig
 
 - handle P_SYMBOL property in dump debugging of Kconfig
 
 - correct help message of LD_DEAD_CODE_DATA_ELIMINATION
 
 - fix occasional segmentation faults in Kconfig
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbN450AAoJED2LAQed4NsGFdAP/0fc2NhkzQMvz1EBEc2n93LC
 FUXew75tsX2ZewssoLzb4Iepkb/mHU+fjhBaE65S+Xu2/6mNfId9a7HAtywvFyO2
 ZUQPXHjMHnLEPRKuzQy34uCy9/wWCiqi8rpWUsOEohmNIcLaF0vMZf5Ifod7wIr7
 pnix3b9Q+dY+l49TSsSv4MX7F9qs5fXRhEarcQ3jYEb3yRUEXgmli3hV1wRita/n
 tJhFDiIdJDeISDkgmHUuOhjFnv5Yf3WJTXi/ILZ2zvpGjjqNDAwxtyzGnPMShQEc
 fxk3/1nkg9h/ScVAaGavrYYmiiH8XsqWY2q6p52jTK3kD+yTXaVakPSmxw8UHImh
 aNWQutzMF8GYEsb+ld1ncsNrwfgd40mA25mEyb/ZPSw2IdNBrXtIVbw7XiBLi8eH
 recAlRN0MouzD7+sXafgtoKopqanQbB/rMqDO4ULfnVvZLWDmZVbfreCc+qrJtiJ
 mqydBMUVxrvB+qf5SHQ7WlDmXWHY1xQuxXzS0gRVGT14EsyD6yhC2D62pEHnB7uG
 zE1pGemOCzOlGY6nDAbtQVR1n5AAWEZYveZXUuFn+vuqR7ZtYxCFUFOS0u621zFI
 HMI9B81ifdNV2efT2VTVi6Tnnvn44sAXOYjaULX6566EyX0/mOL5CWZlTqn5SKOn
 PwNxc7ZeCylTbkZww2c4
 =oABr
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - introduce __diag_* macros and suppress -Wattribute-alias warnings
   from GCC 8

 - fix stack protector test script for x86_64

 - fix line number handling in Kconfig

 - document that '#' starts a comment in Kconfig

 - handle P_SYMBOL property in dump debugging of Kconfig

 - correct help message of LD_DEAD_CODE_DATA_ELIMINATION

 - fix occasional segmentation faults in Kconfig

* tag 'kbuild-fixes-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: loop boundary condition fix
  kbuild: reword help of LD_DEAD_CODE_DATA_ELIMINATION
  kconfig: handle P_SYMBOL in print_symbol()
  kconfig: document Kconfig source file comments
  kconfig: fix line numbers for if-entries in menu tree
  stack-protector: Fix test with 32-bit userland and CONFIG_64BIT=y
  powerpc: Remove -Wattribute-alias pragmas
  disable -Wattribute-alias warning for SYSCALL_DEFINEx()
  kbuild: add macro for controlling warnings to linux/compiler.h
2018-06-30 13:05:30 -07:00
Linus Torvalds e6e5bec43c for-linus-20180629
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAls2oVcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr89D/9PaJCtLpCEU1HdaohIDXDyJKBdtCllIsqJ
 YAJTlUkFRkMvsbdEA3U43rVRVlu7zxP3aRg79VRI+kdZ8y8XSiisAF/QUbG/Bwd3
 NuFqKj4Hm5xFTtLMKIUlumR2/exD4ZpPHKnWmD4idp1PZRoTUwxeIjqVv77L6Nfy
 c4/GVlz2t9buWxH/5PSa//rsT49xM6CLpyIwO2lFsNEk2HYE/uAWya5KZVJ1YKVw
 mSjUdyFtJ//lFK9lVU4YkdNF0d5w2PcEoCfNyPe0n9F43iITACo2om7rkSkTgciS
 izPAs1DkqNdWPta2twULt656WlUoGlwnWyFchU34N/I/pDiww4kdCQFfNIOi6qBW
 7rPQ4xpywiw/U93C2GLEeE09++T8+yO4KuELhIcN5+D3D2VGRIuaGcf4xeY0CX3A
 a335EZIxBGOfxyejknBDg3BsoUcLvejbANKD8ltis3zciGf/QyLt+FFqx25WCzkN
 N028ppml8WPSiqhSlFDMyfscO1dx/WSYh1q7ANrBiPPaEjFYmakzEDT0cZW4JsUh
 av4jqYt3cqrFIXyhRHJM2wjFymQv6aVnmLa4zOKCFbwm9KzMWUUFjc4R962T/sQK
 0g+eCSFGidii4JZ4ghOAQk2dqCTIZqV72pmLlpJBGu+SAIQxkh+29h0LOi5U3v1Y
 FnTtJ1JhCg==
 =0uqV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Small set of fixes for this series. Mostly just minor fixes, the only
  oddball in here is the sg change.

  The sg change came out of the stall fix for NVMe, where we added a
  mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
  sort-of ruins that, since we'd need to account for that. That's
  actually a generic problem, since lots of drivers need to allocate SG
  lists. So this just removes support for CONFIG_SG_DEBUG, which I added
  back in 2007 and to my knowledge it was never useful.

  Anyway, outside of that, this pull contains:

   - clone of request with special payload fix (Bart)

   - drbd discard handling fix (Bart)

   - SATA blk-mq stall fix (me)

   - chunk size fix (Keith)

   - double free nvme rdma fix (Sagi)"

* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
  sg: remove ->sg_magic member
  drbd: Fix drbd_request_prepare() discard handling
  blk-mq: don't queue more if we get a busy return
  block: Fix cloning of requests with a special payload
  nvme-rdma: fix possible double free of controller async event buffer
  block: Fix transfer when chunk sectors exceeds max
2018-06-30 10:47:46 -07:00
Daniel Borkmann 85782e037f bpf: undo prog rejection on read-only lock failure
Partially undo commit 9facc33687 ("bpf: reject any prog that failed
read-only lock") since it caused a regression, that is, syzkaller was
able to manage to cause a panic via fault injection deep in set_memory_ro()
path by letting an allocation fail: In x86's __change_page_attr_set_clr()
it was able to change the attributes of the primary mapping but not in
the alias mapping via cpa_process_alias(), so the second, inner call
to the __change_page_attr() via __change_page_attr_set_clr() had to split
a larger page and failed in the alloc_pages() with the artifically triggered
allocation error which is then propagated down to the call site.

Thus, for set_memory_ro() this means that it returned with an error, but
from debugging a probe_kernel_write() revealed EFAULT on that memory since
the primary mapping succeeded to get changed. Therefore the subsequent
hdr->locked = 0 reset triggered the panic as it was performed on read-only
memory, so call-site assumptions were infact wrong to assume that it would
either succeed /or/ not succeed at all since there's no such rollback in
set_memory_*() calls from partial change of mappings, in other words, we're
left in a state that is "half done". A later undo via set_memory_rw() is
succeeding though due to matching permissions on that part (aka due to the
try_preserve_large_page() succeeding). While reproducing locally with
explicitly triggering this error, the initial splitting only happens on
rare occasions and in real world it would additionally need oom conditions,
but that said, it could partially fail. Therefore, it is definitely wrong
to bail out on set_memory_ro() error and reject the program with the
set_memory_*() semantics we have today. Shouldn't have gone the extra mile
since no other user in tree today infact checks for any set_memory_*()
errors, e.g. neither module_enable_ro() / module_disable_ro() for module
RO/NX handling which is mostly default these days nor kprobes core with
alloc_insn_page() / free_insn_page() as examples that could be invoked long
after bootup and original 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks") did neither when it got first introduced to BPF
so "improving" with bailing out was clearly not right when set_memory_*()
cannot handle it today.

Kees suggested that if set_memory_*() can fail, we should annotate it with
__must_check, and all callers need to deal with it gracefully given those
set_memory_*() markings aren't "advisory", but they're expected to actually
do what they say. This might be an option worth to move forward in future
but would at the same time require that set_memory_*() calls from supporting
archs are guaranteed to be "atomic" in that they provide rollback if part
of the range fails, once that happened, the transition from RW -> RO could
be made more robust that way, while subsequent RO -> RW transition /must/
continue guaranteeing to always succeed the undo part.

Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
Fixes: 9facc33687 ("bpf: reject any prog that failed read-only lock")
Cc: Laura Abbott <labbott@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-29 10:47:35 -07:00
Jens Axboe 9544bc5347 sg: remove ->sg_magic member
This was introduced more than a decade ago when sg chaining was
added, but we never really caught anything with it. The scatterlist
entry size can be critical, since drivers allocate it, so remove
the magic member. Recently it's been triggering allocation stalls
and failures in NVMe.

Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-29 08:48:06 -06:00
Linus Torvalds 5e4e8c55c9 Power management fixes for 4.18-rc3
- Fix the initialization time error handling in the recently added
    Kryo cpufreq driver (Dan Carpenter).
 
  - Fix up the recently added coverage of performance states in the
    generic power domains (genpd) framework (Viresh Kumar).
 
  - Add missing documentation of the new hwp_dynamic_boost sysfs knob
    in the intel_pstate driver (Rafael Wysocki).
 
  - Fix incorrect sysfs path in the intel_pstate driver documentation
    (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJbNefKAAoJEILEb/54YlRxOgoQALdhhtDm7fDHEh+TJhN4Y8X8
 9PfvRwLQEypyQi2uBl1dQHXjCjt6IYhBzHXN8nMgxPBHQDsMqMmk+lQa5KB2RazM
 xXdMTb6nO4Rd333rsm6hSN+hRFGIOnzFYLmEMMV7nYNIGK6y3Xfo+XFlIBpoLE/h
 mHh4A75mFDIf5bABx+mYVhCfQ4ADwDXqdadkdYKVN4RwmnbLw80dV1FrfBGF+JnS
 9oUPkdueG13q4QA3DveAxtTrO3X7DnlMXbOYDrNoQia2rHKtxzy3ubkTAsbvWn91
 AuEWYWGTbutdBY+ga7smLBGzXUA7iIjNQfvMNZiqB0CGSpD0TzdVXJ9/+h0w+ru7
 Dkw/3ytV7k8LLFO3wZvIc6DQ7Hsq2PxL/Rbm0WNvfv6JhdVmiw4tuP2F8TV9uv28
 ej1w2n1NGUM0Yfi7wL7AyICZMOiwzxZgBpY63AGpk7mNvhZ7oqHzC/ENvJ3l6SLb
 lCviClAeBM+c/VOEXjleMtmf7VSRfvRhrKAl0dL7xLqbW2BrTw3vCmpuYKqMIlQB
 5rpCslgiOqufBvHHdca7UU+9iwuaI66nD6o4KV6VnB1a01THs7D0kgvV0Wj+Xocb
 xdKQ7TbzI6VdYZJ4pPM2ub1rb+fWnQ7d/ASNy/O3C3lV3FYfxPw82idBdfdRq3sa
 0kIEAQLHJIC1TrrZzNsg
 =842O
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix up recently added features (the Kryo cpufreq driver and
  performance states coverage in the generic power domains framework),
  add missing documentation for a recently added sysfs knob in the
  intel_pstate driver and fix an error in its documentation.

  Specifics:

   - Fix the initialization time error handling in the recently added
     Kryo cpufreq driver (Dan Carpenter).

   - Fix up the recently added coverage of performance states in the
     generic power domains (genpd) framework (Viresh Kumar).

   - Add missing documentation of the new hwp_dynamic_boost sysfs knob
     in the intel_pstate driver (Rafael Wysocki).

   - Fix incorrect sysfs path in the intel_pstate driver documentation
     (Rafael Wysocki)"

* tag 'pm-4.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Documentation: intel_pstate: Describe hwp_dynamic_boost sysfs knob
  Documentation: admin-guide: intel_pstate: Fix sysfs path
  PM / Domains: Rename opp_node to np
  PM / Domains: Fix return value of of_genpd_opp_to_performance_state()
  cpufreq: qcom-kryo: Fix error handling in probe()
2018-06-29 07:14:41 -07:00
Linus Torvalds ea5f39f2f9 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "7 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  proc: add Alexey to MAINTAINERS
  kasan: depend on CONFIG_SLUB_DEBUG
  include/linux/dax.h: dax_iomap_fault() returns vm_fault_t
  x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
  slub: fix failure when we delete and create a slab cache
  Revert mm/vmstat.c: fix vmstat_update() preemption BUG
  lib/percpu_ida.c: don't do alloc from per-CPU list if there is none
2018-06-28 11:42:56 -07:00
Souptick Joarder f77bc3a82c include/linux/dax.h: dax_iomap_fault() returns vm_fault_t
Commit 1c8f422059 ("mm: change return type to vm_fault_t") missed a
conversion.  It's not a big problem at present because mainline is still
using

	typedef int vm_fault_t;

Fixes: 1c8f422059 ("mm: change return type to vm_fault_t")
Link: http://lkml.kernel.org/r/20180620172046.GA27894@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 11:16:44 -07:00
Mikulas Patocka d50d82faa0 slub: fix failure when we delete and create a slab cache
In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb13276768: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.

This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.

This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.

Running device-mapper-test-suite with:

  dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

triggered:

  Buffer I/O error on dev dm-0, logical block 1572848, async page read
  device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
  device-mapper: thin: 253:1: aborting current metadata transaction
  sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
  CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
  Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
  Workqueue: dm-thin do_worker [dm_thin_pool]
  Call Trace:
   dump_stack+0x5a/0x73
   sysfs_warn_dup+0x58/0x70
   sysfs_create_dir_ns+0x77/0x80
   kobject_add_internal+0xba/0x2e0
   kobject_init_and_add+0x70/0xb0
   sysfs_slab_add+0xb1/0x250
   __kmem_cache_create+0x116/0x150
   create_cache+0xd9/0x1f0
   kmem_cache_create_usercopy+0x1c1/0x250
   kmem_cache_create+0x18/0x20
   dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
   dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
   __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
   dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
   metadata_operation_failed+0x59/0x100 [dm_thin_pool]
   alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
   process_cell+0x2a3/0x550 [dm_thin_pool]
   do_worker+0x28d/0x8f0 [dm_thin_pool]
   process_one_work+0x171/0x370
   worker_thread+0x49/0x3f0
   kthread+0xf8/0x130
   ret_from_fork+0x35/0x40
  kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
  kmem_cache_create(dm_bufio_buffer-16) failed with error -17

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 11:16:44 -07:00
Linus Torvalds a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
David S. Miller 04c6faa175 mlx5-fixes-2018-06-26
Fixes for mlx5 core and netdev driver:
 
 Two fixes from Alex Vesker to address command interface issues
  - Race in command interface polling mode
  - Incorrect raw command length parsing
 
 From Shay Agroskin, Fix wrong size allocation for QoS ETC TC regitster.
 
 From Or Gerlitz and Eli Cohin, Address backward compatability issues for when
 Eswitch capability is not advertised for the PF host driver
     - Fix required capability for manipulating MPFS
     - E-Switch, Disallow vlan/spoofcheck setup if not being esw manager
     - Avoid dealing with vport IB/eth representors if not being e-switch manager
     - E-Switch, Avoid setup attempt if not being e-switch manager
     - Don't attempt to dereference the ppriv struct if not being eswitch manager
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbMtMEAAoJEEg/ir3gV/o+NjAIAMGrerpwg8ADBj+b9tSWm4WV
 2yAJ561kBObwhA+uDJtH7mGUO3+AnkcWz9vynGqFdmkOikUcbpPkBb9D+rmFbkX2
 E585pwR3pH7lEzYEG4xO6SwuQcQ4OytFNxz94AT6CgNEXqrmbrD7A5Vsgk265yZq
 pJzL1OVfkXKOtb2x5PpCOh19/28OxAzyMQfoklsE2Wn7j8/2RWX0UUDxuF8jS+He
 9loaurT4Fsfo5JYE+o+k38knHFBkdTUZBD9/bZrtaMcrD68bZdJTpZm6eYwRXW3S
 7J88SmH/xTy74f1KY4qf0JOTxnaWtm/r4YaCXf1QD05W2/U9FQpIW1ipMKH51vk=
 =te2H
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-fixes-2018-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-fixes-2018-06-26

Fixes for mlx5 core and netdev driver:

Two fixes from Alex Vesker to address command interface issues
 - Race in command interface polling mode
 - Incorrect raw command length parsing

From Shay Agroskin, Fix wrong size allocation for QoS ETC TC regitster.

From Or Gerlitz and Eli Cohin, Address backward compatability issues for when
Eswitch capability is not advertised for the PF host driver
    - Fix required capability for manipulating MPFS
    - E-Switch, Disallow vlan/spoofcheck setup if not being esw manager
    - Avoid dealing with vport IB/eth representors if not being e-switch manager
    - E-Switch, Avoid setup attempt if not being e-switch manager
    - Don't attempt to dereference the ppriv struct if not being eswitch manager
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28 16:21:35 +09:00
Linus Torvalds c92067ae06 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:

 - the main change is a fix for my brain-dead patch to PS/2 button
   reporting for some protocols that made it in 4.17

 - there is a new driver for Spreadtum vibrator that I intended to send
   during merge window but ended up not sending the 2nd pull request.
   Given that this is a brand new driver we should not see regressions
   here

 - a fixup to Elantech PS/2 driver to avoid decoding errors on Thinkpad
   P52

 - addition of few more ACPI IDs for Silead and Elan drivers

 - RMI4 is switched to using IRQ domain code instead of rolling its own
   implementation

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: psmouse - fix button reporting for basic protocols
  Input: xpad - fix GPD Win 2 controller name
  Input: elan_i2c_smbus - fix more potential stack buffer overflows
  Input: elan_i2c - add ELAN0618 (Lenovo v330 15IKB) ACPI ID
  Input: elantech - fix V4 report decoding for module with middle key
  Input: elantech - enable middle button of touchpads on ThinkPad P52
  Input: do not assign new tracking ID when changing tool type
  Input: make input_report_slot_state() return boolean
  Input: synaptics-rmi4 - fix axis-swap behavior
  Input: synaptics-rmi4 - fix the error return code in rmi_probe_interrupts()
  Input: synaptics-rmi4 - convert irq distribution to irq_domain
  Input: silead - add MSSL0002 ACPI HID
  Input: goldfish_events - fix checkpatch warnings
  Input: add Spreadtrum vibrator driver
2018-06-27 09:16:53 -07:00
Or Gerlitz 0efc856249 net/mlx5: E-Switch, Avoid setup attempt if not being e-switch manager
In smartnic env, the host (PF) driver might not be an e-switch
manager, hence the FW will err on driver attempts to deal with
setting/unsetting the eswitch and as a result the overall setup
of sriov will fail.

Fix that by avoiding the operation if e-switch management is not
allowed for this driver instance. While here, move to use the
correct name for the esw manager capability name.

Fixes: 81848731ff ('net/mlx5: E-Switch, Add SR-IOV (FDB) support')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Guy Kushnir <guyk@mellanox.com>
Reviewed-by: Eli Cohen <eli@melloanox.com>
Tested-by: Eli Cohen <eli@melloanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-26 15:26:29 -07:00