Commit graph

588522 commits

Author SHA1 Message Date
Michal Hocko 855b018325 oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g.  by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Michal Hocko 03049269de mm, oom_reaper: implement OOM victims queuing
wake_oom_reaper has allowed only 1 oom victim to be queued.  The main
reason for that was the simplicity as other solutions would require some
way of queuing.  The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach to
help with oom handling when the OOM victim cannot terminate in a
reasonable time.  The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible.  E.g.  the original victim
might have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing.  This is what this patch implements.  This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization.  wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Michal Hocko bc448e897b mm, oom_reaper: report success/failure
Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Michal Hocko 36324a990c oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct.  This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention.  This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Michal Hocko aac4536355 mm, oom: introduce oom reaper
This patch (of 5):

This is based on the idea from Mel Gorman discussed during LSFMM 2015
and independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway.  There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete.  This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g.  allocating memory).  Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt.  basically any
lock blocking forward progress as described above.  In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.  Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests
done with the lock held to absolute minimum to reduce the risk even
further.

The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
NULL->mm transition and oom_reaper clear this atomically once it is done
with the work.  This means that only a single mm_struct can be reaped at
the time.  As the operation is potentially disruptive we are trying to
limit it to the ncessary minimum and the reaper blocks any updates while
it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Andrew Morton 69b27baf00 sched: add schedule_timeout_idle()
This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Tony Luck 2d5ae5c2c7 [IA64] Enable preadv2 and pwritev2 syscalls for ia64
New system calls added in:
      f17d8b3545
      vfs: vfs: Define new syscalls preadv2,pwritev2

Signed-off-by: Tony Luck <tony.luck@intel.com>
2016-03-25 14:37:32 -07:00
Rafael J. Wysocki 8e653b6544 Fix permissions of drivers/power/avs/rockchip-io-domain.c
The permissions of this file were modified by commit (f447671b9e PM /
AVS: rockchip-io: add io selectors and supplies for rk3399) by mistake,
so fix them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-25 22:36:17 +01:00
Geliang Tang 5ee61e95b6 libceph: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:57 +01:00
Geliang Tang 99ec269779 ceph: use kmem_cache_zalloc
Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:56 +01:00
Geliang Tang 03d9440676 rbd: use KMEM_CACHE macro
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:56 +01:00
Yan, Zheng 200fd27c8f ceph: use lookup request to revalidate dentry
If dentry has no lease, ceph_d_revalidate() previously return 0.
This causes VFS to invalidate the dentry and create a new dentry
for later lookup. Invalidating a dentry also detach any underneath
mount points. So mount point inside cephfs can disapear mystically
(even the mount point is not modified by other hosts).

The fix is using lookup request to revalidate dentry without lease.
This can partly solve the mount points disapear issue (as long as
the mount point is not modified by other hosts)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:56 +01:00
Yan, Zheng 641235d8f8 ceph: kill ceph_get_dentry_parent_inode()
use vfs helper dget_parent() instead

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 315f240880 ceph: fix security xattr deadlock
When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 29dccfa5af ceph: don't request vxattrs from MDS
It's uselese because MDS reply does not carry any vxattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 132ca7e1de ceph: fix mounting same fs multiple times
Now __ceph_open_session() only accepts closed client. An opened
client will tigger BUG_ON().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:54 +01:00
Yan, Zheng 4531126753 ceph: remove unnecessary NULL check
If page->mapping is NULL, releasepage() callback does not get called.
Remove the unnecessary NULL check to make static code analysis tool
happy

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:54 +01:00
Yan, Zheng a3d714c336 ceph: avoid updating directory inode's i_size accidentally
Directory inode's i_size is used by readdir cache.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:53 +01:00
Yan, Zheng af5e5eb574 ceph: fix race during filling readdir cache
Readdir cache uses page cache to save dentry pointers. When adding
dentry pointers to middle of a page, we need to make sure the page
already exists. Otherwise the beginning part of the page will be
invalid pointers.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:53 +01:00
Ilya Dryomov 89f081730c libceph: use sizeof_footer() more
Don't open-code sizeof_footer() in read_partial_message() and
ceph_msg_revoke().  Also, after switching to sizeof_footer(), it's now
possible to use con_out_kvec_add() in prepare_write_message_footer().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25 18:51:53 +01:00
Ilya Dryomov 34b759b4a2 ceph: kill ceph_empty_snapc
ceph_empty_snapc->num_snaps == 0 at all times.  Passing such a snapc to
ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is
equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only
for sizing the request message.

Further, in all four cases the subsequent ceph_osdc_build_request() is
passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps
and making ceph_empty_snapc entirely useless.  The two cases where it
actually mattered were removed in commits 8605609049 ("ceph: avoid
sending unnessesary FLUSHSNAP message") and 23078637e0 ("ceph: fix
queuing inode to mdsdir's snaprealm").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by:  Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Anton Protopopov ce4355932a ceph: fix a wrong comparison
A negative value rc compared to the positive value ENOENT in the
finish_read() function.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Deepa Dinamani 8bbd47140c ceph: replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Yan, Zheng 5b64640cf6 ceph: scattered page writeback
This patch makes ceph_writepages_start() try using single OSD request
to write all dirty pages within a strip unit. When a nonconsecutive
dirty page is found, ceph_writepages_start() tries starting a new write
operation to existing OSD request. If it succeeds, it uses the new
operation to writeback the dirty page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:51 +01:00
Yan, Zheng 2c63f49a72 libceph: add helper that duplicates last extent operation
This helper duplicates last extent operation in OSD request, then
adjusts the new extent operation's offset and length. The helper
is for scatterd page writeback, which adds nonconsecutive dirty
pages to single OSD request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov 3f1af42ad0 libceph: enable large, variable-sized OSD requests
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests.  The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.

r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once.  ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc().  req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov 9e767adbd3 libceph: osdc->req_mempool should be backed by a slab pool
ceph_osd_request_cache was introduced a long time ago.  Also, osd_req
is about to get a flexible array member, which ceph_osd_request_cache
is going to be aware of.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:43 +01:00
Ilya Dryomov ae458f5a17 libceph: make r_request msg_size calculation clearer
Although msg_size is calculated correctly, the terms are grouped in
a misleading way - snaps appears to not have room for a u32 length.
Move calculation closer to its use and regroup terms.

No functional change.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Yan, Zheng 7665d85b73 libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:42 +01:00
Ilya Dryomov de2aa102ea libceph: rename ceph_osd_req_op::payload_len to indata_len
Follow userspace nomenclature on this - the next commit adds
outdata_len.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:41 +01:00
Yan, Zheng a587d71b0a ceph: remove useless BUG_ON
ceph_osdc_start_request() never return -EOLDSNAP

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:41 +01:00
Yan, Zheng 133e91566c ceph: don't enable rbytes mount option by default
When rbytes mount option is enabled, directory size is recursive
size. Recursive size is not updated instantly. This can cause
directory size to change between successive stat(1)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:41 +01:00
Yan, Zheng d1eee0c0e1 ceph: encode ctime in cap message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov b5d91704f5 libceph: behave in mon_fault() if cur_mon < 0
This can happen if __close_session() in ceph_monc_stop() races with
a connection reset.  We need to ignore such faults, otherwise it's
likely we would take !hunting, call __schedule_delayed() and end up
with delayed_work() executing on invalid memory, among other things.

The (two!) con->private tests are useless, as nothing ever clears
con->private.  Nuke them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov bee3a37c47 libceph: reschedule tick in mon_fault()
Doing __schedule_delayed() in the hunting branch is pointless, as the
tick will have already been scheduled by then.

What we need to do instead is *reschedule* it in the !hunting branch,
after reopen_session() changes hunt_mult, which affects the delay.
This helps with spacing out connection attempts and avoiding things
like two back-to-back attempts followed by a longer period of waiting
around.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov 1752b50ca2 libceph: introduce and switch to reopen_session()
hunting is now set in __open_session() and cleared in finish_hunting(),
instead of all around.  The "session lost" message is printed not only
on connection resets, but also on keepalive timeouts.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov 168b9090c7 libceph: monc hunt rate is 3s with backoff up to 30s
Unless we are in the process of setting up a client (i.e. connecting to
the monitor cluster for the first time), apply a backoff: every time we
want to reopen a session, increase our timeout by a multiple (currently
2); when we complete the connection, reduce that multipler by 50%.

Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov 58d81b1294 libceph: monc ping rate is 10s
Split ping interval and ping timeout: ping interval is 10s; keepalive
timeout is 30s.

Make monc_ping_timeout a constant while at it - it's not actually
exported as a mount option (and the rest of tick-related settings won't
be either), so it's got no place in ceph_options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:39 +01:00
Ilya Dryomov 0e04dc26cc libceph: pick a different monitor when reconnecting
Don't try to reconnect to the same monitor when we fail to establish
a session within a timeout or it's lost.

For that, pick_new_mon() needs to see the old value of cur_mon, so
don't clear it in __close_session() - all calls to __close_session()
but one are followed by __open_session() anyway.  __open_session() is
only called when a new session needs to be established, so the "already
open?" branch, which is now in the way, is simply dropped.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov 82dcabad75 libceph: revamp subs code, switch to SUBSCRIBE2 protocol
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime".  To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs.  Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov 0f9af169a1 libceph: decouple hunting and subs management
Coupling hunting state with subscribe state is not a good idea.  Clear
hunting when we complete the authentication handshake.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Ilya Dryomov 02ac956c42 libceph: move debugfs initialization into __ceph_open_session()
Our debugfs dir name is a concatenation of cluster fsid and client
unique ID ("global_id").  It used to be the case that we learned
global_id first, nowadays we always learn fsid first - the monmap is
sent before any auth replies are.  ceph_debugfs_client_init() call in
ceph_monc_handle_map() is therefore never executed and can be removed.

Its counterpart in handle_auth_reply() doesn't really belong there
either: having to do monc->client and unlocking early to work around
lockdep is a testament to that.  Move it into __ceph_open_session(),
where it can be called unconditionally.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:37 +01:00
Linus Torvalds 1701f68040 Revert "ppdev: use new parport device model"
This reverts commit e7223f1860.

It causes problems when a ppdev tries to register before the parport
driver has been registered with the device model. That will trigger the

        BUG_ON(!drv->bus->p);

at drivers/base/driver.c:153. The call chain is

  kernel_init ->
    kernel_init_freeable ->
      do_one_initcall ->
        ppdev_init ->
          __parport_register_driver ->
            driver_register *BOOM*

Reported-by: kernel test robot <fengguang.wu@intel.com>
Reported-by: Ross Zwisler <zwisler@gmail.com>
Reported-by: Petr Mladek <pmladek@suse.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 09:02:13 -07:00
Linus Torvalds 3b3b3bd977 IEEE 1394 subsystem patch:
- Occurrences of timeval were supposed to be eliminated last round,
     now remove a last forgotten one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJW9PnoAAoJEHnzb7JUXXnQlBAQAOZELI61TS2XAk+94EvxODgY
 FrTOzZFGDUlXlkdyI7Smlla1D6LwV47FCRpZYTuf4z1xD5+nT9RcvxjiaAGUZxmJ
 swhdei74LPRIW7OfVtODz7kLsNLBJM1Le6+kVfugxDPxaOMFn7oqn7TsLakg005l
 gyzhP7NnZm15IuUnGpjjZgx1utNvVeMZ+0GdhUfTzVHPyHzYh0iquYC8yCnW2AfT
 Z9PN5j5XKdE/sJDTtWwq2hvSaDfUr6cybNkk+jYGIn9qxtTdUebg+ZddoU0aLo6L
 DsJPUdAQ2mo9rrjJchR5v+jbbZ2AYDFFUz8GfJceUvc2dcllm7WQsYQO4AsRM0A7
 YllY8xzkK11Mid7ZxoHpj6F8c6ZJztPZ2qT/H58KgGngkn4frkPCGMkverTRF2Yp
 D/bxQ/sRifK3kLWxTjHEcomifXasvHtj4Cwfklcu1RqCJuKgMawhrtjktjqTZKFC
 vfG9OoRb6xV2RJtAnDVz+QuaGklPLP/NekQzFGryHYWHCEjI62sH8+WUhVkOs+m5
 1/23N4mqJ/sOurp3Gj0a19lKiAbAAYHMp2v8vLWC7dfQ7VhJC/K0vEQm0HL0Bvwf
 FqIUg6QqfVYjckdL9Fh/0XCMKrOF9Qb59JWOWXPnFjuc7KD2/s/vpfgby6cP+14/
 D1NrNVkp2bEiLZ9avdSL
 =S0zD
 -----END PGP SIGNATURE-----

Merge tag 'firewire-update2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire leftover from Stefan Richter:
 "Occurrences of timeval were supposed to be eliminated last round, now
  remove a last forgotten one"

* tag 'firewire-update2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: nosy: Replace timeval with timespec64
2016-03-25 08:52:25 -07:00
Linus Torvalds f98c2135f8 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "Just a couple of dma-buf related fixes and some amdgpu fixes, along
  with a regression fix for radeon off but default feature, but makes my
  30" monitor happy again"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon/mst: cleanup code indentation
  drm/radeon/mst: fix regression in lane/link handling.
  drm/amdgpu: add invalidate_page callback for userptrs
  drm/amdgpu: Revert "remove the userptr rmn->lock"
  drm/amdgpu: clean up path handling for powerplay
  drm/amd/powerplay: fix memory leak of tdp_table
  dma-buf/fence: fix fence_is_later v2
  dma-buf: Update docs for SYNC ioctl
  drm: remove excess description
  dma-buf, drm, ion: Propagate error code from dma_buf_start_cpu_access()
  drm/atmel-hlcdc: use helper to get crtc state
  drm/atomic: use helper to get crtc state
2016-03-25 08:48:31 -07:00
Linus Torvalds 11caf57f6a asm-generic changes for 4.6
There are only three patches this time, most other changes to
 files in include/asm-generic tend to go through the tree of whoever
 depends on the change.
 
 Two patches are cleanups for stuff that is no longer needed,
 the main change is to adapt the generic version of BUG_ON()
 for CONFIG_BUG=n to make it behave consistently with BUG().
 
 This avoids undefined behavior along with a number of warnings
 about that undefined behavior in randconfig builds when
 we keep going on after hitting a BUG_ON().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAVvRXDGCrR//JCVInAQKUFRAAmp23pohv08LZzXL8Qu7XfFN+b1RkZ936
 WYBeiA9PEWufQs2hgXaEUXy0onO7ah4cs2NWfkaBPyxT+I9mN+ThdzqVrlTE+AEO
 2K0f2RaZANC238zB86Yv/YvTj7FegH0DDdMBq/P06vlYdgBegx49U3pMpguxl3d0
 /q9MyqTzo9j4uOEK4ix4/Dko+4eKIS5Y/xeb0TkeKA6HiBVzAhGLZFl+eMku07Bf
 ap8B705hBDXSBFeWcK9AvKjHZCM+FCkb+C3TXo9x5tUu8g5OIG1t962OQvT9ldsP
 rvo5ppRh/TAY2Z9chN3cKrsvshbHiZ9uRzeksCunL+SK+dOhEIPCVzLXndQpi3RD
 NgeNKgo6gKYdle44pEj0EH2ktuvr0u8sbjQg9SY2miC1H4DmEbCakSqtQegHXTKd
 chJ6xyNiQXktdfo0pFOtCA2gjqiAriugttBqUtGcK9zRqjGGpP5hOUQVm3jR7UMp
 Hjb+oj5o+Gjz5J1t5zsjbhFINDCHAgXRzqqaoT9RfE9+QlUftUhu+N9KVFgzhe9I
 93VHaqgGIRoi856BO7UZSaMGhy7ljm1nQ18jP9aZl/tBco0kpd3AO8og9dJ0u2j+
 3fEqAHH30ia8GJCfIDnolxTL6uaqcCIeAoLgGcmn+QZS7ka+tD+000rtgd2pdy9/
 gy/VPpFG064=
 =8tPL
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "There are only three patches this time, most other changes to files in
  include/asm-generic tend to go through the tree of whoever depends on
  the change.

  Two patches are cleanups for stuff that is no longer needed, the main
  change is to adapt the generic version of BUG_ON() for CONFIG_BUG=n to
  make it behave consistently with BUG().

  This avoids undefined behavior along with a number of warnings about
  that undefined behavior in randconfig builds when we keep going on
  after hitting a BUG_ON()"

* tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  asm-generic: remove old nonatomic-io wrapper files
  asm-generic: default BUG_ON(x) to if(x)BUG()
  asm-generic: page.h: Remove useless get_user_page and free_user_page
2016-03-24 23:13:48 -07:00
Dave Airlie 4604202ca8 Merge branch 'drm-next-4.6' of git://people.freedesktop.org/~agd5f/linux into drm-next
some amd fixes
* 'drm-next-4.6' of git://people.freedesktop.org/~agd5f/linux:
  drm/radeon/mst: cleanup code indentation
  drm/radeon/mst: fix regression in lane/link handling.
  drm/amdgpu: add invalidate_page callback for userptrs
  drm/amdgpu: Revert "remove the userptr rmn->lock"
  drm/amdgpu: clean up path handling for powerplay
  drm/amd/powerplay: fix memory leak of tdp_table
2016-03-25 16:02:06 +10:00
Linus Torvalds 3d66c6ba3f Power management and ACPI material for v4.6-rc1, part 2
- Fix for an intel_pstate driver issue related to the handling of
    MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki).
 
  - cpufreq core cleanups related to starting governors and frequency
    synchronization during resume from system suspend and a locking
    fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
 
  - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
    Michael Neuling, Richard Cochran, Shilpasri Bhat).
 
  - intel_idle driver update preventing some Skylake-H systems
    from hanging during initialization by disabling deep C-states
    mishandled by the platform in the problematic configurations (Len
    Brown).
 
  - Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman
    Chandramouli).
 
  - cpuidle menu governor updates to make it always honor PM QoS
    latency constraints (and prevent C1 from being used as the
    fallback C-state on x86 when they are set below its exit latency)
    and to restore the previous behavior to fall back to C1 if the next
    timer event is set far enough in the future that was changed in 4.4
    which led to an energy consumption regression (Rik van Riel, Rafael
    Wysocki).
 
  - New device ID for a future AMD UART controller in the ACPI driver
    for AMD SoCs (Wang Hongcheng).
 
  - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
    scaling (AVS) driver (David Wu).
 
  - ACPI PCI resources management fix for the handling of IO space
    resources on architectures where the IO space is memory mapped
    (IA64 and ARM64) broken by the introduction of common ACPI
    resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
 
  - Fix for the ACPI backend of the generic device properties API
    to make it parse non-device (data node only) children of an
    ACPI device correctly (Irina Tirdea).
 
  - Fixes for the handling of global suspend flags (introduced in 4.4)
    during hibernation and resume from it (Lukas Wunner).
 
  - Support for obtaining configuration information from Device Trees
    in the PM clocks framework (Jon Hunter).
 
  - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
    King, Geert Uytterhoeven).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJW9JaRAAoJEILEb/54YlRx/GAQAJujANWilWHZYm24a9JDcIE9
 rsNZIC/FdeBVilPtRTZQnig/Pj32Z4Jm7IZ/DLOq0Deu1YK/9uv3y59M3BcX6WyL
 H5VR80L8geUJZ7RRk0WfM5D4X82ovzwpE/kWt2Z7HDuvJSCBmFBZOvNrXbaRncKD
 jIvat/p6uCuxt5c08+ebnBLQ6tOs8wLTWiCx3fO128GIrGRGN2xFV6hzRWVGnJ4g
 WXGAR+AdLxRMZz4PPmqdTfRj4TNSR071GjKyaeKfZUjQGAsf5O9A77JFjeNVomDx
 g1K37Byid2bTByzVavlEXPJZ7eKb5dAhlo7IJ9HAcOAXChLqH2Czjrpd+1XjR9MF
 SV/78rCnF8eet83QYLbGV/Mzf7gbJP2Xp6wiaM22VAPpGe+sYfphJoQka9XRTfId
 OgAjyYMYdWAKo5DhxVNI8WyN0W5dsoBFPxnaUFhHSGDCIJH7Ksy20m6y3plG2Bxf
 ahoiQhmd9ohjtB5JbRnf4MY0hjekp8Srdf+DoNKsk/+JscIyROpYY3msQ3smUKo+
 f628MC/wAosMpSV+l+KOYkbjCbtB49IabWtZ//NVD9hYB3E1f6aTN59yFbWB+1rp
 L7Y8iaxzSkyJy/yYVuBal3rSk356+BvvoXBlLXmBsyu1TMlcDjALIYztSiTVT5MB
 RZBhgNwdkxNCYJfU3ex+
 =hUVj
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management and ACPI updates from Rafael Wysocki:
 "The second batch of power management and ACPI updates for v4.6.

  Included are fixups on top of the previous PM/ACPI pull request and
  other material that didn't make into it but still should go into 4.6.

  Among other things, there's a fix for an intel_pstate driver issue
  uncovered by recent cpufreq changes, a workaround for a boot hang on
  Skylake-H related to the handling of deep C-states by the platform and
  a PCI/ACPI fix for the handling of IO port resources on non-x86
  architectures plus some new device IDs and similar.

  Specifics:

   - Fix for an intel_pstate driver issue related to the handling of MSR
     updates uncovered by the recent cpufreq rework (Rafael Wysocki).

   - cpufreq core cleanups related to starting governors and frequency
     synchronization during resume from system suspend and a locking fix
     for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

   - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
     Michael Neuling, Richard Cochran, Shilpasri Bhat).

   - intel_idle driver update preventing some Skylake-H systems from
     hanging during initialization by disabling deep C-states mishandled
     by the platform in the problematic configurations (Len Brown).

   - Intel Xeon Phi Processor x200 support for intel_idle
     (Dasaratharaman Chandramouli).

   - cpuidle menu governor updates to make it always honor PM QoS
     latency constraints (and prevent C1 from being used as the fallback
     C-state on x86 when they are set below its exit latency) and to
     restore the previous behavior to fall back to C1 if the next timer
     event is set far enough in the future that was changed in 4.4 which
     led to an energy consumption regression (Rik van Riel, Rafael
     Wysocki).

   - New device ID for a future AMD UART controller in the ACPI driver
     for AMD SoCs (Wang Hongcheng).

   - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
     scaling (AVS) driver (David Wu).

   - ACPI PCI resources management fix for the handling of IO space
     resources on architectures where the IO space is memory mapped
     (IA64 and ARM64) broken by the introduction of common ACPI
     resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

   - Fix for the ACPI backend of the generic device properties API to
     make it parse non-device (data node only) children of an ACPI
     device correctly (Irina Tirdea).

   - Fixes for the handling of global suspend flags (introduced in 4.4)
     during hibernation and resume from it (Lukas Wunner).

   - Support for obtaining configuration information from Device Trees
     in the PM clocks framework (Jon Hunter).

   - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
     King, Geert Uytterhoeven)"

* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  PM / AVS: rockchip-io: add io selectors and supplies for rk3399
  intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
  intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
  ACPI / PM: Runtime resume devices when waking from hibernate
  PM / sleep: Clear pm_suspend_global_flags upon hibernate
  cpufreq: governor: Always schedule work on the CPU running update
  cpufreq: Always update current frequency before startig governor
  cpufreq: Introduce cpufreq_update_current_freq()
  cpufreq: Introduce cpufreq_start_governor()
  cpufreq: powernv: Add sysfs attributes to show throttle stats
  cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
  PCI: ACPI: IA64: fix IO port generic range check
  ACPI / util: cast data to u64 before shifting to fix sign extension
  cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
  cpuidle: menu: Fall back to polling if next timer event is near
  cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
  intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
  cpufreq: Make cpufreq_quick_get() safe to call
  ACPI / property: fix data node parsing in acpi_get_next_subnode()
  ACPI / APD: Add device HID for future AMD UART controller
  ...
2016-03-24 22:59:58 -07:00
Linus Torvalds 8407ef4685 RTC for 4.6 #2
Drivers:
  - abx80x: handle both XT and RC oscillators, XT failure bit and autocalibration
  - m41t80: avoid out of range year values
  - rv8803: workaround an i2c HW issue
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJW9JdpAAoJEKbNnwlvZCyz9hYP/RU/8fJtdO2veoY+xLVta7UV
 mJahSrMn7CMkZeVh7vy/T8tIJDNB+G4TqkG6V8kT41yltzTZszn/Qm/t6vrEcwY3
 GXj11NtIiOeo6UgGeEKj0feM9Imy/OtHvCfXc7XmTV78f0DYE5MvH+7rTXmFzNm0
 qHE+aGPfp2CGYS8HhDYwedvXo9qphSCfbU3akm2PSdFfOkoCuIBndJgLkKsqdKXY
 St9hN0Z7xo3FayxtTbELftnW1xkvgsJiGcU19b3Uo89V3NS1L+5eJkhI7cx/kQxz
 bMlBBKkvpdmRwy2ng42NGipLXEXwTTOqx4ZlhKLjwhGAF1m4R9ThrCrrXrBT01LM
 8TwO5HNqUEViQYXMSK95UF2F5ldE5S+kfYUjopdoV1N6/04Uz+X03o9+zMAJq4JH
 ZWjcLIYXb1lvmt8KzUD847O69001NtWC0zVgHoQJ3CldguyGX11BYzkRCxRn/u9y
 OWEivF8K0hXYw0Dcmnls6/lrDP1xQc7s6BiFg00UTEjGAmlZVR3eSNBxCXXUneyR
 nsv1dlvB9+mcVFFpEQI/MFFNdoBBx3KLRIE+pFJIadQ4yaWkUhrOJpTW4Y646QAj
 1VMDV5Z6961sMqVkyqNMZqaqnVgrZY0DZWS9PMzYlNnrDSIXu3svu78xq+foBy0+
 OuOHjLcIIKxFSRysMZu7
 =PMVI
 -----END PGP SIGNATURE-----

Merge tag 'rtc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull more RTC updates from Alexandre Belloni:
 "A second pull request for v4.6 with a few fixesi before -rc1.  The new
  features for abx80x actually make the RTC behave correctly.

  Drivers:
   - abx80x: handle both XT and RC oscillators, XT failure bit and
     autocalibration
   - m41t80: avoid out of range year values
   - rv8803: workaround an i2c HW issue"

* tag 'rtc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: abx80x: handle the oscillator failure bit
  rtc: abx80x: handle autocalibration
  rtc: rv8803: workaround i2c HW issue
  rtc: mcp795: add devicetree support
  rtc: asm9260: remove incorrect __init/__exit annotations
  rtc: m41t80: avoid out of range year values
  rtc: s3c: Don't print an error on probe deferral
  rtc: rv3029: stop mentioning rv3029c2
2016-03-24 22:49:08 -07:00
Linus Torvalds b4ae78edf7 Update hwmon mailing list and web page
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW9GjwAAoJEMsfJm/On5mBI8MP/jOIu9g+53OnDp4rjKaurPxZ
 uq/cAiSsOb58jjmoAjGnsjn6PHkF9og6gkH9UoKePcr4f7VbTixPu9mYj1RYEqaE
 p603ASRyRXEEM0RgdQVrNrQbKnAD8COlh66FmvFYFQ6moqbZW3SAv7KqfUghXtKU
 TfPjDVUcxVx6AZ+bsgKi0XeueMgtK7viGxKUEqgayvyQ/JVnHZZKRkJRH5p41QgY
 MOt7MG51WDeEYI8jycXiUrLoyh4H2VQmmwJY7tosyBuRqHdM/Jnv1jjBmiskpZoQ
 NlQ3/8YyDG4KQxWCjmDHxrIUVR9rVfD+pW4dpOWAnps08byJtdo2tiLWM3x6v0g2
 I9HuPqNYf6013a03iCztymKhQZFrP6KZzVzUN7Zq1g6sgM+VCl8VTqB1n2CyI3OC
 rL842jP2CZ3D2Z2JVSW1ryiw5Ogk6Qc/H7OWPVxtVESkMOwuYBxc1gWmN9Ps1Rdz
 2OjkCEA/e00o4xqzJikLjlLDgyJYNw7lOnPAgKP3jxEgc44t7K381740jNbazPCZ
 uTv6DcPBUx66D0Qfpv/AKRlNkFjP3GUVKxb87ENkQaNTBjIuh5x3Qorvd6QoKq8l
 iRHjFWr6+juW2Pubm/+yAzj+BFQIsphHeLnYVrusF9fMC+5EwbRmzJxYqxgfEq3h
 +W76SO4jrkZmy44iwPC4
 =I4zr
 -----END PGP SIGNATURE-----

Merge tag 'hwmon-for-linus-v4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull more hwmon updates from Guenter Roeck:
 "Update hwmon mailing list and web page"

* tag 'hwmon-for-linus-v4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  MAINTAINERS: Update mailing list and web page for hwmon subsystem
2016-03-24 22:45:32 -07:00