Commit graph

1206 commits

Author SHA1 Message Date
Ilya Dryomov fe5da05e97 libceph: redo callbacks and factor out MOSDOpReply decoding
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:28 +02:00
Ilya Dryomov 85e084feb4 libceph: drop msg argument from ceph_osdc_callback_t
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success.  This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov bb873b5391 libceph: switch to calc_target(), part 2
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target.  Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded.  If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op().  While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:27 +02:00
Ilya Dryomov 63244fa123 libceph: introduce ceph_osd_request_target, calc_target()
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:26 +02:00
Ilya Dryomov f81f16339a libceph: rename ceph_calc_pg_primary()
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:25 +02:00
Ilya Dryomov d9591f5e28 libceph: rename ceph_oloc_oid_to_pg()
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:24 +02:00
Ilya Dryomov fcd00b68bb libceph: DEFINE_RB_FUNCS macro
Given

    struct foo {
        u64 id;
        struct rb_node bar_node;
    };

generate insert_bar(), erase_bar() and lookup_bar() functions with

    DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)

The key is assumed to be an integer (u64, int, etc), compared with
< and >.  nodefld has to be initialized with RB_CLEAR_NODE().

Start using it for MDS, MON and OSD requests and OSD sessions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:23 +02:00
Ilya Dryomov d30291b985 libceph: variable-sized ceph_object_id
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:22 +02:00
Ilya Dryomov 13d1ad16d0 libceph: move message allocation out of ceph_osdc_alloc_request()
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed.  Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

    req = ceph_osdc_alloc_request(...);
    <fill in req->r_base_oid>
    <fill in req->r_base_oloc>
    ceph_osdc_alloc_messages(req);

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:21 +02:00
Ilya Dryomov 3ed97d6345 libceph: make ceph_osdc_put_request() accept NULL
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26 00:36:20 +02:00
Ilya Dryomov 6c1ea260f8 libceph: make authorizer destruction independent of ceph_auth_client
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
    [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
    [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
    [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
    [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
    [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
    [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25 20:54:13 +02:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Geliang Tang 99ec269779 ceph: use kmem_cache_zalloc
Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:56 +01:00
Yan, Zheng 200fd27c8f ceph: use lookup request to revalidate dentry
If dentry has no lease, ceph_d_revalidate() previously return 0.
This causes VFS to invalidate the dentry and create a new dentry
for later lookup. Invalidating a dentry also detach any underneath
mount points. So mount point inside cephfs can disapear mystically
(even the mount point is not modified by other hosts).

The fix is using lookup request to revalidate dentry without lease.
This can partly solve the mount points disapear issue (as long as
the mount point is not modified by other hosts)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:56 +01:00
Yan, Zheng 641235d8f8 ceph: kill ceph_get_dentry_parent_inode()
use vfs helper dget_parent() instead

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 315f240880 ceph: fix security xattr deadlock
When security is enabled, security module can call filesystem's
getxattr/setxattr callbacks during d_instantiate(). For cephfs,
d_instantiate() is usually called by MDS' dispatch thread, while
handling MDS reply. If the MDS reply does not include xattrs and
corresponding caps, getxattr/setxattr need to send a new request
to MDS and waits for the reply. This makes MDS' dispatch sleep,
nobody handles later MDS replies.

The fix is make sure lookup/atomic_open reply include xattrs and
corresponding caps. So getxattr can be handled by cached xattrs.
This requires some modification to both MDS and request message.
(Client tells MDS what caps it wants; MDS encodes proper caps in
the reply)

Smack security module may call setxattr during d_instantiate().
Unlike getxattr, we can't force MDS to issue CEPH_CAP_XATTR_EXCL
to us. So just make setxattr return error when called by MDS'
dispatch thread.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 29dccfa5af ceph: don't request vxattrs from MDS
It's uselese because MDS reply does not carry any vxattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:55 +01:00
Yan, Zheng 132ca7e1de ceph: fix mounting same fs multiple times
Now __ceph_open_session() only accepts closed client. An opened
client will tigger BUG_ON().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:54 +01:00
Yan, Zheng 4531126753 ceph: remove unnecessary NULL check
If page->mapping is NULL, releasepage() callback does not get called.
Remove the unnecessary NULL check to make static code analysis tool
happy

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:54 +01:00
Yan, Zheng a3d714c336 ceph: avoid updating directory inode's i_size accidentally
Directory inode's i_size is used by readdir cache.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:53 +01:00
Yan, Zheng af5e5eb574 ceph: fix race during filling readdir cache
Readdir cache uses page cache to save dentry pointers. When adding
dentry pointers to middle of a page, we need to make sure the page
already exists. Otherwise the beginning part of the page will be
invalid pointers.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:53 +01:00
Ilya Dryomov 34b759b4a2 ceph: kill ceph_empty_snapc
ceph_empty_snapc->num_snaps == 0 at all times.  Passing such a snapc to
ceph_osdc_alloc_request() (possibly through ceph_osdc_new_request()) is
equivalent to passing NULL, as ceph_osdc_alloc_request() uses it only
for sizing the request message.

Further, in all four cases the subsequent ceph_osdc_build_request() is
passed NULL for snapc, meaning that 0 is encoded for seq and num_snaps
and making ceph_empty_snapc entirely useless.  The two cases where it
actually mattered were removed in commits 8605609049 ("ceph: avoid
sending unnessesary FLUSHSNAP message") and 23078637e0 ("ceph: fix
queuing inode to mdsdir's snaprealm").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by:  Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Anton Protopopov ce4355932a ceph: fix a wrong comparison
A negative value rc compared to the positive value ENOENT in the
finish_read() function.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Deepa Dinamani 8bbd47140c ceph: replace CURRENT_TIME by current_fs_time()
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_fs_time() instead.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:52 +01:00
Yan, Zheng 5b64640cf6 ceph: scattered page writeback
This patch makes ceph_writepages_start() try using single OSD request
to write all dirty pages within a strip unit. When a nonconsecutive
dirty page is found, ceph_writepages_start() tries starting a new write
operation to existing OSD request. If it succeeds, it uses the new
operation to writeback the dirty page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:51 +01:00
Yan, Zheng a587d71b0a ceph: remove useless BUG_ON
ceph_osdc_start_request() never return -EOLDSNAP

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:41 +01:00
Yan, Zheng 133e91566c ceph: don't enable rbytes mount option by default
When rbytes mount option is enabled, directory size is recursive
size. Recursive size is not updated instantly. This can cause
directory size to change between successive stat(1)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:41 +01:00
Yan, Zheng d1eee0c0e1 ceph: encode ctime in cap message
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-03-25 18:51:40 +01:00
Ilya Dryomov 82dcabad75 libceph: revamp subs code, switch to SUBSCRIBE2 protocol
It is currently hard-coded in the mon_client that mdsmap and monmap
subs are continuous, while osdmap sub is always "onetime".  To better
handle full clusters/pools in the osd_client, we need to be able to
issue continuous osdmap subs.  Revamp subs code to allow us to specify
for each sub whether it should be continuous or not.

Although not strictly required for the above, switch to SUBSCRIBE2
protocol while at it, eliminating the ambiguity between a request for
"every map since X" and a request for "just the latest" when we don't
have a map yet (i.e. have epoch 0).  SUBSCRIBE2 feature bit is now
required - it's been supported since pre-argonaut (2010).

Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling
in before we validate the epoch and successfully install the new map
can mess up mon_client sub state.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25 18:51:38 +01:00
Linus Torvalds 3c2de27d79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - Preparations of parallel lookups (the remaining main obstacle is the
   need to move security_d_instantiate(); once that becomes safe, the
   rest will be a matter of rather short series local to fs/*.c

 - preadv2/pwritev2 series from Christoph

 - assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  splice: handle zero nr_pages in splice_to_pipe()
  vfs: show_vfsstat: do not ignore errors from show_devname method
  dcache.c: new helper: __d_add()
  don't bother with __d_instantiate(dentry, NULL)
  untangle fsnotify_d_instantiate() a bit
  uninline d_add()
  replace d_add_unique() with saner primitive
  quota: use lookup_one_len_unlocked()
  cifs_get_root(): use lookup_one_len_unlocked()
  nfs_lookup: don't bother with d_instantiate(dentry, NULL)
  kill dentry_unhash()
  ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
  autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
  configfs: move d_rehash() into configfs_create() for regular files
  ceph: don't bother with d_rehash() in splice_dentry()
  namei: teach lookup_slow() to skip revalidate
  namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
  lookup_one_len_unlocked(): use lookup_dcache()
  namei: simplify invalidation logics in lookup_dcache()
  namei: change calling conventions for lookup_{fast,slow} and follow_managed()
  ...
2016-03-19 18:52:29 -07:00
Al Viro f8b31710e4 ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
... and use d_add(dn, NULL) in case we need to hash a negative
unhashed rather than using d_rehash() directly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-14 00:16:06 -04:00
Al Viro f7380af04b ceph: don't bother with d_rehash() in splice_dentry()
d_splice_alias() guarantees that it'll be always hashed

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-14 00:15:51 -04:00
Yan, Zheng 5ea5c5e0a7 ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support
Add support for the format change of MClientReply/MclientCaps.
Also add code that denies access to inodes with pool_ns layouts.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04 21:00:37 +01:00
Yan, Zheng db6aed7023 ceph: fix snap context leak in error path
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-02-04 18:25:15 +01:00
Dan Carpenter 1418bf076d ceph: checking for IS_ERR instead of NULL
ceph_osdc_alloc_request() returns NULL on error, it never returns error
pointers.

Fixes: 5be0389dac ('ceph: re-send AIO write request when getting -EOLDSNAP error')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-02-04 18:25:08 +01:00
Linus Torvalds 00e3f5cc30 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "The two main changes are aio support in CephFS, and a series that
  fixes several issues in the authentication key timeout/renewal code.

  On top of that are a variety of cleanups and minor bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: remove outdated comment
  libceph: kill off ceph_x_ticket_handler::validity
  libceph: invalidate AUTH in addition to a service ticket
  libceph: fix authorizer invalidation, take 2
  libceph: clear messenger auth_retry flag if we fault
  libceph: fix ceph_msg_revoke()
  libceph: use list_for_each_entry_safe
  ceph: use i_size_{read,write} to get/set i_size
  ceph: re-send AIO write request when getting -EOLDSNAP error
  ceph: Asynchronous IO support
  ceph: Avoid to propagate the invalid page point
  ceph: fix double page_unlock() in page_mkwrite()
  rbd: delete an unnecessary check before rbd_dev_destroy()
  libceph: use list_next_entry instead of list_entry_next
  ceph: ceph_frag_contains_value can be boolean
  ceph: remove unused functions in ceph_frag.h
2016-01-24 12:34:13 -08:00
Al Viro 5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Yan, Zheng 99c88e6900 ceph: use i_size_{read,write} to get/set i_size
Cap message from MDS can update i_size. In that case, we don't
hold i_mutex. So it's unsafe to directly access inode->i_size
while holding i_mutex.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:08 +01:00
Yan, Zheng 5be0389dac ceph: re-send AIO write request when getting -EOLDSNAP error
When receiving -EOLDSNAP from OSD, we need to re-send corresponding
write request. Due to locking issue, we can send new request inside
another OSD request's complete callback. So we use worker to re-send
request for AIO write.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:08 +01:00
Yan, Zheng c8fe9b17d0 ceph: Asynchronous IO support
The basic idea of AIO support is simple, just call kiocb::ki_complete()
in OSD request's complete callback. But there are several special cases.

when IO span multiple objects, we need to wait until all OSD requests
are complete, then call kiocb::ki_complete(). Error handling in this case
is tricky too. For simplify, AIO both span multiple objects and extends
i_size are not allowed.

Another special case is check EOF for reading (other client can write to
the file and extend i_size concurrently). For simplify, the direct-IO/AIO
code path does do the check, fallback to normal syn read instead.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:07 +01:00
Minfei Huang 458c4703ae ceph: Avoid to propagate the invalid page point
The variant pagep will still get the invalid page point, although ceph
fails in function ceph_update_writeable_page.

To fix this issue, Assigne the page to pagep until there is no failure
in function ceph_update_writeable_page.

Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:07 +01:00
Yan, Zheng f9cac5ac08 ceph: fix double page_unlock() in page_mkwrite()
ceph_update_writeable_page() unlocks the page on errors, so
page_mkwrite() should not unlock the page again.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-01-21 19:36:07 +01:00
Vladimir Davydov 5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Linus Torvalds ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Al Viro 6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Andreas Gruenbacher 97d7929922 posix acls: Remove duplicate xattr name definitions
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT}
and replace them with the definitions in <include/uapi/linux/xattr.h>.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:25:17 -05:00
Linus Torvalds ca4ba96e02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There are several patches from Ilya fixing RBD allocation lifecycle
  issues, a series adding a nocephx_sign_messages option (and associated
  bug fixes/cleanups), several patches from Zheng improving the
  (directory) fsync behavior, a big improvement in IO for direct-io
  requests when striping is enabled from Caifeng, and several other
  small fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: clear msg->con in ceph_msg_release() only
  libceph: add nocephx_sign_messages option
  libceph: stop duplicating client fields in messenger
  libceph: drop authorizer check from cephx msg signing routines
  libceph: msg signing callouts don't need con argument
  libceph: evaluate osd_req_op_data() arguments only once
  ceph: make fsync() wait unsafe requests that created/modified inode
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  libceph: introduce ceph_x_authorizer_cleanup()
  ceph: don't invalidate page cache when inode is no longer used
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  rbd: set device_type::release instead of device::release
  rbd: don't free rbd_dev outside of the release callback
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  libceph: use local variable cursor instead of &msg->cursor
  libceph: remove con argument in handle_reply()
  ceph: combine as many iovec as possile into one OSD request
  ceph: fix message length computation
  ceph: fix a comment typo
  rbd: drop null test before destroy functions
2015-11-13 09:24:40 -08:00
Michal Hocko c62d25556b mm, fs: introduce mapping_gfp_constraint()
There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track.  This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Ilya Dryomov 79dbd1baa6 libceph: msg signing callouts don't need con argument
We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02 23:37:45 +01:00
Yan, Zheng 68cd5b4b76 ceph: make fsync() wait unsafe requests that created/modified inode
If we get a unsafe reply for request that created/modified inode,
add the unsafe request to a list in the newly created/modified
inode. So we can make fsync() wait these unsafe requests.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02 23:36:48 +01:00