io_uring-2019-03-06

-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyAJvAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphb+EACFaKI2HIdjExQ5T7Cxebzwky+Qiro3FV55 ziW00FZrkJ5g0h4ItBzh/5SDlcNQYZDMlA3s4xzWIMadWl5PjMPq1uJul0cITbSl WIJO5hpgNMXeUEhvcXUl6+f/WzpgYUxN40uW8N5V7EKlooaFVfudDqJGlvEv+UgB g8NWQYThSG+/e7r9OGwK0xDRVKfpjxVvmqmnDH3DrxKaDgSOwTf4xn1u41wKwfQ3 3uPfQ+GBeTqt4a2AhOi7K6KQFNnj5Jz5CXYMiOZI2JGtLPcL6dmyBVD7K0a0HUr+ rs4ghNdd1+puvPGNK4TX8qV0uiNrMctoRNVA/JDd1ZTYEKTmNLxeFf+olfYHlwuK K5FRs60/lgNzNkzcUpFvJHitPwYtxYJdB36PyswE1FZP1YviEeVoKNt9W8aIhEoA 549uj90brfA74eCINGhq98pJqj9CNyCPw3bfi76f5Ej2utwYDb9S5Cp2gfSa853X qc/qNda9efEq7ikwCbPzhekRMXZo6TSXtaSmC2C+Vs5+mD1Scc4kdAvdCKGQrtr9 aoy0iQMYO2NDZ/G5fppvXtMVuEPAZWbsGftyOe15IlMysjRze2ycJV8cFahKEVM9 uBeXLyH1pqGU/j7ABP4+XRZ/sbHJTwjKJbnXhTgBsdU8XO/CR3U+kRQFTsidKMfH Wlo3uH2h2A== =p78E -----END PGP SIGNATURE----- Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface
2019-03-08 14:48:40 -08:00 · 2019-03-08 14:48:40 -08:00 · 38e7571c07
parent 80201fe175 21b4aa5d20
commit 38e7571c07
32 changed files with 4783 additions and 146 deletions
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@ -429,3 +429,6 @@
 421	i386	rt_sigtimedwait_time64	sys_rt_sigtimedwait		__ia32_compat_sys_rt_sigtimedwait_time64
 422	i386	futex_time64		sys_futex			__ia32_sys_futex
 423	i386	sched_rr_get_interval_time64	sys_sched_rr_get_interval	__ia32_sys_sched_rr_get_interval
+425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
+426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@ -345,6 +345,9 @@
 334	common	rseq			__x64_sys_rseq
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
+425	common	io_uring_setup		__x64_sys_io_uring_setup
+426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register

 #
 # x32-specific system call numbers start at 512 to avoid cache impact
--- a/block/bio.c
+++ b/block/bio.c
@ -836,6 +836,40 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);

+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+		return -EINVAL;
+
+	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		struct page *page;
+		int i;
+
+		/*
+		 * For the normal O_DIRECT case, we could skip grabbing this
+		 * reference and then not have to put them again when IO
+		 * completes. But this breaks some in-kernel users, like
+		 * splicing to/from a loop device, where we release the pipe
+		 * pages unconditionally. If we can fix that case, we can
+		 * get rid of the get here and the need to call
+		 * bio_release_pages() at IO completion time.
+		 */
+		mp_bvec_for_each_page(page, bv, i)
+			get_page(page);
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))

 /**
@ -884,23 +918,35 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }

 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
 * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. For now, when adding kernel pages, we still grab a reference to the
+ * page. This isn't strictly needed for the common case, but some call paths
+ * end up releasing pages from eg a pipe and we can't easily control these.
+ * See comment in __bio_iov_bvec_add_pages().
 *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
 * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
 */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;

 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);

 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
--- a/fs/Makefile
+++ b/fs/Makefile
@ -31,6 +31,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
--- a/fs/file.c
+++ b/fs/file.c
@ -706,7 +706,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }

-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@ -721,7 +721,7 @@ loop:
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@ -729,15 +729,20 @@ loop:
 	return file;
 }

+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return __fget(fd, FMODE_PATH, 1);
 }
 EXPORT_SYMBOL(fget);

 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);

@ -768,7 +773,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
--- a/fs/file_table.c
+++ b/fs/file_table.c
@ -326,9 +326,9 @@ void flush_delayed_fput(void)

 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);

-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;

 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@ -347,6 +347,11 @@ void fput(struct file *file)
 	}
 }

+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
 /*
 * synchronous analog of fput(); for kernel threads that might be needed
 * in some umount() (and thus can't use flush_delayed_fput() without
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
--- a/include/linux/file.h
+++ b/include/linux/file.h
@ -13,6 +13,7 @@
 struct file;

 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);

 struct file_operations;
 struct vfsmount;
@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }

 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@ -961,7 +961,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt)	\
+	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)

@ -3511,4 +3513,13 @@ extern void inode_nohighmem(struct inode *inode);
 extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);

+#if defined(CONFIG_IO_URING)
+extern struct sock *io_uring_get_socket(struct file *file);
+#else
+static inline struct sock *io_uring_get_socket(struct file *file)
+{
+	return NULL;
+}
+#endif
+
 #endif /* _LINUX_FS_H */
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;

 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif

--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;

 #include <linux/types.h>
 #include <linux/aio_abi.h>
@ -314,6 +315,13 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags,
+				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);

 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@ -10,6 +10,7 @@

 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+void unix_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
 void wait_for_unix_gc(void);
 struct sock *unix_get_socket(struct file *filp);
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@ -824,8 +824,15 @@ __SYSCALL(__NR_futex_time64, sys_futex)
 __SYSCALL(__NR_sched_rr_get_interval_time64, sys_sched_rr_get_interval)
 #endif

+#define __NR_io_uring_setup 425
+__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+#define __NR_io_uring_enter 426
+__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+
 #undef __NR_syscalls
-#define __NR_syscalls 424
+#define __NR_syscalls 428

 /*
 * 32 bit systems traditionally used different
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* IOSQE_ flags */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+	__u64	off;		/* offset into file */
+	__u64	addr;		/* pointer to buffer or iovecs */
+	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		fsync_flags;
+		__u16		poll_events;
+	};
+	__u64	user_data;	/* data to be passed back at completion time */
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
+};
+
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+#define IORING_SETUP_SQPOLL	(1U << 1)	/* SQ poll thread */
+#define IORING_SETUP_SQ_AFF	(1U << 2)	/* sq_thread_cpu is valid */
+
+#define IORING_OP_NOP		0
+#define IORING_OP_READV		1
+#define IORING_OP_WRITEV	2
+#define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
+#define IORING_OP_POLL_ADD	6
+#define IORING_OP_POLL_REMOVE	7
+
+/*
+ * sqe->fsync_flags
+ */
+#define IORING_FSYNC_DATASYNC	(1U << 0)
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+	__u64	user_data;	/* sqe->data submission passed back */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_SQES			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv1;
+	__u64 resv2;
+};
+
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 cqes;
+	__u64 resv[2];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1U << 0)
+#define IORING_ENTER_SQ_WAKEUP	(1U << 1)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u32 sq_thread_cpu;
+	__u32 sq_thread_idle;
+	__u32 resv[5];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
+
+#endif
--- a/init/Kconfig
+++ b/init/Kconfig
@ -1414,6 +1414,15 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.

+config IO_URING
+	bool "Enable IO uring support" if EXPERT
+	select ANON_INODES
+	default y
+	help
+	  This option enables support for the io_uring interface, enabling
+	  applications to submit and complete IO through submission and
+	  completion rings that are shared between the kernel and application.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@ -48,6 +48,9 @@ COND_SYSCALL(io_pgetevents_time32);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_pgetevents_time32);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);

 /* fs/xattr.c */

--- a/net/Makefile
+++ b/net/Makefile
@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER)		+= netfilter/
 obj-$(CONFIG_INET)		+= ipv4/
 obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
-obj-$(CONFIG_UNIX)		+= unix/
+obj-$(CONFIG_UNIX_SCM)		+= unix/
 obj-$(CONFIG_NET)		+= ipv6/
 obj-$(CONFIG_BPFILTER)		+= bpfilter/
 obj-$(CONFIG_PACKET)		+= packet/
--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@ -19,6 +19,11 @@ config UNIX

 	  Say Y unless you know what you are doing.

+config UNIX_SCM
+	bool
+	depends on UNIX
+	default y
+
 config UNIX_DIAG
 	tristate "UNIX: socket monitoring interface"
 	depends on UNIX
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o

 obj-$(CONFIG_UNIX_DIAG)	+= unix_diag.o
 unix_diag-y		:= diag.o
+
+obj-$(CONFIG_UNIX_SCM)	+= scm.o
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@ -119,6 +119,8 @@
 #include <linux/freezer.h>
 #include <linux/file.h>

+#include "scm.h"
+
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
 DEFINE_SPINLOCK(unix_table_lock);
@ -1496,67 +1498,6 @@ out:
 	return err;
 }

-static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	scm->fp = UNIXCB(skb).fp;
-	UNIXCB(skb).fp = NULL;
-
-	for (i = scm->fp->count-1; i >= 0; i--)
-		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
-}
-
-static void unix_destruct_scm(struct sk_buff *skb)
-{
-	struct scm_cookie scm;
-	memset(&scm, 0, sizeof(scm));
-	scm.pid  = UNIXCB(skb).pid;
-	if (UNIXCB(skb).fp)
-		unix_detach_fds(&scm, skb);
-
-	/* Alas, it calls VFS */
-	/* So fscking what? fput() had been SMP-safe since the last Summer */
-	scm_destroy(&scm);
-	sock_wfree(skb);
-}
-
-/*
- * The "user->unix_inflight" variable is protected by the garbage
- * collection lock, and we just read it locklessly here. If you go
- * over the limit, there might be a tiny race in actually noticing
- * it across threads. Tough.
- */
-static inline bool too_many_unix_fds(struct task_struct *p)
-{
-	struct user_struct *user = current_user();
-
-	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
-		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
-	return false;
-}
-
-static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	if (too_many_unix_fds(current))
-		return -ETOOMANYREFS;
-
-	/*
-	 * Need to duplicate file references for the sake of garbage
-	 * collection.  Otherwise a socket in the fps might become a
-	 * candidate for GC while the skb is not yet queued.
-	 */
-	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
-	if (!UNIXCB(skb).fp)
-		return -ENOMEM;
-
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_inflight(scm->fp->user, scm->fp->fp[i]);
-	return 0;
-}
-
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@ -86,77 +86,13 @@
 #include <net/scm.h>
 #include <net/tcp_states.h>

+#include "scm.h"
+
 /* Internal data structures and random procedures: */

-static LIST_HEAD(gc_inflight_list);
 static LIST_HEAD(gc_candidates);
-static DEFINE_SPINLOCK(unix_gc_lock);
 static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);

-unsigned int unix_tot_inflight;
-
-struct sock *unix_get_socket(struct file *filp)
-{
-	struct sock *u_sock = NULL;
-	struct inode *inode = file_inode(filp);
-
-	/* Socket ? */
-	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
-		struct socket *sock = SOCKET_I(inode);
-		struct sock *s = sock->sk;
-
-		/* PF_UNIX ? */
-		if (s && sock->ops && sock->ops->family == PF_UNIX)
-			u_sock = s;
-	}
-	return u_sock;
-}
-
-/* Keep the number of times in flight count for the file
- * descriptor if it is for an AF_UNIX socket.
- */
-
-void unix_inflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		if (atomic_long_inc_return(&u->inflight) == 1) {
-			BUG_ON(!list_empty(&u->link));
-			list_add_tail(&u->link, &gc_inflight_list);
-		} else {
-			BUG_ON(list_empty(&u->link));
-		}
-		unix_tot_inflight++;
-	}
-	user->unix_inflight++;
-	spin_unlock(&unix_gc_lock);
-}
-
-void unix_notinflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		BUG_ON(!atomic_long_read(&u->inflight));
-		BUG_ON(list_empty(&u->link));
-
-		if (atomic_long_dec_and_test(&u->inflight))
-			list_del_init(&u->link);
-		unix_tot_inflight--;
-	}
-	user->unix_inflight--;
-	spin_unlock(&unix_gc_lock);
-}
-
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			  struct sk_buff_head *hitlist)
 {
--- a/net/unix/scm.c
+++ b/net/unix/scm.c
@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/fs.h>
+#include <net/af_unix.h>
+#include <net/scm.h>
+#include <linux/init.h>
+
+#include "scm.h"
+
+unsigned int unix_tot_inflight;
+EXPORT_SYMBOL(unix_tot_inflight);
+
+LIST_HEAD(gc_inflight_list);
+EXPORT_SYMBOL(gc_inflight_list);
+
+DEFINE_SPINLOCK(unix_gc_lock);
+EXPORT_SYMBOL(unix_gc_lock);
+
+struct sock *unix_get_socket(struct file *filp)
+{
+	struct sock *u_sock = NULL;
+	struct inode *inode = file_inode(filp);
+
+	/* Socket ? */
+	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
+		struct socket *sock = SOCKET_I(inode);
+		struct sock *s = sock->sk;
+
+		/* PF_UNIX ? */
+		if (s && sock->ops && sock->ops->family == PF_UNIX)
+			u_sock = s;
+	} else {
+		/* Could be an io_uring instance */
+		u_sock = io_uring_get_socket(filp);
+	}
+	return u_sock;
+}
+EXPORT_SYMBOL(unix_get_socket);
+
+/* Keep the number of times in flight count for the file
+ * descriptor if it is for an AF_UNIX socket.
+ */
+void unix_inflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		if (atomic_long_inc_return(&u->inflight) == 1) {
+			BUG_ON(!list_empty(&u->link));
+			list_add_tail(&u->link, &gc_inflight_list);
+		} else {
+			BUG_ON(list_empty(&u->link));
+		}
+		unix_tot_inflight++;
+	}
+	user->unix_inflight++;
+	spin_unlock(&unix_gc_lock);
+}
+
+void unix_notinflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		BUG_ON(!atomic_long_read(&u->inflight));
+		BUG_ON(list_empty(&u->link));
+
+		if (atomic_long_dec_and_test(&u->inflight))
+			list_del_init(&u->link);
+		unix_tot_inflight--;
+	}
+	user->unix_inflight--;
+	spin_unlock(&unix_gc_lock);
+}
+
+/*
+ * The "user->unix_inflight" variable is protected by the garbage
+ * collection lock, and we just read it locklessly here. If you go
+ * over the limit, there might be a tiny race in actually noticing
+ * it across threads. Tough.
+ */
+static inline bool too_many_unix_fds(struct task_struct *p)
+{
+	struct user_struct *user = current_user();
+
+	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
+		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
+	return false;
+}
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	if (too_many_unix_fds(current))
+		return -ETOOMANYREFS;
+
+	/*
+	 * Need to duplicate file references for the sake of garbage
+	 * collection.  Otherwise a socket in the fps might become a
+	 * candidate for GC while the skb is not yet queued.
+	 */
+	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
+	if (!UNIXCB(skb).fp)
+		return -ENOMEM;
+
+	for (i = scm->fp->count - 1; i >= 0; i--)
+		unix_inflight(scm->fp->user, scm->fp->fp[i]);
+	return 0;
+}
+EXPORT_SYMBOL(unix_attach_fds);
+
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	scm->fp = UNIXCB(skb).fp;
+	UNIXCB(skb).fp = NULL;
+
+	for (i = scm->fp->count-1; i >= 0; i--)
+		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
+}
+EXPORT_SYMBOL(unix_detach_fds);
+
+void unix_destruct_scm(struct sk_buff *skb)
+{
+	struct scm_cookie scm;
+
+	memset(&scm, 0, sizeof(scm));
+	scm.pid  = UNIXCB(skb).pid;
+	if (UNIXCB(skb).fp)
+		unix_detach_fds(&scm, skb);
+
+	/* Alas, it calls VFS */
+	/* So fscking what? fput() had been SMP-safe since the last Summer */
+	scm_destroy(&scm);
+	sock_wfree(skb);
+}
+EXPORT_SYMBOL(unix_destruct_scm);
--- a/net/unix/scm.h
+++ b/net/unix/scm.h
@ -0,0 +1,10 @@
+#ifndef NET_UNIX_SCM_H
+#define NET_UNIX_SCM_H
+
+extern struct list_head gc_inflight_list;
+extern spinlock_t unix_gc_lock;
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+
+#endif
--- a/tools/io_uring/Makefile
+++ b/tools/io_uring/Makefile
@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for io_uring test tools
+CFLAGS += -Wall -Wextra -g -D_GNU_SOURCE
+LDLIBS += -lpthread
+
+all: io_uring-cp io_uring-bench
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+io_uring-bench: syscall.o io_uring-bench.o
+	$(CC) $(CFLAGS) $(LDLIBS) -o $@ $^
+
+io_uring-cp: setup.o syscall.o queue.o
+
+clean:
+	$(RM) io_uring-cp io_uring-bench *.o
+
+.PHONY: all clean
--- a/tools/io_uring/README
+++ b/tools/io_uring/README
@ -0,0 +1,29 @@
+This directory includes a few programs that demonstrate how to use io_uring
+in an application. The examples are:
+
+io_uring-cp
+	A very basic io_uring implementation of cp(1). It takes two
+	arguments, copies the first argument to the second. This example
+	is part of liburing, and hence uses the simplified liburing API
+	for setting up an io_uring instance, submitting IO, completing IO,
+	etc. The support functions in queue.c and setup.c are straight
+	out of liburing.
+
+io_uring-bench
+	Benchmark program that does random reads on a number of files. This
+	app demonstrates the various features of io_uring, like fixed files,
+	fixed buffers, and polled IO. There are options in the program to
+	control which features to use. Arguments is the file (or files) that
+	io_uring-bench should operate on. This uses the raw io_uring
+	interface.
+
+liburing can be cloned with git here:
+
+	git://git.kernel.dk/liburing
+
+and contains a number of unit tests as well for testing io_uring. It also
+comes with man pages for the three system calls.
+
+Fio includes an io_uring engine, you can clone fio here:
+
+	git://git.kernel.dk/fio
--- a/tools/io_uring/barrier.h
+++ b/tools/io_uring/barrier.h
@ -0,0 +1,16 @@
+#ifndef LIBURING_BARRIER_H
+#define LIBURING_BARRIER_H
+
+#if defined(__x86_64) || defined(__i386__)
+#define read_barrier()	__asm__ __volatile__("":::"memory")
+#define write_barrier()	__asm__ __volatile__("":::"memory")
+#else
+/*
+ * Add arch appropriate definitions. Be safe and use full barriers for
+ * archs we don't have support for.
+ */
+#define read_barrier()	__sync_synchronize()
+#define write_barrier()	__sync_synchronize()
+#endif
+
+#endif
--- a/tools/io_uring/io_uring-bench.c
+++ b/tools/io_uring/io_uring-bench.c
@ -0,0 +1,616 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple benchmark program that uses the various features of io_uring
+ * to provide fast random access to a device/file. It has various
+ * options that are control how we use io_uring, see the OPTIONS section
+ * below. This uses the raw io_uring interface.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <stdio.h>
+#include <errno.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <stddef.h>
+#include <signal.h>
+#include <inttypes.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/mman.h>
+#include <sys/uio.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <pthread.h>
+#include <sched.h>
+
+#include "liburing.h"
+#include "barrier.h"
+
+#ifndef IOCQE_FLAG_CACHEHIT
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)
+#endif
+
+#define min(a, b)		((a < b) ? (a) : (b))
+
+struct io_sq_ring {
+	unsigned *head;
+	unsigned *tail;
+	unsigned *ring_mask;
+	unsigned *ring_entries;
+	unsigned *flags;
+	unsigned *array;
+};
+
+struct io_cq_ring {
+	unsigned *head;
+	unsigned *tail;
+	unsigned *ring_mask;
+	unsigned *ring_entries;
+	struct io_uring_cqe *cqes;
+};
+
+#define DEPTH			128
+
+#define BATCH_SUBMIT		32
+#define BATCH_COMPLETE		32
+
+#define BS			4096
+
+#define MAX_FDS			16
+
+static unsigned sq_ring_mask, cq_ring_mask;
+
+struct file {
+	unsigned long max_blocks;
+	unsigned pending_ios;
+	int real_fd;
+	int fixed_fd;
+};
+
+struct submitter {
+	pthread_t thread;
+	int ring_fd;
+	struct drand48_data rand;
+	struct io_sq_ring sq_ring;
+	struct io_uring_sqe *sqes;
+	struct iovec iovecs[DEPTH];
+	struct io_cq_ring cq_ring;
+	int inflight;
+	unsigned long reaps;
+	unsigned long done;
+	unsigned long calls;
+	unsigned long cachehit, cachemiss;
+	volatile int finish;
+
+	__s32 *fds;
+
+	struct file files[MAX_FDS];
+	unsigned nr_files;
+	unsigned cur_file;
+};
+
+static struct submitter submitters[1];
+static volatile int finish;
+
+/*
+ * OPTIONS: Set these to test the various features of io_uring.
+ */
+static int polled = 1;		/* use IO polling */
+static int fixedbufs = 1;	/* use fixed user buffers */
+static int register_files = 1;	/* use fixed files */
+static int buffered = 0;	/* use buffered IO, not O_DIRECT */
+static int sq_thread_poll = 0;	/* use kernel submission/poller thread */
+static int sq_thread_cpu = -1;	/* pin above thread to this CPU */
+static int do_nop = 0;		/* no-op SQ ring commands */
+
+static int io_uring_register_buffers(struct submitter *s)
+{
+	if (do_nop)
+		return 0;
+
+	return io_uring_register(s->ring_fd, IORING_REGISTER_BUFFERS, s->iovecs,
+					DEPTH);
+}
+
+static int io_uring_register_files(struct submitter *s)
+{
+	unsigned i;
+
+	if (do_nop)
+		return 0;
+
+	s->fds = calloc(s->nr_files, sizeof(__s32));
+	for (i = 0; i < s->nr_files; i++) {
+		s->fds[i] = s->files[i].real_fd;
+		s->files[i].fixed_fd = i;
+	}
+
+	return io_uring_register(s->ring_fd, IORING_REGISTER_FILES, s->fds,
+					s->nr_files);
+}
+
+static int gettid(void)
+{
+	return syscall(__NR_gettid);
+}
+
+static unsigned file_depth(struct submitter *s)
+{
+	return (DEPTH + s->nr_files - 1) / s->nr_files;
+}
+
+static void init_io(struct submitter *s, unsigned index)
+{
+	struct io_uring_sqe *sqe = &s->sqes[index];
+	unsigned long offset;
+	struct file *f;
+	long r;
+
+	if (do_nop) {
+		sqe->opcode = IORING_OP_NOP;
+		return;
+	}
+
+	if (s->nr_files == 1) {
+		f = &s->files[0];
+	} else {
+		f = &s->files[s->cur_file];
+		if (f->pending_ios >= file_depth(s)) {
+			s->cur_file++;
+			if (s->cur_file == s->nr_files)
+				s->cur_file = 0;
+			f = &s->files[s->cur_file];
+		}
+	}
+	f->pending_ios++;
+
+	lrand48_r(&s->rand, &r);
+	offset = (r % (f->max_blocks - 1)) * BS;
+
+	if (register_files) {
+		sqe->flags = IOSQE_FIXED_FILE;
+		sqe->fd = f->fixed_fd;
+	} else {
+		sqe->flags = 0;
+		sqe->fd = f->real_fd;
+	}
+	if (fixedbufs) {
+		sqe->opcode = IORING_OP_READ_FIXED;
+		sqe->addr = (unsigned long) s->iovecs[index].iov_base;
+		sqe->len = BS;
+		sqe->buf_index = index;
+	} else {
+		sqe->opcode = IORING_OP_READV;
+		sqe->addr = (unsigned long) &s->iovecs[index];
+		sqe->len = 1;
+		sqe->buf_index = 0;
+	}
+	sqe->ioprio = 0;
+	sqe->off = offset;
+	sqe->user_data = (unsigned long) f;
+}
+
+static int prep_more_ios(struct submitter *s, unsigned max_ios)
+{
+	struct io_sq_ring *ring = &s->sq_ring;
+	unsigned index, tail, next_tail, prepped = 0;
+
+	next_tail = tail = *ring->tail;
+	do {
+		next_tail++;
+		read_barrier();
+		if (next_tail == *ring->head)
+			break;
+
+		index = tail & sq_ring_mask;
+		init_io(s, index);
+		ring->array[index] = index;
+		prepped++;
+		tail = next_tail;
+	} while (prepped < max_ios);
+
+	if (*ring->tail != tail) {
+		/* order tail store with writes to sqes above */
+		write_barrier();
+		*ring->tail = tail;
+		write_barrier();
+	}
+	return prepped;
+}
+
+static int get_file_size(struct file *f)
+{
+	struct stat st;
+
+	if (fstat(f->real_fd, &st) < 0)
+		return -1;
+	if (S_ISBLK(st.st_mode)) {
+		unsigned long long bytes;
+
+		if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
+			return -1;
+
+		f->max_blocks = bytes / BS;
+		return 0;
+	} else if (S_ISREG(st.st_mode)) {
+		f->max_blocks = st.st_size / BS;
+		return 0;
+	}
+
+	return -1;
+}
+
+static int reap_events(struct submitter *s)
+{
+	struct io_cq_ring *ring = &s->cq_ring;
+	struct io_uring_cqe *cqe;
+	unsigned head, reaped = 0;
+
+	head = *ring->head;
+	do {
+		struct file *f;
+
+		read_barrier();
+		if (head == *ring->tail)
+			break;
+		cqe = &ring->cqes[head & cq_ring_mask];
+		if (!do_nop) {
+			f = (struct file *) (uintptr_t) cqe->user_data;
+			f->pending_ios--;
+			if (cqe->res != BS) {
+				printf("io: unexpected ret=%d\n", cqe->res);
+				if (polled && cqe->res == -EOPNOTSUPP)
+					printf("Your filesystem doesn't support poll\n");
+				return -1;
+			}
+		}
+		if (cqe->flags & IOCQE_FLAG_CACHEHIT)
+			s->cachehit++;
+		else
+			s->cachemiss++;
+		reaped++;
+		head++;
+	} while (1);
+
+	s->inflight -= reaped;
+	*ring->head = head;
+	write_barrier();
+	return reaped;
+}
+
+static void *submitter_fn(void *data)
+{
+	struct submitter *s = data;
+	struct io_sq_ring *ring = &s->sq_ring;
+	int ret, prepped;
+
+	printf("submitter=%d\n", gettid());
+
+	srand48_r(pthread_self(), &s->rand);
+
+	prepped = 0;
+	do {
+		int to_wait, to_submit, this_reap, to_prep;
+
+		if (!prepped && s->inflight < DEPTH) {
+			to_prep = min(DEPTH - s->inflight, BATCH_SUBMIT);
+			prepped = prep_more_ios(s, to_prep);
+		}
+		s->inflight += prepped;
+submit_more:
+		to_submit = prepped;
+submit:
+		if (to_submit && (s->inflight + to_submit <= DEPTH))
+			to_wait = 0;
+		else
+			to_wait = min(s->inflight + to_submit, BATCH_COMPLETE);
+
+		/*
+		 * Only need to call io_uring_enter if we're not using SQ thread
+		 * poll, or if IORING_SQ_NEED_WAKEUP is set.
+		 */
+		if (!sq_thread_poll || (*ring->flags & IORING_SQ_NEED_WAKEUP)) {
+			unsigned flags = 0;
+
+			if (to_wait)
+				flags = IORING_ENTER_GETEVENTS;
+			if ((*ring->flags & IORING_SQ_NEED_WAKEUP))
+				flags |= IORING_ENTER_SQ_WAKEUP;
+			ret = io_uring_enter(s->ring_fd, to_submit, to_wait,
+						flags, NULL);
+			s->calls++;
+		}
+
+		/*
+		 * For non SQ thread poll, we already got the events we needed
+		 * through the io_uring_enter() above. For SQ thread poll, we
+		 * need to loop here until we find enough events.
+		 */
+		this_reap = 0;
+		do {
+			int r;
+			r = reap_events(s);
+			if (r == -1) {
+				s->finish = 1;
+				break;
+			} else if (r > 0)
+				this_reap += r;
+		} while (sq_thread_poll && this_reap < to_wait);
+		s->reaps += this_reap;
+
+		if (ret >= 0) {
+			if (!ret) {
+				to_submit = 0;
+				if (s->inflight)
+					goto submit;
+				continue;
+			} else if (ret < to_submit) {
+				int diff = to_submit - ret;
+
+				s->done += ret;
+				prepped -= diff;
+				goto submit_more;
+			}
+			s->done += ret;
+			prepped = 0;
+			continue;
+		} else if (ret < 0) {
+			if (errno == EAGAIN) {
+				if (s->finish)
+					break;
+				if (this_reap)
+					goto submit;
+				to_submit = 0;
+				goto submit;
+			}
+			printf("io_submit: %s\n", strerror(errno));
+			break;
+		}
+	} while (!s->finish);
+
+	finish = 1;
+	return NULL;
+}
+
+static void sig_int(int sig)
+{
+	printf("Exiting on signal %d\n", sig);
+	submitters[0].finish = 1;
+	finish = 1;
+}
+
+static void arm_sig_int(void)
+{
+	struct sigaction act;
+
+	memset(&act, 0, sizeof(act));
+	act.sa_handler = sig_int;
+	act.sa_flags = SA_RESTART;
+	sigaction(SIGINT, &act, NULL);
+}
+
+static int setup_ring(struct submitter *s)
+{
+	struct io_sq_ring *sring = &s->sq_ring;
+	struct io_cq_ring *cring = &s->cq_ring;
+	struct io_uring_params p;
+	int ret, fd;
+	void *ptr;
+
+	memset(&p, 0, sizeof(p));
+
+	if (polled && !do_nop)
+		p.flags |= IORING_SETUP_IOPOLL;
+	if (sq_thread_poll) {
+		p.flags |= IORING_SETUP_SQPOLL;
+		if (sq_thread_cpu != -1) {
+			p.flags |= IORING_SETUP_SQ_AFF;
+			p.sq_thread_cpu = sq_thread_cpu;
+		}
+	}
+
+	fd = io_uring_setup(DEPTH, &p);
+	if (fd < 0) {
+		perror("io_uring_setup");
+		return 1;
+	}
+	s->ring_fd = fd;
+
+	if (fixedbufs) {
+		ret = io_uring_register_buffers(s);
+		if (ret < 0) {
+			perror("io_uring_register_buffers");
+			return 1;
+		}
+	}
+
+	if (register_files) {
+		ret = io_uring_register_files(s);
+		if (ret < 0) {
+			perror("io_uring_register_files");
+			return 1;
+		}
+	}
+
+	ptr = mmap(0, p.sq_off.array + p.sq_entries * sizeof(__u32),
+			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
+			IORING_OFF_SQ_RING);
+	printf("sq_ring ptr = 0x%p\n", ptr);
+	sring->head = ptr + p.sq_off.head;
+	sring->tail = ptr + p.sq_off.tail;
+	sring->ring_mask = ptr + p.sq_off.ring_mask;
+	sring->ring_entries = ptr + p.sq_off.ring_entries;
+	sring->flags = ptr + p.sq_off.flags;
+	sring->array = ptr + p.sq_off.array;
+	sq_ring_mask = *sring->ring_mask;
+
+	s->sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
+			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
+			IORING_OFF_SQES);
+	printf("sqes ptr    = 0x%p\n", s->sqes);
+
+	ptr = mmap(0, p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe),
+			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
+			IORING_OFF_CQ_RING);
+	printf("cq_ring ptr = 0x%p\n", ptr);
+	cring->head = ptr + p.cq_off.head;
+	cring->tail = ptr + p.cq_off.tail;
+	cring->ring_mask = ptr + p.cq_off.ring_mask;
+	cring->ring_entries = ptr + p.cq_off.ring_entries;
+	cring->cqes = ptr + p.cq_off.cqes;
+	cq_ring_mask = *cring->ring_mask;
+	return 0;
+}
+
+static void file_depths(char *buf)
+{
+	struct submitter *s = &submitters[0];
+	unsigned i;
+	char *p;
+
+	buf[0] = '\0';
+	p = buf;
+	for (i = 0; i < s->nr_files; i++) {
+		struct file *f = &s->files[i];
+
+		if (i + 1 == s->nr_files)
+			p += sprintf(p, "%d", f->pending_ios);
+		else
+			p += sprintf(p, "%d, ", f->pending_ios);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct submitter *s = &submitters[0];
+	unsigned long done, calls, reap, cache_hit, cache_miss;
+	int err, i, flags, fd;
+	char *fdepths;
+	void *ret;
+
+	if (!do_nop && argc < 2) {
+		printf("%s: filename\n", argv[0]);
+		return 1;
+	}
+
+	flags = O_RDONLY | O_NOATIME;
+	if (!buffered)
+		flags |= O_DIRECT;
+
+	i = 1;
+	while (!do_nop && i < argc) {
+		struct file *f;
+
+		if (s->nr_files == MAX_FDS) {
+			printf("Max number of files (%d) reached\n", MAX_FDS);
+			break;
+		}
+		fd = open(argv[i], flags);
+		if (fd < 0) {
+			perror("open");
+			return 1;
+		}
+
+		f = &s->files[s->nr_files];
+		f->real_fd = fd;
+		if (get_file_size(f)) {
+			printf("failed getting size of device/file\n");
+			return 1;
+		}
+		if (f->max_blocks <= 1) {
+			printf("Zero file/device size?\n");
+			return 1;
+		}
+		f->max_blocks--;
+
+		printf("Added file %s\n", argv[i]);
+		s->nr_files++;
+		i++;
+	}
+
+	if (fixedbufs) {
+		struct rlimit rlim;
+
+		rlim.rlim_cur = RLIM_INFINITY;
+		rlim.rlim_max = RLIM_INFINITY;
+		if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) {
+			perror("setrlimit");
+			return 1;
+		}
+	}
+
+	arm_sig_int();
+
+	for (i = 0; i < DEPTH; i++) {
+		void *buf;
+
+		if (posix_memalign(&buf, BS, BS)) {
+			printf("failed alloc\n");
+			return 1;
+		}
+		s->iovecs[i].iov_base = buf;
+		s->iovecs[i].iov_len = BS;
+	}
+
+	err = setup_ring(s);
+	if (err) {
+		printf("ring setup failed: %s, %d\n", strerror(errno), err);
+		return 1;
+	}
+	printf("polled=%d, fixedbufs=%d, buffered=%d", polled, fixedbufs, buffered);
+	printf(" QD=%d, sq_ring=%d, cq_ring=%d\n", DEPTH, *s->sq_ring.ring_entries, *s->cq_ring.ring_entries);
+
+	pthread_create(&s->thread, NULL, submitter_fn, s);
+
+	fdepths = malloc(8 * s->nr_files);
+	cache_hit = cache_miss = reap = calls = done = 0;
+	do {
+		unsigned long this_done = 0;
+		unsigned long this_reap = 0;
+		unsigned long this_call = 0;
+		unsigned long this_cache_hit = 0;
+		unsigned long this_cache_miss = 0;
+		unsigned long rpc = 0, ipc = 0;
+		double hit = 0.0;
+
+		sleep(1);
+		this_done += s->done;
+		this_call += s->calls;
+		this_reap += s->reaps;
+		this_cache_hit += s->cachehit;
+		this_cache_miss += s->cachemiss;
+		if (this_cache_hit && this_cache_miss) {
+			unsigned long hits, total;
+
+			hits = this_cache_hit - cache_hit;
+			total = hits + this_cache_miss - cache_miss;
+			hit = (double) hits / (double) total;
+			hit *= 100.0;
+		}
+		if (this_call - calls) {
+			rpc = (this_done - done) / (this_call - calls);
+			ipc = (this_reap - reap) / (this_call - calls);
+		} else
+			rpc = ipc = -1;
+		file_depths(fdepths);
+		printf("IOPS=%lu, IOS/call=%ld/%ld, inflight=%u (%s), Cachehit=%0.2f%%\n",
+				this_done - done, rpc, ipc, s->inflight,
+				fdepths, hit);
+		done = this_done;
+		calls = this_call;
+		reap = this_reap;
+		cache_hit = s->cachehit;
+		cache_miss = s->cachemiss;
+	} while (!finish);
+
+	pthread_join(s->thread, &ret);
+	close(s->ring_fd);
+	free(fdepths);
+	return 0;
+}
--- a/tools/io_uring/io_uring-cp.c
+++ b/tools/io_uring/io_uring-cp.c
@ -0,0 +1,251 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple test program that demonstrates a file copy through io_uring. This
+ * uses the API exposed by liburing.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <stdio.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <assert.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+
+#include "liburing.h"
+
+#define QD	64
+#define BS	(32*1024)
+
+static int infd, outfd;
+
+struct io_data {
+	int read;
+	off_t first_offset, offset;
+	size_t first_len;
+	struct iovec iov;
+};
+
+static int setup_context(unsigned entries, struct io_uring *ring)
+{
+	int ret;
+
+	ret = io_uring_queue_init(entries, ring, 0);
+	if (ret < 0) {
+		fprintf(stderr, "queue_init: %s\n", strerror(-ret));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int get_file_size(int fd, off_t *size)
+{
+	struct stat st;
+
+	if (fstat(fd, &st) < 0)
+		return -1;
+	if (S_ISREG(st.st_mode)) {
+		*size = st.st_size;
+		return 0;
+	} else if (S_ISBLK(st.st_mode)) {
+		unsigned long long bytes;
+
+		if (ioctl(fd, BLKGETSIZE64, &bytes) != 0)
+			return -1;
+
+		*size = bytes;
+		return 0;
+	}
+
+	return -1;
+}
+
+static void queue_prepped(struct io_uring *ring, struct io_data *data)
+{
+	struct io_uring_sqe *sqe;
+
+	sqe = io_uring_get_sqe(ring);
+	assert(sqe);
+
+	if (data->read)
+		io_uring_prep_readv(sqe, infd, &data->iov, 1, data->offset);
+	else
+		io_uring_prep_writev(sqe, outfd, &data->iov, 1, data->offset);
+
+	io_uring_sqe_set_data(sqe, data);
+}
+
+static int queue_read(struct io_uring *ring, off_t size, off_t offset)
+{
+	struct io_uring_sqe *sqe;
+	struct io_data *data;
+
+	sqe = io_uring_get_sqe(ring);
+	if (!sqe)
+		return 1;
+
+	data = malloc(size + sizeof(*data));
+	data->read = 1;
+	data->offset = data->first_offset = offset;
+
+	data->iov.iov_base = data + 1;
+	data->iov.iov_len = size;
+	data->first_len = size;
+
+	io_uring_prep_readv(sqe, infd, &data->iov, 1, offset);
+	io_uring_sqe_set_data(sqe, data);
+	return 0;
+}
+
+static void queue_write(struct io_uring *ring, struct io_data *data)
+{
+	data->read = 0;
+	data->offset = data->first_offset;
+
+	data->iov.iov_base = data + 1;
+	data->iov.iov_len = data->first_len;
+
+	queue_prepped(ring, data);
+	io_uring_submit(ring);
+}
+
+static int copy_file(struct io_uring *ring, off_t insize)
+{
+	unsigned long reads, writes;
+	struct io_uring_cqe *cqe;
+	off_t write_left, offset;
+	int ret;
+
+	write_left = insize;
+	writes = reads = offset = 0;
+
+	while (insize || write_left) {
+		unsigned long had_reads;
+		int got_comp;
+
+		/*
+		 * Queue up as many reads as we can
+		 */
+		had_reads = reads;
+		while (insize) {
+			off_t this_size = insize;
+
+			if (reads + writes >= QD)
+				break;
+			if (this_size > BS)
+				this_size = BS;
+			else if (!this_size)
+				break;
+
+			if (queue_read(ring, this_size, offset))
+				break;
+
+			insize -= this_size;
+			offset += this_size;
+			reads++;
+		}
+
+		if (had_reads != reads) {
+			ret = io_uring_submit(ring);
+			if (ret < 0) {
+				fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
+				break;
+			}
+		}
+
+		/*
+		 * Queue is full at this point. Find at least one completion.
+		 */
+		got_comp = 0;
+		while (write_left) {
+			struct io_data *data;
+
+			if (!got_comp) {
+				ret = io_uring_wait_completion(ring, &cqe);
+				got_comp = 1;
+			} else
+				ret = io_uring_get_completion(ring, &cqe);
+			if (ret < 0) {
+				fprintf(stderr, "io_uring_get_completion: %s\n",
+							strerror(-ret));
+				return 1;
+			}
+			if (!cqe)
+				break;
+
+			data = (struct io_data *) (uintptr_t) cqe->user_data;
+			if (cqe->res < 0) {
+				if (cqe->res == -EAGAIN) {
+					queue_prepped(ring, data);
+					continue;
+				}
+				fprintf(stderr, "cqe failed: %s\n",
+						strerror(-cqe->res));
+				return 1;
+			} else if ((size_t) cqe->res != data->iov.iov_len) {
+				/* Short read/write, adjust and requeue */
+				data->iov.iov_base += cqe->res;
+				data->iov.iov_len -= cqe->res;
+				data->offset += cqe->res;
+				queue_prepped(ring, data);
+				continue;
+			}
+
+			/*
+			 * All done. if write, nothing else to do. if read,
+			 * queue up corresponding write.
+			 */
+			if (data->read) {
+				queue_write(ring, data);
+				write_left -= data->first_len;
+				reads--;
+				writes++;
+			} else {
+				free(data);
+				writes--;
+			}
+		}
+	}
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	struct io_uring ring;
+	off_t insize;
+	int ret;
+
+	if (argc < 3) {
+		printf("%s: infile outfile\n", argv[0]);
+		return 1;
+	}
+
+	infd = open(argv[1], O_RDONLY);
+	if (infd < 0) {
+		perror("open infile");
+		return 1;
+	}
+	outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644);
+	if (outfd < 0) {
+		perror("open outfile");
+		return 1;
+	}
+
+	if (setup_context(QD, &ring))
+		return 1;
+	if (get_file_size(infd, &insize))
+		return 1;
+
+	ret = copy_file(&ring, insize);
+
+	close(infd);
+	close(outfd);
+	io_uring_queue_exit(&ring);
+	return ret;
+}
--- a/tools/io_uring/liburing.h
+++ b/tools/io_uring/liburing.h
@ -0,0 +1,143 @@
+#ifndef LIB_URING_H
+#define LIB_URING_H
+
+#include <sys/uio.h>
+#include <signal.h>
+#include <string.h>
+#include "../../include/uapi/linux/io_uring.h"
+
+/*
+ * Library interface to io_uring
+ */
+struct io_uring_sq {
+	unsigned *khead;
+	unsigned *ktail;
+	unsigned *kring_mask;
+	unsigned *kring_entries;
+	unsigned *kflags;
+	unsigned *kdropped;
+	unsigned *array;
+	struct io_uring_sqe *sqes;
+
+	unsigned sqe_head;
+	unsigned sqe_tail;
+
+	size_t ring_sz;
+};
+
+struct io_uring_cq {
+	unsigned *khead;
+	unsigned *ktail;
+	unsigned *kring_mask;
+	unsigned *kring_entries;
+	unsigned *koverflow;
+	struct io_uring_cqe *cqes;
+
+	size_t ring_sz;
+};
+
+struct io_uring {
+	struct io_uring_sq sq;
+	struct io_uring_cq cq;
+	int ring_fd;
+};
+
+/*
+ * System calls
+ */
+extern int io_uring_setup(unsigned entries, struct io_uring_params *p);
+extern int io_uring_enter(unsigned fd, unsigned to_submit,
+	unsigned min_complete, unsigned flags, sigset_t *sig);
+extern int io_uring_register(int fd, unsigned int opcode, void *arg,
+	unsigned int nr_args);
+
+/*
+ * Library interface
+ */
+extern int io_uring_queue_init(unsigned entries, struct io_uring *ring,
+	unsigned flags);
+extern int io_uring_queue_mmap(int fd, struct io_uring_params *p,
+	struct io_uring *ring);
+extern void io_uring_queue_exit(struct io_uring *ring);
+extern int io_uring_get_completion(struct io_uring *ring,
+	struct io_uring_cqe **cqe_ptr);
+extern int io_uring_wait_completion(struct io_uring *ring,
+	struct io_uring_cqe **cqe_ptr);
+extern int io_uring_submit(struct io_uring *ring);
+extern struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);
+
+/*
+ * Command prep helpers
+ */
+static inline void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data)
+{
+	sqe->user_data = (unsigned long) data;
+}
+
+static inline void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd,
+				    void *addr, unsigned len, off_t offset)
+{
+	memset(sqe, 0, sizeof(*sqe));
+	sqe->opcode = op;
+	sqe->fd = fd;
+	sqe->off = offset;
+	sqe->addr = (unsigned long) addr;
+	sqe->len = len;
+}
+
+static inline void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd,
+				       struct iovec *iovecs, unsigned nr_vecs,
+				       off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_READV, sqe, fd, iovecs, nr_vecs, offset);
+}
+
+static inline void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd,
+					    void *buf, unsigned nbytes,
+					    off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_READ_FIXED, sqe, fd, buf, nbytes, offset);
+}
+
+static inline void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd,
+				        struct iovec *iovecs, unsigned nr_vecs,
+					off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_WRITEV, sqe, fd, iovecs, nr_vecs, offset);
+}
+
+static inline void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd,
+					     void *buf, unsigned nbytes,
+					     off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_WRITE_FIXED, sqe, fd, buf, nbytes, offset);
+}
+
+static inline void io_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd,
+					  short poll_mask)
+{
+	memset(sqe, 0, sizeof(*sqe));
+	sqe->opcode = IORING_OP_POLL_ADD;
+	sqe->fd = fd;
+	sqe->poll_events = poll_mask;
+}
+
+static inline void io_uring_prep_poll_remove(struct io_uring_sqe *sqe,
+					     void *user_data)
+{
+	memset(sqe, 0, sizeof(*sqe));
+	sqe->opcode = IORING_OP_POLL_REMOVE;
+	sqe->addr = (unsigned long) user_data;
+}
+
+static inline void io_uring_prep_fsync(struct io_uring_sqe *sqe, int fd,
+				       int datasync)
+{
+	memset(sqe, 0, sizeof(*sqe));
+	sqe->opcode = IORING_OP_FSYNC;
+	sqe->fd = fd;
+	if (datasync)
+		sqe->fsync_flags = IORING_FSYNC_DATASYNC;
+}
+
+#endif
--- a/tools/io_uring/queue.c
+++ b/tools/io_uring/queue.c
@ -0,0 +1,164 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+
+#include "liburing.h"
+#include "barrier.h"
+
+static int __io_uring_get_completion(struct io_uring *ring,
+				     struct io_uring_cqe **cqe_ptr, int wait)
+{
+	struct io_uring_cq *cq = &ring->cq;
+	const unsigned mask = *cq->kring_mask;
+	unsigned head;
+	int ret;
+
+	*cqe_ptr = NULL;
+	head = *cq->khead;
+	do {
+		/*
+		 * It's necessary to use a read_barrier() before reading
+		 * the CQ tail, since the kernel updates it locklessly. The
+		 * kernel has the matching store barrier for the update. The
+		 * kernel also ensures that previous stores to CQEs are ordered
+		 * with the tail update.
+		 */
+		read_barrier();
+		if (head != *cq->ktail) {
+			*cqe_ptr = &cq->cqes[head & mask];
+			break;
+		}
+		if (!wait)
+			break;
+		ret = io_uring_enter(ring->ring_fd, 0, 1,
+					IORING_ENTER_GETEVENTS, NULL);
+		if (ret < 0)
+			return -errno;
+	} while (1);
+
+	if (*cqe_ptr) {
+		*cq->khead = head + 1;
+		/*
+		 * Ensure that the kernel sees our new head, the kernel has
+		 * the matching read barrier.
+		 */
+		write_barrier();
+	}
+
+	return 0;
+}
+
+/*
+ * Return an IO completion, if one is readily available
+ */
+int io_uring_get_completion(struct io_uring *ring,
+			    struct io_uring_cqe **cqe_ptr)
+{
+	return __io_uring_get_completion(ring, cqe_ptr, 0);
+}
+
+/*
+ * Return an IO completion, waiting for it if necessary
+ */
+int io_uring_wait_completion(struct io_uring *ring,
+			     struct io_uring_cqe **cqe_ptr)
+{
+	return __io_uring_get_completion(ring, cqe_ptr, 1);
+}
+
+/*
+ * Submit sqes acquired from io_uring_get_sqe() to the kernel.
+ *
+ * Returns number of sqes submitted
+ */
+int io_uring_submit(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	const unsigned mask = *sq->kring_mask;
+	unsigned ktail, ktail_next, submitted;
+	int ret;
+
+	/*
+	 * If we have pending IO in the kring, submit it first. We need a
+	 * read barrier here to match the kernels store barrier when updating
+	 * the SQ head.
+	 */
+	read_barrier();
+	if (*sq->khead != *sq->ktail) {
+		submitted = *sq->kring_entries;
+		goto submit;
+	}
+
+	if (sq->sqe_head == sq->sqe_tail)
+		return 0;
+
+	/*
+	 * Fill in sqes that we have queued up, adding them to the kernel ring
+	 */
+	submitted = 0;
+	ktail = ktail_next = *sq->ktail;
+	while (sq->sqe_head < sq->sqe_tail) {
+		ktail_next++;
+		read_barrier();
+
+		sq->array[ktail & mask] = sq->sqe_head & mask;
+		ktail = ktail_next;
+
+		sq->sqe_head++;
+		submitted++;
+	}
+
+	if (!submitted)
+		return 0;
+
+	if (*sq->ktail != ktail) {
+		/*
+		 * First write barrier ensures that the SQE stores are updated
+		 * with the tail update. This is needed so that the kernel
+		 * will never see a tail update without the preceeding sQE
+		 * stores being done.
+		 */
+		write_barrier();
+		*sq->ktail = ktail;
+		/*
+		 * The kernel has the matching read barrier for reading the
+		 * SQ tail.
+		 */
+		write_barrier();
+	}
+
+submit:
+	ret = io_uring_enter(ring->ring_fd, submitted, 0,
+				IORING_ENTER_GETEVENTS, NULL);
+	if (ret < 0)
+		return -errno;
+
+	return 0;
+}
+
+/*
+ * Return an sqe to fill. Application must later call io_uring_submit()
+ * when it's ready to tell the kernel about it. The caller may call this
+ * function multiple times before calling io_uring_submit().
+ *
+ * Returns a vacant sqe, or NULL if we're full.
+ */
+struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	unsigned next = sq->sqe_tail + 1;
+	struct io_uring_sqe *sqe;
+
+	/*
+	 * All sqes are used
+	 */
+	if (next - sq->sqe_head > *sq->kring_entries)
+		return NULL;
+
+	sqe = &sq->sqes[sq->sqe_tail & *sq->kring_mask];
+	sq->sqe_tail = next;
+	return sqe;
+}
--- a/tools/io_uring/setup.c
+++ b/tools/io_uring/setup.c
@ -0,0 +1,103 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+
+#include "liburing.h"
+
+static int io_uring_mmap(int fd, struct io_uring_params *p,
+			 struct io_uring_sq *sq, struct io_uring_cq *cq)
+{
+	size_t size;
+	void *ptr;
+	int ret;
+
+	sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
+	ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
+	if (ptr == MAP_FAILED)
+		return -errno;
+	sq->khead = ptr + p->sq_off.head;
+	sq->ktail = ptr + p->sq_off.tail;
+	sq->kring_mask = ptr + p->sq_off.ring_mask;
+	sq->kring_entries = ptr + p->sq_off.ring_entries;
+	sq->kflags = ptr + p->sq_off.flags;
+	sq->kdropped = ptr + p->sq_off.dropped;
+	sq->array = ptr + p->sq_off.array;
+
+	size = p->sq_entries * sizeof(struct io_uring_sqe),
+	sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
+				MAP_SHARED | MAP_POPULATE, fd,
+				IORING_OFF_SQES);
+	if (sq->sqes == MAP_FAILED) {
+		ret = -errno;
+err:
+		munmap(sq->khead, sq->ring_sz);
+		return ret;
+	}
+
+	cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
+	ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
+	if (ptr == MAP_FAILED) {
+		ret = -errno;
+		munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
+		goto err;
+	}
+	cq->khead = ptr + p->cq_off.head;
+	cq->ktail = ptr + p->cq_off.tail;
+	cq->kring_mask = ptr + p->cq_off.ring_mask;
+	cq->kring_entries = ptr + p->cq_off.ring_entries;
+	cq->koverflow = ptr + p->cq_off.overflow;
+	cq->cqes = ptr + p->cq_off.cqes;
+	return 0;
+}
+
+/*
+ * For users that want to specify sq_thread_cpu or sq_thread_idle, this
+ * interface is a convenient helper for mmap()ing the rings.
+ * Returns -1 on error, or zero on success.  On success, 'ring'
+ * contains the necessary information to read/write to the rings.
+ */
+int io_uring_queue_mmap(int fd, struct io_uring_params *p, struct io_uring *ring)
+{
+	int ret;
+
+	memset(ring, 0, sizeof(*ring));
+	ret = io_uring_mmap(fd, p, &ring->sq, &ring->cq);
+	if (!ret)
+		ring->ring_fd = fd;
+	return ret;
+}
+
+/*
+ * Returns -1 on error, or zero on success. On success, 'ring'
+ * contains the necessary information to read/write to the rings.
+ */
+int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags)
+{
+	struct io_uring_params p;
+	int fd;
+
+	memset(&p, 0, sizeof(p));
+	p.flags = flags;
+
+	fd = io_uring_setup(entries, &p);
+	if (fd < 0)
+		return fd;
+
+	return io_uring_queue_mmap(fd, &p, ring);
+}
+
+void io_uring_queue_exit(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	struct io_uring_cq *cq = &ring->cq;
+
+	munmap(sq->sqes, *sq->kring_entries * sizeof(struct io_uring_sqe));
+	munmap(sq->khead, sq->ring_sz);
+	munmap(cq->khead, cq->ring_sz);
+	close(ring->ring_fd);
+}
--- a/tools/io_uring/syscall.c
+++ b/tools/io_uring/syscall.c
@ -0,0 +1,40 @@
+/*
+ * Will go away once libc support is there
+ */
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/uio.h>
+#include <signal.h>
+#include "liburing.h"
+
+#if defined(__x86_64) || defined(__i386__)
+#ifndef __NR_sys_io_uring_setup
+#define __NR_sys_io_uring_setup		425
+#endif
+#ifndef __NR_sys_io_uring_enter
+#define __NR_sys_io_uring_enter		426
+#endif
+#ifndef __NR_sys_io_uring_register
+#define __NR_sys_io_uring_register	427
+#endif
+#else
+#error "Arch not supported yet"
+#endif
+
+int io_uring_register(int fd, unsigned int opcode, void *arg,
+		      unsigned int nr_args)
+{
+	return syscall(__NR_sys_io_uring_register, fd, opcode, arg, nr_args);
+}
+
+int io_uring_setup(unsigned entries, struct io_uring_params *p)
+{
+	return syscall(__NR_sys_io_uring_setup, entries, p);
+}
+
+int io_uring_enter(unsigned fd, unsigned to_submit, unsigned min_complete,
+		   unsigned flags, sigset_t *sig)
+{
+	return syscall(__NR_sys_io_uring_enter, fd, to_submit, min_complete,
+			flags, sig, _NSIG / 8);
+}