remarkable-linux/fs
Andi Kleen 6a46079cf5 HWPOISON: The high level memory error handler in the VM v7
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
2009-09-16 11:50:15 +02:00
..
9p 9p: remove unnecessary v9fses->options which duplicates the mount string 2009-08-17 16:42:28 -05:00
adfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
affs affs: add ->sync_fs 2009-06-11 21:36:14 -04:00
afs AFS: Stop readlink() on AFS crashing due to NULL 'file' ptr 2009-08-27 12:22:08 -07:00
autofs switch follow_down() 2009-06-11 21:36:01 -04:00
autofs4 autofs4 - fix missed case when changing to use struct path 2009-08-31 17:44:05 -10:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-06-17 08:46:57 -07:00
bfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
btrfs Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block 2009-09-14 17:55:15 -07:00
cachefiles enforce ->sync_fs is only called for rw superblock 2009-06-11 21:36:06 -04:00
cifs cifs: consolidate reconnect logic in smb_init routines 2009-09-03 15:30:48 +00:00
coda
configfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
cramfs
debugfs debugfs: use specified mode to possibly mark files read/write only 2009-06-15 21:30:28 -07:00
devpts devpts: remove module-related code 2009-06-24 08:15:24 -04:00
dlm Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-08-12 17:44:53 -07:00
ecryptfs eCryptfs: parse_tag_3_packet check tag 3 packet encrypted key size 2009-07-28 14:26:06 -07:00
efs get rid of BKL in fs/efs 2009-06-17 00:36:36 -04:00
exofs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
exportfs
ext2 ext2: Update comment about generic_osync_inode 2009-09-14 17:08:16 +02:00
ext3 ext3: Remove syncing logic from ext3_file_write 2009-09-14 17:08:16 +02:00
ext4 ext4: Remove syncing logic from ext4_file_write 2009-09-14 17:08:16 +02:00
fat fat: Opencode sync_page_range_nolock() 2009-09-14 17:08:17 +02:00
freevxfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fscache FS-Cache: Fixup renamed filenames in comments in internal.h 2009-05-27 10:20:13 -07:00
fuse writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
gfs2 Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block 2009-09-14 17:55:15 -07:00
hfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hfsplus headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hostfs hostfs: set maximum filesize in superblock for proper LFS support 2009-06-30 18:56:03 -07:00
hpfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hppfs
hugetlbfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
isofs isofs: fix Joliet regression 2009-07-10 19:18:59 -07:00
jbd jbd: fix race between write_metadata_buffer and get_write_access 2009-07-21 11:54:42 +02:00
jbd2 jbd2: fix race between write_metadata_buffer and get_write_access 2009-07-13 17:55:35 -04:00
jffs2 jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()' 2009-09-08 11:09:04 -07:00
jfs jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()' 2009-09-08 11:09:04 -07:00
lockd lockd: Replace nsm_display_address() with rpc_ntop() 2009-08-09 15:09:39 -04:00
minix Making fs/minix/minix.h double including safe 2009-06-22 11:34:42 -07:00
ncpfs NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
nfs Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
nfs_common
nfsd Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
nilfs2 fs/Kconfig: move nilfs2 outside misc filesystems 2009-09-14 18:27:16 +09:00
nls NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
notify inotify: update the group mask on mark addition 2009-08-28 12:51:14 -04:00
ntfs ntfs: Use new syncing helpers and update comments 2009-09-14 17:08:16 +02:00
ocfs2 ocfs2: Update syncing after splicing to match generic version 2009-09-14 17:08:16 +02:00
omfs switch omfs to simple_fsync() 2009-06-11 21:36:13 -04:00
openpromfs
partitions Seperate read and write statistics of in_flight requests 2009-09-14 08:24:52 +02:00
proc HWPOISON: The high level memory error handler in the VM v7 2009-09-16 11:50:15 +02:00
qnx4 fs/qnx4: sanitize includes 2009-06-11 21:36:12 -04:00
quota quota: Silence lockdep on quota_on 2009-07-30 17:31:23 +02:00
ramfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
reiserfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
romfs
smbfs push BKL down into ->put_super 2009-06-11 21:36:07 -04:00
squashfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
sysfs Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block 2009-09-11 09:17:05 -07:00
sysv get rid of BKL in fs/sysv 2009-06-17 00:36:37 -04:00
ubifs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
udf udf: Fix possible corruption when close races with write 2009-09-14 19:13:01 +02:00
ufs ufs: sector_t cannot be negative 2009-06-18 13:03:46 -07:00
xfs xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() 2009-09-14 17:08:17 +02:00
aio.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
anon_inodes.c fs: Provide empty .set_page_dirty() aop for anon inodes 2009-06-18 14:46:10 +02:00
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf.c binfmt_elf: fix PT_INTERP bss handling 2009-09-10 20:11:12 +10:00
binfmt_elf_fdpic.c elf_core_dump: use rcu_read_lock() to access ->real_parent 2009-06-18 13:03:52 -07:00
binfmt_em86.c
binfmt_flat.c flat: fix uninitialized ptr with shared libs 2009-08-07 10:39:57 -07:00
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c block: Create bip slabs with embedded integrity vectors 2009-07-01 10:56:25 +02:00
bio.c block: fix sg SG_DXFER_TO_FROM_DEV regression 2009-07-10 20:31:53 +02:00
block_dev.c vfs: Rename generic_file_aio_write_nolock 2009-09-14 17:08:15 +02:00
buffer.c writeback: switch to per-bdi threads for flushing data 2009-09-11 09:20:25 +02:00
char_dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6 2009-09-11 09:19:35 -07:00
compat.c exec: do not sleep in TASK_TRACED under ->cred_guard_mutex 2009-09-05 11:30:42 -07:00
compat_binfmt_elf.c
compat_ioctl.c compat_ioctl: hook up compat handler for FIEMAP ioctl 2009-08-07 10:39:56 -07:00
dcache.c sched: Pull up the might_sleep() check into cond_resched() 2009-07-18 15:51:44 +02:00
dcookies.c
direct-io.c block: Do away with the notion of hardsect_size 2009-05-22 23:22:54 +02:00
drop_caches.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
eventfd.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
eventpoll.c epoll: fix nested calls support 2009-06-18 13:03:41 -07:00
exec.c exec: do not sleep in TASK_TRACED under ->cred_guard_mutex 2009-09-05 11:30:42 -07:00
fcntl.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fifo.c
file.c
file_table.c fs: move mark_files_ro into file_table.c 2009-06-11 21:36:02 -04:00
filesystems.c
fs-writeback.c vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() 2009-09-14 17:08:17 +02:00
fs_struct.c
generic_acl.c
inode.c vfs: add __destroy_inode 2009-08-07 14:38:29 -03:00
internal.h Trim a bit of crap from fs.h 2009-06-11 21:36:07 -04:00
ioctl.c fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls 2009-06-24 08:15:27 -04:00
ioprio.c
Kconfig fs/Kconfig: move nilfs2 outside misc filesystems 2009-09-14 18:27:16 +09:00
Kconfig.binfmt
libfs.c vfs: make get_sb_pseudo set s_maxbytes to value that can be cast to signed 2009-08-18 16:31:12 -07:00
locks.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:23:18 -07:00
Makefile
mbcache.c
mpage.c
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2009-09-11 08:55:49 -07:00
namespace.c vfs: mnt_want_write_file(): fix special file handling 2009-08-07 10:39:56 -07:00
nfsctl.c
no-block.c
open.c CRED: Add some configurable debugging [try #6] 2009-09-02 21:29:01 +10:00
pipe.c lockdep: Fix lockdep annotation for pipe_double_lock() 2009-07-22 21:14:14 +02:00
pnode.c
pnode.h
posix_acl.c
read_write.c
read_write.h
readdir.c
select.c poll/select: initialize triggered field of struct poll_wqueues 2009-08-15 18:40:11 -07:00
seq_file.c seq_file: add function to write binary data 2009-06-18 13:03:57 -07:00
signalfd.c
splice.c Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block 2009-09-14 17:55:15 -07:00
stack.c
stat.c
super.c writeback: switch to per-bdi threads for flushing data 2009-09-11 09:20:25 +02:00
sync.c fsync: wait for data writeout completion before calling ->fsync 2009-09-14 17:08:17 +02:00
timerfd.c
utimes.c
xattr.c VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx. 2009-09-10 10:11:22 +10:00
xattr_acl.c