1
0
Fork 0
Commit Graph

44061 Commits (c42843f2f0bbc9d716a32caf667d18fc2bf3bc4c)

Author SHA1 Message Date
Wu Fengguang c42843f2f0 writeback: introduce smoothed global dirty limit
The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:02 -07:00
Wu Fengguang e98be2d599 writeback: bdi write bandwidth estimation
The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Jan Kara f7d2b1ecd0 writeback: account per-bdi accumulated written pages
Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Wu Fengguang d46db3d582 writeback: make writeback_control.nr_to_write straight
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Wu Fengguang e84d0a4f8e writeback: trace event writeback_queue_io
Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang 251d6a471c writeback: trace event writeback_single_inode
It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang 846d5a091b writeback: remove .nonblocking and .encountered_congestion
Remove two unused struct writeback_control fields:

	.encountered_congestion	(completely unused)
	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)

The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.

Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang b7a2441f99 writeback: remove writeback_control.more_io
When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:23 +08:00
Wu Fengguang e185dda89d writeback: avoid extra sync work at enqueue time
This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:22 +08:00
Christoph Hellwig f758eeabeb writeback: split inode_wb_list_lock into bdi_writeback.list_lock
Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:21 +08:00
Wu Fengguang cb9bd1159c writeback: introduce writeback_control.inodes_written
The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Wu Fengguang 6e6938b6d3 writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08 08:25:20 +08:00
Linus Torvalds 0e833d8cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
  tg3: Fix tg3_skb_error_unmap()
  net: tracepoint of net_dev_xmit sees freed skb and causes panic
  drivers/net/can/flexcan.c: add missing clk_put
  net: dm9000: Get the chip in a known good state before enabling interrupts
  drivers/net/davinci_emac.c: add missing clk_put
  af-packet: Add flag to distinguish VID 0 from no-vlan.
  caif: Fix race when conditionally taking rtnl lock
  usbnet/cdc_ncm: add missing .reset_resume hook
  vlan: fix typo in vlan_dev_hard_start_xmit()
  net/ipv4: Check for mistakenly passed in non-IPv4 address
  iwl4965: correctly validate temperature value
  bluetooth l2cap: fix locking in l2cap_global_chan_by_psm
  ath9k: fix two more bugs in tx power
  cfg80211: don't drop p2p probe responses
  Revert "net: fix section mismatches"
  drivers/net/usb/catc.c: Fix potential deadlock in catc_ctrl_run()
  sctp: stop pending timers and purge queues when peer restart asoc
  drivers/net: ks8842 Fix crash on received packet when in PIO mode.
  ip_options_compile: properly handle unaligned pointer
  iwlagn: fix incorrect PCI subsystem id for 6150 devices
  ...
2011-06-04 23:16:00 +09:00
Linus Torvalds 4f1ba49efa Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block:
  block: Use hlist_entry() for io_context.cic_list.first
  cfq-iosched: Remove bogus check in queue_fail path
  xen/blkback: potential null dereference in error handling
  xen/blkback: don't call vbd_size() if bd_disk is NULL
  block: blkdev_get() should access ->bd_disk only after success
  CFQ: Fix typo and remove unnecessary semicolon
  block: remove unwanted semicolons
  Revert "block: Remove extra discard_alignment from hd_struct."
  nbd: adjust 'max_part' according to part_shift
  nbd: limit module parameters to a sane value
  nbd: pass MSG_* flags to kernel_recvmsg()
  block: improve the bio_add_page() and bio_add_pc_page() descriptions
2011-06-04 08:11:26 +09:00
Linus Torvalds cb37bbd90a Merge branch 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  asm-generic/unistd.h: support sendmmsg syscall
  tile: enable CONFIG_BUGVERBOSE
2011-06-04 08:03:16 +09:00
Linus Torvalds 55db4c64ed Revert "tty: make receive_buf() return the amout of bytes received"
This reverts commit b1c43f82c5.

It was broken in so many ways, and results in random odd pty issues.

It re-introduced the buggy schedule_work() in flush_to_ldisc() that can
cause endless work-loops (see commit a5660b41af6a: "tty: fix endless
work loop when the buffer fills up").

It also used an "unsigned int" return value fo the ->receive_buf()
function, but then made multiple functions return a negative error code,
and didn't actually check for the error in the caller.

And it didn't actually work at all.  BenH bisected down odd tty behavior
to it:
  "It looks like the patch is causing some major malfunctions of the X
   server for me, possibly related to PTYs.  For example, cat'ing a
   large file in a gnome terminal hangs the kernel for -minutes- in a
   loop of what looks like flush_to_ldisc/workqueue code, (some ftrace
   data in the quoted bits further down).

   ...

   Some more data: It -looks- like what happens is that the
   flush_to_ldisc work queue entry constantly re-queues itself (because
   the PTY is full ?) and the workqueue thread will basically loop
   forver calling it without ever scheduling, thus starving the consumer
   process that could have emptied the PTY."

which is pretty much exactly the problem we fixed in a5660b41af.

Milton Miller pointed out the 'unsigned int' issue.

Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Milton Miller <miltonm@bga.com>
Cc: Stefan Bigler <stefan.bigler@keymile.com>
Cc: Toby Gray <toby.gray@realvnc.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-04 06:33:24 +09:00
John W. Linville 7b29dc21ea Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into for-davem 2011-06-03 14:31:50 -04:00
Koki Sanagi ec764bf083 net: tracepoint of net_dev_xmit sees freed skb and causes panic
Because there is a possibility that skb is kfree_skb()ed and zero cleared
after ndo_start_xmit, we should not see the contents of skb like skb->len and
skb->dev->name after ndo_start_xmit. But trace_net_dev_xmit does that
and causes panic by NULL pointer dereference.
This patch fixes trace_net_dev_xmit not to see the contents of skb directly.

If you want to reproduce this panic,

1. Get tracepoint of net_dev_xmit on
2. Create 2 guests on KVM
2. Make 2 guests use virtio_net
4. Execute netperf from one to another for a long time as a network burden
5. host will panic(It takes about 30 minutes)

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-02 14:06:31 -07:00
Chris Metcalf b36a968927 asm-generic/unistd.h: support sendmmsg syscall
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
2011-06-02 14:31:38 -04:00
Ben Greear a3bcc23e89 af-packet: Add flag to distinguish VID 0 from no-vlan.
Currently, user-space cannot determine if a 0 tcp_vlan_tci
means there is no VLAN tag or the VLAN ID was zero.

Add flag to make this explicit.  User-space can check for
TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
compatible. Older could would have just checked for tp_vlan_tci,
so it will work no worse than before.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-01 21:18:03 -07:00
Linus Torvalds f0f52a9463 Merge git://git.infradead.org/iommu-2.6
* git://git.infradead.org/iommu-2.6:
  intel-iommu: Fix off-by-one in RMRR setup
  intel-iommu: Add domain check in domain_remove_one_dev_info
  intel-iommu: Remove Host Bridge devices from identity mapping
  intel-iommu: Use coherent DMA mask when requested
  intel-iommu: Dont cache iova above 32bit
  intel-iommu: Speed up processing of the identity_mapping function
  intel-iommu: Check for identity mapping candidate using system dma mask
  intel-iommu: Only unlink device domains from iommu
  intel-iommu: Enable super page (2MiB, 1GiB, etc.) support
  intel-iommu: Flush unmaps at domain_exit
  intel-iommu: Remove obsolete comment from detect_intel_iommu
  intel-iommu: fix VT-d PMR disable for TXT on S3 resume
2011-06-02 05:48:50 +09:00
Eliad Peller 333ba73252 cfg80211: don't drop p2p probe responses
Commit 0a35d36 ("cfg80211: Use capability info to detect mesh beacons")
assumed that probe response with both ESS and IBSS bits cleared
means that the frame was sent by a mesh sta.

However, these capabilities are also being used in the p2p_find phase,
and the mesh-validation broke it.

Rename the WLAN_CAPABILITY_IS_MBSS macro, and verify that mesh ies
exist before assuming this frame was sent by a mesh sta.

Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-06-01 14:34:01 -04:00
Linus Torvalds 3f303103b8 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  mtd: fix physmap.h warnings
2011-06-01 21:47:39 +09:00
Youquan Song 6dd9a7c737 intel-iommu: Enable super page (2MiB, 1GiB, etc.) support
There are no externally-visible changes with this. In the loop in the
internal __domain_mapping() function, we simply detect if we are mapping:
  - size >= 2MiB, and
  - virtual address aligned to 2MiB, and
  - physical address aligned to 2MiB, and
  - on hardware that supports superpages.

(and likewise for larger superpages).

We automatically use a superpage for such mappings. We never have to
worry about *breaking* superpages, since we trust that we will always
*unmap* the same range that was mapped. So all we need to do is ensure
that dma_pte_clear_range() will also cope with superpages.

Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
it can return a PTE at the appropriate level rather than always
extending the page tables all the way down to level 1. Again, this is
simplified by the fact that we should never encounter existing small
pages when we're creating a mapping; any old mapping that used the same
virtual range will have been entirely removed and its obsolete page
tables freed.

Provide an 'intel_iommu=sp_off' argument on the command line as a
chicken bit. Not that it should ever be required.

==

The original commit seen in the iommu-2.6.git was Youquan's
implementation (and completion) of my own half-baked code which I'd
typed into an email. Followed by half a dozen subsequent 'fixes'.

I've taken the unusual step of rewriting history and collapsing the
original commits in order to keep the main history simpler, and make
life easier for the people who are going to have to backport this to
older kernels. And also so I can give it a more coherent commit comment
which (hopefully) gives a better explanation of what's going on.

The original sequence of commits leading to identical code was:

Youquan Song (3):
      intel-iommu: super page support
      intel-iommu: Fix superpage alignment calculation error
      intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()

David Woodhouse (4):
      intel-iommu: Precalculate superpage support for dmar_domain
      intel-iommu: Fix hardware_largepage_caps()
      intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
      intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages

Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-06-01 12:26:35 +01:00
Randy Dunlap 63da029015 mtd: fix physmap.h warnings
Fix build warnings in physmap.h:

include/linux/mtd/physmap.h:25: warning: 'struct platform_device' declared inside parameter list
include/linux/mtd/physmap.h:25: warning: its scope is only this definition or declaration, which is probably not what you want
include/linux/mtd/physmap.h:26: warning: 'struct platform_device' declared inside parameter list
include/linux/mtd/physmap.h:27: warning: 'struct platform_device' declared inside parameter list

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-06-01 11:36:49 +01:00
Wei Yongjun a000c01e60 sctp: stop pending timers and purge queues when peer restart asoc
If the peer restart the asoc, we should not only fail any unsent/unacked
data, but also stop the T3-rtx, SACK, T4-rto timers, and teardown ASCONF
queues.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-31 15:29:17 -07:00
Namhyung Kim ea9d6553b3 block: remove unwanted semicolons
Since those defined functions require additional semicolon
from the caller, they could cause potential syntax errors
when used in if-else statements.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-31 13:45:53 +02:00
Jens Axboe a1706ac4c0 Revert "block: Remove extra discard_alignment from hd_struct."
It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.

Since discard is the only one of the bunch actually looking into
the queue, just revert the change.

This reverts commit 23ceb5b771.

Conflicts:
	fs/partitions/check.c
2011-05-30 07:42:51 +02:00
Michael S. Tsirkin 7ab358c23c virtio: add api for delayed callbacks
Add an API that tells the other side that callbacks
should be delayed until a lot of work has been done.
Implement using the new event_idx feature.

Note: it might seem advantageous to let the drivers
ask for a callback after a specific capacity has
been reached. However, as a single head can
free many entries in the descriptor table,
we don't really have a clue about capacity
until get_buf is called. The API is the simplest
to implement at the moment, we'll see what kind of
hints drivers can pass when there's more than one
user of the feature.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-30 11:14:16 +09:30
Michael S. Tsirkin bf7035bf20 virtio ring: inline function to check for events
With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-30 11:14:14 +09:30
Michael S. Tsirkin 770b31a85e virtio: event index interface
Define a new feature bit for the guest and host to utilize
an event index (like Xen) instead if a flag bit to enable/disable
interrupts and kicks.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-30 11:14:14 +09:30
Rusty Russell a1b383870a virtio: add full three-clause BSD text to headers.
It's unclear to me if it's important, but it's obviously causing my
technical colleages some headaches and I'd hate such imprecision to
slow virtio adoption.

I've emailed this to all non-trivial contributors for approval, too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Ryan Harper <ryanh@us.ibm.com>
Acked-by: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Eric Van Hensbergen <ericvh@gmail.com>
Acked-by: john cooper <john.cooper@redhat.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
2011-05-30 11:14:14 +09:30
Linus Torvalds cd1acdf172 Merge branch 'pnfs-submit' of git://git.open-osd.org/linux-open-osd
* 'pnfs-submit' of git://git.open-osd.org/linux-open-osd: (32 commits)
  pnfs-obj: pg_test check for max_io_size
  NFSv4.1: define nfs_generic_pg_test
  NFSv4.1: use pnfs_generic_pg_test directly by layout driver
  NFSv4.1: change pg_test return type to bool
  NFSv4.1: unify pnfs_pageio_init functions
  pnfs-obj: objlayout_encode_layoutcommit implementation
  pnfs: encode_layoutcommit
  pnfs-obj: report errors and .encode_layoutreturn Implementation.
  pnfs: encode_layoutreturn
  pnfs: layoutret_on_setattr
  pnfs: layoutreturn
  pnfs-obj: osd raid engine read/write implementation
  pnfs: support for non-rpc layout drivers
  pnfs-obj: define per-inode private structure
  pnfs: alloc and free layout_hdr layoutdriver methods
  pnfs-obj: objio_osd device information retrieval and caching
  pnfs-obj: decode layout, alloc/free lseg
  pnfs-obj: pnfs_osd XDR client implementation
  pnfs-obj: pnfs_osd XDR definitions
  pnfs-obj: objlayoutdriver module skeleton
  ...
2011-05-29 14:10:13 -07:00
Linus Torvalds 6345d24daf mm: Fix boot crash in mm_alloc()
Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [<c11ae035>] find_next_bit+0x55/0xb0
    Call Trace:
     [<c11addda>] cpumask_any_but+0x2a/0x70
     [<c102396b>] flush_tlb_mm+0x2b/0x80
     [<c1022705>] pud_populate+0x35/0x50
     [<c10227ba>] pgd_alloc+0x9a/0xf0
     [<c103a3fc>] mm_init+0xec/0x120
     [<c103a7a3>] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfc ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-29 11:32:28 -07:00
Linus Torvalds cab0d85c8d Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
  [S390] mm: fix mmu_gather rework
  [S390] mm: fix storage key handling
2011-05-29 11:30:20 -07:00
Linus Torvalds a74d70b63f Merge branch 'for-2.6.40' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
  nfsd: make local functions static
  NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
  NFSD: Check status from nfsd4_map_bcts_dir()
  NFSD: Remove setting unused variable in nfsd_vfs_read()
  nfsd41: error out on repeated RECLAIM_COMPLETE
  nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
  nfsd v4.1 lOCKT clientid field must be ignored
  nfsd41: add flag checking for create_session
  nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
  nfsd4: fix wrongsec handling for PUTFH + op cases
  nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
  nfsd4: introduce OPDESC helper
  nfsd4: allow fh_verify caller to skip pseudoflavor checks
  nfsd: distinguish functions of NFSD_MAY_* flags
  svcrpc: complete svsk processing on cb receive failure
  svcrpc: take advantage of tcp autotuning
  SUNRPC: Don't wait for full record to receive tcp data
  svcrpc: copy cb reply instead of pages
  svcrpc: close connection if client sends short packet
  svcrpc: note network-order types in svc_process_calldir
  ...
2011-05-29 11:21:12 -07:00
Linus Torvalds b11b06d90a Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm kcopyd: return client directly and not through a pointer
  dm kcopyd: reserve fewer pages
  dm io: use fixed initial mempool size
  dm kcopyd: alloc pages from the main page allocator
  dm kcopyd: add gfp parm to alloc_pl
  dm kcopyd: remove superfluous page allocation spinlock
  dm kcopyd: preallocate sub jobs to avoid deadlock
  dm kcopyd: avoid pointless job splitting
  dm mpath: do not fail paths after integrity errors
  dm table: reject devices without request fns
  dm table: allow targets to support discards internally
2011-05-29 11:20:48 -07:00
Linus Torvalds f1d1c9fa8f Merge branch 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  SUNRPC: Support for RPC over AF_LOCAL transports
  SUNRPC: Remove obsolete comment
  SUNRPC: Use AF_LOCAL for rpcbind upcalls
  SUNRPC: Clean up use of curly braces in switch cases
  NFS: Revert NFSROOT default mount options
  SUNRPC: Rename xs_encode_tcp_fragment_header()
  nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
  nfs41: Correct offset for LAYOUTCOMMIT
  NFS: nfs_update_inode: print current and new inode size in debug output
  NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
  NFSv4: Handle expired stateids when the lease is still valid
  SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
2011-05-29 11:20:02 -07:00
Linus Torvalds daa94222b6 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
  ACPI EC: remove redundant code
  ACPI: Add D3 cold state
  ACPI: processor: fix processor_physically_present in UP kernel
  ACPI: Split out custom_method functionality into an own driver
  ACPI: Cleanup custom_method debug stuff
  ACPI EC: enable MSI workaround for Quanta laptops
  ACPICA: Update to version 20110413
  ACPICA: Execute an orphan _REG method under the EC device
  ACPICA: Move ACPI_NUM_PREDEFINED_REGIONS to a more appropriate place
  ACPICA: Update internal address SpaceID for DataTable regions
  ACPICA: Add more methods eligible for NULL package element removal
  ACPICA: Split all internal Global Lock functions to new file - evglock
  ACPI: EC: add another DMI check for ASUS hardware
  ACPI EC: remove dead code
  ACPICA: Fix code divergence of global lock handling
  ACPICA: Use acpi_os_create_lock interface
  ACPI: osl, add acpi_os_create_lock interface
  ACPI:Fix goto flows in thermal-sys
2011-05-29 11:19:16 -07:00
Linus Torvalds f310642123 Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
  x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
  x86 idle: deprecate "no-hlt" cmdline param
  x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
  x86 idle floppy: deprecate disable_hlt()
  x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
  x86 idle: clarify AMD erratum 400 workaround
  idle governor: Avoid lock acquisition to read pm_qos before entering idle
  cpuidle: menu: fixed wrapping timers at 4.294 seconds
2011-05-29 11:18:09 -07:00
Benny Halevy 18ad0a9f2c NFSv4.1: change pg_test return type to bool
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 20:56:54 +03:00
Benny Halevy cbe8260369 pnfs: layoutreturn
NFSv4.1 LAYOUTRETURN implementation

Currently, does not support layout-type payload encoding.

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 20:54:36 +03:00
Benny Halevy d20581aa4b pnfs: support for non-rpc layout drivers
Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.

[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 20:53:51 +03:00
Benny Halevy 38b7c401f6 pnfs-obj: pnfs_osd XDR definitions
* Add the pnfs_osd_xdr.h header

* defintions the pnfs_osd_layout structure including all it's
  sub-types and constants.
* Declare the pnfs_osd_xdr_decode_layout API + all needed
  inline helpers.

* Define the pnfs_osd_deviceaddr structure and all its subtypes and
  constants.
* Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.

* Define the pnfs_osd_ioerr structure, its substructures and constants.
* Declare API for encoding of a pnfs_osd_ioerr into XDR stream.

* Define the pnfs_osd_layoutupdate structure and its substructures.
* Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.

[Remove server definitions]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 20:52:36 +03:00
Benny Halevy f7da7a129d SUNRPC: introduce xdr_init_decode_pages
Initialize xdr_stream and xdr_buf using an array of page pointers
and length of buffer.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2011-05-29 20:52:32 +03:00
Mikulas Patocka fa34ce7307 dm kcopyd: return client directly and not through a pointer
Return client directly from dm_kcopyd_client_create, not through a
parameter, making it consistent with dm_io_client_create.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29 13:03:13 +01:00
Mikulas Patocka 5f43ba2950 dm kcopyd: reserve fewer pages
Reserve just the minimum of pages needed to process one job.

Because we allocate pages from page allocator, we don't need to reserve
a large number of pages.  The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29 13:03:11 +01:00
Mikulas Patocka bda8efec5c dm io: use fixed initial mempool size
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.

The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4.  This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.

Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures.  One structure is enough to
process the whole request.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29 13:03:09 +01:00
Mike Snitzer 4c25932701 dm table: allow targets to support discards internally
Permit a target to support discards regardless of whether or not all its
underlying devices do.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-05-29 12:52:55 +01:00
Heiko Carstens a43a9d93d4 [S390] mm: fix storage key handling
page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".

Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.

Unable to handle kernel paging request at virtual user address (null)
Oops: 0004 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 1 Not tainted 2.6.39-07551-g139f37f-dirty #152
Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
           0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
           0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
           0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
           0000000000316b0a: b9020022           ltgr    %r2,%r2
           0000000000316b0e: a7740015           brc     7,316b38
          >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
           0000000000316b18: 4120c010           la      %r2,16(%r12)
           0000000000316b1c: 4130d060           la      %r3,96(%r13)
           0000000000316b20: e340d0600004       lg      %r4,96(%r13)
           0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
Call Trace:
([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
 [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
 [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
 [<00000000002597e8>] writeback_single_inode+0xf8/0x268
 [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
 [<000000000025a700>] writeback_inodes_wb+0x80/0x168
 [<000000000025aa92>] wb_writeback+0x2aa/0x324
 [<000000000025abde>] wb_do_writeback+0xd2/0x274
 [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
 [<00000000001737be>] kthread+0xa6/0xb0
 [<000000000056c1da>] kernel_thread_starter+0x6/0xc
 [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
INFO: lockdep is turned off.
Last Breaking-Event-Address:
 [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138

Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2011-05-29 12:40:51 +02:00