alistair23-linux

redonkable

Author	SHA1	Message	Date
Vivek Goyal	02977e4af7	blkio: Add root group to td->tg_list o Currently all the dynamically allocated groups, except root grp is added to td->tg_list. This was not a problem so far but in next patch I will travel through td->tg_list to process any updates of limits on the group. If root group is not in tg_list, then root group's updates are not processed. o It is better to root group also to tg_list instead of doing special processing for it during limit updates. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-01 14:49:48 +02:00
Vivek Goyal	61014e96e6	blkio: deletion of a cgroup was causes oops o Now a cgroup list of blkg elements can contain blkg from multiple policies. Before sending an unlink event, make sure blkg belongs to they policy. If policy does not own the blkg, do not send update for this blkg. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-01 14:49:44 +02:00
Vivek Goyal	13f98250f5	blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n Currently throttling related files were visible even if user had disabled throttling using config options. It was switching off background throttling of bio but not the cgroup files. This patch fixes it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-01 14:49:41 +02:00
Malahal Naineni	efb012b361	block: set the bounce_pfn to the actual DMA limit rather than to max memory The bounce_pfn of the request queue in 64 bit systems is set to the current max_low_pfn. Adding more memory later makes this incorrect. Memory allocated beyond this boot time max_low_pfn appear to require bounce buffers (bounce buffers are actually not allocated but used in calculating segments that may result in "over max segments limit" errors). Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-01 14:45:27 +02:00
Jens Axboe	260a67a9e5	block: revert bad fix for memory hotplug causing bounces Revert "block: set the bounce_pfn to the actual DMA limit rather than to max memory" This reverts commit `c49825facf`. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-01 14:42:43 +02:00
Malahal Naineni	c49825facf	block: set the bounce_pfn to the actual DMA limit rather than to max memory The bounce_pfn of the request queue in 64 bit systems is set to the current max_low_pfn. Adding more memory later makes this incorrect. Memory allocated beyond this boot time max_low_pfn appear to require bounce buffers (bounce buffers are actually not allocated but used in calculating segments that may result in "over max segments limit" errors). Signed-off-by: Malahal Naineni <malahal@us.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-24 20:27:16 +02:00
Mark Lord	4b1977698c	block: Prevent hang_check firing during long I/O During long I/O operations, the hang_check timer may fire, trigger stack dumps that unnecessarily alarm the user. Eg. hdparm --security-erase NULL /dev/sdb ## can take hours to complete So, if hang_check is armed, we should wake up periodically to prevent it from triggering. This patch uses a wake-up interval equal to half the hang_check timer period, which keeps overhead low enough. Signed-off-by: Mark Lord <mlord@pobox.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-24 15:52:09 +02:00
Corrado Zoccolo	749ef9f842	cfq: improve fsync performance for small files Fsync performance for small files achieved by cfq on high-end disks is lower than what deadline can achieve, due to idling introduced between the sync write happening in process context and the journal commit. Moreover, when competing with a sequential reader, a process writing small files and fsync-ing them is starved. This patch fixes the two problems by: - marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE flag set, - force all queues that have REQ_NOIDLE requests to be put in the noidle tree. Having the queue associated to the fsync-ing process and the one associated to journal commits in the noidle tree allows: - switching between them without idling, - fairness vs. competing idling queues, since they will be serviced only after the noidle tree expires its slice. Acked-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-20 15:24:50 +02:00
Signed-off-by: Jan Kara	01ea50638b	block: Fix race during disk initialization When a new disk is being discovered, add_disk() first ties the bdev to gendisk (via register_disk()->blkdev_get()) and only after that calls bdi_register_bdev(). Because register_disk() also creates disk's kobject, it can happen that userspace manages to open and modify the device's data (or inode) before its BDI is properly initialized leading to a warning in __mark_inode_dirty(). Fix the problem by registering BDI early enough. This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=16312 Cc: stable@kernel.org Reported-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 20:36:36 +02:00
Vivek Goyal	8e89d13f4e	blkio: Implementation of IOPS limit logic o core logic of implementing IOPS throttling. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:44:00 +02:00
Vivek Goyal	7702e8f45b	blk-cgroup: cgroup changes for IOPS limit support o cgroup changes for IOPS throttling rules. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:42:58 +02:00
Vivek Goyal	e43473b7f2	blkio: Core implementation of throttle policy o Actual implementation of throttling policy in block layer. Currently it implements READ and WRITE bytes per second throttling logic. IOPS throttling comes in later patches. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:42:52 +02:00
Vivek Goyal	4c9eefa16c	blk-cgroup: Introduce cgroup changes for throttling policy o cgroup chagnes for throttle policy. o Introduces READ and WRITE bytes per second throttling rules. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:42:12 +02:00
Vivek Goyal	062a644d61	blk-cgroup: Prepare the base for supporting more than one IO control policies o This patch prepares the base for introducing new IO control policies. Currently all the code is written knowing there is only one policy and that is proportional bandwidth. Creating infrastructure for newer policies to come in. o Also there were many functions which were generated using macro. It was very confusing. Got rid of those. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:42:04 +02:00
Vivek Goyal	af41d7bd9b	blk-cgroup: Kill the header printed at the start of blkio.weight_device file o Kill extra "dev weight" header which is printed when somebody reads blkio.weight_device file. This really seems to be out of convention. No other blkio files are printing any header at the start of file. I think it is ok to just print values and how to interpret values should be part of documentation. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-16 08:40:42 +02:00
Will Drewry	b5af921ec0	init: add support for root devices specified by partition UUID This is the third patch in a series which adds support for storing partition metadata, optionally, off of the hd_struct. One major use for that data is being able to resolve partition by other identities than just the index on a block device. Device enumeration varies by platform and there's a benefit to being able to use something like EFI GPT's GUIDs to determine the correct block device and partition to mount as the root. This change adds that support to root= by adding support for the following syntax: root=PARTUUID=hex-uuid Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-15 16:14:03 +02:00
Will Drewry	6d1d8050b4	block, partition: add partition_meta_info to hd_struct I'm reposting this patch series as v4 since there have been no additional comments, and I cleaned up one extra bit of unneeded code (in 3/3). The patches are against Linus's tree: `2bfc96a127` (2.6.36-rc3). Would this patchset be suitable for inclusion in an mm branch? This changes adds a partition_meta_info struct which itself contains a union of structures that provide partition table specific metadata. This change leaves the union empty. The subsequent patch includes an implementation for CONFIG_EFI_PARTITION-based metadata. Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-15 16:13:18 +02:00
Namhyung Kim	144177991c	block: fix an address space warning in blk-map.c Change type of 2nd parameter of blk_rq_aligned() into unsigned long and remove unnecessary casting. Now we can call it with 'uaddr' instead of 'ubuf' in __blk_rq_map_user() so that it can remove following warnings from sparse: block/blk-map.c:57:31: warning: incorrect type in argument 2 (different address spaces) block/blk-map.c:57:31: expected void addr block/blk-map.c:57:31: got void [noderef] <asn:1>ubuf However blk_rq_map_kern() needs one more local variable to handle it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-15 13:08:27 +02:00
San Mehat	8dcbdc742f	block: block_dump: Add number of sectors to debug output Signed-off-by: San Mehat <san@android.com> Signed-off-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-14 08:48:01 +02:00
Martin K. Petersen	13f05c8d8e	block/scsi: Provide a limit on the number of integrity segments Some controllers have a hardware limit on the number of protection information scatter-gather list segments they can handle. Introduce a max_integrity_segments limit in the block layer and provide a new scsi_host_template setting that allows HBA drivers to provide a value suitable for the hardware. Add support for honoring the integrity segment limit when merging both bios and requests. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>	2010-09-10 20:50:10 +02:00
Martin K. Petersen	c8bf133682	Consolidate min_not_zero We have several users of min_not_zero, each of them using their own definition. Move the define to kernel.h. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>	2010-09-10 20:07:38 +02:00
Adrian Hunter	8d57a98ccd	block: add secure discard Secure discard is the same as discard except that all copies of the discarded sectors (perhaps created by garbage collection) must also be erased. Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Madhusudhan Chikkature <madhu.cr@ti.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Gardiner <bengardiner@nanometrics.ca> Cc: <linux-mmc@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-12 08:43:30 -07:00
Dmitry Monakhov	18edc8eaa6	blkdev: fix blkdev_issue_zeroout return value - If function called without barrier option retvalue is incorrect Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-08 12:31:08 -04:00
ike Snitzer	3383977fad	block: update request stacking methods to support discards Propagate REQ_DISCARD in cmd_flags when cloning a discard request. Skip blk_rq_check_limits's existing checks for discard requests because discard limits will have already been checked in blkdev_issue_discard. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-08 12:11:33 -04:00
FUJITA Tomonori	16f2319fd6	block: set up rq->rq_disk properly for flush requests q->bar_rq.rq_disk is NULL. Use the rq_disk of the original request instead. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:52:41 +02:00
FUJITA Tomonori	28e18d0188	block: set REQ_TYPE_FS on flush requests the block layer doesn't set rq->cmd_type on flush requests. By definition, it should be REQ_TYPE_FS (the lower layers build a command and interpret the result of it, that is, the block layer doesn't know the details). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:52:40 +02:00
Jens Axboe	10d1f9e2cc	block: fix problem with sending down discard that isn't of correct granularity If the queue doesn't have a limit set, or it just set UINT_MAX like we default to, we coud be sending down a discard request that isn't of the correct granularity if the block size is > 512b. Fix this by adjusting max_discard_sectors down to the proper alignment. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:33 +02:00
Dave Chinner	f10d9f617a	blkdev: check for valid request queue before issuing flush Issuing a blkdev_issue_flush() on an unconfigured loop device causes a panic as q->make_request_fn is not configured. This can occur when trying to mount the unconfigured loop device as an XFS filesystem. There are no guards that catch the bio before the request function is called because we don't add a payload to the bio. Instead, manually check this case as soon as we have a pointer to the queue to flush. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:29 +02:00
Arnd Bergmann	15392efb9d	block: remove BKL from partition ioctls The blkpg_ioctl and blkdev_reread_part access fields of the bdev and gendisk structures, yet they always do so under the protection of bdev->bd_mutex, which seems sufficient. Signed-off-by: Arnd Bergmann <arnd@arndb.de> cked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:08 +02:00
Arnd Bergmann	6de4370310	block: remove BKL from BLKROSET and BLKFLSBUF We only call the functions set_device_ro(), invalidate_bdev(), sync_filesystem() and sync_blockdev() while holding the BKL in these commands. All of these are also done in other code paths without the BKL, which leads me to the conclusion that the BKL is not needed here either. The reason we hold it here is that it was originally pushed down into the ioctl function from vfs_ioctl. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:08 +02:00
Arnd Bergmann	62c2a7d969	block: push BKL into blktrace ioctls The blktrace driver currently needs the BKL, but we should not need to take that in the block layer, so just push it down into the driver itself. It is quite likely that the BKL is not actually required in blktrace code and could be removed in a follow-on patch. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:26:08 +02:00
Arnd Bergmann	8a6cfeb6de	block: push down BKL into .locked_ioctl As a preparation for the removal of the big kernel lock in the block layer, this removes the BKL from the common ioctl handling code, moving it into every single driver still using it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:25:00 +02:00
FUJITA Tomonori	00fff26539	block: remove q->prepare_flush_fn completely This removes q->prepare_flush_fn completely (changes the blk_queue_ordered API). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:24:15 +02:00
FUJITA Tomonori	b6a903151d	block: permit PREFLUSH and POSTFLUSH without prepare_flush_fn This is preparation for removing q->prepare_flush_fn. Temporarily, blk_queue_ordered() permits QUEUE_ORDERED_DO_PREFLUSH and QUEUE_ORDERED_DO_POSTFLUSH without prepare_flush_fn. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:56 +02:00
FUJITA Tomonori	8749534fe6	block: introduce REQ_FLUSH flag SCSI-ml needs a way to mark a request as flush request in q->prepare_flush_fn because it needs to identify them later (e.g. in q->request_fn or prep_rq_fn). queue_flush sets REQ_HARDBARRIER in rq->cmd_flags however the block layer also sends normal REQ_TYPE_FS requests with REQ_HARDBARRIER. So SCSI-ml can't use REQ_HARDBARRIER to identify flush requests. We could change the block layer to clear REQ_HARDBARRIER bit before sending non flush requests to the lower layers. However, intorudcing the new flag looks cleaner (surely easier). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@suse.de> Cc: David S. Miller <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:53 +02:00
James Bottomley	28018c242a	block: implement an unprep function corresponding directly to prep Reviewed-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:47 +02:00
Jens Axboe	3ffb52e73b	block: fixup missing conversion from BIO_RW_DISCARD to REQ_DISCARD Didn't cause a merge conflict, so fixed this one up manually post merge. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:41 +02:00
Andi Kleen	2c8919dee6	gcc-4.6: block: fix unused but set variables in blk-merge Just some dead code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:10 +02:00
Christoph Hellwig	66ac028019	block: don't allocate a payload for discard request Allocating a fixed payload for discard requests always was a horrible hack, and it's not coming to byte us when adding support for discard in DM/MD. So change the code to leave the allocation of a payload to the lowlevel driver. Unfortunately that means we'll need another hack, which allows us to update the various block layer length fields indicating that we have a payload. Instead of hiding this in sd.c, which we already partially do for UNMAP support add a documented helper in the core block layer for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:23:08 +02:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
Christoph Hellwig	33659ebbae	block: remove wrappers for request type/flags Remove all the trivial wrappers for the cmd_type and cmd_flags fields in struct requests. This allows much easier grepping for different request types instead of unwinding through macros. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:17:56 +02:00
Jens Axboe	956bcb7c1a	block: add helpers for the trivial queue flag sysfs show/store entries The code for nonrot, random, and io stats are completely identical. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:13:50 +02:00
Jens Axboe	e2e1a148bc	block: add sysfs knob for turning off disk entropy contributions There are two reasons for doing this: - On SSD disks, the completion times aren't as random as they are for rotational drives. So it's questionable whether they should contribute to the random pool in the first place. - Calling add_disk_randomness() has a lot of overhead. This adds /sys/block/<dev>/queue/add_random that will allow you to switch off on a per-device basis. The default setting is on, so there should be no functional changes from this patch. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:13:00 +02:00
Tao Ma	1b99973f1c	block: Don't count_vm_events for discard bio in submit_bio. In submit_bio, we count vm events by check READ/WRITE. But actually DISCARD_NOBARRIER also has the WRITE flag set. It looks as if in blkdev_issue_discard, we also add a page as the payload and the bio_has_data check isn't enough. So add another check for discard bio. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-24 08:14:22 +02:00
Jens Axboe	9e495db1a1	cfq: fix recursive call in cfq_blkiocg_update_completion_stats() `e98ef89b` has a typo, causing cfq_blkiocg_update_completion_stats() to call itself instead of blkiocg_update_completion_stats(). Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-21 09:10:55 +02:00
Vivek Goyal	e98ef89b30	cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n Hi Jens, Few days back Ingo noticed a CFQ boot time warning. This patch fixes it. The issue here is that with CFQ_GROUP_IOSCHED=n, CFQ should not really be making blkio stat related calls. > Hm, it's still not entirely fixed, as of 2.6.35-rc2-00131-g7908a9e. With > some > configs i get bad spinlock warnings during bootup: > > [ 28.968013] initcall net_olddevs_init+0x0/0x82 returned 0 after 93750 > usecs > [ 28.972003] calling b44_init+0x0/0x55 @ 1 > [ 28.976009] bus: 'pci': add driver b44 > [ 28.976374] sda: > [ 28.978157] BUG: spinlock bad magic on CPU#1, async/0/117 > [ 28.980000] lock: 7e1c5bbc, .magic: 00000000, .owner: <none>/-1, +.owner_cpu: 0 > [ 28.980000] Pid: 117, comm: async/0 Not tainted +2.6.35-rc2-tip-01092-g010e7ef-dirty #8183 > [ 28.980000] Call Trace: > [ 28.980000] [<41ba6d55>] ? printk+0x20/0x24 > [ 28.980000] [<4134b7b7>] spin_bug+0x7c/0x87 > [ 28.980000] [<4134b853>] do_raw_spin_lock+0x1e/0x123 > [ 28.980000] [<41ba92ca>] ? _raw_spin_lock_irqsave+0x12/0x20 > [ 28.980000] [<41ba92d2>] _raw_spin_lock_irqsave+0x1a/0x20 > [ 28.980000] [<4133476f>] blkiocg_update_io_add_stats+0x25/0xfb > [ 28.980000] [<41335dae>] ? cfq_prio_tree_add+0xb1/0xc1 > [ 28.980000] [<41337bc7>] cfq_insert_request+0x8c/0x425 Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-18 19:57:47 +02:00
Jeff Moyer	c10b61f091	cfq: Don't allow queue merges for queues that have no process references Hi, A user reported a kernel bug when running a particular program that did the following: created 32 threads - each thread took a mutex, grabbed a global offset, added a buffer size to that offset, released the lock - read from the given offset in the file - created a new thread to do the same - exited The result is that cfq's close cooperator logic would trigger, as the threads were issuing I/O within the mean seek distance of one another. This workload managed to routinely trigger a use after free bug when walking the list of merge candidates for a particular cfqq (cfqq->new_cfqq). The logic used for merging queues looks like this: static void cfq_setup_merge(struct cfq_queue cfqq, struct cfq_queue new_cfqq) { int process_refs, new_process_refs; struct cfq_queue __cfqq; / Avoid a circular list and skip interim queue merges / while ((__cfqq = new_cfqq->new_cfqq)) { if (__cfqq == cfqq) return; new_cfqq = __cfqq; } process_refs = cfqq_process_refs(cfqq); / * If the process for the cfqq has gone away, there is no * sense in merging the queues. / if (process_refs == 0) return; / * Merge in the direction of the lesser amount of work. / new_process_refs = cfqq_process_refs(new_cfqq); if (new_process_refs >= process_refs) { cfqq->new_cfqq = new_cfqq; atomic_add(process_refs, &new_cfqq->ref); } else { new_cfqq->new_cfqq = cfqq; atomic_add(new_process_refs, &cfqq->ref); } } When a merge candidate is found, we add the process references for the queue with less references to the queue with more. The actual merging of queues happens when a new request is issued for a given cfqq. In the case of the test program, it only does a single pread call to read in 1MB, so the actual merge never happens. Normally, this is fine, as when the queue exits, we simply drop the references we took on the other cfqqs in the merge chain: / * If this queue was scheduled to merge with another queue, be * sure to drop the reference taken on that queue (and others in * the merge chain). See cfq_setup_merge and cfq_merge_cfqqs. */ __cfqq = cfqq->new_cfqq; while (__cfqq) { if (__cfqq == cfqq) { WARN(1, "cfqq->new_cfqq loop detected\n"); break; } next = __cfqq->new_cfqq; cfq_put_queue(__cfqq); __cfqq = next; } However, there is a hole in this logic. Consider the following (and keep in mind that each I/O keeps a reference to the cfqq): q1->new_cfqq = q2 // q2 now has 2 process references q3->new_cfqq = q2 // q2 now has 3 process references // the process associated with q2 exits // q2 now has 2 process references // queue 1 exits, drops its reference on q2 // q2 now has 1 process reference // q3 exits, so has 0 process references, and hence drops its references // to q2, which leaves q2 also with 0 process references q4 comes along and wants to merge with q3 q3->new_cfqq still points at q2! We follow that link and end up at an already freed cfqq. So, the fix is to not follow a merge chain if the top-most queue does not have a process reference, otherwise any queue in the chain could be already freed. I also changed the logic to disallow merging with a queue that does not have any process references. Previously, we did this check for one of the merge candidates, but not the other. That doesn't really make sense. Without the attached patch, my system would BUG within a couple of seconds of running the reproducer program. With the patch applied, my system ran the program for over an hour without issues. This addresses the following bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16217 Thanks a ton to Phil Carns for providing the bug report and an excellent reproducer. [ Note for stable: this applies to 2.6.32/33/34 ]. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Phil Carns <carns@mcs.anl.gov> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-17 20:17:35 +02:00
Christoph Hellwig	fbbf055692	block: fix DISCARD_BARRIER requests Filesystems assume that DISCARD_BARRIER are full barriers, so that they don't have to track in-progress discard operation when submitting new I/O. But currently we only treat them as elevator barriers, which don't actually do the nessecary queue drains. Also remove the unlikely around both the DISCARD and BARRIER requests - the happen far too often for a static mispredict. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-17 10:10:53 +02:00
Mike Snitzer	1abec4fdbb	block: make blk_init_free_list and elevator_init idempotent blk_init_allocated_queue_node may fail and the caller _could_ retry. Accommodate the unlikely event that blk_init_allocated_queue_node is called on an already initialized (possibly partially) request_queue. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-04 13:47:06 +02:00
Mike Snitzer	c86d1b8ae6	block: avoid unconditionally freeing previously allocated request_queue On blk_init_allocated_queue_node failure, only free the request_queue if it is wasn't previously allocated outside the block layer (e.g. blk_init_queue_node was blk_init_allocated_queue_node caller). This addresses an interface bug introduced by the following commit: `01effb0` block: allow initialization of previously allocated request_queue Otherwise the request_queue may be free'd out from underneath a caller that is managing the request_queue directly (e.g. caller uses blk_alloc_queue + blk_init_allocated_queue_node). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-04 13:47:06 +02:00

1 2 3 4 5 ...

1233 Commits (02977e4af7ed3b478c505e50491ffdf3e1314cf4)