Commit graph

1453 commits

Author SHA1 Message Date
Martin K. Petersen 3c5820c743 block: Optimal I/O limit wrapper
Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
around it. DM needs this to avoid poking at the queue_limits directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Nikanth Karthikesan a9327cac44 Seperate read and write statistics of in_flight requests
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.

This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-14 08:24:52 +02:00
Jens Axboe 1f98a13f62 bio: first step in sanitizing the bio->bi_rw flag testing
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-11 14:33:31 +02:00
Geert Uytterhoeven 0d03d59d9b md: Fix "strchr" [drivers/md/dm-log-userspace.ko] undefined!
Commit b8313b6da7 ("dm log: remove incorrect
field from userspace table output") added a call to strstr() with a
single-character "needle" string parameter.

Unfortunately some versions of gcc replace such calls to strstr() by calls
to strchr() behind our back.  This causes linking errors if strchr() is
defined as an inline function in <asm/string.h> (e.g. on m68k):

| WARNING: "strchr" [drivers/md/dm-log-userspace.ko] undefined!

Avoid this by explicitly calling strchr() instead.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-10 14:55:01 -07:00
Dan Williams 9134d02bc0 Merge commit 'md/for-linus' into async-tx-next
Conflicts:
	drivers/md/raid5.c
2009-09-08 17:55:54 -07:00
Dan Williams bbb20089a3 Merge branch 'dmaengine' into async-tx-next
Conflicts:
	crypto/async_tx/async_xor.c
	drivers/dma/ioat/dma_v2.h
	drivers/dma/ioat/pci.c
	drivers/md/raid5.c
2009-09-08 17:55:21 -07:00
Dan Williams 0403e38277 dmaengine: add fence support
Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization.  The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency.  So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:42:50 -07:00
Dan Williams f9dd213437 Merge branch 'md-raid6-accel' into ioat3.2
Conflicts:
	include/linux/dmaengine.h
2009-09-08 17:42:29 -07:00
Mikulas Patocka ae0b7448e9 dm snapshot: fix on disk chunk size validation
Fix some problems seen in the chunk size processing when activating a
pre-existing snapshot.

For a new snapshot, the chunk size can either be supplied by the creator
or a default value can be used.  For an existing snapshot, the
chunk size in the snapshot header on disk should always be used.

If someone attempts to load an existing snapshot and has the 'default
chunk size' option set, the kernel uses its default value even when it
is incorrect for the snapshot being loaded.  This patch ensures the
correct on-disk value is always used.

Secondly, when the code does use the chunk size stored on the disk it is
prudent to revalidate it, so the code can exit cleanly if it got
corrupted as happened in
https://bugzilla.redhat.com/show_bug.cgi?id=461506 .

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:43 +01:00
Mikulas Patocka 2defcc3fb4 dm exception store: split set_chunk_size
Break the function set_chunk_size to two functions in preparation for
the fix in the following patch.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:41 +01:00
Mikulas Patocka 61578dcd3f dm snapshot: fix header corruption race on invalidation
If a persistent snapshot fills up, a race can corrupt the on-disk header
which causes a crash on any future attempt to activate the snapshot
(typically while booting).  This patch fixes the race.

When the snapshot overflows, __invalidate_snapshot is called, which calls
snapshot store method drop_snapshot. It goes to persistent_drop_snapshot that
calls write_header. write_header constructs the new header in the "area"
location.

Concurrently, an existing kcopyd job may finish, call copy_callback
and commit_exception method, that goes to persistent_commit_exception.
persistent_commit_exception doesn't do locking, relying on the fact that
callbacks are single-threaded, but it can race with snapshot invalidation and
overwrite the header that is just being written while the snapshot is being
invalidated.

The result of this race is a corrupted header being written that can
lead to a crash on further reactivation (if chunk_size is zero in the
corrupted header).

The fix is to use separate memory areas for each.

See the bug: https://bugzilla.redhat.com/show_bug.cgi?id=461506

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:39 +01:00
Mikulas Patocka 02d2fd31de dm snapshot: refactor zero_disk_area to use chunk_io
Refactor chunk_io to prepare for the fix in the following patch.

Pass an area pointer to chunk_io and simplify zero_disk_area to use
chunk_io.  No functional change.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:37 +01:00
Jonathan Brassow 7ec23d5094 dm log: userspace add luid to distinguish between concurrent log instances
Device-mapper userspace logs (like the clustered log) are
identified by a universally unique identifier (UUID).  This
identifier is used to associate requests from the kernel to
a specific log in userspace.  The UUID must be unique everywhere,
since multiple machines may use this identifier when communicating
about a particular log, as is the case for cluster logs.

Sometimes, device-mapper/LVM may re-use a UUID.  This is the
case during pvmoves, when moving from one segment of an LV
to another, or when resizing a mirror, etc.  In these cases,
a new log is created with the same UUID and loaded in the
"inactive" slot.  When a device-mapper "resume" is issued,
the "live" table is deactivated and the new "inactive" table
becomes "live".  (The "inactive" table can also be removed
via a device-mapper 'clear' command.)

The above two issues were colliding.  More than one log was being
created with the same UUID, and there was no way to distinguish
between them.  So, sometimes the wrong log would be swapped
out during the exchange.

The solution is to create a locally unique identifier,
'luid', to go along with the UUID.  This new identifier is used
to determine exactly which log is being referenced by the kernel
when the log exchange is made.  The identifier is not
universally safe, but it does not need to be, since
create/destroy/suspend/resume operations are bound to a specific
machine; and these are the operations that make up the exchange.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:34 +01:00
Jonathan Brassow d2b698644c dm raid1: do not allow log_failure variable to unset after being set
This patch fixes a bug which was triggering a case where the primary leg
could not be changed on failure even when the mirror was in-sync.

The case involves the failure of the primary device along with
the transient failure of the log device.  The problem is that
bios can be put on the 'failures' list (due to log failure)
before 'fail_mirror' is called due to the primary device failure.
Normally, this is fine, but if the log device failure is transient,
a subsequent iteration of the work thread, 'do_mirror', will
reset 'log_failure'.  The 'do_failures' function then resets
the 'in_sync' variable when processing bios on the failures list.
The 'in_sync' variable is what is used to determine if the
primary device can be switched in the event of a failure.  Since
this has been reset, the primary device is incorrectly assumed
to be not switchable.

The case has been seen in the cluster mirror context, where one
machine realizes the log device is dead before the other machines.
As the responsibilities of the server migrate from one node to
another (because the mirror is being reconfigured due to the failure),
the new server may think for a moment that the log device is fine -
thus resetting the 'log_failure' variable.

In any case, it is inappropiate for us to reset the 'log_failure'
variable.  The above bug simply illustrates that it can actually
hurt us.

Cc: stable@kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:32 +01:00
Jonathan Brassow b8313b6da7 dm log: remove incorrect field from userspace table output
The output of 'dmsetup table' includes an internal field that should not
be there.  This patch removes it.  To make the fix simpler, we first
reorder a constructor argument

The 'device size' argument is generated internally.  Currently it is
placed as the last space-separated word of the constructor string.
However, we need to use a version of the string without this word, so we
move it to the beginning instead so it is trivial to skip past it.

We keep a copy of the arguments passed to userspace for creating a log,
just in case we need to resend them.  These are the same arguments that
are desired in the STATUSTYPE_TABLE request, except for one.  When
creating the userspace log, the userspace daemon must know the size of
the mirror, so that is added to the arguments given in the constructor
table.  We were printing this extra argument out as well, which is a
mistake.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:30 +01:00
Jonathan Brassow 4142a96917 dm log: fix userspace status output
Fix 'dmsetup table' output.

There is a missing ' ' at the end of the string causing two
words to run together.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:28 +01:00
Mike Snitzer 40bea43127 dm stripe: expose correct io hints
Set sensible I/O hints for striped DM devices in the topology
infrastructure added for 2.6.31 for userspace tools to
obtain via sysfs.

Add .io_hints to 'struct target_type' to allow the I/O hints portion
(io_min and io_opt) of the 'struct queue_limits' to be set by each
target and implement this for dm-stripe.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:25 +01:00
Mike Snitzer a963a95622 dm table: add more context to terse warning messages
A couple of recent warning messages make it difficult for the reader to
determine exactly what is wrong.  This patch adds more information to
those messages.

The messages were added by these commits:
  5dea271b6d ("dm table: pass correct dev area size
to device_area_is_valid")
  ea9df47cc9 ("dm table: fix blk_stack_limits arg
to use bytes not sectors")

The patch also corrects references to logical_block_size in printk format
strings from %hu to %u.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:24 +01:00
Mikulas Patocka f6a1ed1086 dm table: fix queue_limit checking device iterator
The logic to check for valid device areas is inverted relative to proper
use with iterate_devices.

The iterate_devices method calls its callback for every underlying
device in the target.  If any callback returns non-zero, iterate_devices
exits immediately.  But the callback device_area_is_valid() returns 0 on
error and 1 on success.  The overall effect without is that an error is
issued only if every device is invalid.

This patch renames device_area_is_valid to device_area_is_invalid and
inverts the logic so that one invalid device is sufficient to raise
an error.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:22 +01:00
Mike Snitzer 8811f46c1f dm snapshot: implement iterate devices
Implement the .iterate_devices for the origin and snapshot targets.
dm-snapshot's lack of .iterate_devices resulted in the inability to
properly establish queue_limits for both targets.

With 4K sector drives: an unfortunate side-effect of not establishing
proper limits in either targets' DM device was that IO to the devices
would fail even though both had been created without error.

Commit af4874e03e ("dm target:s introduce
iterate devices fn") in 2.6.31-rc1 should have implemented .iterate_devices
for dm-snap.c's origin and snapshot targets.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:19 +01:00
Kiyoshi Ueda a77e28c7e1 dm multipath: fix oops when request based io fails when no paths
The patch posted at http://marc.info/?l=dm-devel&m=124539787228784&w=2
which was merged into cec47e3d4a ("dm:
prepare for request based option") introduced a regression in
request-based dm.

If map_request() calls dm_kill_unmapped_request() to complete a cloned
bio without dispatching it, clone->bio is still set when
dm_end_request() is called and the BUG_ON(clone->bio) is incorrect.

The patch fixes this bug by freeing bio in dm_end_request() if the clone
has bio.  I've redone my tests to cover all I/O paths and confirmed
there's no other regression.

Here is the oops I hit in request-based dm when I do I/O to a multipath
device which doesn't have any active path nor queue_if_no_path setting:

------------[ cut here ]------------
kernel BUG at /root/2.6.31-rc4.rqdm/drivers/md/dm.c:828!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac sg sr_mod e1000e button cdrom serio_raw rtc_cmos rtc_core rtc_lib piix lpfc scsi_transport_fc ata_piix libata megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 7, comm: ksoftirqd/1 Not tainted 2.6.31-rc4.rqdm #1 Express5800/120Lj [N8100-1417]
RIP: 0010:[<ffffffffa023629d>]  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
RSP: 0018:ffff8800280a1f08  EFLAGS: 00010282
RAX: ffffffffa02544e0 RBX: ffff8802aa1111d0 RCX: ffff8802aa1111e0
RDX: ffff8802ab913e70 RSI: 0000000000000000 RDI: ffff8802ab913e70
RBP: ffff8800280a1f28 R08: ffffc90005457040 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000fffffffb
R13: ffff8802ab913e88 R14: ffff8802ab9c1438 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff88002809e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003d54a98640 CR3: 000000029f0a1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/1 (pid: 7, threadinfo ffff8802ae50e000, task ffff8802ae4f8040)
Stack:
 ffff8800280a1f38 0000000000000020 ffffffff814f30a0 0000000000000004
<0> ffff8800280a1f58 ffffffff8116b245 ffff8800280a1f38 ffff8800280a1f38
<0> ffff8800280a1f58 0000000000000001 ffff8800280a1fa8 ffffffff810477bc
Call Trace:
 <IRQ>
 [<ffffffff8116b245>] blk_done_softirq+0x75/0x90
 [<ffffffff810477bc>] __do_softirq+0xcc/0x210
 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
 [<ffffffff8100ce7c>] call_softirq+0x1c/0x50
 <EOI>
 [<ffffffff8100e785>] do_softirq+0x65/0xa0
 [<ffffffff81047170>] ? ksoftirqd+0x0/0x110
 [<ffffffff810471e0>] ksoftirqd+0x70/0x110
 [<ffffffff81059559>] kthread+0x99/0xb0
 [<ffffffff8100cd7a>] child_rip+0xa/0x20
 [<ffffffff8100c73c>] ? restore_args+0x0/0x30
 [<ffffffff810594c0>] ? kthread+0x0/0xb0
 [<ffffffff8100cd70>] ? child_rip+0x0/0x20
Code: 44 89 e6 48 89 df e8 23 fb f2 e0 be 01 00 00 00 4c 89 f7 e8 f6 fd ff ff 5b 41 5c 41 5d 41 5e c9 c3 4c 89 ef e8 85 fe ff ff eb ed <0f> 0b eb fe 41 8b 85 dc 00 00 00 48 83 bb 10 01 00 00 00 89 83
RIP  [<ffffffffa023629d>] dm_softirq_done+0xbd/0x100 [dm_mod]
 RSP <ffff8800280a1f08>
---[ end trace 16af0a1d8542da55 ]---

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-09-04 20:40:16 +01:00
Dan Williams 07a3b417dc md/raid456: distribute raid processing over multiple cores
Now that the resources to handle stripe_head operations are allocated
percpu it is possible for raid5d to distribute stripe handling over
multiple cores.  This conversion also adds a call to cond_resched() in
the non-multicore case to prevent one core from getting monopolized for
raid operations.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov b774ef491b md/raid6: remove synchronous infrastructure
These routines have been replaced by there asynchronous counterparts.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov 6c0069c0ae md/raid6: asynchronous handle_stripe6
1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
   raid_run_ops
2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
   (adds handling of Q parity)
3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
   operations
4/ Hook up raid_run_ops

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Dan Williams d82dfee0ad md/raid6: asynchronous handle_parity_check6
[ Based on an original patch by Yuri Tikhonov ]

Implement the state machine for handling the RAID-6 parities check and
repair functionality.  Note that the raid6 case does not need to check
for new failures, like raid5, as it will always writeback the correct
disks.  The raid5 case can be updated to check zero_sum_result to avoid
getting confused by new failures rather than retrying the entire check
operation.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov a9b39a741a md/raid6: asynchronous handle_stripe_dirtying6
In the synchronous implementation of stripe dirtying we processed a
degraded stripe with one call to handle_stripe_dirtying6().  I.e.
compute the missing blocks from the other drives, then copy in the new
data and reconstruct the parities.

In the asynchronous case we do not perform stripe operations directly.
Instead, operations are scheduled with flags to be later serviced by
raid_run_ops.  So, for the degraded case the final reconstruction step
can only be carried out after all blocks have been brought up to date by
being read, or computed.  Like the raid5 case schedule_reconstruction()
sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
through operation chaining can handle compute and reconstruct in a
single raid_run_ops pass.

[dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Yuri Tikhonov 5599becca4 md/raid6: asynchronous handle_stripe_fill6
Modify handle_stripe_fill6 to work asynchronously by introducing
fetch_block6 as the raid6 analog of fetch_block5 (schedule compute
operations for missing/out-of-sync disks).

[dan.j.williams@intel.com: compute D+Q in one pass]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Yuri Tikhonov c0f7bddbe6 md/raid5,6: common schedule_reconstruction for raid5/6
Extend schedule_reconstruction5 for reuse by the raid6 path.  Add
support for generating Q and BUG() if a request is made to perform
'prexor'.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Dan Williams ac6b53b6e6 md/raid6: asynchronous raid6 operations
[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Dan Williams 4e7d2c0aef md/raid5: factor out mark_uptodate from ops_complete_compute5
ops_complete_compute5 can be reused in the raid6 path if it is updated to
generically handle a second target.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:11 -07:00
Dan Williams cb3c82992f async_tx: raid6 recovery self test
Port drivers/md/raid6test/test.c to use the async raid6 recovery
routines.  This is meant as a unit test for raid6 acceleration drivers.  In
addition to the 16-drive test case this implements tests for the 4-disk and
5-disk special cases (dma devices can not generically handle less than 2
sources), and adds a test for the D+Q case.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:28 -07:00
Dan Williams ad283ea4a3 async_tx: add sum check flags
Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams d6f38f31f3 md/raid5,6: add percpu scribble region for buffer lists
Use percpu memory rather than stack for storing the buffer lists used in
parity calculations.  Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams 36d1c6476b md/raid6: move the spare page to a percpu allocation
In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Chandra Seetharaman 2bfd2e1337 [SCSI] scsi_dh: Use scsi_dh_set_params() in multipath.
Use scsi_dh_set_params() set parameters provided. Save the parameters in
parse_hw_handler() and use it in parse_path().

Reported-by: Eddie Williams <Eddie.Williams@steeleye.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Tested-by: Eddie Williams <Eddie.Williams@steeleye.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2009-08-22 17:52:15 -05:00
Linus Torvalds 435a71d9ef Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  Fix new incorrect error return from do_md_stop.
2009-08-18 13:54:08 -07:00
NeilBrown 80ffb3ccea Fix new incorrect error return from do_md_stop.
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough.  There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-18 10:35:26 +10:00
Randy Dunlap 894ef820b1 dm-log-userspace: fix printk format warning
drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t'

Previously posted and acked, but apparently lost.
http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-16 08:35:58 -07:00
NeilBrown 4d484a4a7a md: allow upper limit for resync/reshape to be set when array is read-only
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races.  But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:50 +10:00
NeilBrown 1a67dde0ab md/raid5: Properly remove excess drives after shrinking a raid5/6
We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:49 +10:00
NeilBrown a639755cf8 md/raid5: make sure a reshape restarts at the correct address.
This "if" don't allow for the possibility that the number of devices
doesn't change, and so sector_nr isn't set correctly in that case.
So change '>' to '>='.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:13:00 +10:00
NeilBrown 67ac6011db md/raid5: allow new reshape modes to be restarted in the middle.
md/raid5 doesn't allow a reshape to restart if it involves writing
over the same part of disk that it would be reading from.
This happens at the beginning of a reshape that increases the number
of devices, at the end of a reshape that decreases the number of
devices, and continuously for a reshape that does not change the
number of devices.

The current code is correct for the "increase number of devices"
case as the critical section at the start is handled by userspace
performing a backup.

It does not work for reducing the number of devices, or the
no-change case.
For 'reducing', we need to invert the test.  For no-change we cannot
really be sure things will be safe, so simply require the array
to be read-only, which is how the user-space code which carefully
starts such arrays works.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:06:24 +10:00
NeilBrown 51d5668cb2 md: never advance 'events' counter by more than 1.
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'.  This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2.  This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason.  So change it to never increase by
two.  This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost.  The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 09:54:02 +10:00
NeilBrown c8c00a6915 Remove deadlock potential in md_open
A recent commit:
  commit 449aad3e25

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74de
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL.  So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-10 12:50:52 +10:00
NeilBrown 449aad3e25 md: Use revalidate_disk to effect changes in size of device.
As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown 64bd660b51 md: allow raid5_quiesce to work properly when reshape is happening.
The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.

However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:58 +10:00
NeilBrown e516402c0d md/raid5: set reshape_position correctly when reshape starts.
As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect.  It is visible in sysfs and that
is all.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown 70471dafe3 md: Handle growth of v1.x metadata correctly.
The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:57 +10:00
NeilBrown 3673f305fa md: avoid array overflow with bad v1.x metadata
We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:56 +10:00
NeilBrown 3a981b03f3 md: when a level change reduces the number of devices, remove the excess.
When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:55 +10:00
Andre Noll ac5e7113e7 md: Push down data integrity code to personalities.
This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-03 10:59:47 +10:00
Dan Williams 95fc17aac4 md/raid6: release spare page at ->stop()
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-31 12:39:15 +10:00
Mike Snitzer 5dea271b6d dm table: pass correct dev area size to device_area_is_valid
Incorrect device area lengths are being passed to device_area_is_valid().

The regression appeared in 2.6.31-rc1 through commit
754c5fc7eb.

With the dm-stripe target, the size of the target (ti->len) was used
instead of the stripe_width (ti->len/#stripes).  An example of a
consequent incorrect error message is:

  device-mapper: table: 254:0: sdb too small for target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:42 +01:00
Mike Snitzer a732c207d1 dm: remove queue next_ordered workaround for barriers
This patch removes DM's bio-based vs request-based conditional setting
of next_ordered.  For bio-based DM the next_ordered check is no longer a
concern (as that check is now in the __make_request path).  For
request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

bio-based DM was changed to work-around the previously misplaced
next_ordered check with this commit:
99360b4c18

request-based DM does not yet support barriers but reacted to the above
bio-based DM change with this commit:
5d67aa2366

The above changes are no longer needed given Neil Brown's recent fix to
put the next_ordered check in the __make_request path:
db64f680ba

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:40 +01:00
Mikulas Patocka 69885683d2 dm raid1: wake kmirrord when requeueing delayed bios after remote recovery
The recent commit 7513c2a761 (dm raid1:
add is_remote_recovering hook for clusters) changed do_writes() to
update the ms->writes list but forgot to wake up kmirrord to process it.

The rule is that when anything is being added on ms->reads, ms->writes
or ms->failures and the list was empty before we must call
wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
processing).  Otherwise the bios could sit on the list indefinitely.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-07-23 20:30:37 +01:00
Dan Williams a11034b428 md/raid6: release spare page at ->stop()
Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-07-14 11:48:16 -07:00
Jens Axboe 8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Joe Perches ad361c9884 Remove multiple KERN_ prefixes from printk formats
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 10:30:03 -07:00
Linus Torvalds 2027bd9f92 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
  blocK: Restore barrier support for md and probably other virtual devices.
  block: get rid of queue-private command filter
  block: Create bip slabs with embedded integrity vectors
  cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
  cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
  Trivial typo fixes in Documentation/block/data-integrity.txt.
2009-07-01 10:41:09 -07:00
Linus Torvalds 544ae5f96e Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: use interruptible wait when duration is controlled by userspace.
  md/raid5: suspend shouldn't affect read requests.
  md: tidy up error paths in md_alloc
  md: fix error path when duplicate name is found on md device creation.
  md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
  md: Use new topology calls to indicate alignment and I/O sizes
2009-07-01 10:31:26 -07:00
Martin K. Petersen 7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00
NeilBrown e62e58a5ff md: use interruptible wait when duration is controlled by userspace.
User space can set various limits on an md array so that resync waits
when it gets to a certain point, or so that I/O is blocked for a short
while.
When md is waiting against one of these limit, it should use an
interruptible wait so as not to add to the load average, and so are
not to trigger a warning if the wait goes on for too long.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 13:15:35 +10:00
NeilBrown a5c308d4d1 md/raid5: suspend shouldn't affect read requests.
md allows write to regions on an array to be suspended temporarily.
This allows user-space to participate is aspects of reshape.
In particular, data can be copied with not risk of a race.
We should not be blocking read requests though, so don't.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 13:15:35 +10:00
NeilBrown 0909dc448c md: tidy up error paths in md_alloc
As the recent bug in md_alloc showed, having a single exit path for
unlocking and putting is a good idea.  So restructure md_alloc to have
a single mutex_unlock and mddev_put, and use gotos where necessary.

Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 12:27:21 +10:00
NeilBrown 1ec22eb2b4 md: fix error path when duplicate name is found on md device creation.
When an md device is created by name (rather than number) we need to
check that the name is not already in use.  If this check finds a
duplicate, we return an error without dropping the lock or freeing
the newly create mddev.
This patch fixes that.

Cc: stable@kernel.org
Found-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 12:27:21 +10:00
NeilBrown b8d966efd9 md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
If we try to modify one of the md/ sysfs files
  suspend_lo or suspend_hi
when the array is not active, we dereference a NULL.
Protect against that.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:14:04 +10:00
Martin K. Petersen 8f6c2e4b32 md: Use new topology calls to indicate alignment and I/O sizes
Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-07-01 11:13:45 +10:00
Mike Snitzer ea9df47cc9 dm table: fix blk_stack_limits arg to use bytes not sectors
The offset passed to blk_stack_limits() must be in bytes not sectors.
Fixes false warnings like the following:
device-mapper: table: 254:1: target device sda6 is misaligned

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: Frans Pop <elendil@planet.nl>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-30 15:18:17 +01:00
Milan Broz 874d2f61d3 dm exception store: really fix type lookup
Fix exception store name handling.

We need to reference exception store by zero terminated string.

Fixes regression introduced in commit f6bd4eb73c

Cc: Yi Yang <yi.y.yang@intel.com>
Cc: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-30 15:18:14 +01:00
Kiyoshi Ueda f40c67f0f7 dm mpath: change to be request based
This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.

multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:37 +01:00
Kiyoshi Ueda 523d9297d4 dm: disable interrupt when taking map_lock
This patch disables interrupt when taking map_lock to avoid
lockdep warnings in request-based dm.

request-based dm takes map_lock after taking queue_lock with
disabling interrupt:
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)
while queue_lock could be (but isn't) taken in interrupt context.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:37 +01:00
Kiyoshi Ueda 5d67aa2366 dm: do not set QUEUE_ORDERED_DRAIN if request based
Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda e6ee8c0b76 dm: enable request based option
This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      request-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     --                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table binding time.
  Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
  mempools for request-based dm are different from those for bio-based
  dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:36 +01:00
Kiyoshi Ueda cec47e3d4a dm: prepare for request based option
This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn:         dm_prep_fn()
    request_fn:      dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn:     dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md

  bio to request
  ==============
  md->queue->make_request_fn() (dm_make_request()) calls __make_request()
  for a bio submitted to the md.
  Then, the bio is kept in the queue as a new request or merged into
  another request in the queue if possible.

  Cloning and Mapping
  ===================
  Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
  when requests are dispatched after they are sorted by the I/O scheduler.

  dm_request_fn() checks busy state of underlying devices using
  target's busy() function and stops dispatching requests to keep them
  on the dm device's queue if busy.
  It helps better I/O merging, since no merge is done for a request
  once it is dispatched to underlying devices.

  Actual cloning and mapping are done in dm_prep_fn() and map_request()
  called from dm_request_fn().
  dm_prep_fn() clones not only request but also bios of the request
  so that dm can hold bio completion in error cases and prevent
  the bio submitter from noticing the error.
  (See the "Completion" section below for details.)

  After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using
    blk_insert_cloned_request().

  Completion
  ==========
  Request completion can be hooked by rq->end_io(), but then, all bios
  in the request will have been completed even error cases, and the bio
  submitter will have noticed the error.
  To prevent the bio completion in error cases, request-based dm clones
  both bio and request and hooks both bio->bi_end_io() and rq->end_io():
      bio->bi_end_io(): end_clone_bio()
      rq->end_io():     end_clone_request()

  Summary of the request completion flow is below:
  blk_end_request() for a clone request
    => blk_update_request()
       => bio->bi_end_io() == end_clone_bio() for each clone bio
          => Free the clone bio
          => Success: Complete the original bio (blk_update_request())
             Error:   Don't complete the original bio
    => blk_finish_request()
       => rq->end_io() == end_clone_request()
          => blk_complete_request()
             => dm_softirq_done()
                => Free the clone request
                => Success: Complete the original request (blk_end_request())
                   Error:   Requeue the original request

  end_clone_bio() completes the original request on the size of
  the original bio in successful cases.
  Even if all bios in the original request are completed by that
  completion, the original request must not be completed yet to keep
  the ordering of request completion for the stacking.
  So end_clone_bio() uses blk_update_request() instead of
  blk_end_request().
  In error cases, end_clone_bio() doesn't complete the original bio.
  It just frees the cloned bio and gives over the error handling to
  end_clone_request().

  end_clone_request(), which is called with queue lock held, completes
  the clone request and the original request in a softirq context
  (dm_softirq_done()), which has no queue lock, to avoid a deadlock
  issue on submission of another request during the completion:
      - The submitted request may be mapped to the same device
      - Request submission requires queue lock, but the queue lock
        has been held by itself and it doesn't know that

  The clone request has no clone bio when dm_softirq_done() is called.
  So target drivers can't resubmit it again even error cases.
  Instead, they can ask dm core for requeueing and remapping
  the original request in that cases.

  suspend
  =======
  Request-based dm uses stopping md->queue as suspend of the md.
  For noflush suspend, just stops md->queue.

  For flush suspend, inserts a marker request to the tail of md->queue.
  And dispatches all requests in md->queue until the marker comes to
  the front of md->queue.  Then, stops dispatching request and waits
  for the all dispatched requests to complete.
  After that, completes the marker request, stops md->queue and
  wake up the waiter on the suspend queue, md->wait.

  resume
  ======
  Starts md->queue.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
Jonthan Brassow f5db4af466 dm raid1: add userspace log
This patch contains a device-mapper mirror log module that forwards
requests to userspace for processing.

The structures used for communication between kernel and userspace are
located in include/linux/dm-log-userspace.h.  Due to the frequency,
diversity, and 2-way communication nature of the exchanges between
kernel and userspace, 'connector' was chosen as the interface for
communication.

The first log implementations written in userspace - "clustered-disk"
and "clustered-core" - support clustered shared storage.   A userspace
daemon (in the LVM2 source code repository) uses openAIS/corosync to
process requests in an ordered fashion with the rest of the nodes in the
cluster so as to prevent log state corruption.  Other implementations
with no association to LVM or openAIS/corosync, are certainly possible.

(Imagine if two machines are writing to the same region of a mirror.
They would both mark the region dirty, but you need a cluster-aware
entity that can handle properly marking the region clean when they are
done.  Otherwise, you might clear the region when the first machine is
done, not the second.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:35 +01:00
Mike Snitzer 754c5fc7eb dm: calculate queue limits during resume not load
Currently, device-mapper maintains a separate instance of 'struct
queue_limits' for each table of each device.  When the configuration of
a device is to be changed, first its table is loaded and this structure
is populated, then the device is 'resumed' and the calculated
queue_limits are applied.

This places restrictions on how userspace may process related devices,
where it is often advantageous to 'load' tables for several devices
at once before 'resuming' them together.  As the new queue_limits
only take effect after the 'resume', if they are changing and one
device uses another, the latter must be 'resumed' before the former
may be 'loaded'.

This patch moves the calculation of these queue_limits out of
the 'load' operation into 'resume'.  Since we are no longer
pre-calculating this struct, we no longer need to maintain copies
within our dm structs.

dm_set_device_limits() now passes the 'start' of the device's
data area (aka pe_start) as the 'offset' to blk_stack_limits().

init_valid_queue_limits() is replaced by blk_set_default_limits().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: martin.petersen@oracle.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:34 +01:00
Mike Snitzer 18d8594dd9 dm log: fix create_log_context to use logical_block_size of log device
create_log_context() must use the logical_block_size from the log disk,
where the I/O happens, not the target's logical_block_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:33 +01:00
Mike Snitzer af4874e03e dm target:s introduce iterate devices fn
Add .iterate_devices to 'struct target_type' to allow a function to be
called for all devices in a DM target.  Implemented it for all targets
except those in dm-snap.c (origin and snapshot).

(The raid1 version number jumps to 1.12 because we originally reserved
1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
instead.)

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: martin.petersen@oracle.com
2009-06-22 10:12:33 +01:00
Mike Snitzer 1197764e40 dm table: establish queue limits by copying table limits
Copy the table's queue_limits to the DM device's request_queue.  This
properly initializes the queue's topology limits and also avoids having
to track the evolution of 'struct queue_limits' in
dm_table_set_restrictions()

Also fixes a bug that was introduced in dm_table_set_restrictions() via
commit ae03bf639a.  In addition to
establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
also performs an allocation to setup the ISA DMA pool.  This allocation
resulted in "sleeping function called from invalid context" when called
from dm_table_set_restrictions().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:32 +01:00
Mike Snitzer 5ab97588fb dm table: replace struct io_restrictions with struct queue_limits
Use blk_stack_limits() to stack block limits (including topology) rather
than duplicate the equivalent within Device Mapper.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:32 +01:00
Mike Snitzer be6d4305db dm table: validate device logical_block_size
Impose necessary and sufficient conditions on a devices's table such
that any incoming bio which respects its logical_block_size can be
processed successfully.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:31 +01:00
Mike Snitzer 02acc3a4fa dm table: ensure targets are aligned to logical_block_size
Ensure I/O is aligned to the logical block size of target devices.

Rename check_device_area() to device_area_is_valid() for clarity and
establish the device limits including the logical block size prior to
calling it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:30 +01:00
Milan Broz 60935eb21d dm ioctl: support cookies for udev
Add support for passing a 32 bit "cookie" into the kernel with the
DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls.  The (unsigned)
value of this cookie is returned to userspace alongside the uevents
issued by these ioctls in the variable DM_COOKIE.

This means the userspace process issuing these ioctls can be notified
by udev after udev has completed any actions triggered.

To minimise the interface extension, we pass the cookie into the
kernel in the event_nr field which is otherwise unused when calling
these ioctls.  Incrementing the version number allows userspace to
determine in advance whether or not the kernel supports the cookie.
If the kernel does support this but userspace does not, there should
be no impact as the new variable will just get ignored.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:30 +01:00
Peter Rajnoha 486d220fe4 dm: sysfs add suspended attribute
Add a file named 'suspended' to each device-mapper device directory in
sysfs.  It holds the value 1 while the device is suspended.  Otherwise
it holds 0.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:29 +01:00
Jonthan Brassow 1b6da75459 dm table: improve warning message when devices not freed before destruction
Report any devices forgotten to be freed before a table is destroyed.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:29 +01:00
Kiyoshi Ueda f392ba889b dm mpath: add service time load balancer
This patch adds a service time oriented dynamic load balancer,
dm-service-time, which selects the path with the shortest estimated
service time for the incoming I/O.
The service time is estimated by dividing the in-flight I/O size
by a performance value of each path.

The performance value can be given as a table argument at the table
loading time.  If no performance value is given, all paths are
considered equal.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:28 +01:00
Kiyoshi Ueda fd5e033908 dm mpath: add queue length load balancer
This patch adds a dynamic load balancer, dm-queue-length, which
balances the number of in-flight I/Os across the paths.

The code is based on the patch posted by Stefan Bader:
https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:27 +01:00
Kiyoshi Ueda 02ab823fd1 dm mpath: add start_io and nr_bytes to path selectors
This patch makes two additions to the dm path selector interface for
dynamic load balancers:
  o a new hook, start_io()
  o a new parameter 'nr_bytes' to select_path()/start_io()/end_io()
    to pass the size of the I/O

start_io() is called when a target driver actually submits I/O
to the selected path.
Path selectors can use it to start accounting of the I/O.
(e.g. counting the number of in-flight I/Os.)
The start_io hook is based on the patch posted by Stefan Bader:
https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

nr_bytes, the size of the I/O, is so path selectors can take the
size of the I/O into account when deciding which path to use.
dm-service-time uses it to estimate service time, for example.
(Added the nr_bytes member to dm_mpath_io instead of using existing
 details.bi_size, since request-based dm patch deletes it.)

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:27 +01:00
Mikulas Patocka 2bd0234525 dm snapshot: use barrier when writing exception store
Send barrier requests when updating the exception area.

Exception area updates need to be ordered w.r.t. data writes, so that
the writes are not reordered in hardware disk cache.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:26 +01:00
Mikulas Patocka 51aa322849 dm io: retry after barrier error
If -EOPNOTSUPP was returned and the request was a barrier request, retry it
without barrier.

Retry all regions for now. Barriers are submitted only for one-region requests,
so it doesn't matter.  (In the future, retries can be limited to the actual
regions that failed.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:26 +01:00
Mikulas Patocka 5af443a7e1 dm io: record eopnotsupp
Add another field, eopnotsupp_bits. It is subset of error_bits, representing
regions that returned -EOPNOTSUPP.  (The bit is set in both error_bits and
eopnotsupp_bits).

This value will be used in further patches.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:25 +01:00
Mikulas Patocka 494b3ee7d4 dm snapshot: support barriers
Flush support for dm-snapshot target.

This patch just forwards the flush request to either the origin or the snapshot
device.  (It doesn't flush exception store metadata.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:25 +01:00
Mikulas Patocka 8627921fa2 dm mpath: support barriers
Flush support for dm-multipath target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:24 +01:00
Mikulas Patocka c927259e34 dm delay: support barriers
Flush support for dm-delay target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:23 +01:00
Mikulas Patocka 647c7db14e dm crypt: support flush
Flush support for dm-crypt target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:23 +01:00
Mikulas Patocka 374bf7e7f6 dm: stripe support flush
Flush support for the stripe target.

This sets ti->num_flush_requests to the number of stripes and
remaps individual flush requests to the appropriate stripe devices.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:22 +01:00
Mikulas Patocka 433bcac564 dm: linear support flush
Flush support for the linear target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:22 +01:00
Mikulas Patocka 52b1fd5a27 dm: send empty barriers to targets in dm_flush
Pass empty barrier flushes to the targets in dm_flush().

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Alasdair G Kergon 9015df24a8 dm: initialise tio in alloc_tio
Move repeated dm_target_io initialisation inside alloc_tio().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:21 +01:00
Mikulas Patocka f9ab94cee3 dm: introduce num_flush_requests
Introduce num_flush_requests for a target to set to say how many flush
instructions (empty barriers) it wants to receive.  These are sent by
__clone_and_map_empty_barrier with map_info->flush_request going from 0
to (num_flush_requests - 1).

Old targets without flush support won't receive any flush requests.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-06-22 10:12:20 +01:00