Commit Graph

3007 Commits

Author SHA1 Message Date
Joe Thornber 4d298d4d3d dm cache: fix missing ERR_PTR returns and handling
commit 766a78882ddf79b162243649d7dfdbac1fb6fb88 upstream.

Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns.  Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05 22:35:37 -08:00
Joe Thornber b2c38d8fcf dm thin: don't allow messages to be sent to a pool target in READ_ONLY or FAIL mode
commit 2a7eaea02b99b6e267b1e89c79acc6e9a51cee3b upstream.

You can't modify the metadata in these modes.  It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks.  Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05 22:35:36 -08:00
Dinesh K Garg ac77931ced md: dm-req-crypt: Change HW FDE to use AES 128
Change encryption algorithm for HW FDE to use AES 128

Change-Id: I3294a42e58c260738c1e77bd9406c1877e19ba28
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2015-01-29 21:45:49 -08:00
Dinesh K Garg 7a4756618d md: dm-req-crypt: Reduce the max hw_sectors for dm-req-crypt queue
Split IO requests for dm-req-crypt to improve encryption performance
of smaller file sizes.

Change-Id: I4e0a6f011e92accf7f797d4a37b0640185cbe175
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2015-01-29 21:43:58 -08:00
NeilBrown a4f2a8dd08 md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
commit 108cef3aa41669610e1836fe638812dd067d72de upstream.

It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:57 -08:00
Joe Thornber 5b39b457f9 dm cache: share cache-metadata object across inactive and active DM tables
commit 9b1cc9f251affdd27f29fe46d0989ba76c33faf6 upstream.

If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge.  This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.

Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29 17:40:55 -08:00
Dinesh K Garg d4e4400e72 dm: dm-req-crypt: Split read requests to reduce decryption latency
Split an IO into multiple requests so the the crypto accelerators
can be exercized in parallel to reduce latency.

Change-Id: I24b15568b5afd375ad39bf3b74f60743f0e1dde9
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2015-01-14 12:20:01 -08:00
Dan Carpenter 81f250e1cf dm space map metadata: fix sm_bootstrap_get_nr_blocks()
commit c1c6156fe4d4577444b769d7edd5dd503e57bbc9 upstream.

This function isn't right and it causes a static checker warning:

	drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
	error: potentially using uninitialized 'sb_data_size'.

It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.

Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 09:58:15 -08:00
Darrick J. Wong e1102c5b03 dm bufio: fix memleak when using a dm_buffer's inline bio
commit 445559cdcb98a141f5de415b94fd6eaccab87e6d upstream.

When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload.  Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.

Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 09:58:15 -08:00
NeilBrown 15225559b3 md/bitmap: always wait for writes on unplug.
commit 4b5060ddae2b03c5387321fafc089d242225697a upstream.

If two threads call bitmap_unplug at the same time, then
one might schedule all the writes, and the other might
decide that it doesn't need to wait.  But really it does.

It rarely hurts to wait when it isn't absolutely necessary,
and the current code doesn't really focus on 'absolutely necessary'
anyway.  So just wait always.

This can potentially lead to data corruption if a crash happens
at an awkward time and data was written before the bitmap was
updated.  It is very unlikely, but this should go to -stable
just to be safe.  Appropriate for any -stable.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08 09:58:15 -08:00
AnilKumar Chimata 3387a1b918 dm: Stop the dm_request calls in idle
Even if there are no pending peek work, dm layer trying to schedule
the work with 10ms time delay. This patch fixes the issue by putting
back the table entry.

Change-Id: I0e5df117fae74ae37a621862f254220706f0d840
Signed-off-by: AnilKumar Chimata <anilc@codeaurora.org>
2014-12-10 23:37:30 -08:00
Dinesh K Garg f281aa1822 md: Adding support of ICE in dm-req-crypt
Storage hardware can have embedded crypto engine which can greatly
reduce degradation in IO performance if crypto operations are performed
on data. Added support in dm-req-crypt so that it can work either in
transparent mode or crypto mode. In transparent mode, dm-req-crypt will
not perform any crypto operation by itself. In crypto mode, dm-req-crypt
will perform crypto operation on data using a seperate crypto engine
(SW based CE or HW based CE).

Change-Id: I8f27840899566c1a608ca13ce6b7480c9866fb6a
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-12-06 01:09:49 -08:00
Heinz Mauelshagen 513f8da83b dm raid: ensure superblock's size matches device's logical block size
commit 40d43c4b4cac4c2647bf07110d7b07d35f399a84 upstream.

The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.

Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.

Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work.  Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.

[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:53 -08:00
Joe Thornber fe30b804a2 dm btree: fix a recursion depth bug in btree walking code
commit 9b460d3699324d570a4d4161c3741431887f102f upstream.

The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain.  This is not suitable for the simple recursive walk algorithm,
which retraces its steps.

This code is only used by the persistent array code, which in turn is
only used by dm-cache.  In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks.  For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.

The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:53 -08:00
Alexey Khoroshilov fdef68fb1b dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
commit 56ec16cb1e1ce46354de8511eef962a417c32c92 upstream.

If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:47:55 -08:00
Joe Thornber 3f62631789 dm bufio: update last_accessed when relinking a buffer
commit eb76faf53b1ff7a77ce3f78cc98ad392ac70c2a0 upstream.

The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created.  This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently.  In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.

'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14 08:47:55 -08:00
Dinesh K Garg 8be3bfa1fa md: dm-req-crypt Recalculate nr_phys_segments
Recalculate nr_phys_segments after pages are allocated
for write requests. Move _req_crypt_io_pool allocation
and de-allocation to ctr and dtr instead of driver init
and exit.

Change-Id: I8576dce1f7c9bc39dcc975762562fb84a349bba7
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-10-31 12:07:48 -07:00
Dinesh K Garg 66b2ca9d3a md: dm-req-crypt: Initialize global variables in constructor
dm-req-crypt was not initializing number of engines available for
crypto operation. Hence, number of engines were getting accumulated
everytime a device based of dm-req-crypt was created. This caused
crash in qcrypto module while retrieving the crypto engine.

Change-Id: I06b5b296a80ae4f9f6bfd024222be9f47a29bfce
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-10-30 10:34:35 -07:00
Amir Samuelov 812c6796d0 dm: dm-req-crypt: fix null pointer reference
When dm_get_device(&dev) fails the "dev" variable is NULL,
so calling dm_put_device(dev) crash the device.

[  125.744115] [<ffffffc000837ecc>] dm_put_device+0x14/0x58
[  125.749408] [<ffffffc000843794>] req_crypt_dtr+0xa8/0xbc
[  125.754696] [<ffffffc000843bc0>] req_crypt_ctr+0x418/0x440
[  125.760176] [<ffffffc0008387fc>] dm_table_add_target+0x24c/0x2f8
[  125.766167] [<ffffffc00083bfac>] table_load+0x27c/0x2b0
[  125.771377] [<ffffffc00083b90c>] ctl_ioctl+0x3a8/0x448
[  125.776496] [<ffffffc00083b9bc>] dm_ctl_ioctl+0x10/0x20
[  125.781706] [<ffffffc000301cd0>] vfs_ioctl+0x20/0x3c
[  125.786648] [<ffffffc000302618>] do_vfs_ioctl+0x49c/0x568
[  125.792031] [<ffffffc000302740>] SyS_ioctl+0x5c/0x88

Change-Id: Icef2841b1aede259ee0f697f21b1976cd4485991
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-10-12 12:07:59 +03:00
NeilBrown 06905ff8a6 md/raid5: disable 'DISCARD' by default due to safety concerns.
commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.

It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.10: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09 12:18:43 -07:00
Linux Build Service Account 35a37bc8a1 Merge "dm-req-crypt: check argc before access argv[5]" 2014-10-06 22:52:47 -07:00
NeilBrown a8f165f517 md/raid1: fix_read_error should act on all non-faulty devices.
commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd upstream.

If a devices is being recovered it is not InSync and is not Faulty.

If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices.  So it will
neither fix the error nor fail the device.

It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices.  So fix it.

This became a bug when we allowed reading from a device that was being
recovered.  It is suitable for any subsequent -stable kernel.

Fixes: da8840a747
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:54:14 -07:00
Mikulas Patocka d283ad1a60 dm crypt: fix access beyond the end of allocated space
commit d49ec52ff6ddcda178fc2476a109cf1bd1fa19ed upstream.

The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.

This bug is very old (it dates back to 2.6.25 commit 3a7f6c990a "dm
crypt: use async crypto").  However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two.  This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data").  By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.

To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc.  The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.

The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block.  dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.

When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).

dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure.  However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector.  If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.

Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.

Also correct the alignment of dm_crypt_request.  struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).

Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is
aligned as if the block was allocated with kmalloc.

Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05 14:54:12 -07:00
Linux Build Service Account b3634c8a61 Merge "md: dm-req-crypt: add message when mapping completed" 2014-09-29 07:55:18 -07:00
Dinesh K Garg 95f503fcba md: dm-req-crypt does not release the device
dm-req-crypt does not release the device which it got during
construction of dm-req-crypt based node. This causes issue if
dm-req-crypt based device is created and destroyed without
rebooting the device.

Change-Id: Ifeb1210a6e1cf365b8a656556082806a24f3e582
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-09-24 09:12:43 -07:00
Amir Samuelov e7706d9c92 md: dm-req-crypt: add message when mapping completed
Add info message when mapping a block-device to
the dm-req-crypt device-mapper target.
This is the first step before this driver can be used for
encryption of a block device.
It is called once per power-up when disk-encryption is enabled.

Change-Id: I51866b95cbe77d1a3d39fcd5d5d5297c78950fa2
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-09-21 13:11:37 +03:00
NeilBrown b08633de6b md/raid1,raid10: always abort recover on write error.
commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream.

Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices).  In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable.  For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17 09:04:01 -07:00
NeilBrown 26584e18c4 md/raid10: Fix memory leak when raid10 reshape completes.
commit b39685526f46976bcd13aa08c82480092befa46c upstream.

When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed.  But not
when the reshape completes.
This can result in a small memory leak.

There is a subtle side-effect of this bug.  When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space.  This "resync" will use the buffer
space which was allocated for "reshape".  This can cause problems
including a "BUG" in the SCSI layer.  So this is suitable for -stable.

Fixes: 3ea7daa5d7
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17 09:04:00 -07:00
NeilBrown 1075d2bdf6 md/raid10: fix memory leak when reshaping a RAID10.
commit ce0b0a46955d1bb389684a2605dbcaa990ba0154 upstream.

raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't.  This results in a
memory leak.

So change to used the approved method of clearing unwanted bits.

As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.

Fixes: a38352e0ac
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17 09:04:00 -07:00
NeilBrown 318a3d59ce md/raid6: avoid data corruption during recovery of double-degraded RAID6
commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream.

During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then.  In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17 09:04:00 -07:00
Zhen Kong ef7f7bef08 dm-req-crypt: Remove high priority flag from workqueue
Remove WQ_HIGHPRI flag for the workqueue. WQ_HIGHPRI and
WQ_UNBOUND flags cannot be used together.

Change-Id: I4ca4eb6596552866049ea6402e492f463eeaed2c
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Zhen Kong <zkong@codeaurora.org>
2014-09-09 12:38:07 -07:00
Linux Build Service Account c2f950d396 Merge "dm: dm-req-crypt: Add multi-core crypto support" 2014-08-26 09:00:03 -07:00
William Clark 4d60bae866 dm: dm-req-crypt: Add multi-core crypto support
Add support for multiple crypto instances, ping-pong between
available crypto instances to increase performance

Change-Id: I562d77bf0f70ce6cab173dae18bc40fc5e766ba1
Signed-off-by: William Clark <wclark@codeaurora.org>
2014-08-22 13:34:01 -07:00
Ian Maund 6440f462f9 Merge upstream tag 'v3.10.49' into msm-3.10
* commit 'v3.10.49': (529 commits)
  Linux 3.10.49
  ACPI / battery: Retry to get battery information if failed during probing
  x86, ioremap: Speed up check for RAM pages
  Score: Modify the Makefile of Score, remove -mlong-calls for compiling
  Score: The commit is for compiling successfully.
  Score: Implement the function csum_ipv6_magic
  score: normalize global variables exported by vmlinux.lds
  rtmutex: Plug slow unlock race
  rtmutex: Handle deadlock detection smarter
  rtmutex: Detect changes in the pi lock chain
  rtmutex: Fix deadlock detector for real
  ring-buffer: Check if buffer exists before polling
  drm/radeon: stop poisoning the GART TLB
  drm/radeon: fix typo in golden register setup on evergreen
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
  dm io: fix a race condition in the wake up code for sync_io
  Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
  clk: spear3xx: Use proper control register offset
  ...

In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:

The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.

irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.

Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.

Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-08-20 13:23:09 -07:00
Amir Samuelov a262d0c39c dm-req-crypt: check argc before access argv[5]
for backward compatibility, argc == 5 is allowed.
however, check argc >= 6 before access argv[5].

Change-Id: Ib7baf741b302110bb170b8c915bfa3dbb223f606
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-08-17 14:18:28 +03:00
Joonwoo Park 67449d3d3e arch: convert smp_mb__*()
Convert deprecated smp_mb__*() barriers.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-08-15 11:45:28 -07:00
Peter Zijlstra c5ac12693f arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 4e857c58efeb99393cba5a5d0d8ec7117183137c
[joonwoop@codeaurora.org: fixed trivial merge conflict.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-08-15 11:45:28 -07:00
Mike Snitzer 8a09a31a13 dm cache metadata: do not allow the data block size to change
commit 048e5a07f282c57815b3901d4a68a77fa131ce0a upstream.

The block size for the dm-cache's data device must remained fixed for
the life of the cache.  Disallow any attempt to change the cache's data
block size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:00:07 -07:00
Mike Snitzer 4bbfb80c25 dm thin metadata: do not allow the data block size to change
commit 9aec8629ec829fc9403788cd959e05dd87988bd1 upstream.

The block size for the thin-pool's data device must remained fixed for
the life of the thin-pool.  Disallow any attempt to change the
thin-pool's data block size.

It should be noted that attempting to change the data block size via
thin-pool table reload will be ignored as a side-effect of the thin-pool
handover that the thin-pool target does during thin-pool table reload.

Here is an example outcome of attempting to load a thin-pool table that
reduced the thin-pool's data block size from 1024K to 512K.

Before:
kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks

After:
kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
kernel: device-mapper: ioctl: error adding target to table

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28 08:00:07 -07:00
Joe Thornber 09edef606e dm io: fix a race condition in the wake up code for sync_io
commit 10f1d5d111e8aed46a0f1179faf9a3cf422f689e upstream.

There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread.  If the thread
is spuriously woken immediately after the decrement it may exit,
making the on stack io struct invalid, yet the dec_count could still
be using it.

Fix this race by using a completion in sync_io() and dec_count().

Reported-by: Minfei Huang <huangminfei@ucloud.cn>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17 15:58:02 -07:00
NeilBrown 527d01946f md: flush writes before starting a recovery.
commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream.

When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).

If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.

Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.

We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not.  That
depends on details of the particular raid level and particular write
request.

This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel.  It applied correctly to 3.0 at
least and will minor editing to earlier kernels.

Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:02 -07:00
Lukas Czerner 563d232b4c dm thin: update discard_granularity to reflect the thin-pool blocksize
commit 09869de57ed2728ae3c619803932a86cb0e2c4f8 upstream.

DM thinp already checks whether the discard_granularity of the data
device is a factor of the thin-pool block size.  But when using the
dm-thin-pool's discard passdown support, DM thinp was not selecting the
max of the underlying data device's discard_granularity and the
thin-pool's block size.

Update set_discard_limits() to set discard_granularity to the max of
these values.  This enables blkdev_issue_discard() to properly align the
discards that are sent to the DM thin device on a full block boundary.
As such each discard will now cover an entire DM thin-pool block and the
block will be reclaimed.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:01 -07:00
Amir Samuelov 70b031848c platform: msm: fix PFT for 64-bit
Fix compile warning for 64-bit platform.

Change-Id: Ie19f331a893f3265eef70e70945e1edd3268c4b0
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-07-06 13:51:12 +03:00
Amir Samuelov 8307869921 platform: msm: fix PFT when using direct-io
When read/write a file using dircet-io (O_DIRECT),
we can't get the inode from a bio by walking the path
bio->bi_io_vec->bv_page->mapping->host
since the page is anonymous and not mapped.
On the other more typical cases when not using O_DIRECT,
the page cache is used and the bv_page is available.

Change-Id: I349cad54a978ed9919f960d55f0f95c1e53262e5
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-06-23 21:38:40 +03:00
Ian Maund 491fb5c232 Merge upstream tag 'v3.10.40' into msm-3.10
* commit 'v3.10.40': (203 commits)
  Linux 3.10.40
  ARC: !PREEMPT: Ensure Return to kernel mode is IRQ safe
  drm: cirrus: add power management support
  Input: synaptics - add min/max quirk for ThinkPad Edge E431
  Input: synaptics - add min/max quirk for ThinkPad T431s, L440, L540, S1 Yoga and X1
  lockd: ensure we tear down any live sockets when socket creation fails during lockd_up
  dm thin: fix dangling bio in process_deferred_bios error path
  dm transaction manager: fix corruption due to non-atomic transaction commit
  Skip intel_crt_init for Dell XPS 8700
  mtd: sm_ftl: heap corruption in sm_create_sysfs_attributes()
  mtd: nuc900_nand: NULL dereference in nuc900_nand_enable()
  mtd: atmel_nand: Disable subpage NAND write when using Atmel PMECC
  tgafb: fix data copying
  gpio: mxs: Allow for recursive enable_irq_wake() call
  rtlwifi: rtl8188ee: initialize packet_beacon
  rtlwifi: rtl8192se: Fix regression due to commit 1bf4bbb
  rtlwifi: rtl8192se: Fix too long disable of IRQs
  rtlwifi: rtl8192cu: Fix too long disable of IRQs
  rtlwifi: rtl8188ee: Fix too long disable of IRQs
  rtlwifi: rtl8723ae: Fix too long disable of IRQs
  ...

Change-Id: If5388cf980cb123e35e1b29275ba288c89c5aa18
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-06-18 13:10:54 -07:00
NeilBrown 88580b125a md: always set MD_RECOVERY_INTR when interrupting a reshape thread.
commit 2ac295a544dcae9299cba13ce250419117ae7fd1 upstream.

Commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
   md: fix problem when adding device to read-only array with bitmap.

added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.

This only happens when mddev->ro is set and normally reshape threads
don't run in that situation.  But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.

So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.

Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)

Fixes: 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 12:03:24 -07:00
NeilBrown d86a5d297d md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".
commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream.

If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.

If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.

Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted.  If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.

Consequently, this patch is suitable for any -stable kernel.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 12:03:24 -07:00
Heinz Mauelshagen 50faf9370b dm cache: always split discards on cache block boundaries
commit f1daa838e861ae1a0fb7cd9721a21258430fcc8c upstream.

The DM cache target cannot cope with discards that span multiple cache
blocks, so each discard bio that spans more than one cache block must
get split by the DM core.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11 12:03:23 -07:00
Linux Build Service Account 0ccc34fcb6 Merge "dm-crypt: sort writes" 2014-06-11 10:56:56 -07:00
Linux Build Service Account 057c6edf81 Merge "dm-crypt: offload writes to thread" 2014-06-11 10:56:55 -07:00
Linux Build Service Account 3e924cec74 Merge "dm-crypt: remove io_pool" 2014-06-11 10:56:54 -07:00
Linux Build Service Account 382821f43c Merge "dm-crypt: avoid deadlock in mempools" 2014-06-11 10:56:53 -07:00
Linux Build Service Account 416b849eab Merge "dm-crypt: don't allocate pages for a partial request" 2014-06-11 10:56:52 -07:00
Linux Build Service Account cfeac85536 Merge "dm-crypt: use unbound workqueue for request processing" 2014-06-11 10:56:51 -07:00
Linux Build Service Account 2a2d1f4e98 Merge "dm-crypt: use per-bio data" 2014-06-11 10:56:50 -07:00
Linux Build Service Account 07b061120d Merge "dm-crypt: remove per-cpu structure" 2014-06-11 10:56:48 -07:00
Mikulas Patocka 019c8ec9e3 dm crypt: fix cpu hotplug crash by removing per-cpu structure
commit 610f2de3559c383caf8fbbf91e9968102dff7ca0 upstream.

The DM crypt target used per-cpu structures to hold pointers to a
ablkcipher_request structure.  The code assumed that the work item keeps
executing on a single CPU, so it didn't use synchronization when
accessing this structure.

If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
the work item could be moved to another CPU.  This causes dm-crypt
crashes, like the following, because the code starts using an incorrect
ablkcipher_request:

 smpboot: CPU 7 is now offline
 BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
 IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
 ...
 Call Trace:
  [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
  [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
  [<ffffffff81052a28>] ? process_one_work+0x168/0x470
  [<ffffffff8105366b>] ? worker_thread+0x10b/0x390
  [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
  [<ffffffff81058d9f>] ? kthread+0xaf/0xc0
  [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
  [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
  [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120

Fix this bug by removing the per-cpu definition.  The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.

This change may undermine performance improvements intended by commit
c0297721 ("dm crypt: scale to multiple cpus") on select hardware.  In
practice no performance difference was observed on recent hardware.  But
regardless, correctness is more important than performance.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 13:25:39 -07:00
NeilBrown 658864cc2c md: avoid possible spinning md thread at shutdown.
commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.

If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:

1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
   and md_check_recover enters an infinite loop.

Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 13:25:32 -07:00
Mikulas Patocka 364a180149 dm-crypt: sort writes
Write requests are sorted in a red-black tree structure and are submitted
in the sorted order.

In theory the sorting should be performed by the underlying disk scheduler,
however, in practice the disk scheduler accepts and sorts only 128 requests.
In order to sort more requests, we need to implement our own sorting.

CRs-fixed: 670391
Change-Id: Iffd9345fa1253f5cf2a556893ed36e08f1ac51aa
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:09
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:53 -07:00
Mikulas Patocka 980c01db22 dm-crypt: offload writes to thread
Submitting write bios directly in the encryption thread caused serious
performance degradation. On multiprocessor machine encryption requests
finish in a different order than they were submitted in. Consequently, write
requests would be submitted in a different order and it could cause severe
performance degradation.

This patch moves submitting write requests to a separate thread so that
the requests can be sorted before submitting.

Sorting is implemented in the next patch.

Note: it is required that a previous patch "dm-crypt: don't allocate pages
for a partial request." is applied before applying this patch. Without
that, this patch could introduce a crash.

CRs-fixed: 670391
Change-Id: I886ed2da0ff174d3539ea18e27170d7fd1062680
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:08
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:53 -07:00
Mikulas Patocka 30147f3526 dm-crypt: remove io_pool
Remove io_pool and _crypt_io_pool because they are unused.

CRs-fixed: 670391
Change-Id: I71400ecda66902c56c3af981e8b739f156db1e27
Signed-off-by: Mikulas Patocka <mpatock@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:08
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:52 -07:00
Mikulas Patocka e1254d136d dm-crypt: avoid deadlock in mempools
This patch fixes a theoretical deadlock introduced in the previous patch.

The function crypt_alloc_buffer may be called concurrently. If we allocate
from the mempool concurrently, there is a possibility of deadlock.
For example, if we have mempool of 256 pages, two processes, each wanting 256,
pages allocate from the mempool concurrently, it may deadlock in a situation
where both processes have allocated 128 pages and the mempool is exhausted.

In order to avoid this scenarios, we allocate the pages under a mutex.

In order to not degrade performance with excessive locking, we try
non-blocking allocations without a mutex first and if it fails, we fallback
to a blocking allocation with a mutex.

CRs-fixed: 670391
Change-Id: I6c391dece4ba44fe0b2e9b75ea2b9235bf1b525b
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:07
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:52 -07:00
Mikulas Patocka db4c8ba31c dm-crypt: don't allocate pages for a partial request
This patch changes crypt_alloc_buffer so that it always allocates pages for
a full request.

This change enables further simplification and removing of one refcounts
in the next patches.

Note: the next patch is needed to fix a theoretical deadlock

CRs-fixed: 670391
Change-Id: I7bcadac8b3450976366c701fceb1fee7cb18df85
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
[joonwoop@codeaurora.org: resolve trivial merge conflicts]
Patch-mainline: dm-devel @ 04/05/14, 14:07
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:52 -07:00
Mikulas Patocka 0a03cc03c9 dm-crypt: use unbound workqueue for request processing
Use unbound workqueue so that work is automatically ballanced between
available CPUs.

CRs-fixed: 670391
Change-Id: I169099d0b5b27535633c9d3aaab2037b5fea6aa9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
[joonwoop@codeaurora.org: resolve trivial merge conflict]
Patch-mainline: dm-devel @ 04/05/14, 14:06
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:52 -07:00
Mikulas Patocka 5b03d4b7bb dm-crypt: use per-bio data
This patch changes dm-crypt so that it uses auxiliary data allocated with
the bio.

Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it). It used
mempool for the allocation.

Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.

This patch changes it so that the first dm_crypt_io and ablkcipher_request
and allocated with the bio (using target per_bio_data_size option). If the
request needs additional values, they are allocated from the mempool.

CRs-fixed: 670391
Change-Id: I8abc48a021391398f3b35bdd4ac9efbbec3a9fa5
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:05
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:52 -07:00
Mikulas Patocka 58423ac034 dm-crypt: remove per-cpu structure
Dm-crypt used per-cpu structures to hold pointers to ablkcipher_request.
The code assumed that the work item keeps executing on a single CPU, so it
used no synchronization when accessing this structure.

When we disable a CPU by writing zero to
/sys/devices/system/cpu/cpu*/online, the work item could be moved to
another CPU. This causes crashes in dm-crypt because the code starts using
a wrong ablkcipher_request.

This patch fixes this bug by removing the percpu definition. The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.

CRs-fixed: 670391
Change-Id: Ib94281c719aedb1d53b853d38085d2e3f884665b
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Patch-mainline: dm-devel @ 04/05/14, 14:04
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-06-05 16:45:51 -07:00
AnilKumar Chimata 02a4f15149 crypto: msm: Move qcrypto.h header file
Move header file from architecture folder to include/linux
folder.

Change-Id: I20a653b272ec21419706cb02bc7c1beac20802eb
Signed-off-by: AnilKumar Chimata <anilc@codeaurora.org>
2014-06-06 04:04:23 +05:30
Amir Samuelov da92eea47b dm: dm-req-crypt: add support for Per-File-Encryption
Support Per-File-Encryption (PFE) based on file tagging.
The Per-File-Tagger (PFT) reports if a file should be encrypted or not.
The Per-File-Encryption can be used after Full-Disk-Encryption (FDE) was
completed or without FDE.
The PFE and FDE uses different keys, which are managed by the Trust-Zone.

Change-Id: I727ef11e252649f895a5e3f8a49ca848cea50795
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-06-03 09:55:10 +03:00
NeilBrown fdc9aa48b7 md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
commit da1aab3dca9aa88ae34ca392470b8943159e25fe upstream.

When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.

If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.

This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.

This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.

Fixes: a07876064a
Cc: Kent Overstreet <koverstreet@google.com>
Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30 21:52:12 -07:00
Dinesh K Garg 40acb5587b Setting correct HW crypto driver instance for FDE
HW crypto provides different instance for different use cases.
Dm-req-crypt should use the one for full disk encryption use case.

Change-Id: I22d3f6ab1dd6f83b0d5b83a681236719e2340934
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-05-15 14:48:36 -07:00
Mike Snitzer e2e8845036 dm thin: fix dangling bio in process_deferred_bios error path
commit fe76cd88e654124d1431bb662a0fc6e99ca811a5 upstream.

If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13 13:59:46 +02:00
Joe Thornber ff3296214c dm transaction manager: fix corruption due to non-atomic transaction commit
commit a9d45396f5956d0b615c7ae3b936afd888351a47 upstream.

The persistent-data library used by dm-thin, dm-cache, etc is
transactional.  If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.

Atomicity when committing a transaction is achieved by:

a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
   disk.

This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.

Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.

As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush().  Now
the unlocking must be done separately.

This issue was discovered by forcing io errors at the crucial time
using dm-flakey.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13 13:59:45 +02:00
Ian Maund 356fb13538 Merge upstream linux-stable v3.10.36 into msm-3.10
* commit 'v3.10.36': (494 commits)
  Linux 3.10.36
  netfilter: nf_conntrack_dccp: fix skb_header_pointer API usages
  mm: close PageTail race
  net: mvneta: rename MVNETA_GMAC2_PSC_ENABLE to MVNETA_GMAC2_PCS_ENABLE
  x86: fix boot on uniprocessor systems
  Input: cypress_ps2 - don't report as a button pads
  Input: synaptics - add manual min/max quirk for ThinkPad X240
  Input: synaptics - add manual min/max quirk
  Input: mousedev - fix race when creating mixed device
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  Linux 3.10.35
  sched/autogroup: Fix race with task_groups list
  e100: Fix "disabling already-disabled device" warning
  xhci: Fix resume issues on Renesas chips in Samsung laptops
  Input: wacom - make sure touch_max is set for touch devices
  KVM: VMX: fix use after free of vmx->loaded_vmcs
  KVM: x86: handle invalid root_hpa everywhere
  KVM: MMU: handle invalid root_hpa at __direct_map
  Input: elantech - improve clickpad detection
  ARM: highbank: avoid L2 cache smc calls when PL310 is not present
  ...

Change-Id: Ib68f565291702c53df09e914e637930c5d3e5310
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-04-23 16:23:49 -07:00
Dinesh K Garg def50e6b5e dm: Clean up dm-req-crypt
dm-req-crypt defines optional functions from device mapper. These
functions are not required for dm-req-crypt. dm-req-crypt has few
memory de-allocation issues. This change removes unnecessary
functions and fixes memory deallocation issues as well.

Change-Id: Id36bd13989940ce41571a8a10277730048df6f6d
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-04-14 21:49:31 -07:00
Ian Maund f1b32d4e47 Merge upstream linux-stable v3.10.28 into msm-3.10
The following commits have been reverted from this merge, as they are
known to introduce new bugs and are currently incompatible with our
audio implementation. Investigation of these commits is ongoing, and
they are expected to be brought in at a later time:

86e6de7 ALSA: compress: fix drain calls blocking other compress functions (v6)
16442d4 ALSA: compress: fix drain calls blocking other compress functions

This merge commit also includes a change in block, necessary for
compilation. Upstream has modified elevator_init_fn to prevent race
conditions, requring updates to row_init_queue and test_init_queue.

* commit 'v3.10.28': (1964 commits)
  Linux 3.10.28
  ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling
  drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init()
  serial: amba-pl011: use port lock to guard control register access
  mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL
  md/raid5: Fix possible confusion when multiple write errors occur.
  md/raid10: fix two bugs in handling of known-bad-blocks.
  md/raid10: fix bug when raid10 recovery fails to recover a block.
  md: fix problem when adding device to read-only array with bitmap.
  drm/i915: fix DDI PLLs HW state readout code
  nilfs2: fix segctor bug that causes file system corruption
  thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
  ftrace/x86: Load ftrace_ops in parameter not the variable holding it
  SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()
  writeback: Fix data corruption on NFS
  hwmon: (coretemp) Fix truncated name of alarm attributes
  vfs: In d_path don't call d_dname on a mount point
  staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq()
  staging: comedi: addi_apci_1032: fix subdevice type/flags bug
  mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
  GFS2: Increase i_writecount during gfs2_setattr_chown
  perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
  perf scripting perl: Fix build error on Fedora 12
  ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
  Linux 3.10.27
  sched: Guarantee new group-entities always have weight
  sched: Fix hrtimer_cancel()/rq->lock deadlock
  sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
  sched: Fix race on toggling cfs_bandwidth_used
  x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround
  netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
  SCSI: sd: Reduce buffer size for vpd request
  intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
  mac80211: move "bufferable MMPDU" check to fix AP mode scan
  ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
  ACPI / TPM: fix memory leak when walking ACPI namespace
  mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
  clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
  clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
  clk: samsung: exynos4: Correct SRC_MFC register
  clk: clk-divider: fix divisor > 255 bug
  ahci: add PCI ID for Marvell 88SE9170 SATA controller
  parisc: Ensure full cache coherency for kmap/kunmap
  drm/nouveau/bios: make jump conditional
  ARM: shmobile: mackerel: Fix coherent DMA mask
  ARM: shmobile: armadillo: Fix coherent DMA mask
  ARM: shmobile: kzm9g: Fix coherent DMA mask
  ARM: dts: exynos5250: Fix MDMA0 clock number
  ARM: fix "bad mode in ... handler" message for undefined instructions
  ARM: fix footbridge clockevent device
  net: Loosen constraints for recalculating checksum in skb_segment()
  bridge: use spin_lock_bh() in br_multicast_set_hash_max
  netpoll: Fix missing TXQ unlock and and OOPS.
  net: llc: fix use after free in llc_ui_recvmsg
  virtio-net: fix refill races during restore
  virtio_net: don't leak memory or block when too many frags
  virtio-net: make all RX paths handle errors consistently
  virtio_net: fix error handling for mergeable buffers
  vlan: Fix header ops passthru when doing TX VLAN offload.
  net: rose: restore old recvmsg behavior
  rds: prevent dereference of a NULL device
  ipv6: always set the new created dst's from in ip6_rt_copy
  net: fec: fix potential use after free
  hamradio/yam: fix info leak in ioctl
  drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
  net: inet_diag: zero out uninitialized idiag_{src,dst} fields
  ip_gre: fix msg_name parsing for recvfrom/recvmsg
  net: unix: allow bind to fail on mutex lock
  ipv6: fix illegal mac_header comparison on 32bit
  netvsc: don't flush peers notifying work during setting mtu
  tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
  net: unix: allow set_peek_off to fail
  net: drop_monitor: fix the value of maxattr
  ipv6: don't count addrconf generated routes against gc limit
  packet: fix send path when running with proto == 0
  virtio: delete napi structures from netdev before releasing memory
  macvtap: signal truncated packets
  tun: update file current position
  macvtap: update file current position
  macvtap: Do not double-count received packets
  rds: prevent BUG_ON triggered on congestion update to loopback
  net: do not pretend FRAGLIST support
  IPv6: Fixed support for blackhole and prohibit routes
  HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
  gpio-rcar: R-Car GPIO IRQ share interrupt
  clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
  irqchip: renesas-irqc: Fix irqc_probe error handling
  Linux 3.10.26
  sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
  ext4: fix bigalloc regression
  arm64: Use Normal NonCacheable memory for writecombine
  arm64: Do not flush the D-cache for anonymous pages
  arm64: Avoid cache flushing in flush_dcache_page()
  ARM: KVM: arch_timers: zero CNTVOFF upon return to host
  ARM: hyp: initialize CNTVOFF to zero
  clocksource: arch_timer: use virtual counters
  arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
  arm64: dts: Reserve the memory used for secondary CPU release address
  arm64: check for number of arguments in syscall_get/set_arguments()
  arm64: fix possible invalid FPSIMD initialization state
  ...

Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-03-24 14:28:34 -07:00
Heinz Mauelshagen bd86e7cb77 dm cache: fix access beyond end of origin device
commit e893fba90c09f9b57fb97daae204ea9cc2c52fa5 upstream.

In order to avoid wasting cache space a partial block at the end of the
origin device is not cached.  Unfortunately, the check for such a
partial block at the end of the origin device was flawed.

Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:

- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset

Otherwise, users can experience errors like the following:

 attempt to access beyond end of device
 dm-5: rw=0, want=20971520, limit=20971456
 ...
 device-mapper: cache: promotion failed; couldn't copy block

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23 21:38:18 -07:00
Heinz Mauelshagen e88217a8ee dm cache: fix truncation bug when copying a block to/from >2TB fast device
commit 8b9d96666529a979acf4825391efcc7c8a3e9f12 upstream.

During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits.  The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().

Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.

Here is an example of how this bug manifests on an ext4 filesystem:

 EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
 JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23 21:38:18 -07:00
Mike Snitzer fe8ee730af dm thin: fix the error path for the thin device constructor
commit 1acacc0784aab45627b6009e0e9224886279ac0b upstream.

dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:

 device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
 device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.

Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06 21:30:14 -08:00
Mike Snitzer 4f48d3328e dm thin: avoid metadata commit if a pool's thin devices haven't changed
commit 4d1662a30dde6e545086fe0e8fd7e474c4e0b639 upstream.

Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit.  This commit occurs regardless of whether
any thin devices have made changes.

Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().

Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06 21:30:13 -08:00
Hannes Reinecke f4124bc34f dm mpath: fix stalls when handling invalid ioctls
commit a1989b330093578ea5470bea0a00f940c444c466 upstream.

An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not.  So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately.  This fix resolves numerous instances of:

 udevd[]: worker [] unexpectedly returned with status 0x0100

that have been seen during testing.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06 21:30:13 -08:00
Dinesh K Garg d9b69c0618 dm: Adding support of AES 256 bits in dm-req-crypt
dm-req-crypt currently uses AES 128 bit. Updating dm-req-crypt
to use AES 256 bit instead.

Change-Id: I2e8a4484b25496d53f1f9aa83ec6c0ed7b947901
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-02-26 11:27:03 -08:00
Oleg Nesterov 4d4ef86d44 md/raid5: Fix CPU hotplug callback registration
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22 12:41:29 -08:00
NeilBrown 9f2d289933 md/raid1: restore ability for check and repair to fix read errors.
commit 1877db75589a895bbdc4c4c3f23558e57b521141 upstream.

commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
    md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug).  So suitable for any -stable since 3.10.

Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru>
Fixed: 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22 12:41:29 -08:00
Mikulas Patocka 4f66403630 dm sysfs: fix a module unload race
commit 2995fa78e423d7193f3b57835f6c1c75006a0315 upstream.

This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.

The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).

To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.

The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:02 -08:00
Joe Thornber 88972eec55 dm space map metadata: fix bug in resizing of thin metadata
commit fca028438fb903852beaf7c3fe1cd326651af57d upstream.

This bug was introduced in commit 7e664b3dec431e ("dm space map metadata:
fix extending the space map").

When extending a dm-thin metadata volume we:

- Switch the space map into a simple bootstrap mode, which allocates
  all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
  entries
- Commit changes to disk
- Switch back out of bootstrap mode.

But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.

The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit.  This would cause the disk commit to fail, and thinp
to enter read_only mode.  The metadata was not damaged (thin_check
passed).

The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space.  In practise this loop only
runs twice.

With this fix the following device mapper testsuite test passes:
 dmtest run --suite thin-provisioning -n thin_remove_works_after_resize

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Joe Thornber f15396a369 dm space map metadata: fix extending the space map
commit 7e664b3dec431eebf0c5df5ff704d6197634cf35 upstream.

When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.

That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.

With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
 dmtest run --suite thin-provisioning -n /resize_metadata_no_io/

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Joe Thornber 247373410a dm space map common: make sure new space is used during extend
commit 12c91a5c2d2a8e8cc40a9552313e1e7b0a2d9ee3 upstream.

When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.

Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
 -> sm_ll_extend
    -> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
    => returns -ENOSPC because smm->begin == smm->ll.nr_blocks

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Mikulas Patocka eef2b6df86 dm: wait until embedded kobject is released before destroying a device
commit be35f486108227e10fe5d96fd42fb2b344c59983 upstream.

There may be other parts of the kernel holding a reference on the dm
kobject.  We must wait until all references are dropped before
deallocating the mapped_device structure.

The dm_kobject_release method signals that all references are dropped
via completion.  But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).

This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
  release method is called
* the release method signals the completion and should return without
  delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
  contains the kobject

Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Mike Snitzer 025d61e0f1 dm thin: initialize dm_thin_new_mapping returned by get_next_mapping
commit 16961b042db8cc5cf75d782b4255193ad56e1d4f upstream.

As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Joe Thornber 614319dfb0 dm thin: fix discard support to a previously shared block
commit 19fa1a6756ed9e92daa9537c03b47d6b55cc2316 upstream.

If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time.  The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity.  In this case, the shared flag would be true.

But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists).  This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.

To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used().  If the
reference count for a block is now zero the discard is allowed to be
passed down.

Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13 13:48:01 -08:00
Kent Overstreet f4ac67e8e7 bcache: Data corruption fix
commit ef71ec00002d92a08eb27e9d036e3d48835b6597 upstream.

The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.

The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.

I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06 11:08:16 -08:00
NeilBrown 6ba854e9bf md/raid5: fix long-standing problem with bitmap handling on write failure.
commit 9f97e4b128d2ea90a5f5063ea0ee3b0911f4c669 upstream.

Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.

Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06 11:08:12 -08:00
Dinesh K Garg 4aab49333d dm: updating crypto algorithm used by dm-req-crypt
dm-req-crypt uses AES-XTS algorithm implemented by HW crypto
engine. Crypto driver renamed the aes-xts algorithm to avoid
conflict with SW based implementation of AES-XTS. dm-req-crypt
must use the new name for AES-XTS provided by crypto driver.

Change-Id: I286b6435af33fd64673c7c1af4bb2447f3115868
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-02-03 09:28:19 -08:00
NeilBrown c44bc4496d md/raid5: Fix possible confusion when multiple write errors occur.
commit 1cc03eb93245e63b0b7a7832165efdc52e25b4e6 upstream.

commit 5d8c71f9e5
    md: raid5 crash during degradation

Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.

commit 14a75d3e07
    md/raid5: preferentially read from replacement device if possible.

Fixed the same bug if a more effective way, so we can now revert
the original commit.

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 5d8c71f9e5
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25 08:27:12 -08:00
NeilBrown ddd618aa9e md/raid10: fix two bugs in handling of known-bad-blocks.
commit b50c259e25d9260b9108dc0c2964c26e5ecbe1c1 upstream.

If we discover a bad block when reading we split the request and
potentially read some of it from a different device.

The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
   seen by comparison with raid1.c

This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.

Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25 08:27:12 -08:00
NeilBrown 16342c21c9 md/raid10: fix bug when raid10 recovery fails to recover a block.
commit e8b849158508565e0cd6bc80061124afc5879160 upstream.

commit e875ecea26
    md/raid10 record bad blocks as needed during recovery.

added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.

So move the freeing of r10bio down where it is safe.

Fixes: e875ecea26
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25 08:27:12 -08:00
NeilBrown bb4a65df30 md: fix problem when adding device to read-only array with bitmap.
commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 upstream.

If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update.  We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag.  If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25 08:27:12 -08:00
Dinesh K Garg e6304b0351 dm: Request based dm-crypt
dm-crypt provides bios based device mapper module. dm-crypt
operates on packets with 512 bytes size which is not effiicent
way for HW based crypto blocks. dm-req-crypt is developed to
address this. dm-req-crypt works on requests which carry upto
512KB of data for unmerged requests.

Change-Id: I7d6a63d516dc2dbe80f46c06dd0722847d55bc9f
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-01-18 14:11:36 -08:00
Joe Thornber f4cf4b1b2d dm thin: switch to read only mode if a mapping insert fails
commit fafc7a815e40255d24e80a1cb7365892362fa398 upstream.

Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.

This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/

The quantity of errors logged in this case must be reduced.

before patch:

device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode

after patch:

device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:11 -08:00
Mikulas Patocka 135949c10b dm table: fail dm_table_create on dm_round_up overflow
commit 5b2d06576c5410c10d95adfd5c4d8b24de861d87 upstream.

The dm_round_up function may overflow to zero.  In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().

This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:11 -08:00
Mike Snitzer 2c54d62aa8 dm space map metadata: return on failure in sm_metadata_new_block
commit f62b6b8f498658a9d537c7d380e9966f15e1b2a1 upstream.

Commit 2fc48021f4 ("dm persistent
metadata: add space map threshold callback") introduced a regression
to the metadata block allocation path that resulted in errors being
ignored.  This regression was uncovered by running the following
device-mapper-test-suite test:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/

The ignored error codes in sm_metadata_new_block() could crash the
kernel through use of either the dm-thin or dm-cache targets, e.g.:

device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
device-mapper: space map metadata: unable to allocate new metadata block
general protection fault: 0000 [#1] SMP
...
Workqueue: dm-thin do_worker [dm_thin_pool]
task: ffff880035ce2ab0 ti: ffff88021a054000 task.ti: ffff88021a054000
RIP: 0010:[<ffffffffa0331385>]  [<ffffffffa0331385>] metadata_ll_load_ie+0x15/0x30 [dm_persistent_data]
RSP: 0018:ffff88021a055a68  EFLAGS: 00010202
RAX: 003fc8243d212ba0 RBX: ffff88021a780070 RCX: ffff88021a055a78
RDX: ffff88021a055a78 RSI: 0040402222a92a80 RDI: ffff88021a780070
RBP: ffff88021a055a68 R08: ffff88021a055ba4 R09: 0000000000000010
R10: 0000000000000000 R11: 00000002a02e1000 R12: ffff88021a055ad4
R13: 0000000000000598 R14: ffffffffa0338470 R15: ffff88021a055ba4
FS:  0000000000000000(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f467c0291b8 CR3: 0000000001a0b000 CR4: 00000000000007e0
Stack:
 ffff88021a055ab8 ffffffffa0332020 ffff88021a055b30 0000000000000001
 ffff88021a055b30 0000000000000000 ffff88021a055b18 0000000000000000
 ffff88021a055ba4 ffff88021a055b98 ffff88021a055ae8 ffffffffa033304c
Call Trace:
 [<ffffffffa0332020>] sm_ll_lookup_bitmap+0x40/0xa0 [dm_persistent_data]
 [<ffffffffa033304c>] sm_metadata_count_is_more_than_one+0x8c/0xc0 [dm_persistent_data]
 [<ffffffffa0333825>] dm_tm_shadow_block+0x65/0x110 [dm_persistent_data]
 [<ffffffffa0331b00>] sm_ll_mutate+0x80/0x300 [dm_persistent_data]
 [<ffffffffa0330e60>] ? set_ref_count+0x10/0x10 [dm_persistent_data]
 [<ffffffffa0331dba>] sm_ll_inc+0x1a/0x20 [dm_persistent_data]
 [<ffffffffa0332270>] sm_disk_new_block+0x60/0x80 [dm_persistent_data]
 [<ffffffff81520036>] ? down_write+0x16/0x40
 [<ffffffffa001e5c4>] dm_pool_alloc_data_block+0x54/0x80 [dm_thin_pool]
 [<ffffffffa001b23c>] alloc_data_block+0x9c/0x130 [dm_thin_pool]
 [<ffffffffa001c27e>] provision_block+0x4e/0x180 [dm_thin_pool]
 [<ffffffffa001fe9a>] ? dm_thin_find_block+0x6a/0x110 [dm_thin_pool]
 [<ffffffffa001c57a>] process_bio+0x1ca/0x1f0 [dm_thin_pool]
 [<ffffffff8111e2ed>] ? mempool_free+0x8d/0xa0
 [<ffffffffa001d755>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
 [<ffffffffa001d911>] do_worker+0x51/0x60 [dm_thin_pool]
 [<ffffffff81067872>] process_one_work+0x182/0x3b0
 [<ffffffff81068c90>] worker_thread+0x120/0x3a0
 [<ffffffff81068b70>] ? manage_workers+0x160/0x160
 [<ffffffff8106eb2e>] kthread+0xce/0xe0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff8152af6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff8106ea60>] ? kthread_freezable_should_stop+0x70/0x70

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:11 -08:00
Mikulas Patocka 729d38d163 dm delay: fix a possible deadlock due to shared workqueue
commit 718822c1c112dc99e0c72c8968ee1db9d9d910f0 upstream.

The dm-delay target uses a shared workqueue for multiple instances.  This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.

This patch changes dm-delay to use a per-instance workqueue.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:11 -08:00
Joe Thornber 1cfc455291 dm array: fix a reference counting bug in shadow_ablock
commit ed9571f0cf1fe09d3506302610f3ccdfa1d22c4a upstream.

An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.

The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:11 -08:00
Mikulas Patocka 20d68d38ef dm snapshot: avoid snapshot space leak on crash
commit 230c83afdd9cd384348475bea1e14b80b3b6b1b8 upstream.

There is a possible leak of snapshot space in case of crash.

The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.

For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250

Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260

Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.

This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.

Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
  to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:10 -08:00
Mikulas Patocka d468a287d0 dm bufio: initialize read-only module parameters
commit 4cb57ab4a2e61978f3a9b7d4f53988f30d61c27f upstream.

Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.

However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345

The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.

This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20 07:45:10 -08:00
NeilBrown 8e43aac1ec md: fix calculation of stacking limits on level change.
commit 02e5f5c0a0f726e66e3d8506ea1691e344277969 upstream.

The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().

However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified.  This can result in incorrect final
settings.

So call blk_set_stacking_limits() before ->run in level_store().

A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.

This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.

Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:57:15 -08:00
Mikulas Patocka 4c52f00134 dm: allocate buffer for messages with small number of arguments using GFP_NOIO
commit f36afb3957353d2529cb2b00f78fdccd14fc5e9c upstream.

dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.

On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.

The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:56:44 -08:00
Joe Thornber 8fafee9829 dm cache: fix a race condition between queuing new migrations and quiescing for a shutdown
commit 66cb1910df17b38334153462ec8166e48058035f upstream.

The code that was trying to do this was inadequate.  The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce.  Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing.  In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:56:43 -08:00
Joe Thornber 0c5fd99e89 dm array: fix bug in growing array
commit 9c1d4de56066e4d6abc66ec188faafd7b303fb08 upstream.

Entries would be lost if the old tail block was partially filled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:56:42 -08:00
Shiva Krishna Merla 9fb1b9d041 dm mpath: fix race condition between multipath_dtr and pg_init_done
commit 954a73d5d3073df2231820c718fdd2f18b0fe4c9 upstream.

Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work.  Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set.  By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.

Without this patch a race condition exists that may result in a kernel
panic:

1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
   to wait_for_pg_init_completion() assumes there are no more pending path
   management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
   mode_select errors, then process_queued_ios() will again queue the
   path activation work.
3) If free_multipath() completes before activate_path() work is called a
   NULL pointer dereference like the following can be seen when
   accessing members of the recently destructed multipath:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>]  [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40

[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:56:41 -08:00
Lukasz Dorau ed840bec21 md: Fix skipping recovery for read-only arrays.
commit 61e4947c99c4494336254ec540c50186d186150b upstream.

Since:
        commit 7ceb17e87b
        md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13 12:05:32 +09:00
Bian Yu 0465496671 md: avoid deadlock when md_set_badblocks.
commit 905b0297a9533d7a6ee00a01a990456636877dd6 upstream.

When operate harddisk and hit errors, md_set_badblocks is called after
scsi_restart_operations which already disabled the irq. but md_set_badblocks
will call write_sequnlock_irq and enable irq. so softirq can preempt the
current thread and that may cause a deadlock. I think this situation should
use write_sequnlock_irqsave/irqrestore instead.

I met the situation and the call trace is below:
[  638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010
[  638.921923]  lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0
[  638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37
[  638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013
[  638.927816]  ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007
[  638.929829]  ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8
[  638.931848]  ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8
[  638.933884] Call Trace:
[  638.935867]  <IRQ>  [<ffffffff8172ff85>] dump_stack+0x55/0x76
[  638.937878]  [<ffffffff81730030>] spin_dump+0x8a/0x8f
[  638.939861]  [<ffffffff81730056>] spin_bug+0x21/0x26
[  638.941836]  [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0
[  638.943801]  [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80
[  638.945747]  [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0
[  638.947672]  [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50
[  638.949595]  [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0
[  638.951504]  [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0
[  638.953388]  [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140
[  638.955248]  [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90
[  638.957116]  [<ffffffff8104fddd>] __do_softirq+0xfd/0x330
[  638.958987]  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.960861]  [<ffffffff8174a5cc>] call_softirq+0x1c/0x30
[  638.962724]  [<ffffffff81004c7d>] do_softirq+0x8d/0xc0
[  638.964565]  [<ffffffff8105024e>] irq_exit+0x10e/0x150
[  638.966390]  [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60
[  638.968223]  [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80
[  638.970079]  <EOI>  [<ffffffff810b964f>] ? __lock_release+0x6f/0x100
[  638.971899]  [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50
[  638.973691]  [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50
[  638.975475]  [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0
[  638.977243]  [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80
[  638.978988]  [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456]
[  638.980723]  [<ffffffff811b5a1d>] bio_endio+0x1d/0x40
[  638.982463]  [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0
[  638.984214]  [<ffffffff81306b9f>] blk_update_request+0x7f/0x350
[  638.985967]  [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90
[  638.987710]  [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50
[  638.989439]  [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30
[  638.991149]  [<ffffffff81308746>] blk_peek_request+0x106/0x250
[  638.992861]  [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130
[  638.994561]  [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0
[  638.996251]  [<ffffffff813040a7>] __blk_run_queue+0x37/0x50
[  638.997900]  [<ffffffff813045af>] blk_run_queue+0x2f/0x50
[  638.999553]  [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0
[  639.001185]  [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40
[  639.002798]  [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200
[  639.004391]  [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0
[  639.005996]  [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0
[  639.007600]  [<ffffffff81072f6b>] kthread+0xdb/0xe0
[  639.009205]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170
[  639.010821]  [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0
[  639.012437]  [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170

This bug was introduce in commit  2e8ac30312
(the first time rdev_set_badblock was call from interrupt context),
so this patch is appropriate for 3.5 and subsequent kernels.

Signed-off-by: Bian Yu <bianyu@kedacom.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13 12:05:32 +09:00
Shaohua Li 01e608d727 raid5: avoid finding "discard" stripe
commit d47648fcf0611812286f68131b40251c6fa54f5e upstream.

SCSI discard will damage discard stripe bio setting, eg, some fields are
changed. If the stripe is reused very soon, we have wrong bios setting. We
remove discard stripe from hash list, so next time the strip will be fully
initialized.

Suitable for backport to 3.7+.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13 12:05:31 +09:00
Shaohua Li 7e44a92662 raid5: set bio bi_vcnt 0 for discard request
commit 37c61ff31e9b5e3fcf3cc6579f5c68f6ad40c4b1 upstream.

SCSI layer will add new payload for discard request. If two bios are merged
to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse
SCSI and cause oops.

Suitable for backport to 3.7+

Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13 12:05:31 +09:00
Kent Overstreet 955a23e181 bcache: Fixed incorrect order of arguments to bio_alloc_bioset()
commit d4eddd42f592a0cf06818fae694a3d271f842e4d upstream.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13 12:05:30 +09:00
Mikulas Patocka 2d99b6dd66 dm snapshot: fix data corruption
commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream.

This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.

When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.

When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.

This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.

The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.

CVE-2013-4299

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-04 04:31:06 -08:00
Kent Overstreet 84c8b3b9e1 bcache: Fix a null ptr deref regression
commit 2fe80d3bbf1c8bd9efc5b8154207c8dd104e7306 upstream.

Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-13 16:08:35 -07:00
NeilBrown f38af5d3f6 dm-raid: silence compiler warning on rebuilds_per_group.
commit 3f6bbd3ffd7b733dd705e494663e5761aa2cb9c1 upstream.

This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:11 -07:00
Mike Snitzer e9d60f6991 dm mpath: disable WRITE SAME if it fails
commit f84cb8a46a771f36a04a02c61ea635c968ed5f6a upstream.

Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.

The WRITE SAME heuristics, with both the original commit 5db44863b6
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).

When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled.  This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.

This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop).  A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.

Before this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920

After this patch:

EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.

# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0

It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:11 -07:00
Mikulas Patocka 0f64fad39c dm-snapshot: fix performance degradation due to small hash size
commit 60e356f381954d79088d0455e357db48cfdd6857 upstream.

LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin.  Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64.  Such small hash table causes performance degradation.

This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size.  It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:11 -07:00
Mikulas Patocka 4541f4e356 dm snapshot: workaround for a false positive lockdep warning
commit 5ea330a75bd86b2b2a01d7b85c516983238306fb upstream.

The kernel reports a lockdep warning if a snapshot is invalidated because
it runs out of space.

The lockdep warning was triggered by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()") in v3.5.

The warning is false positive.  The real cause for the warning is that
the lockdep engine treats different instances of md->lock as a single
lock.

This patch is a workaround - we use flush_workqueue instead of flush_work.
This code path is not performance sensitive (it is called only on
initialization or invalidation), thus it doesn't matter that we flush the
whole workqueue.

The real fix for the problem would be to teach the lockdep engine to treat
different instances of md->lock as separate locks.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:11 -07:00
Kent Overstreet 30d0e7953b bcache: Fix flushes in writeback mode
commit c0f04d88e46d14de51f4baebb6efafb7d59e9f96 upstream.

In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.

The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.

This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet df8b0d944c bcache: Fix for handling overlapping extents when reading in a btree node
commit 84786438ed17978d72eeced580ab757e4da8830b upstream.

btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.

This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet 7866bece34 bcache: Fix a shrinker deadlock
commit a698e08c82dfb9771e0bac12c7337c706d729b6d upstream.

GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet ff1a51b9bf bcache: Fix a dumb CPU spinning bug in writeback
commit 79e3dab90d9f826ceca67c7890e048ac9169de49 upstream.

schedule_timeout() != schedule_timeout_uninterruptible()

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet 77dbabefd6 bcache: Fix a flush/fua performance bug
commit 1394d6761b6e9e15ee7c632a6d48791188727b40 upstream.

bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...

Whoops

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet 48c4100e10 bcache: Fix a writeback performance regression
commit c2a4f3183a1248f615a695fbd8905da55ad11bba upstream.

Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.

When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO.  Doh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet 808eea9d29 bcache: Fix for when no journal entries are found
commit c426c4fd46f709ade2bddd51c5738729c7ae1db5 upstream.

The journal replay code didn't handle this case, causing it to go into
an infinite loop...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Gabriel de Perthuis bb34311525 bcache: Strip endline when writing the label through sysfs
commit aee6f1cfff3ce240eb4b43b41ca466b907acbd2e upstream.

sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.

This should make these characters more unusual.

Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet 0a22f485ee bcache: Fix a dumb journal discard bug
commit 6d9d21e35fbfa2934339e96934f862d118abac23 upstream.

That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-05 07:13:09 -07:00
Kent Overstreet ae61fd4496 bcache: FUA fixes
commit e49c7c374e7aacd1f04ecbc21d9dbbeeea4a77d6 upstream.

Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29 09:47:40 -07:00
Kumar Amit Mehta ba5c60fc8f md: bcache: io.c: fix a potential NULL pointer dereference
commit 5c694129c8db6d89c9be109049a16510b2f70f6d upstream.

bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.

Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29 09:47:40 -07:00
Mikulas Patocka dabd258221 dm verity: fix inability to use a few specific devices sizes
commit b1bf2de07271932326af847a3c6a01fdfd29d4be upstream.

Fix a boundary condition that caused failure for certain device sizes.

The problem is reported at
  http://code.google.com/p/cryptsetup/issues/detail?id=160

For certain device sizes the number of hashes at a specific level was
calculated incorrectly.

It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.

The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.

The condition for the bug is:

Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.

If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.

The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:51:02 +08:00
Mikulas Patocka ba0e6dd5e6 dm ioctl: set noio flag to avoid __vmalloc deadlock
commit 1c0e883e86ece31880fac2f84b260545d66a39e0 upstream.

Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bdad).

This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:51:02 +08:00
Hannes Reinecke 565ff95cdc dm mpath: fix ioctl deadlock when no paths
commit 6c182cd88d179cbbd06f4f8a8a19b6977940753f upstream.

When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
  and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
  hangs in dm_table_destroy() waiting for references to
  drop to zero.

With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:51:02 +08:00
NeilBrown 5a4de61514 md/raid10: remove use-after-free bug.
commit 0eb25bb027a100f5a9df8991f2f628e7d851bc1e upstream.

We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough.  If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:50:55 +08:00
NeilBrown c1dadcc108 md/raid5: fix interaction of 'replace' and 'recovery'.
commit f94c0b6658c7edea8bc19d13be321e3860a3fa54 upstream.

If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.

This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).

So schedule those writes when s.replacing is set as well.

In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete.  So introduce STRIPE_REPLACED to
record if the replacement has happened.

For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test.  The latter
ensure that now IO has been flagged but not started.  The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.

Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.

Finally if a NeedReplace device is not UPTODATE then that is an
error.  We really must trigger a warning.

This bug was introduced in commit 9a3e1101b8
(md/raid5:  detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.

Reported-by: qindehua <13691222965@163.com>
Tested-by: qindehua <qindehua@163.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:50:54 +08:00
NeilBrown 8afb90da9f md/raid1: fix bio handling problems in process_checks()
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a upstream.

Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array.  The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:50:54 +08:00
NeilBrown 1d83ab8548 md: Remove recent change which allows devices to skip recovery.
commit 5024c298311f3b97c85cb034f9edaa333fdb9338 upstream.

commit 7ceb17e87b
    md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that.  It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
 If the array has a bitmap, then the device will still need a bitmap
 based resync ('saved_raid_disk' is set under different conditions
 is a bitmap is present).
 If the array doesn't have a bitmap, then this is correct as long as
 nothing has been written to the array since the metadata was checked
 by ->validate_super.  However there is no locking to ensure that there
 was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04 16:50:53 +08:00
Kent Overstreet 5e20e8b371 bcache: Journal replay fix
commit faa5673617656ee58369a3cfe4a312cfcdc59c81 upstream.

The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.

Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:10 -07:00
Kent Overstreet 99e56bf5ab bcache: Fix GC_SECTORS_USED() calculation
commit 29ebf465b9050f241c4433a796a32e6c896a9dcd upstream.

Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:10 -07:00
Kent Overstreet fe1b710530 bcache: Fix a sysfs splat on shutdown
commit c9502ea4424b31728703d113fc6b30bfead14633 upstream.

If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.

So keep track of whether we've removed stuff from sysfs.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:10 -07:00
Kent Overstreet 63a53870bb bcache: Shutdown fix
commit 5caa52afc5abd1396e4af720469abb5843a71eb8 upstream.

Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:09 -07:00
Kent Overstreet 3fcbc17636 bcache: Advertise that flushes are supported
commit 54d12f2b4fd0f218590d1490b41a18d0e2328a9a upstream.

Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:09 -07:00
Kent Overstreet c0d8f45560 bcache: Fix a dumb race
commit 6aa8f1a6ca41c49721d2de4e048d3da8d06411f9 upstream.

In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.

Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28 16:30:09 -07:00
NeilBrown 2dc04d3333 md/raid10: fix two problems with RAID10 resync.
commit 7bb23c4934059c64cbee2e41d5d24ce122285176 upstream.

1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2c9 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25 14:07:28 -07:00
NeilBrown 4b5c14511c md/raid10: fix two bugs affecting RAID10 reshape.
commit 78eaa0d4cbcdb345992fa3dd22b3bcbb473cc064 upstream.

1/ If a RAID10 is being reshaped to a fewer number of devices
 and is stopped while this is ongoing, then when the array is
 reassembled the 'mirrors' array will be allocated too small.
 This will lead to an access error or memory corruption.

2/ A sanity test for a reshaping RAID10 array is restarted
 is slightly incorrect.

Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25 14:07:28 -07:00
NeilBrown 509c317d09 md/raid10: fix bug which causes all RAID10 reshapes to move no data.
commit 1376512065b23f39d5f9a160948f313397dde972 upstream.

The recent comment:
commit 7e83ccbecd
    md/raid10: Allow skipping recovery when clean arrays are assembled

Causes raid10 to skip a recovery in certain cases where it is safe to
do so.  Unfortunately it also causes a reshape to be skipped which is
never safe.  The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).

Bug was introduced in 3.10, so this is suitable for 3.10-stable.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-25 14:07:28 -07:00
Linus Torvalds 82ea4be61f A few bugfixes for md
Some tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
 +zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
 g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
 VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
 YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
 IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
 jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
 tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
 72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
 ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
 sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
 aoj69nQ3i/4=
 =8iy3
 -----END PGP SIGNATURE-----

Merge tag 'md-3.10-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few bugfixes for md

  Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
  md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
  md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
  md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
  md: md_stop_writes() should always freeze recovery.
2013-06-13 10:13:29 -07:00
H. Peter Anvin 5026d7a9b2 md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13 14:49:54 +10:00