commit 766a78882ddf79b162243649d7dfdbac1fb6fb88 upstream.
Commit 9b1cc9f251 ("dm cache: share cache-metadata object across
inactive and active DM tables") mistakenly ignored the use of ERR_PTR
returns. Restore missing IS_ERR checks and ERR_PTR returns where
appropriate.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2a7eaea02b99b6e267b1e89c79acc6e9a51cee3b upstream.
You can't modify the metadata in these modes. It's better to fail these
messages immediately than let the block-manager deny write locks on
metadata blocks. Otherwise these failed metadata changes will trigger
'needs_check' to get set in the metadata superblock -- requiring repair
using the thin_check utility.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 108cef3aa41669610e1836fe638812dd067d72de upstream.
It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.
Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.
So treat that case just like RAID6, just as we do in
handle_stripe_dirtying. RAID6 always does reconstruct-write.
This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.
Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487cd
Reported-by: Henry Cai <henryplusplus@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b1cc9f251affdd27f29fe46d0989ba76c33faf6 upstream.
If a DM table is reloaded with an inactive table when the device is not
suspended (normal procedure for LVM2), then there will be two dm-bufio
objects that can diverge. This can lead to a situation where the
inactive table uses bufio to read metadata at the same time the active
table writes metadata -- resulting in the inactive table having stale
metadata buffers once it is promoted to the active table slot.
Fix this by using reference counting and a global list of cache metadata
objects to ensure there is only one metadata object per metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Split an IO into multiple requests so the the crypto accelerators
can be exercized in parallel to reduce latency.
Change-Id: I24b15568b5afd375ad39bf3b74f60743f0e1dde9
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit c1c6156fe4d4577444b769d7edd5dd503e57bbc9 upstream.
This function isn't right and it causes a static checker warning:
drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
error: potentially using uninitialized 'sb_data_size'.
It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.
Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 445559cdcb98a141f5de415b94fd6eaccab87e6d upstream.
When dm-bufio sets out to use the bio built into a struct dm_buffer to
issue an IO, it needs to call bio_reset after it's done with the bio
so that we can free things attached to the bio such as the integrity
payload. Therefore, inject our own endio callback to take care of
the bio_reset after calling submit_io's end_io callback.
Test case:
1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300
2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device
3. Repeatedly read metadata and watch kmalloc-192 leak!
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4b5060ddae2b03c5387321fafc089d242225697a upstream.
If two threads call bitmap_unplug at the same time, then
one might schedule all the writes, and the other might
decide that it doesn't need to wait. But really it does.
It rarely hurts to wait when it isn't absolutely necessary,
and the current code doesn't really focus on 'absolutely necessary'
anyway. So just wait always.
This can potentially lead to data corruption if a crash happens
at an awkward time and data was written before the bitmap was
updated. It is very unlikely, but this should go to -stable
just to be safe. Appropriate for any -stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Even if there are no pending peek work, dm layer trying to schedule
the work with 10ms time delay. This patch fixes the issue by putting
back the table entry.
Change-Id: I0e5df117fae74ae37a621862f254220706f0d840
Signed-off-by: AnilKumar Chimata <anilc@codeaurora.org>
Storage hardware can have embedded crypto engine which can greatly
reduce degradation in IO performance if crypto operations are performed
on data. Added support in dm-req-crypt so that it can work either in
transparent mode or crypto mode. In transparent mode, dm-req-crypt will
not perform any crypto operation by itself. In crypto mode, dm-req-crypt
will perform crypto operation on data using a seperate crypto engine
(SW based CE or HW based CE).
Change-Id: I8f27840899566c1a608ca13ce6b7480c9866fb6a
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit 40d43c4b4cac4c2647bf07110d7b07d35f399a84 upstream.
The dm-raid superblock (struct dm_raid_superblock) is padded to 512
bytes and that size is being used to read it in from the metadata
device into one preallocated page.
Reading or writing this on a 512-byte sector device works fine but on
a 4096-byte sector device this fails.
Set the dm-raid superblock's size to the logical block size of the
metadata device, because IO at that size is guaranteed too work. Also
add a size check to avoid silent partial metadata loss in case the
superblock should ever grow past the logical block size or PAGE_SIZE.
[includes pointer math fix from Dan Carpenter]
Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b460d3699324d570a4d4161c3741431887f102f upstream.
The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain. This is not suitable for the simple recursive walk algorithm,
which retraces its steps.
This code is only used by the persistent array code, which in turn is
only used by dm-cache. In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks. For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.
The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 56ec16cb1e1ce46354de8511eef962a417c32c92 upstream.
If cn_add_callback() fails in dm_ulog_tfr_init(), it does not
deallocate prealloced memory but calls cn_del_callback().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eb76faf53b1ff7a77ce3f78cc98ad392ac70c2a0 upstream.
The 'last_accessed' member of the dm_buffer structure was only set when
the the buffer was created. This led to each buffer being discarded
after dm_bufio_max_age time even if it was used recently. In practice
this resulted in all thinp metadata being evicted soon after being read
-- this is particularly problematic for metadata intensive workloads
like multithreaded small random IO.
'last_accessed' is now updated each time the buffer is moved to the head
of the LRU list, so the buffer is now properly discarded if it was not
used in dm_bufio_max_age time.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recalculate nr_phys_segments after pages are allocated
for write requests. Move _req_crypt_io_pool allocation
and de-allocation to ctr and dtr instead of driver init
and exit.
Change-Id: I8576dce1f7c9bc39dcc975762562fb84a349bba7
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
dm-req-crypt was not initializing number of engines available for
crypto operation. Hence, number of engines were getting accumulated
everytime a device based of dm-req-crypt was created. This caused
crash in qcrypto module while retrieving the crypto engine.
Change-Id: I06b5b296a80ae4f9f6bfd024222be9f47a29bfce
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.
It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint. Some devices in some cases don't do what it
says on the label.
The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero. If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks. If all were zero, this would
be the case. If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.
As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.
As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.
If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
raid456.devices_handle_discard_safely=Y
As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7
Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf
Signed-off-by: NeilBrown <neilb@suse.de>
[bwh: Backported to 3.10: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd upstream.
If a devices is being recovered it is not InSync and is not Faulty.
If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices. So it will
neither fix the error nor fail the device.
It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices. So fix it.
This became a bug when we allowed reading from a device that was being
recovered. It is suitable for any subsequent -stable kernel.
Fixes: da8840a747
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d49ec52ff6ddcda178fc2476a109cf1bd1fa19ed upstream.
The DM crypt target accesses memory beyond allocated space resulting in
a crash on 32 bit x86 systems.
This bug is very old (it dates back to 2.6.25 commit 3a7f6c990a "dm
crypt: use async crypto"). However, this bug was masked by the fact
that kmalloc rounds the size up to the next power of two. This bug
wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio
data"). By switching to using per-bio data there was no longer any
padding beyond the end of a dm-crypt allocated memory block.
To minimize allocation overhead dm-crypt puts several structures into one
block allocated with kmalloc. The block holds struct ablkcipher_request,
cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
struct dm_crypt_request and an initialization vector.
The variable dmreq_start is set to offset of struct dm_crypt_request
within this memory block. dm-crypt allocates the block with this size:
cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.
When accessing the initialization vector, dm-crypt uses the function
iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
+ 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).
dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
structure. However, when dm-crypt accesses the initialization vector, it
takes a pointer to the end of dm_crypt_request, aligns it, and then uses
it as the initialization vector. If the end of dm_crypt_request is not
aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
alignment causes the initialization vector to point beyond the allocated
space.
Fix this bug by calculating the variable iv_size_padding and adding it
to the allocated size.
Also correct the alignment of dm_crypt_request. struct dm_crypt_request
is specific to dm-crypt (it isn't used by the crypto subsystem at all),
so it is aligned on __alignof__(struct dm_crypt_request).
Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is
aligned as if the block was allocated with kmalloc.
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dm-req-crypt does not release the device which it got during
construction of dm-req-crypt based node. This causes issue if
dm-req-crypt based device is created and destroyed without
rebooting the device.
Change-Id: Ifeb1210a6e1cf365b8a656556082806a24f3e582
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
Add info message when mapping a block-device to
the dm-req-crypt device-mapper target.
This is the first step before this driver can be used for
encryption of a block device.
It is called once per power-up when disk-encryption is enabled.
Change-Id: I51866b95cbe77d1a3d39fcd5d5d5297c78950fa2
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream.
Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).
This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices). In this case
the bitmap bit will be cleared, but it really shouldn't.
The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.
If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.
As the bug can result in data corruption the patch is suitable for
-stable. For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.
Original-from: jiao hui <jiaohui@bwstor.com.cn>
Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b39685526f46976bcd13aa08c82480092befa46c upstream.
When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.
There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.
Fixes: 3ea7daa5d7
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ce0b0a46955d1bb389684a2605dbcaa990ba0154 upstream.
raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.
So change to used the approved method of clearing unwanted bits.
As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.
Fixes: a38352e0ac
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream.
During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.
If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.
This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.
Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then. In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().
Fixes: 6c0069c0ae
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove WQ_HIGHPRI flag for the workqueue. WQ_HIGHPRI and
WQ_UNBOUND flags cannot be used together.
Change-Id: I4ca4eb6596552866049ea6402e492f463eeaed2c
Acked-by: Baranidharan Muthukumaran <bmuthuku@qti.qualcomm.com>
Signed-off-by: Zhen Kong <zkong@codeaurora.org>
Add support for multiple crypto instances, ping-pong between
available crypto instances to increase performance
Change-Id: I562d77bf0f70ce6cab173dae18bc40fc5e766ba1
Signed-off-by: William Clark <wclark@codeaurora.org>
* commit 'v3.10.49': (529 commits)
Linux 3.10.49
ACPI / battery: Retry to get battery information if failed during probing
x86, ioremap: Speed up check for RAM pages
Score: Modify the Makefile of Score, remove -mlong-calls for compiling
Score: The commit is for compiling successfully.
Score: Implement the function csum_ipv6_magic
score: normalize global variables exported by vmlinux.lds
rtmutex: Plug slow unlock race
rtmutex: Handle deadlock detection smarter
rtmutex: Detect changes in the pi lock chain
rtmutex: Fix deadlock detector for real
ring-buffer: Check if buffer exists before polling
drm/radeon: stop poisoning the GART TLB
drm/radeon: fix typo in golden register setup on evergreen
ext4: disable synchronous transaction batching if max_batch_time==0
ext4: clarify error count warning messages
ext4: fix unjournalled bg descriptor while initializing inode bitmap
dm io: fix a race condition in the wake up code for sync_io
Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
clk: spear3xx: Use proper control register offset
...
In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:
The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.
irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.
Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.
Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
for backward compatibility, argc == 5 is allowed.
however, check argc >= 6 before access argv[5].
Change-Id: Ib7baf741b302110bb170b8c915bfa3dbb223f606
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
commit 048e5a07f282c57815b3901d4a68a77fa131ce0a upstream.
The block size for the dm-cache's data device must remained fixed for
the life of the cache. Disallow any attempt to change the cache's data
block size.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9aec8629ec829fc9403788cd959e05dd87988bd1 upstream.
The block size for the thin-pool's data device must remained fixed for
the life of the thin-pool. Disallow any attempt to change the
thin-pool's data block size.
It should be noted that attempting to change the data block size via
thin-pool table reload will be ignored as a side-effect of the thin-pool
handover that the thin-pool target does during thin-pool table reload.
Here is an example outcome of attempting to load a thin-pool table that
reduced the thin-pool's data block size from 1024K to 512K.
Before:
kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks
After:
kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
kernel: device-mapper: ioctl: error adding target to table
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 10f1d5d111e8aed46a0f1179faf9a3cf422f689e upstream.
There's a race condition between the atomic_dec_and_test(&io->count)
in dec_count() and the waking of the sync_io() thread. If the thread
is spuriously woken immediately after the decrement it may exit,
making the on stack io struct invalid, yet the dec_count could still
be using it.
Fix this race by using a completion in sync_io() and dec_count().
Reported-by: Minfei Huang <huangminfei@ucloud.cn>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream.
When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).
If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.
Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.
We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not. That
depends on details of the particular raid level and particular write
request.
This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel. It applied correctly to 3.0 at
least and will minor editing to earlier kernels.
Reported-by: Bill <billstuff2001@sbcglobal.net>
Tested-by: Bill <billstuff2001@sbcglobal.net>
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 09869de57ed2728ae3c619803932a86cb0e2c4f8 upstream.
DM thinp already checks whether the discard_granularity of the data
device is a factor of the thin-pool block size. But when using the
dm-thin-pool's discard passdown support, DM thinp was not selecting the
max of the underlying data device's discard_granularity and the
thin-pool's block size.
Update set_discard_limits() to set discard_granularity to the max of
these values. This enables blkdev_issue_discard() to properly align the
discards that are sent to the DM thin device on a full block boundary.
As such each discard will now cover an entire DM thin-pool block and the
block will be reclaimed.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When read/write a file using dircet-io (O_DIRECT),
we can't get the inode from a bio by walking the path
bio->bi_io_vec->bv_page->mapping->host
since the page is anonymous and not mapped.
On the other more typical cases when not using O_DIRECT,
the page cache is used and the bv_page is available.
Change-Id: I349cad54a978ed9919f960d55f0f95c1e53262e5
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
* commit 'v3.10.40': (203 commits)
Linux 3.10.40
ARC: !PREEMPT: Ensure Return to kernel mode is IRQ safe
drm: cirrus: add power management support
Input: synaptics - add min/max quirk for ThinkPad Edge E431
Input: synaptics - add min/max quirk for ThinkPad T431s, L440, L540, S1 Yoga and X1
lockd: ensure we tear down any live sockets when socket creation fails during lockd_up
dm thin: fix dangling bio in process_deferred_bios error path
dm transaction manager: fix corruption due to non-atomic transaction commit
Skip intel_crt_init for Dell XPS 8700
mtd: sm_ftl: heap corruption in sm_create_sysfs_attributes()
mtd: nuc900_nand: NULL dereference in nuc900_nand_enable()
mtd: atmel_nand: Disable subpage NAND write when using Atmel PMECC
tgafb: fix data copying
gpio: mxs: Allow for recursive enable_irq_wake() call
rtlwifi: rtl8188ee: initialize packet_beacon
rtlwifi: rtl8192se: Fix regression due to commit 1bf4bbb
rtlwifi: rtl8192se: Fix too long disable of IRQs
rtlwifi: rtl8192cu: Fix too long disable of IRQs
rtlwifi: rtl8188ee: Fix too long disable of IRQs
rtlwifi: rtl8723ae: Fix too long disable of IRQs
...
Change-Id: If5388cf980cb123e35e1b29275ba288c89c5aa18
Signed-off-by: Ian Maund <imaund@codeaurora.org>
commit 2ac295a544dcae9299cba13ce250419117ae7fd1 upstream.
Commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
md: fix problem when adding device to read-only array with bitmap.
added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.
This only happens when mddev->ro is set and normally reshape threads
don't run in that situation. But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.
So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.
Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)
Fixes: 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream.
If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.
If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.
Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted. If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.
Consequently, this patch is suitable for any -stable kernel.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f1daa838e861ae1a0fb7cd9721a21258430fcc8c upstream.
The DM cache target cannot cope with discards that span multiple cache
blocks, so each discard bio that spans more than one cache block must
get split by the DM core.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 610f2de3559c383caf8fbbf91e9968102dff7ca0 upstream.
The DM crypt target used per-cpu structures to hold pointers to a
ablkcipher_request structure. The code assumed that the work item keeps
executing on a single CPU, so it didn't use synchronization when
accessing this structure.
If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online,
the work item could be moved to another CPU. This causes dm-crypt
crashes, like the following, because the code starts using an incorrect
ablkcipher_request:
smpboot: CPU 7 is now offline
BUG: unable to handle kernel NULL pointer dereference at 0000000000000130
IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt]
...
Call Trace:
[<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt]
[<ffffffff81062060>] ? finish_task_switch+0x40/0xc0
[<ffffffff81052a28>] ? process_one_work+0x168/0x470
[<ffffffff8105366b>] ? worker_thread+0x10b/0x390
[<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290
[<ffffffff81058d9f>] ? kthread+0xaf/0xc0
[<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0
[<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120
Fix this bug by removing the per-cpu definition. The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.
This change may undermine performance improvements intended by commit
c0297721 ("dm crypt: scale to multiple cpus") on select hardware. In
practice no performance difference was observed on recent hardware. But
regardless, correctness is more important than performance.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.
If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:
1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
and md_check_recover enters an infinite loop.
Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Write requests are sorted in a red-black tree structure and are submitted
in the sorted order.
In theory the sorting should be performed by the underlying disk scheduler,
however, in practice the disk scheduler accepts and sorts only 128 requests.
In order to sort more requests, we need to implement our own sorting.
CRs-fixed: 670391
Change-Id: Iffd9345fa1253f5cf2a556893ed36e08f1ac51aa
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:09
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Submitting write bios directly in the encryption thread caused serious
performance degradation. On multiprocessor machine encryption requests
finish in a different order than they were submitted in. Consequently, write
requests would be submitted in a different order and it could cause severe
performance degradation.
This patch moves submitting write requests to a separate thread so that
the requests can be sorted before submitting.
Sorting is implemented in the next patch.
Note: it is required that a previous patch "dm-crypt: don't allocate pages
for a partial request." is applied before applying this patch. Without
that, this patch could introduce a crash.
CRs-fixed: 670391
Change-Id: I886ed2da0ff174d3539ea18e27170d7fd1062680
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:08
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Remove io_pool and _crypt_io_pool because they are unused.
CRs-fixed: 670391
Change-Id: I71400ecda66902c56c3af981e8b739f156db1e27
Signed-off-by: Mikulas Patocka <mpatock@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:08
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
This patch fixes a theoretical deadlock introduced in the previous patch.
The function crypt_alloc_buffer may be called concurrently. If we allocate
from the mempool concurrently, there is a possibility of deadlock.
For example, if we have mempool of 256 pages, two processes, each wanting 256,
pages allocate from the mempool concurrently, it may deadlock in a situation
where both processes have allocated 128 pages and the mempool is exhausted.
In order to avoid this scenarios, we allocate the pages under a mutex.
In order to not degrade performance with excessive locking, we try
non-blocking allocations without a mutex first and if it fails, we fallback
to a blocking allocation with a mutex.
CRs-fixed: 670391
Change-Id: I6c391dece4ba44fe0b2e9b75ea2b9235bf1b525b
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:07
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
This patch changes crypt_alloc_buffer so that it always allocates pages for
a full request.
This change enables further simplification and removing of one refcounts
in the next patches.
Note: the next patch is needed to fix a theoretical deadlock
CRs-fixed: 670391
Change-Id: I7bcadac8b3450976366c701fceb1fee7cb18df85
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
[joonwoop@codeaurora.org: resolve trivial merge conflicts]
Patch-mainline: dm-devel @ 04/05/14, 14:07
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Use unbound workqueue so that work is automatically ballanced between
available CPUs.
CRs-fixed: 670391
Change-Id: I169099d0b5b27535633c9d3aaab2037b5fea6aa9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
[joonwoop@codeaurora.org: resolve trivial merge conflict]
Patch-mainline: dm-devel @ 04/05/14, 14:06
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
This patch changes dm-crypt so that it uses auxiliary data allocated with
the bio.
Dm-crypt requires two allocations per request - struct dm_crypt_io and
struct ablkcipher_request (with other data appended to it). It used
mempool for the allocation.
Some requests may require more dm_crypt_ios and ablkcipher_requests,
however most requests need just one of each of these two structures to
complete.
This patch changes it so that the first dm_crypt_io and ablkcipher_request
and allocated with the bio (using target per_bio_data_size option). If the
request needs additional values, they are allocated from the mempool.
CRs-fixed: 670391
Change-Id: I8abc48a021391398f3b35bdd4ac9efbbec3a9fa5
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Patch-mainline: dm-devel @ 04/05/14, 14:05
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Dm-crypt used per-cpu structures to hold pointers to ablkcipher_request.
The code assumed that the work item keeps executing on a single CPU, so it
used no synchronization when accessing this structure.
When we disable a CPU by writing zero to
/sys/devices/system/cpu/cpu*/online, the work item could be moved to
another CPU. This causes crashes in dm-crypt because the code starts using
a wrong ablkcipher_request.
This patch fixes this bug by removing the percpu definition. The structure
ablkcipher_request is accessed via a pointer from convert_context.
Consequently, if the work item is rescheduled to a different CPU, the
thread still uses the same ablkcipher_request.
CRs-fixed: 670391
Change-Id: Ib94281c719aedb1d53b853d38085d2e3f884665b
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Patch-mainline: dm-devel @ 04/05/14, 14:04
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Support Per-File-Encryption (PFE) based on file tagging.
The Per-File-Tagger (PFT) reports if a file should be encrypted or not.
The Per-File-Encryption can be used after Full-Disk-Encryption (FDE) was
completed or without FDE.
The PFE and FDE uses different keys, which are managed by the Trust-Zone.
Change-Id: I727ef11e252649f895a5e3f8a49ca848cea50795
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
commit da1aab3dca9aa88ae34ca392470b8943159e25fe upstream.
When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.
If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.
This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.
This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.
Fixes: a07876064a
Cc: Kent Overstreet <koverstreet@google.com>
Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
HW crypto provides different instance for different use cases.
Dm-req-crypt should use the one for full disk encryption use case.
Change-Id: I22d3f6ab1dd6f83b0d5b83a681236719e2340934
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit fe76cd88e654124d1431bb662a0fc6e99ca811a5 upstream.
If unable to ensure_next_mapping() we must add the current bio, which
was removed from the @bios list via bio_list_pop, back to the
deferred_bios list before all the remaining @bios.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a9d45396f5956d0b615c7ae3b936afd888351a47 upstream.
The persistent-data library used by dm-thin, dm-cache, etc is
transactional. If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.
Atomicity when committing a transaction is achieved by:
a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
disk.
This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.
Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.
As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now
the unlocking must be done separately.
This issue was discovered by forcing io errors at the crucial time
using dm-flakey.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dm-req-crypt defines optional functions from device mapper. These
functions are not required for dm-req-crypt. dm-req-crypt has few
memory de-allocation issues. This change removes unnecessary
functions and fixes memory deallocation issues as well.
Change-Id: Id36bd13989940ce41571a8a10277730048df6f6d
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
The following commits have been reverted from this merge, as they are
known to introduce new bugs and are currently incompatible with our
audio implementation. Investigation of these commits is ongoing, and
they are expected to be brought in at a later time:
86e6de7 ALSA: compress: fix drain calls blocking other compress functions (v6)
16442d4 ALSA: compress: fix drain calls blocking other compress functions
This merge commit also includes a change in block, necessary for
compilation. Upstream has modified elevator_init_fn to prevent race
conditions, requring updates to row_init_queue and test_init_queue.
* commit 'v3.10.28': (1964 commits)
Linux 3.10.28
ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling
drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init()
serial: amba-pl011: use port lock to guard control register access
mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL
md/raid5: Fix possible confusion when multiple write errors occur.
md/raid10: fix two bugs in handling of known-bad-blocks.
md/raid10: fix bug when raid10 recovery fails to recover a block.
md: fix problem when adding device to read-only array with bitmap.
drm/i915: fix DDI PLLs HW state readout code
nilfs2: fix segctor bug that causes file system corruption
thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
ftrace/x86: Load ftrace_ops in parameter not the variable holding it
SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()
writeback: Fix data corruption on NFS
hwmon: (coretemp) Fix truncated name of alarm attributes
vfs: In d_path don't call d_dname on a mount point
staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq()
staging: comedi: addi_apci_1032: fix subdevice type/flags bug
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
GFS2: Increase i_writecount during gfs2_setattr_chown
perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
perf scripting perl: Fix build error on Fedora 12
ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
Linux 3.10.27
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
SCSI: sd: Reduce buffer size for vpd request
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
mac80211: move "bufferable MMPDU" check to fix AP mode scan
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
ACPI / TPM: fix memory leak when walking ACPI namespace
mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
clk: samsung: exynos4: Correct SRC_MFC register
clk: clk-divider: fix divisor > 255 bug
ahci: add PCI ID for Marvell 88SE9170 SATA controller
parisc: Ensure full cache coherency for kmap/kunmap
drm/nouveau/bios: make jump conditional
ARM: shmobile: mackerel: Fix coherent DMA mask
ARM: shmobile: armadillo: Fix coherent DMA mask
ARM: shmobile: kzm9g: Fix coherent DMA mask
ARM: dts: exynos5250: Fix MDMA0 clock number
ARM: fix "bad mode in ... handler" message for undefined instructions
ARM: fix footbridge clockevent device
net: Loosen constraints for recalculating checksum in skb_segment()
bridge: use spin_lock_bh() in br_multicast_set_hash_max
netpoll: Fix missing TXQ unlock and and OOPS.
net: llc: fix use after free in llc_ui_recvmsg
virtio-net: fix refill races during restore
virtio_net: don't leak memory or block when too many frags
virtio-net: make all RX paths handle errors consistently
virtio_net: fix error handling for mergeable buffers
vlan: Fix header ops passthru when doing TX VLAN offload.
net: rose: restore old recvmsg behavior
rds: prevent dereference of a NULL device
ipv6: always set the new created dst's from in ip6_rt_copy
net: fec: fix potential use after free
hamradio/yam: fix info leak in ioctl
drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
net: inet_diag: zero out uninitialized idiag_{src,dst} fields
ip_gre: fix msg_name parsing for recvfrom/recvmsg
net: unix: allow bind to fail on mutex lock
ipv6: fix illegal mac_header comparison on 32bit
netvsc: don't flush peers notifying work during setting mtu
tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
net: unix: allow set_peek_off to fail
net: drop_monitor: fix the value of maxattr
ipv6: don't count addrconf generated routes against gc limit
packet: fix send path when running with proto == 0
virtio: delete napi structures from netdev before releasing memory
macvtap: signal truncated packets
tun: update file current position
macvtap: update file current position
macvtap: Do not double-count received packets
rds: prevent BUG_ON triggered on congestion update to loopback
net: do not pretend FRAGLIST support
IPv6: Fixed support for blackhole and prohibit routes
HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
gpio-rcar: R-Car GPIO IRQ share interrupt
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
irqchip: renesas-irqc: Fix irqc_probe error handling
Linux 3.10.26
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
ext4: fix bigalloc regression
arm64: Use Normal NonCacheable memory for writecombine
arm64: Do not flush the D-cache for anonymous pages
arm64: Avoid cache flushing in flush_dcache_page()
ARM: KVM: arch_timers: zero CNTVOFF upon return to host
ARM: hyp: initialize CNTVOFF to zero
clocksource: arch_timer: use virtual counters
arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
arm64: dts: Reserve the memory used for secondary CPU release address
arm64: check for number of arguments in syscall_get/set_arguments()
arm64: fix possible invalid FPSIMD initialization state
...
Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106
Signed-off-by: Ian Maund <imaund@codeaurora.org>
commit e893fba90c09f9b57fb97daae204ea9cc2c52fa5 upstream.
In order to avoid wasting cache space a partial block at the end of the
origin device is not cached. Unfortunately, the check for such a
partial block at the end of the origin device was flawed.
Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:
- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset
Otherwise, users can experience errors like the following:
attempt to access beyond end of device
dm-5: rw=0, want=20971520, limit=20971456
...
device-mapper: cache: promotion failed; couldn't copy block
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8b9d96666529a979acf4825391efcc7c8a3e9f12 upstream.
During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits. The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().
Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.
Here is an example of how this bug manifests on an ext4 filesystem:
EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1acacc0784aab45627b6009e0e9224886279ac0b upstream.
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
fails in thin_ctr(). Otherwise __pool_destroy() will fail because the
pool will still have an open thin device:
device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
Also, must establish error code if failing thin_ctr() because the pool
is in fail_io mode.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d1662a30dde6e545086fe0e8fd7e474c4e0b639 upstream.
Commit 905e51b ("dm thin: commit outstanding data every second")
introduced a periodic commit. This commit occurs regardless of whether
any thin devices have made changes.
Fix the periodic commit to check if any of a pool's thin devices have
changed using dm_pool_changed_this_transaction().
Reported-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a1989b330093578ea5470bea0a00f940c444c466 upstream.
An invalid ioctl will never be valid, irrespective of whether multipath
has active paths or not. So for invalid ioctls we do not have to wait
for multipath to activate any paths, but can rather return an error
code immediately. This fix resolves numerous instances of:
udevd[]: worker [] unexpectedly returned with status 0x0100
that have been seen during testing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dm-req-crypt currently uses AES 128 bit. Updating dm-req-crypt
to use AES 256 bit instead.
Change-Id: I2e8a4484b25496d53f1f9aa83ec6c0ed7b947901
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
Cc: linux-raid@vger.kernel.org
Fixes: 36d1c6476b
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1877db75589a895bbdc4c4c3f23558e57b521141 upstream.
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
md/raid1: fix bio handling problems in process_checks()
Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.
This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.
This patch preserves the flag until it is needed.
Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug). So suitable for any -stable since 3.10.
Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru>
Fixed: 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2995fa78e423d7193f3b57835f6c1c75006a0315 upstream.
This reverts commit be35f48610 ("dm: wait until embedded kobject is
released before destroying a device") and provides an improved fix.
The kobject release code that calls the completion must be placed in a
non-module file, otherwise there is a module unload race (if the process
calling dm_kobject_release is preempted and the DM module unloaded after
the completion is triggered, but before dm_kobject_release returns).
To fix this race, this patch moves the completion code to dm-builtin.c
which is always compiled directly into the kernel if BLK_DEV_DM is
selected.
The patch introduces a new dm_kobject_holder structure, its purpose is
to keep the completion and kobject in one place, so that it can be
accessed from non-module code without the need to export the layout of
struct mapped_device to that code.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fca028438fb903852beaf7c3fe1cd326651af57d upstream.
This bug was introduced in commit 7e664b3dec431e ("dm space map metadata:
fix extending the space map").
When extending a dm-thin metadata volume we:
- Switch the space map into a simple bootstrap mode, which allocates
all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
entries
- Commit changes to disk
- Switch back out of bootstrap mode.
But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.
The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit. This would cause the disk commit to fail, and thinp
to enter read_only mode. The metadata was not damaged (thin_check
passed).
The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space. In practise this loop only
runs twice.
With this fix the following device mapper testsuite test passes:
dmtest run --suite thin-provisioning -n thin_remove_works_after_resize
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7e664b3dec431eebf0c5df5ff704d6197634cf35 upstream.
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.
That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.
With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
dmtest run --suite thin-provisioning -n /resize_metadata_no_io/
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12c91a5c2d2a8e8cc40a9552313e1e7b0a2d9ee3 upstream.
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.
Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
-> sm_ll_extend
-> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
=> returns -ENOSPC because smm->begin == smm->ll.nr_blocks
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit be35f486108227e10fe5d96fd42fb2b344c59983 upstream.
There may be other parts of the kernel holding a reference on the dm
kobject. We must wait until all references are dropped before
deallocating the mapped_device structure.
The dm_kobject_release method signals that all references are dropped
via completion. But dm_kobject_release doesn't free the kobject (which
is embedded in the mapped_device structure).
This is the sequence of operations:
* when destroying a DM device, call kobject_put from dm_sysfs_exit
* wait until all users stop using the kobject, when it happens the
release method is called
* the release method signals the completion and should return without
delay
* the dm device removal code that waits on the completion continues
* the dm device removal code drops the dm_mod reference the device had
* the dm device removal code frees the mapped_device structure that
contains the kobject
Using kobject this way should avoid the module unload race that was
mentioned at the beginning of this thread:
https://lkml.org/lkml/2014/1/4/83
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 16961b042db8cc5cf75d782b4255193ad56e1d4f upstream.
As additional members are added to the dm_thin_new_mapping structure
care should be taken to make sure they get initialized before use.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 19fa1a6756ed9e92daa9537c03b47d6b55cc2316 upstream.
If a snapshot is created and later deleted the origin dm_thin_device's
snapshotted_time will have been updated to reflect the snapshot's
creation time. The 'shared' flag in the dm_thin_lookup_result struct
returned from dm_thin_find_block() is an approximation based on
snapshotted_time -- this is done to avoid 0(n), or worse, time
complexity. In this case, the shared flag would be true.
But because the 'shared' flag reflects an approximation a block can be
incorrectly assumed to be shared (e.g. false positive for 'shared'
because the snapshot no longer exists). This could result in discards
issued to a thin device not being passed down to the pool's underlying
data device.
To fix this we double check that a thin block is really still in-use
after a mapping is removed using dm_pool_block_is_used(). If the
reference count for a block is now zero the discard is allowed to be
passed down.
Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
structure -- reflects that the 'shared' flag in the response from
dm_thin_find_block() can only be held as definitive if false is
returned.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ef71ec00002d92a08eb27e9d036e3d48835b6597 upstream.
The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.
The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.
I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f97e4b128d2ea90a5f5063ea0ee3b0911f4c669 upstream.
Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices. However if the write wasn't fully successful we
should not clear the bit. If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.
This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing. But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.
So: set the flag when a write fails.
This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dm-req-crypt uses AES-XTS algorithm implemented by HW crypto
engine. Crypto driver renamed the aes-xts algorithm to avoid
conflict with SW based implementation of AES-XTS. dm-req-crypt
must use the new name for AES-XTS provided by crypto driver.
Change-Id: I286b6435af33fd64673c7c1af4bb2447f3115868
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit 1cc03eb93245e63b0b7a7832165efdc52e25b4e6 upstream.
commit 5d8c71f9e5
md: raid5 crash during degradation
Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.
commit 14a75d3e07
md/raid5: preferentially read from replacement device if possible.
Fixed the same bug if a more effective way, so we can now revert
the original commit.
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 5d8c71f9e5
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b50c259e25d9260b9108dc0c2964c26e5ecbe1c1 upstream.
If we discover a bad block when reading we split the request and
potentially read some of it from a different device.
The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c
This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.
Fixes: 856e08e237
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e8b849158508565e0cd6bc80061124afc5879160 upstream.
commit e875ecea26
md/raid10 record bad blocks as needed during recovery.
added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.
So move the freeing of r10bio down where it is safe.
Fixes: e875ecea26
Reported-by: Damian Nowak <spam@nowaker.net>
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 upstream.
If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.
If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.
However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update. We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.
This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag. If that is set,
then the device will not be added to a read-only array.
Cc: Andrei Warkentin <andreiw@vmware.com>
Fixes: d70ed2e4fa
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dm-crypt provides bios based device mapper module. dm-crypt
operates on packets with 512 bytes size which is not effiicent
way for HW based crypto blocks. dm-req-crypt is developed to
address this. dm-req-crypt works on requests which carry upto
512KB of data for unmerged requests.
Change-Id: I7d6a63d516dc2dbe80f46c06dd0722847d55bc9f
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
commit fafc7a815e40255d24e80a1cb7365892362fa398 upstream.
Switch the thin pool to read-only mode when dm_thin_insert_block() fails
since there is little reason to expect the cause of the failure to be
resolved without further action by user space.
This issue was noticed with the device-mapper-test-suite using:
dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
The quantity of errors logged in this case must be reduced.
before patch:
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: dm_thin_insert_block() failed
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map metadata: unable to allocate new metadata block
<snip ... these repeat for a long while ... >
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: space map common: dm_tm_shadow_block() failed
device-mapper: thin: 253:4: no free metadata space available.
device-mapper: thin: 253:4: switching pool to read-only mode
after patch:
device-mapper: space map metadata: unable to allocate new metadata block
device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
device-mapper: thin: 253:4: switching pool to read-only mode
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5b2d06576c5410c10d95adfd5c4d8b24de861d87 upstream.
The dm_round_up function may overflow to zero. In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().
This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 718822c1c112dc99e0c72c8968ee1db9d9d910f0 upstream.
The dm-delay target uses a shared workqueue for multiple instances. This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.
This patch changes dm-delay to use a per-instance workqueue.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ed9571f0cf1fe09d3506302610f3ccdfa1d22c4a upstream.
An old array block could have its reference count decremented below
zero when it is being replaced in the btree by a new array block.
The fix is to increment the old ablock's reference count just before
inserting a new ablock into the btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 230c83afdd9cd384348475bea1e14b80b3b6b1b8 upstream.
There is a possible leak of snapshot space in case of crash.
The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.
For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250
Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260
Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.
This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.
Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4cb57ab4a2e61978f3a9b7d4f53988f30d61c27f upstream.
Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.
However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345
The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.
This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 02e5f5c0a0f726e66e3d8506ea1691e344277969 upstream.
The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().
However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified. This can result in incorrect final
settings.
So call blk_set_stacking_limits() before ->run in level_store().
A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.
This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.
Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f36afb3957353d2529cb2b00f78fdccd14fc5e9c upstream.
dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.
On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.
The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 66cb1910df17b38334153462ec8166e48058035f upstream.
The code that was trying to do this was inadequate. The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce. Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing. In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c1d4de56066e4d6abc66ec188faafd7b303fb08 upstream.
Entries would be lost if the old tail block was partially filled.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 954a73d5d3073df2231820c718fdd2f18b0fe4c9 upstream.
Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work. Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set. By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.
Without this patch a race condition exists that may result in a kernel
panic:
1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
to wait_for_pg_init_completion() assumes there are no more pending path
management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
mode_select errors, then process_queued_ios() will again queue the
path activation work.
3) If free_multipath() completes before activate_path() work is called a
NULL pointer dereference like the following can be seen when
accessing members of the recently destructed multipath:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[<ffffffffa003db1b>] [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath]
[<ffffffff81090ac0>] worker_thread+0x170/0x2a0
[<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com>
Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com>
Tested-by: Speagle Andy <Andy.Speagle@netapp.com>
Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 61e4947c99c4494336254ec540c50186d186150b upstream.
Since:
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.
If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.
This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).
Bug was introduced in 3.10 and causes data corruption.
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d47648fcf0611812286f68131b40251c6fa54f5e upstream.
SCSI discard will damage discard stripe bio setting, eg, some fields are
changed. If the stripe is reused very soon, we have wrong bios setting. We
remove discard stripe from hash list, so next time the strip will be fully
initialized.
Suitable for backport to 3.7+.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 37c61ff31e9b5e3fcf3cc6579f5c68f6ad40c4b1 upstream.
SCSI layer will add new payload for discard request. If two bios are merged
to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse
SCSI and cause oops.
Suitable for backport to 3.7+
Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream.
This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.
When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.
When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.
This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.
The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.
CVE-2013-4299
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2fe80d3bbf1c8bd9efc5b8154207c8dd104e7306 upstream.
Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing
a reported data corruption bug, but it seems some last minute
refactoring or rebasing introduced a null pointer deref.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3f6bbd3ffd7b733dd705e494663e5761aa2cb9c1 upstream.
This doesn't really need to be initialised, but it doesn't hurt,
silences the compiler, and as it is a counter it makes sense for it to
start at zero.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f84cb8a46a771f36a04a02c61ea635c968ed5f6a upstream.
Workaround the SCSI layer's problematic WRITE SAME heuristics by
disabling WRITE SAME in the DM multipath device's queue_limits if an
underlying device disabled it.
The WRITE SAME heuristics, with both the original commit 5db44863b6
("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
WRITE SAME(10) even without successfully determining it is supported.
After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
for the device (by setting sdkp->device->no_write_same which results in
'max_write_same_sectors' in device's queue_limits to be set to 0).
When a device is stacked ontop of such a SCSI device any changes to that
SCSI device's queue_limits do not automatically propagate up the stack.
As such, a DM multipath device will not have its WRITE SAME support
disabled. This causes the block layer to continue to issue WRITE SAME
requests to the mpath device which causes paths to fail and (if mpath IO
isn't configured to queue when no paths are available) it will result in
actual IO errors to the upper layers.
This fix doesn't help configurations that have additional devices
stacked ontop of the mpath device (e.g. LVM created linear DM devices
ontop). A proper fix that restacks all the queue_limits from the bottom
of the device stack up will need to be explored if SCSI will continue to
use this model of optimistically allowing op codes and then disabling
them after they fail for the first time.
Before this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
device-mapper: multipath: Failing path 8:112.
end_request: I/O error, dev dm-6, sector 4616
dm-6: WRITE SAME failed. Manually zeroing.
end_request: I/O error, dev dm-6, sector 4616
end_request: I/O error, dev dm-6, sector 5640
end_request: I/O error, dev dm-6, sector 6664
end_request: I/O error, dev dm-6, sector 7688
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
end_request: I/O error, dev dm-6, sector 524296
Aborting journal on device dm-6-8.
end_request: I/O error, dev dm-6, sector 524288
Buffer I/O error on device dm-6, logical block 65536
lost page write due to I/O error on dm-6
JBD2: Error -5 detected when updating journal superblock for dm-6-8.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
33553920
After this patch:
EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
end_request: critical target error, dev dm-6, sector 528
dm-6: WRITE SAME failed. Manually zeroing.
# cat /sys/block/sdh/queue/write_same_max_bytes
0
# cat /sys/block/dm-6/queue/write_same_max_bytes
0
It should be noted that WRITE SAME support wasn't enabled in DM
multipath until v3.10.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 60e356f381954d79088d0455e357db48cfdd6857 upstream.
LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin. Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64. Such small hash table causes performance degradation.
This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size. It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5ea330a75bd86b2b2a01d7b85c516983238306fb upstream.
The kernel reports a lockdep warning if a snapshot is invalidated because
it runs out of space.
The lockdep warning was triggered by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()") in v3.5.
The warning is false positive. The real cause for the warning is that
the lockdep engine treats different instances of md->lock as a single
lock.
This patch is a workaround - we use flush_workqueue instead of flush_work.
This code path is not performance sensitive (it is called only on
initialization or invalidation), thus it doesn't matter that we flush the
whole workqueue.
The real fix for the problem would be to teach the lockdep engine to treat
different instances of md->lock as separate locks.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c0f04d88e46d14de51f4baebb6efafb7d59e9f96 upstream.
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.
The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.
This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 84786438ed17978d72eeced580ab757e4da8830b upstream.
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.
This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a698e08c82dfb9771e0bac12c7337c706d729b6d upstream.
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1394d6761b6e9e15ee7c632a6d48791188727b40 upstream.
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...
Whoops
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c2a4f3183a1248f615a695fbd8905da55ad11bba upstream.
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.
When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO. Doh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c426c4fd46f709ade2bddd51c5738729c7ae1db5 upstream.
The journal replay code didn't handle this case, causing it to go into
an infinite loop...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aee6f1cfff3ce240eb4b43b41ca466b907acbd2e upstream.
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.
This should make these characters more unusual.
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6d9d21e35fbfa2934339e96934f862d118abac23 upstream.
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e49c7c374e7aacd1f04ecbc21d9dbbeeea4a77d6 upstream.
Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b1bf2de07271932326af847a3c6a01fdfd29d4be upstream.
Fix a boundary condition that caused failure for certain device sizes.
The problem is reported at
http://code.google.com/p/cryptsetup/issues/detail?id=160
For certain device sizes the number of hashes at a specific level was
calculated incorrectly.
It happens for example for a device with data and metadata block size 4096
that has 16385 blocks and algorithm sha256.
The user can test if he is affected by this bug by running the
"veritysetup verify" command and also by activating the dm-verity kernel
driver and reading the whole block device. If it passes without an error,
then the user is not affected.
The condition for the bug is:
Split the total number of data blocks (data_block_bits) into bit strings,
each string has hash_per_block_bits bits. hash_per_block_bits is
rounddown(log2(metadata_block_size/hash_digest_size)). Equivalently, you
can say that you convert data_blocks_bits to 2^hash_per_block_bits base.
If there some zero bit string below the most significant bit string and at
least one bit below this zero bit string is set, then the bug happens.
The same bug exists in the userspace veritysetup tool, so you must use
fixed veritysetup too if you want to use devices that are affected by
this boundary condition.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1c0e883e86ece31880fac2f84b260545d66a39e0 upstream.
Set noio flag while calling __vmalloc() because it doesn't fully respect
gfp flags to avoid a possible deadlock (see commit
502624bdad).
This should be backported to stable kernels 3.8 and newer. The kernel 3.8
doesn't have memalloc_noio_save(), so we should set and restore process
flag PF_MEMALLOC instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6c182cd88d179cbbd06f4f8a8a19b6977940753f upstream.
When multipath needs to retry an ioctl the reference to the
current live table needs to be dropped. Otherwise a deadlock
occurs when all paths are down:
- dm_blk_ioctl takes a reference to the current table
and spins in multipath_ioctl().
- A new table is being loaded, but upon resume the process
hangs in dm_table_destroy() waiting for references to
drop to zero.
With this patch the reference to the old table is dropped
prior to retry, thereby avoiding the deadlock.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0eb25bb027a100f5a9df8991f2f628e7d851bc1e upstream.
We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.
Here is one place I wasn't careful enough. If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.
This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f94c0b6658c7edea8bc19d13be321e3860a3fa54 upstream.
If a device in a RAID4/5/6 is being replaced while another is being
recovered, then the writes to the replacement device currently don't
happen, resulting in corruption when the replacement completes and the
new drive takes over.
This is because the replacement writes are only triggered when
's.replacing' is set and not when the similar 's.sync' is set (which
is the case during resync and recovery - it means all devices need to
be read).
So schedule those writes when s.replacing is set as well.
In this case we cannot use "STRIPE_INSYNC" to record that the
replacement has happened as that is needed for recording that any
parity calculation is complete. So introduce STRIPE_REPLACED to
record if the replacement has happened.
For safety we should also check that STRIPE_COMPUTE_RUN is not set.
This has a similar effect to the "s.locked == 0" test. The latter
ensure that now IO has been flagged but not started. The former
checks if any parity calculation has been flagged by not started.
We must wait for both of these to complete before triggering the
'replace'.
Add a similar test to the subsequent check for "are we finished yet".
This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
but it makes it more obvious that the REPLACE will happen before we
think we are finished.
Finally if a NeedReplace device is not UPTODATE then that is an
error. We really must trigger a warning.
This bug was introduced in commit 9a3e1101b8
(md/raid5: detect and handle replacements during recovery.)
which introduced replacement for raid5.
That was in 3.3-rc3, so any stable kernel since then would benefit
from this fix.
Reported-by: qindehua <13691222965@163.com>
Tested-by: qindehua <qindehua@163.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a upstream.
Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.
The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.
So perform the reset on all bios that were read from and do it early.
This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.
This fixes a crash during a 'check' of a RAID1 array. The crash was
introduced in 3.10 so this is suitable for 3.10-stable.
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5024c298311f3b97c85cb034f9edaa333fdb9338 upstream.
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
allowed a bit more than just that. It also allows devices to be added
to a read-write array and to end up skipping recovery.
This patch removes the offending piece of code pending a rewrite for a
subsequent release.
More specifically:
If the array has a bitmap, then the device will still need a bitmap
based resync ('saved_raid_disk' is set under different conditions
is a bitmap is present).
If the array doesn't have a bitmap, then this is correct as long as
nothing has been written to the array since the metadata was checked
by ->validate_super. However there is no locking to ensure that there
was no write.
Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit faa5673617656ee58369a3cfe4a312cfcdc59c81 upstream.
The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.
Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 29ebf465b9050f241c4433a796a32e6c896a9dcd upstream.
Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.
This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c9502ea4424b31728703d113fc6b30bfead14633 upstream.
If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.
So keep track of whether we've removed stuff from sysfs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5caa52afc5abd1396e4af720469abb5843a71eb8 upstream.
Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 54d12f2b4fd0f218590d1490b41a18d0e2328a9a upstream.
Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6aa8f1a6ca41c49721d2de4e048d3da8d06411f9 upstream.
In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.
Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7bb23c4934059c64cbee2e41d5d24ce122285176 upstream.
1/ When an different between blocks is found, data is copied from
one bio to the other. However bv_len is used as the length to
copy and this could be zero. So use r10_bio->sectors to calculate
length instead.
Using bv_len was probably always a bit dubious, but the introduction
of bio_advance made it much more likely to be a problem.
2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
except on bios that we schedule for a read. This ensures that
missing/failed devices don't confuse the loop at the top of
sync_request write.
Commit 8be185f2c9 "raid10: Use bio_reset()"
removed a loop which set BIO_UPTDATE on all appropriate bios.
So we need to re-add that flag.
These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 78eaa0d4cbcdb345992fa3dd22b3bcbb473cc064 upstream.
1/ If a RAID10 is being reshaped to a fewer number of devices
and is stopped while this is ongoing, then when the array is
reassembled the 'mirrors' array will be allocated too small.
This will lead to an access error or memory corruption.
2/ A sanity test for a reshaping RAID10 array is restarted
is slightly incorrect.
Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1376512065b23f39d5f9a160948f313397dde972 upstream.
The recent comment:
commit 7e83ccbecd
md/raid10: Allow skipping recovery when clean arrays are assembled
Causes raid10 to skip a recovery in certain cases where it is safe to
do so. Unfortunately it also causes a reshape to be skipped which is
never safe. The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).
Bug was introduced in 3.10, so this is suitable for 3.10-stable.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Martin Wilck <mwilck@arcor.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some tagged for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
+zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
aoj69nQ3i/4=
=8iy3
-----END PGP SIGNATURE-----
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>