android_kernel_lge_bullhead

Commit Graph

Author	SHA1	Message	Date
Keno Fischer	e64530fd1b	mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream. In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #define TEST_SIZE 5 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS \| MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void addr2 = mmap(NULL, TEST_SIZE, PROT_READ \| PROT_WRITE, MAP_ANONYMOUS \| MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com Signed-off-by: Keno Fischer <keno@juliacomputing.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Drop change to follow_devmap_pmd() - pmd_dirty() is not available; check the page flags as in can_follow_write_pte() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [mhocko: This has been forward ported from the 3.2 stable tree. And fixed to return NULL.] Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-06-08 00:47:11 +02:00
Oliver O'Halloran	414989d25f	mm/init: fix zone boundary creation commit 90cae1fe1c3540f791d5b8e025985fa5e699b2bb upstream. As a part of memory initialisation the architecture passes an array to free_area_init_nodes() which specifies the max PFN of each memory zone. This array is not necessarily monotonic (due to unused zones) so this array is parsed to build monotonic lists of the min and max PFN for each zone. ZONE_MOVABLE is special cased here as its limits are managed by the mm subsystem rather than the architecture. Unfortunately, this special casing is broken when ZONE_MOVABLE is the not the last zone in the zone list. The core of the issue is: if (i == ZONE_MOVABLE) continue; arch_zone_lowest_possible_pfn[i] = arch_zone_highest_possible_pfn[i-1]; As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will be set to zero. This patch fixes this bug by adding explicitly tracking where the next zone should start rather than relying on the contents arch_zone_highest_possible_pfn[]. Thie is low priority. To get bitten by this you need to enable a zone that appears after ZONE_MOVABLE in the zone_type enum. As far as I can tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled, so I can't see this affecting too many people. I only noticed this because I've been fiddling with ZONE_DEVICE on powerpc and 4.6 broke my test kernel. This bug, in conjunction with the changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and powerpc being the odd architecture which initialises max_zone_pfn[] to ~0ul instead of 0 caused all of system memory to be placed into ZONE_DEVICE at boot, followed a panic since device memory cannot be used for kernel allocations. I've already submitted a patch to fix the powerpc specific bits, but I figured this should be fixed too. Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-06-08 00:47:09 +02:00
Mike Kravetz	10e96e6b6d	mm/hugetlb.c: fix reservation race when freeing surplus pages commit e5bbc8a6c992901058bc09e2ce01d16c111ff047 upstream. return_unused_surplus_pages() decrements the global reservation count, and frees any unused surplus pages that were backing the reservation. Commit 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") added a call to cond_resched_lock in the loop freeing the pages. As a result, the hugetlb_lock could be dropped, and someone else could use the pages that will be freed in subsequent iterations of the loop. This could result in inconsistent global hugetlb page state, application api failures (such as mmap) failures or application crashes. When dropping the lock in return_unused_surplus_pages, make sure that the global reservation count (resv_huge_pages) remains sufficiently large to prevent someone else from claiming pages about to be freed. Analyzed by Paul Cassella. Fixes: 7848a4bf51b3 ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()") Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Paul Cassella <cassella@cray.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-06-08 00:46:54 +02:00
zhong jiang	3419dc516e	mm,ksm: fix endless looping in allocating memory when ksm enable commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream. I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-02-10 11:04:06 +01:00
Jann Horn	f2dc0d7725	swapfile: fix memory corruption via malformed swapfile commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream. When root activates a swap partition whose header has the wrong endianness, nr_badpages elements of badpages are swabbed before nr_badpages has been checked, leading to a buffer overrun of up to 8GB. This normally is not a security issue because it can only be exploited by root (more specifically, a process with CAP_SYS_ADMIN or the ability to modify a swap file/partition), and such a process can already e.g. modify swapped-out memory of any other userspace process on the system. Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-02-06 23:33:05 +01:00
Yuan Lin	8250bd666d	BACKPORT: mm: avoid setting up anonymous pages into file mapping (cherry-picked from 6b7339f4c31ad69c8e9c0b2859276e22cf72176d) Reading page fault handler code I've noticed that under right circumstances kernel would map anonymous pages into file mappings: if the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated on ->mmap(), kernel would handle page fault to not populated pte with do_anonymous_page(). Let's change page fault handler to use do_anonymous_page() only on anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not shared. For file mappings without vm_ops->fault() or shred VMA without vm_ops, page fault on pte_none() entry would lead to SIGBUS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 32460277 Change-Id: I48b744281eecbb6e35ed14f022846870ccf0d316 Signed-off-by: Yuan Lin <yualin@google.com>	2016-11-17 04:19:46 +00:00
Linus Torvalds	7a4db5430f	UPSTREAM: mm: remove gup_flags FOLL_WRITE games from __get_user_pages() (cherry-picked from `9691eac559`) commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit `4ceb5db975` ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit `f33ea7f404` ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in `abf09bed3c` ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask; s/faultin_page/__get_user_page] Signed-off-by: Willy Tarreau <w@1wt.eu> Change-Id: I42e448ecacad4781b460c4c989026307169ba1b5 Bug: 32141528	2016-10-28 20:02:52 -07:00
Linus Torvalds	9691eac559	mm: remove gup_flags FOLL_WRITE games from __get_user_pages() commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream. This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit `4ceb5db975` ("Fix get_user_pages() race for write access") but that was then undone due to problems on s390 by commit `f33ea7f404` ("fix get_user_pages bug"). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in `abf09bed3c` ("s390/mm: implement software dirty bits") which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger. To fix it, we introduce a new internal FOLL_COW flag to mark the "yes, we already did a COW" rather than play racy games with FOLL_WRITE that is very fundamental, and then use the pte dirty flag to validate that the FOLL_COW flag is still valid. Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask; s/faultin_page/__get_user_page] Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-10-20 00:46:32 +02:00
Andrea Arcangeli	afac3781d2	mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream. pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were introduced to locklessy (but atomically) detect when a pmd is a regular (stable) pmd or when the pmd is unstable and can infinitely transition from pmd_none() and pmd_trans_huge() from under us, while only holding the mmap_sem for reading (for writing not). While holding the mmap_sem only for reading, MADV_DONTNEED can run from under us and so before we can assume the pmd to be a regular stable pmd we need to compare it against pmd_none() and pmd_trans_huge() in an atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a tiny window for a race. Useful applications are unlikely to notice the difference as doing MADV_DONTNEED concurrently with a page fault would lead to undefined behavior. [js] 3.12 backport: no pmd_devmap in 3.12 yet. [akpm@linux-foundation.org: tidy up comment grammar/layout] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-10-20 00:46:31 +02:00
Willy Tarreau	f7ed93fef0	squash mm: Export migrate_page_... : also make it non-static commit ce16887b69e94a8c0305e88c918989f8bc1bd6b7 upstream. Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-08-27 11:40:39 +02:00
Richard Weinberger	31f476515d	mm: Export migrate_page_move_mapping and migrate_page_copy commit 1118dce773d84f39ebd51a9fe7261f9169cb056e upstream. Export these symbols such that UBIFS can implement ->migratepage. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [wt: also add the prototype to include/linux/migrate.h] Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-08-27 11:40:23 +02:00
Hugh Dickins	24fc11a987	tmpfs: fix regression hang in fallocate undo commit 7f556567036cb7f89aabe2f0954b08566b4efb53 upstream. The well-spotted fallocate undo fix is good in most cases, but not when fallocate failed on the very first page. index 0 then passes lend -1 to shmem_undo_range(), and that has two bad effects: (a) that it will undo every fallocation throughout the file, unrestricted by the current range; but more importantly (b) it can cause the undo to hang, because lend -1 is treated as truncation, which makes it keep on retrying until every page has gone, but those already fully instantiated will never go away. Big thank you to xfstests generic/269 which demonstrates this. Fixes: b9b4bb26af01 ("tmpfs: don't undo fallocate past its last page") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-08-21 23:22:38 +02:00
Anthony Romano	6d2eb0f116	tmpfs: don't undo fallocate past its last page commit b9b4bb26af017dbe930cd4df7f9b2fc3a0497bfe upstream. When fallocate is interrupted it will undo a range that extends one byte past its range of allocated pages. This can corrupt an in-use page by zeroing out its first byte. Instead, undo using the inclusive byte range. Fixes: `1635f6a741` ("tmpfs: undo fallocation on failure") Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com Signed-off-by: Anthony Romano <anthony.romano@coreos.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Brandon Philips <brandon@ifup.co> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-08-21 23:22:38 +02:00
Hugh Dickins	af110cc4b2	mm: migrate dirty page without clear_page_dirty_for_io etc commit 42cb14b110a5698ccf26ce59c4441722605a3743 upstream. clear_page_dirty_for_io() has accumulated writeback and memcg subtleties since v2.6.16 first introduced page migration; and the set_page_dirty() which completed its migration of PageDirty, later had to be moderated to __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too. No actual problems seen with this procedure recently, but if you look into what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually achieving, it turns out to be nothing more than moving the PageDirty flag, and its NR_FILE_DIRTY stat from one zone to another. It would be good to avoid a pile of irrelevant decrementations and incrementations, and improper event counting, and unnecessary descent of the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which radix_tree_replace_slot() left in place anyway). Do the NR_FILE_DIRTY movement, like the other stats movements, while interrupts still disabled in migrate_page_move_mapping(); and don't even bother if the zone is the same. Do the PageDirty movement there under tree_lock too, where old page is frozen and newpage not yet visible: bearing in mind that as soon as newpage becomes visible in radix_tree, an un-page-locked set_page_dirty() might interfere (or perhaps that's just not possible: anything doing so should already hold an additional reference to the old page, preventing its migration; but play safe). But we do still need to transfer PageDirty in migrate_page_copy(), for those who don't go the mapping route through migrate_page_move_mapping(). CVE-2016-3070 Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ciwillia@brocade.com: backported to 3.10: adjusted context] Signed-off-by: Charles (Chas) Williams <ciwillia@brocade.com> Signed-off-by: Willy Tarreau <w@1wt.eu>	2016-08-21 23:22:37 +02:00
Tim Murray	34bcfd0679	mm: improve migration heuristic Some users were still seeing extreme unmovable page block migration over time due to unmovable allocations stealing mostly free movable blocks. Reduce the likelihood of this by only allowing unmovable allocations to aggressively steal reclaimable pageblocks. bug 26916944 Change-Id: I87fe0b0963ea967e4edf1ef60ae3fd297bf6978c	2016-03-28 21:22:52 +00:00
Martijn Coenen	12c7a54b36	mm: vmpressure: dynamic window sizing. The window size used for calculating vm pressure events was previously fixed at 512 pages. The window size has a big impact on the rate of notifications sent off to userspace, in particular when using the "low" level. On machines with a lot of memory, the current value is likely excessive. On the other hand, if the window size is too big, we might delay memory pressure events for too long, especially at critical levels of memory pressure. This patch attempts to address that problem with two changes. The first change is to calculate the window size based on the machine size, quite similar to how the vm watermarks are being calculated. This reduces the chance of false positives at any pressure level. Using the machine size only makes sense on the root cgroup though; for non-root cgroups, their hard memory limit is used to calculate the window size. If no hard memory limit is set, we fall back to the default window size that was previously used. The second change is based on an idea from Johannes Weiner, to only report medium and low pressure levels for every X windows that we scan. This reduces the frequency with which we report low/medium pressure levels, but at the same time will still report critical memory pressure immediately. Change-Id: Ieffca055b2fb6aa27ae0179e0a588e6fcb173a61 Signed-off-by: Martijn Coenen <maco@google.com>	2016-03-21 10:23:32 +00:00
Tim Murray	6082439b22	mm: adjust page migration heuristic The page allocator's heuristic to decide when to migrate page blocks to unmovable seems to have been tuned on architectures that do not have kernel drivers that would make unmovable allocations of several megabytes or greater--ie, no cameras or shared-memory GPUs. The number of allocations from these drivers may be unbounded and may occupy a significant percentage of overall system memory (>50%). As a result, every Android device has suffered to some extent from increasing fragmentation due to unmovable page block migration over time. This change adjusts the page migration heuristic to only migrate page blocks for unmovable allocations when the order of the requested allocation is order-5 or greater. This prevents migration due to GPU and ion allocations so long as kernel drivers allocate memory at runtime using order-4 or smaller pages. Experimental results running the Android longevity test suite on a Nexus 5X for 10 hours: old heuristic: 116 unmovable blocks after boot -> 281 unmovable blocks new heuristic: 105 unmovable blocks after boot -> 101 unmovable blocks bug 26916944 Change-Id: I5b7ccbbafa4049a2f47f399df4cb4779689f4c40	2016-03-10 11:09:47 -08:00
Martijn Coenen	a28c4bb399	memcg: only free spare array when readers are done commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream. A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Signed-off-by: Martijn Coenen <maco@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-02-25 11:57:49 -08:00
Andrew Banman	f8f1013f5c	mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone() commit 5f0f2887f4de9508dcf438deab28f1de8070c271 upstream. test_pages_in_a_zone() does not account for the possibility of missing sections in the given pfn range. pfn_valid_within always returns 1 when CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing sections to pass the test, leading to a kernel oops. Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check for missing sections before proceeding into the zone-check code. This also prevents a crash from offlining memory devices with missing sections. Despite this, it may be a good idea to keep the related patch '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with missing sections' because missing sections in a memory block may lead to other problems not covered by the scope of this fix. Signed-off-by: Andrew Banman <abanman@sgi.com> Acked-by: Alex Thorlton <athorlton@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-02-25 11:57:49 -08:00
Naoya Horiguchi	523ea6a002	mm: soft-offline: check return value in second __get_any_page() call commit d96b339f453997f2f08c52da3f41423be48c978f upstream. I saw the following BUG_ON triggered in a testcase where a process calls madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that calls migratepages command repeatedly (doing ping-pong among different NUMA nodes) for the first process: Soft offlining page 0x60000 at 0x700000600000 __get_any_page: 0x60000 free buddy page page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1 flags: 0x1fffc0000000000() page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0) ------------[ cut here ]------------ kernel BUG at /src/linux-dev/include/linux/mm.h:342! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000 RIP: 0010:[<ffffffff8118998c>] [<ffffffff8118998c>] put_page+0x5c/0x60 RSP: 0018:ffff88007c213e00 EFLAGS: 00010246 Call Trace: put_hwpoison_page+0x4e/0x80 soft_offline_page+0x501/0x520 SyS_madvise+0x6bc/0x6f0 entry_SYSCALL_64_fastpath+0x12/0x6a Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47 RIP [<ffffffff8118998c>] put_page+0x5c/0x60 RSP <ffff88007c213e00> The root cause resides in get_any_page() which retries to get a refcount of the page to be soft-offlined. This function calls put_hwpoison_page(), expecting that the target page is putback to LRU list. But it can be also freed to buddy. So the second check need to care about such case. Fixes: `af8fae7c08` ("mm/memory-failure.c: clean up soft_offline_page()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-02-25 11:57:48 -08:00
Jann Horn	414f6fbc84	ptrace: use fsuid, fsgid, effective creds for fs access checks commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream. By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-02-25 11:57:47 -08:00
Martijn Coenen	da92fc8598	UPSTREAM: memcg: Only free spare array when readers are done A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Change-Id: Id324c5c15cc6aaa684c6e84c1049f723c6c6d984 Signed-off-by: Martijn Coenen <maco@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6611d8d76132f86faa501de9451a89bf23fb2371)	2016-01-25 09:13:56 +00:00
dcashman	f6abe1ab29	FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR. (cherry picked from commit https://lkml.org/lkml/2015/12/21/337) ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Bug: 24047224 Signed-off-by: Daniel Cashman <dcashman@android.com> Signed-off-by: Daniel Cashman <dcashman@google.com> Change-Id: I66ac01c6f4f2c8dcfc84d1f1e99490b8385b3ed4	2016-01-14 11:48:20 -08:00
Vlastimil Babka	10dba3c441	mm: more aggressive page stealing for UNMOVABLE allocations When allocation falls back to stealing free pages of another migratetype, it can decide to steal extra pages, or even the whole pageblock in order to reduce fragmentation, which could happen if further allocation fallbacks pick a different pageblock. In try_to_steal_freepages(), one of the situations where extra pages are stolen happens when we are trying to allocate a MIGRATE_RECLAIMABLE page. However, MIGRATE_UNMOVABLE allocations are not treated the same way, although spreading such allocation over multiple fallback pageblocks is arguably even worse than it is for RECLAIMABLE allocations. To minimize fragmentation, we should minimize the number of such fallbacks, and thus steal as much as is possible from each fallback pageblock. Note that in theory this might put more pressure on movable pageblocks and cause movable allocations to steal back from unmovable pageblocks. However, movable allocations are not as aggressive with stealing, and do not cause permanent fragmentation, so the tradeoff is reasonable, and evaluation seems to support the change. This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to steal extra free pages. When evaluating with stress-highalloc from mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to roughly 1/6. The number of these fallbacks stealing from MIGRATE_MOVABLE block is reduced to 1/3. There was no observation of growing number of unmovable pageblocks over time, and also not of increased movable allocation fallbacks. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:07 -08:00
Vlastimil Babka	9d32d05be8	mm: always steal split buddies in fallback allocations When allocation falls back to another migratetype, it will steal a page with highest available order, and (depending on this order and desired migratetype), it might also steal the rest of free pages from the same pageblock. Given the preference of highest available order, it is likely that it will be higher than the desired order, and result in the stolen buddy page being split. The remaining pages after split are currently stolen only when the rest of the free pages are stolen. This can however lead to situations where for MOVABLE allocations we split e.g. order-4 fallback UNMOVABLE page, but steal only order-0 page. Then on the next MOVABLE allocation (which may be batched to fill the pcplists) we split another order-3 or higher page, etc. By stealing all pages that we have split, we can avoid further stealing. This patch therefore adjusts the page stealing so that buddy pages created by split are always stolen. This has effect only on MOVABLE allocations, as RECLAIMABLE and UNMOVABLE allocations already always do that in addition to stealing the rest of free pages from the pageblock. The change also allows to simplify try_to_steal_freepages() and factor out CMA handling. According to Mel, it has been intended since the beginning that buddy pages after split would be stolen always, but it doesn't seem like it was ever the case until commit `47118af076` ("mm: mmzone: MIGRATE_CMA migration type added"). The commit has unintentionally introduced this behavior, but was reverted by commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type"). Neither included evaluation. My evaluation with stress-highalloc from mmtests shows about 2.5x reduction of page stealing events for MOVABLE allocations, without affecting the page stealing events for other allocation migratetypes. Change-Id: I2c5b1a7fd01fc080efb689da07d380abd0e030ee Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:06 -08:00
Vlastimil Babka	895ac1ed5b	mm: when stealing freepages, also take pages created by splitting buddy page When studying page stealing, I noticed some weird looking decisions in try_to_steal_freepages(). The first I assume is a bug (Patch 1), the following two patches were driven by evaluation. Testing was done with stress-highalloc of mmtests, using the mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how often page stealing occurs for individual migratetypes, and what migratetypes are used for fallbacks. Arguably, the worst case of page stealing is when UNMOVABLE allocation steals from MOVABLE pageblock. RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal, so the goal is to minimize these two cases. The evaluation of v2 wasn't always clear win and Joonsoo questioned the results. Here I used different baseline which includes RFC compaction improvements from [1]. I found that the compaction improvements reduce variability of stress-highalloc, so there's less noise in the data. First, let's look at stress-highalloc configured to do sync compaction, and how these patches reduce page stealing events during the test. First column is after fresh reboot, other two are reiterations of test without reboot. That was all accumulater over 5 re-iterations (so the benchmark was run 5x3 times with 5 fresh restarts). Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Page alloc extfrag event 10264225 8702233 10244125 Extfrag fragmenting 10263271 8701552 10243473 Extfrag fragmenting for unmovable 13595 17616 15960 Extfrag fragmenting unmovable placed with movable 7989 12193 8447 Extfrag fragmenting for reclaimable 658 1840 1817 Extfrag fragmenting reclaimable placed with movable 558 1677 1679 Extfrag fragmenting for movable 10249018 8682096 10225696 With Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Page alloc extfrag event 11834954 9877523 9774860 Extfrag fragmenting 11833993 9876880 9774245 Extfrag fragmenting for unmovable 7342 16129 11712 Extfrag fragmenting unmovable placed with movable 4191 10547 6270 Extfrag fragmenting for reclaimable 373 1130 923 Extfrag fragmenting reclaimable placed with movable 302 906 738 Extfrag fragmenting for movable 11826278 9859621 9761610 With Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Page alloc extfrag event 4725990 3668793 3807436 Extfrag fragmenting 4725104 3668252 3806898 Extfrag fragmenting for unmovable 6678 7974 7281 Extfrag fragmenting unmovable placed with movable 2051 3829 4017 Extfrag fragmenting for reclaimable 429 1208 1278 Extfrag fragmenting reclaimable placed with movable 369 976 1034 Extfrag fragmenting for movable 4717997 3659070 3798339 With Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Page alloc extfrag event 5016183 4700142 3850633 Extfrag fragmenting 5015325 4699613 3850072 Extfrag fragmenting for unmovable 1312 3154 3088 Extfrag fragmenting unmovable placed with movable 1115 2777 2714 Extfrag fragmenting for reclaimable 437 1193 1097 Extfrag fragmenting reclaimable placed with movable 330 969 879 Extfrag fragmenting for movable 5013576 4695266 3845887 In v2 we've seen apparent regression with Patch 1 for unmovable events, this is now gone, suggesting it was indeed noise. Here, each patch improves the situation for unmovable events. Reclaimable is improved by patch 1 and then either the same modulo noise, or perhaps sligtly worse - a small price for unmovable improvements, IMHO. The number of movable allocations falling back to other migratetypes is most noisy, but it's reduced to half at Patch 2 nevertheless. These are least critical as compaction can move them around. If we look at success rates, the patches don't affect them, that didn't change. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%) Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%) Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%) Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%) Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%) Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%) Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%) Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%) Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%) Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%) Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%) Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%) Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%) Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%) Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%) Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%) Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%) Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%) Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%) Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%) Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%) Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%) Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%) Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%) Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%) Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%) Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%) Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%) While there's no improvement here, I consider reduced fragmentation events to be worth on its own. Patch 2 also seems to reduce scanning for free pages, and migrations in compaction, suggesting it has somewhat less work to do: Patch 1: Compaction stalls 4153 3959 3978 Compaction success 1523 1441 1446 Compaction failures 2630 2517 2531 Page migrate success 4600827 4943120 5104348 Page migrate failure 19763 16656 17806 Compaction pages isolated 9597640 10305617 10653541 Compaction migrate scanned 77828948 86533283 87137064 Compaction free scanned 517758295 521312840 521462251 Compaction cost 5503 5932 6110 Patch 2: Compaction stalls 3800 3450 3518 Compaction success 1421 1316 1317 Compaction failures 2379 2134 2201 Page migrate success 4160421 4502708 4752148 Page migrate failure 19705 14340 14911 Compaction pages isolated 8731983 9382374 9910043 Compaction migrate scanned 98362797 96349194 98609686 Compaction free scanned 496512560 469502017 480442545 Compaction cost 5173 5526 5811 As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers of unmovable and reclaimable pageblocks. Configuring the benchmark to allocate like THP page fault (i.e. no sync compaction) gives much noisier results for iterations 2 and 3 after reboot. This is not so surprising given how [1] offers lower improvements in this scenario due to less restarts after deferred compaction which would change compaction pivot. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-thp-1 5-thp-2 5-thp-3 Page alloc extfrag event 8148965 6227815 6646741 Extfrag fragmenting 8147872 6227130 6646117 Extfrag fragmenting for unmovable 10324 12942 15975 Extfrag fragmenting unmovable placed with movable 5972 8495 10907 Extfrag fragmenting for reclaimable 601 1707 2210 Extfrag fragmenting reclaimable placed with movable 520 1570 2000 Extfrag fragmenting for movable 8136947 6212481 6627932 Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-thp-1 6-thp-2 6-thp-3 Page alloc extfrag event 8345457 7574471 7020419 Extfrag fragmenting 8343546 7573777 7019718 Extfrag fragmenting for unmovable 10256 18535 30716 Extfrag fragmenting unmovable placed with movable 6893 11726 22181 Extfrag fragmenting for reclaimable 465 1208 1023 Extfrag fragmenting reclaimable placed with movable 353 996 843 Extfrag fragmenting for movable 8332825 7554034 6987979 Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-thp-1 7-thp-2 7-thp-3 Page alloc extfrag event 3512847 3020756 2891625 Extfrag fragmenting 3511940 3020185 2891059 Extfrag fragmenting for unmovable 9017 6892 6191 Extfrag fragmenting unmovable placed with movable 1524 3053 2435 Extfrag fragmenting for reclaimable 445 1081 1160 Extfrag fragmenting reclaimable placed with movable 375 918 986 Extfrag fragmenting for movable 3502478 3012212 2883708 Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-thp-1 8-thp-2 8-thp-3 Page alloc extfrag event 3181699 3082881 2674164 Extfrag fragmenting 3180812 3082303 2673611 Extfrag fragmenting for unmovable 1201 4031 4040 Extfrag fragmenting unmovable placed with movable 974 3611 3645 Extfrag fragmenting for reclaimable 478 1165 1294 Extfrag fragmenting reclaimable placed with movable 387 985 1030 Extfrag fragmenting for movable 3179133 3077107 2668277 The improvements for first iteration are clear, the rest is much noisier and can appear like regression for Patch 1. Anyway, patch 2 rectifies it. Allocation success rates are again unaffected so there's no point in making this e-mail any longer. [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2 This patch (of 3): When __rmqueue_fallback() is called to allocate a page of order X, it will find a page of order Y >= X of a fallback migratetype, which is different from the desired migratetype. With the help of try_to_steal_freepages(), it may change the migratetype (to the desired one) also of: 1) all currently free pages in the pageblock containing the fallback page 2) the fallback pageblock itself 3) buddy pages created by splitting the fallback page (when Y > X) These decisions take the order Y into account, as well as the desired migratetype, with the goal of preventing multiple fallback allocations that could e.g. distribute UNMOVABLE allocations among multiple pageblocks. Originally, decision for 1) has implied the decision for 3). Commit `47118af076` ("mm: mmzone: MIGRATE_CMA migration type added") changed that (probably unintentionally) so that the buddy pages in case 3) are always changed to the desired migratetype, except for CMA pageblocks. Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code and fix a bug") did some refactoring and added a comment that the case of 3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type") removed the comment and tried to restore the original behavior where 1) implies 3), but due to the previous refactoring, the result is instead that only 2) implies 3) - and the conditions for 2) are less frequently met than conditions for 1). This may increase fragmentation in situations where the code decides to steal all free pages from the pageblock (case 1)), but then gives back the buddy pages produced by splitting. This patch restores the original intended logic where 1) implies 3). During testing with stress-highalloc from mmtests, this has shown to decrease the number of events where UNMOVABLE and RECLAIMABLE allocations steal from MOVABLE pageblocks, which can lead to permanent fragmentation. In some cases it has increased the number of events when MOVABLE allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are fixable by sync compaction and thus less harmful. Note that evaluation has shown that the behavior introduced by `47118af076` for buddy pages in case 3) is actually even better than the original logic, so the following patch will introduce it properly once again. For stable backports of this patch it makes thus sense to only fix versions containing 0cbef29a7821. [iamjoonsoo.kim@lge.com: tracepoint fix] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.13+ containing 0cbef29a7821] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:06 -08:00
KOSAKI Motohiro	4febd75b67	mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() In general, every tracepoint should be zero overhead if it is disabled. However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate "new_type == start_migratetype" even if tracepoint is disabled. However, the code can be moved into tracepoint's TP_fast_assign() and TP_fast_assign exist exactly such purpose. This patch does it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:05 -08:00
Srivatsa S. Bhat	d961ef5d89	mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint() In the current code, the value of fallback_migratetype that is printed using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype after it has been set to the preferred migratetype (if the ownership was changed). Obviously that wouldn't have been the original intent. (We already have a separate 'change_ownership' field to tell whether the ownership of the pageblock was changed from the fallback_migratetype to the preferred type.) The intent of the fallback_migratetype field is to show the migratetype from which we borrowed pages in order to satisfy the allocation request. So fix the code to print that value correctly. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:05 -08:00
Vlastimil Babka	21aabf49a9	mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on free_list of other migratetype, otherwise they might get allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should prevent the misplacement where possible. Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a fallback for other desired migratetype, and then later freed back through free_pcppages_bulk() without being actually used. This happens because free_pcppages_bulk() uses get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the desired migratetype and not the page's original MIGRATE_RESERVE migratetype. This patch fixes the problem by moving the call to set_freepage_migratetype() from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where the actual page's migratetype (e.g. from which free_list the page is taken from) is used. Note that this migratetype might be different from the pageblock's migratetype due to freepage stealing decisions. This is OK, as page stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all MIGRATE_CMA pages on the correct freelist. Therefore, as an additional benefit, the call to get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies on the fact that MIGRATE_CMA pageblocks are created only during system init, and the above. The related is_migrate_isolate() check is also unnecessary, as memory isolation has other ways to move pages between freelists, and drain pcp lists containing pages that should be isolated. The buffered_rmqueue() can also benefit from calling get_freepage_migratetype() instead of get_pageblock_migratetype(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:04 -08:00
KOSAKI Motohiro	586f4ca13a	mm: __rmqueue_fallback() should respect pageblock type When __rmqueue_fallback() doesn't find a free block with the required size it splits a larger page and puts the rest of the page onto the free list. But it has one serious mistake. When putting back, __rmqueue_fallback() always use start_migratetype if type is not CMA. However, __rmqueue_fallback() is only called when all of the start_migratetype queue is empty. That said, __rmqueue_fallback always puts back memory to the wrong queue except try_to_steal_freepages() changed pageblock type (i.e. requested size is smaller than half of page block). The end result is that the antifragmentation framework increases fragmenation instead of decreasing it. Mel's original anti fragmentation does the right thing. But commit `47118af076` ("mm: mmzone: MIGRATE_CMA migration type added") broke it. This patch restores sane and old behavior. It also removes an incorrect comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c: restructure free-page stealing code and fix a bug"). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-04 14:17:04 -08:00
Srivatsa S. Bhat	2a4e53b3c9	mm/page_allo.c: restructure free-page stealing code and fix a bug The free-page stealing code in __rmqueue_fallback() is somewhat hard to follow, and has an incredible amount of subtlety hidden inside! First off, there is a minor bug in the reporting of change-of-ownership of pageblocks. Under some conditions, we try to move upto 'pageblock_nr_pages' no. of pages to the preferred allocation list. But we change the ownership of that pageblock to the preferred type only if we manage to successfully move atleast half of that pageblock (or if page_group_by_mobility_disabled is set). However, the current code ignores the latter part and sets the 'migratetype' variable to the preferred type, irrespective of whether we actually changed the pageblock migratetype of that block or not. So, the page_alloc_extfrag tracepoint can end up printing incorrect info (i.e., 'change_ownership' might be shown as 1 when it must have been 0). So fixing this involves moving the update of the 'migratetype' variable to the right place. But looking closer, we observe that the 'migratetype' variable is used subsequently for checks such as "is_migrate_cma()". Obviously the intent there is to check if the fallback type is MIGRATE_CMA, but since we already set the 'migratetype' variable to start_migratetype, we end up checking if the preferred type is MIGRATE_CMA!! To make things more interesting, this actually doesn't cause a bug in practice, because we never change anything if the fallback type is CMA. So, restructure the code in such a way that it is trivial to understand what is going on, and also fix the above mentioned bug. And while at it, also add a comment explaining the subtlety behind the migratetype used in the call to expand(). [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I2e84c3b2a45dc063402117dd74179585caa7234c	2016-01-04 14:17:03 -08:00
Thierry Strudel	b895aa44d8	Revert "android/lowmemorykiller: Selectively count free CMA pages" This reverts commit `06e8520b10`.	2015-12-29 13:25:55 -08:00
Thierry Strudel	fbd5d09a89	Revert "lowmemorykiller: Don't count swap cache pages twice" This reverts commit `52acbe414c`.	2015-12-29 13:25:51 -08:00
Minchan Kim	7646ebbebb	BACKPORT: mm: /proc/pid/smaps:: show proportional swap share of the mapping We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc//smaps \| grep "SwapPss:" \| awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc//smaps \| grep "Swap:" \| awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc//smaps \| grep "SwapPss:" \| awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc//smaps \| grep "Swap:" \| awk '{sum += $2} END {print sum}'; 1387744 [akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Idf92d682fdef432bdd66e530a7e7cdff8f375db1	2015-12-16 21:32:30 +00:00
Daniel Campello	89fea44a62	Page cache miss tracing using ftrace on mm/filemap This patch includes two trace events on generic_perform_write and do_generic_file_read to check on the address_space mapping for the pages to be accessed by the request. Change-Id: Ib319b9b2c971b9e5c76645be6cfd995ef9465d77 Signed-off-by: Daniel Campello <campello@google.com> (cherry picked from commit d3952c50853166bd04562766c9603ed86ab0da75)	2015-11-19 11:03:16 -08:00
Jan Kara	07d6db9281	mm: make sendfile(2) killable commit 296291cdd1629c308114504b850dc343eabc2782 upstream. Currently a simple program below issues a sendfile(2) system call which takes about 62 days to complete in my test KVM instance. int fd; off_t off = 0; fd = open("file", O_RDWR \| O_TRUNC \| O_SYNC \| O_CREAT, 0644); ftruncate(fd, 2); lseek(fd, 0, SEEK_END); sendfile(fd, fd, &off, 0xfffffff); Now you should not ask kernel to do a stupid stuff like copying 256MB in 2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin should have a way to stop you. We actually do have a check for fatal_signal_pending() in generic_perform_write() which triggers in this path however because we always succeed in writing something before the check is done, we return value > 0 from generic_perform_write() and thus the information about signal gets lost. Fix the problem by doing the signal check before writing anything. That way generic_perform_write() returns -EINTR, the error gets propagated up and the sendfile loop terminates early. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-11-09 10:12:58 -08:00
Mel Gorman	9834213fa8	mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream. SunDong reported the following on https://bugzilla.kernel.org/show_bug.cgi?id=103841 I think I find a linux bug, I have the test cases is constructed. I can stable recurring problems in fedora22(4.0.4) kernel version, arch for x86_64. I construct transparent huge page, when the parent and child process with MAP_SHARE, MAP_PRIVATE way to access the same huge page area, it has the opportunity to lead to huge page copy on write failure, and then it will munmap the child corresponding mmap area, but then the child mmap area with VM_MAYSHARE attributes, child process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags functions (vma - > vm_flags & VM_MAYSHARE). There were a number of problems with the report (e.g. it's hugetlbfs that triggers this, not transparent huge pages) but it was fundamentally correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that looks like this vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800 prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0 pgoff 0 file ffff88106bdb9800 private_data (null) flags: 0x84400fb(read\|write\|shared\|mayread\|maywrite\|mayexec\|mayshare\|dontexpand\|hugetlb) ------------ kernel BUG at mm/hugetlb.c:462! SMP Modules linked in: xt_pkttype xt_LOG xt_limit [..] CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 set_vma_resv_flags+0x2d/0x30 The VM_BUG_ON is correct because private and shared mappings have different reservation accounting but the warning clearly shows that the VMA is shared. When a private COW fails to allocate a new page then only the process that created the VMA gets the page -- all the children unmap the page. If the children access that data in the future then they get killed. The problem is that the same file is mapped shared and private. During the COW, the allocation fails, the VMAs are traversed to unmap the other private pages but a shared VMA is found and the bug is triggered. This patch identifies such VMAs and skips them. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: SunDong <sund_sky@126.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-10-22 14:37:50 -07:00
Jaewon Kim	de047ce495	vmscan: fix increasing nr_isolated incurred by putback unevictable pages commit c54839a722a02818677bcabe57e957f0ce4f841d upstream. reclaim_clean_pages_from_list() assumes that shrink_page_list() returns number of pages removed from the candidate list. But shrink_page_list() puts back mlocked pages without passing it to caller and without counting as nr_reclaimed. This increases nr_isolated. To fix this, this patch changes shrink_page_list() to pass unevictable pages back to caller. Caller will take care those pages. Minchan said: It fixes two issues. 1. With unevictable page, cma_alloc will be successful. Exactly speaking, cma_alloc of current kernel will fail due to unevictable pages. 2. fix leaking of NR_ISOLATED counter of vmstat With it, too_many_isolated works. Otherwise, it could make hang until the process get SIGKILL. Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-10-01 12:07:32 +02:00
Minchan Kim	e944ac4a67	Rebase zram and zsmalloc from 3.15. 3.15 upstream has many patches to zram that significantly improve performance. Rebase zram and zsmalloc from that time. zsmalloc: move it under mm This patch moves zsmalloc under mm directory. Before that, description will explain why we have needed custom allocator. Zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. The zsmalloc fulfills the allocation needs for zram perfectly [sjenning@linux.vnet.ibm.com: borrow Seth's quote] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit bcf1647d0899666f0fb90d176abf63bae22abb7c) Change-Id: Id6d44d6b743131eac94009d6765dcf2ae097501e zsmalloc: add copyright Add my copyright to the zsmalloc source code which I maintain. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 31fc00bb788ffde7d8d861d8b2bba798ab445992) zsmalloc: Fix CPU hotplug callback registration Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Instead, the correct and race-free way of performing the callback registration is: cpu_notifier_register_begin(); for_each_online_cpu(cpu) init_cpu(cpu); /* Note the use of the double underscored version of the API / __register_cpu_notifier(&foobar_cpu_notifier); cpu_notifier_register_done(); Fix the zsmalloc code by using this latter form of callback registration. Cc: Nitin Gupta <ngupta@vflare.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit f0e71fcd0fa6f3f5495cd9ad3f1e4acd94446a55) zram: remove drivers/staging/zram. Prepare for rebase to newer zram. Change-Id: I1985eb6c9abacf018e5ff597314723f68828f1e9 zram: promote zram from staging Zram has lived in staging for a LONG LONG time and have been fixed/improved by many contributors so code is clean and stable now. Of course, there are lots of product using zram in real practice. The major TV companys have used zram as swap since two years ago and recently our production team released android smart phone with zram which is used as swap, too and recently Android Kitkat start to use zram for small memory smart phone. And there was a report Google released their ChromeOS with zram, too and cyanogenmod have been used zram long time ago. And I heard some disto have used zram block device for tmpfs. In addition, I saw many report from many other peoples. For example, Lubuntu start to use it. The benefit of zram is very clear. With my experience, one of the benefit was to remove jitter of video application with backgroud memory pressure. It would be effect of efficient memory usage by compression but more issue is whether swap is there or not in the system. Recent mobile platforms have used JAVA so there are many anonymous pages. But embedded system normally are reluctant to use eMMC or SDCard as swap because there is wear-leveling and latency issues so if we do not use swap, it means we can't reclaim anoymous pages and at last, we could encounter OOM kill. :( Although we have real storage as swap, it was a problem, too. Because it sometime ends up making system very unresponsible caused by slow swap storage performance. Quote from Luigi on Google "Since Chrome OS was mentioned: the main reason why we don't use swap to a disk (rotating or SSD) is because it doesn't degrade gracefully and leads to a bad interactive experience. Generally we prefer to manage RAM at a higher level, by transparently killing and restarting processes. But we noticed that zram is fast enough to be competitive with the latter, and it lets us make more efficient use of the available RAM. " and he announced. http://www.spinics.net/lists/linux-mm/msg57717.html Other uses case is to use zram for block device. Zram is block device so anyone can format the block device and mount on it so some guys on the internet start zram as /var/tmp. http://forums.gentoo.org/viewtopic-t-838198-start-0.html Let's promote zram and enhance/maintain it instead of removing. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Nitin Gupta <ngupta@vflare.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit cd67e10ac6997c6d1e1504e3c111b693bfdbc148) Change-Id: Ic5311ef825799d48c52fe9ea9e96b8277e7001fb zram: remove old private project comment Remove the old private compcache project address so upcoming patches should be sent to LKML because we Linux kernel community will take care. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 49061236a9c2e18b31617cef10d27ba136068bac) zram: add copyright Add my copyright to the zram source code which I maintain. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7bfb3de8a1b3bebc2dc68d381efe27448c0584c5) zram: fix race between reset and flushing pending work Dan and Sergey reported that there is a racy between reset and flushing of pending work so that it could make oops by freeing zram->meta in reset while zram_slot_free can access zram->meta if new request is adding during the race window. This patch moves flush after taking init_lock so it prevents new request so that it closes the race. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit da4a04126baa3be03bc566d4a2ee0944c5e783d0) zram: delay pending free request in read path Sergey reported we don't need to handle pending free request every I/O so that this patch removes it in read path while we remain it in write path. Let's consider below example. Swap subsystem ask to zram "A" block free by swap_slot_free_notify but zram had been pended it without real freeing. Swap subsystem allocates "A" block for new data but request pended for a long time just handled and zram blindly free new data on the "A" block. :( That's why we couldn't remove handle pending free request right before zram-write. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9b353db16d18f87242337e3e61a948c023505a65) zram: remove unnecessary free Commit a0c516cbfc74 ("zram: don't grab mutex in zram_slot_free_noity") introduced pending zram slot free in zram's write path in case of missing slot free by memory allocation failure in zram_slot_free_notify but it is not necessary because we have already freed the slot right before overwriting. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 874e3cddc33f0c0f9cc08ad2b73fa0cbe7dfaa63) zram: use atomic operation for stat Some of fields in zram->stats are protected by zram->lock which is rather coarse-grained so let's use atomic operation without explict locking. This patch is ready for removing dependency of zram->lock in read path which is very coarse-grained rw_semaphore. Of course, this patch adds new atomic operation so it might make slow but my 12CPU test couldn't spot any regression. All gain/lose is marginal within stddev. iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0 ==Initial write ==Initial write records: 50 records: 50 avg: 412875.17 avg: 415638.23 std: 38543.12 (9.34%) std: 36601.11 (8.81%) max: 521262.03 max: 502976.72 min: 343263.13 min: 351389.12 ==Rewrite ==Rewrite records: 50 records: 50 avg: 416640.34 avg: 397914.33 std: 60798.92 (14.59%) std: 46150.42 (11.60%) max: 543057.07 max: 522669.17 min: 304071.67 min: 316588.77 ==Read ==Read records: 50 records: 50 avg: 4147338.63 avg: 4070736.51 std: 179333.25 (4.32%) std: 223499.89 (5.49%) max: 4459295.28 max: 4539514.44 min: 3753057.53 min: 3444686.31 ==Re-read ==Re-read records: 50 records: 50 avg: 4096706.71 avg: 4117218.57 std: 229735.04 (5.61%) std: 171676.25 (4.17%) max: 4430012.09 max: 4459263.94 min: 2987217.80 min: 3666904.28 ==Reverse Read ==Reverse Read records: 50 records: 50 avg: 4062763.83 avg: 4078508.32 std: 186208.46 (4.58%) std: 172684.34 (4.23%) max: 4401358.78 max: 4424757.22 min: 3381625.00 min: 3679359.94 ==Stride read ==Stride read records: 50 records: 50 avg: 4094933.49 avg: 4082170.22 std: 185710.52 (4.54%) std: 196346.68 (4.81%) max: 4478241.25 max: 4460060.97 min: 3732593.23 min: 3584125.78 ==Random read ==Random read records: 50 records: 50 avg: 4031070.04 avg: 4074847.49 std: 192065.51 (4.76%) std: 206911.33 (5.08%) max: 4356931.16 max: 4399442.56 min: 3481619.62 min: 3548372.44 ==Mixed workload ==Mixed workload records: 50 records: 50 avg: 149925.73 avg: 149675.54 std: 7701.26 (5.14%) std: 6902.09 (4.61%) max: 191301.56 max: 175162.05 min: 133566.28 min: 137762.87 ==Random write ==Random write records: 50 records: 50 avg: 404050.11 avg: 393021.47 std: 58887.57 (14.57%) std: 42813.70 (10.89%) max: 601798.09 max: 524533.43 min: 325176.99 min: 313255.34 ==Pwrite ==Pwrite records: 50 records: 50 avg: 411217.70 avg: 411237.96 std: 43114.99 (10.48%) std: 33136.29 (8.06%) max: 530766.79 max: 471899.76 min: 320786.84 min: 317906.94 ==Pread ==Pread records: 50 records: 50 avg: 4154908.65 avg: 4087121.92 std: 151272.08 (3.64%) std: 219505.04 (5.37%) max: 4459478.12 max: 4435857.38 min: 3730512.41 min: 3101101.67 Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit deb0bdeb2f3d6b81d37fc778316dae46b6daab56) zram: introduce zram->tb_lock Currently, the zram table is protected by zram->lock but it's rather coarse-grained lock and it makes hard for scalibility. Let's use own rwlock instead of depending on zram->lock. This patch adds new locking so obviously, it would make slow but this patch is just prepartion for removing coarse-grained rw_semaphore(ie, zram->lock) which is hurdle about zram scalability. Final patch in this patchset series will remove the lock from read-path and change rw_semaphore with mutex in write path. With bonus, we could drop pending slot free mess in next patch. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 92967471b67163bb1654e9b7fe99449ab70a4aaa) zram: remove workqueue for freeing removed pending slot Commit a0c516cbfc74 ("zram: don't grab mutex in zram_slot_free_noity") introduced free request pending code to avoid scheduling by mutex under spinlock and it was a mess which made code lenghty and increased overhead. Now, we don't need zram->lock any more to free slot so this patch reverts it and then, tb_lock should protect it. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f614a9f48dedd2b80d1dc8bae8094842fcdb39dd) zram: remove zram->lock in read path and change it with mutex Finally, we separated zram->lock dependency from 32bit stat/ table handling so there is no reason to use rw_semaphore between read and write path so this patch removes the lock from read path totally and changes rw_semaphore with mutex. So, we could do old: read-read: OK read-write: NO write-write: NO Now: read-read: OK read-write: OK write-write: NO The below data proves mixed workload performs well 11 times and there is also enhance on write-write path because current rw-semaphore doesn't support SPIN_ON_OWNER. It's side effect but anyway good thing for us. Write-related tests perform better (from 61% to 1058%) but read path has good/bad(from -2.22% to 1.45%) but they are all marginal within stddev. CPU 12 iozone -t -T -l 12 -u 12 -r 16K -s 60M -I +Z -V 0 ==Initial write ==Initial write records: 10 records: 10 avg: 516189.16 avg: 839907.96 std: 22486.53 (4.36%) std: 47902.17 (5.70%) max: 546970.60 max: 909910.35 min: 481131.54 min: 751148.38 ==Rewrite ==Rewrite records: 10 records: 10 avg: 509527.98 avg: 1050156.37 std: 45799.94 (8.99%) std: 40695.44 (3.88%) max: 611574.27 max: 1111929.26 min: 443679.95 min: 980409.62 ==Read ==Read records: 10 records: 10 avg: 4408624.17 avg: 4472546.76 std: 281152.61 (6.38%) std: 163662.78 (3.66%) max: 4867888.66 max: 4727351.03 min: 4058347.69 min: 4126520.88 ==Re-read ==Re-read records: 10 records: 10 avg: 4462147.53 avg: 4363257.75 std: 283546.11 (6.35%) std: 247292.63 (5.67%) max: 4912894.44 max: 4677241.75 min: 4131386.50 min: 4035235.84 ==Reverse Read ==Reverse Read records: 10 records: 10 avg: 4565865.97 avg: 4485818.08 std: 313395.63 (6.86%) std: 248470.10 (5.54%) max: 5232749.16 max: 4789749.94 min: 4185809.62 min: 3963081.34 ==Stride read ==Stride read records: 10 records: 10 avg: 4515981.80 avg: 4418806.01 std: 211192.32 (4.68%) std: 212837.97 (4.82%) max: 4889287.28 max: 4686967.22 min: 4210362.00 min: 4083041.84 ==Random read ==Random read records: 10 records: 10 avg: 4410525.23 avg: 4387093.18 std: 236693.22 (5.37%) std: 235285.23 (5.36%) max: 4713698.47 max: 4669760.62 min: 4057163.62 min: 3952002.16 ==Mixed workload ==Mixed workload records: 10 records: 10 avg: 243234.25 avg: 2818677.27 std: 28505.07 (11.72%) std: 195569.70 (6.94%) max: 288905.23 max: 3126478.11 min: 212473.16 min: 2484150.69 ==Random write ==Random write records: 10 records: 10 avg: 555887.07 avg: 1053057.79 std: 70841.98 (12.74%) std: 35195.36 (3.34%) max: 683188.28 max: 1096125.73 min: 437299.57 min: 992481.93 ==Pwrite ==Pwrite records: 10 records: 10 avg: 501745.93 avg: 810363.09 std: 16373.54 (3.26%) std: 19245.01 (2.37%) max: 518724.52 max: 833359.70 min: 464208.73 min: 765501.87 ==Pread ==Pread records: 10 records: 10 avg: 4539894.60 avg: 4457680.58 std: 197094.66 (4.34%) std: 188965.60 (4.24%) max: 4877170.38 max: 4689905.53 min: 4226326.03 min: 4095739.72 Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e46e33152eb82b8e2db7ffb3790a2a2653c34513) zram: avoid null access when fail to alloc meta zram_meta_alloc could fail so caller should check it. Otherwise, your system will hang. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit db5d711e2db776f18219b033e5dc4fb7e4264dd7) zram: drop `init_done' struct zram member Introduce init_done() helper function which allows us to drop `init_done' struct zram member. init_done() uses the fact that ->init_done == 1 equals to ->meta != NULL. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit be2d1d56c82d8cf20e6c77515eb499f8e86eb5be) zram: do not pass rw argument to __zram_make_request() Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw() chain, decode it in zram_bvec_rw() instead. Besides, this is the place where we distinguish READ and WRITE bio data directions, so account zram RW stats here, instead of __zram_make_request(). This also allows to account a real number of zram READ/WRITE operations, not just requests (single RW request may cause a number of zram RW ops with separate locking, compression/decompression, etc). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit be257c61306750d11c20d2ac567bf63304c696a3) zram: remove good and bad compress stats Remove `good' and `bad' compressed sub-requests stats. RW request may cause a number of RW sub-requests. zram used to account `good' compressed sub-queries (with compressed size less than 50% of original size), `bad' compressed sub-queries (with compressed size greater that 75% of original size), leaving sub-requests with compression size between 50% and 75% of original size not accounted and not reported. zram already accounts each sub-request's compression size so we can calculate real device compression ratio. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b7cccf8b4009bf74df61f3c9d86b95fabd807c11) zram: use atomic64_t for all zram stats This is a preparation patch for stats code duplication removal. 1) use atomic64_t for `pages_zero' and `pages_stored' zram stats. 2) `compr_size' and `pages_zero' struct zram_stats members did not follow the existing device attr naming scheme: zram_stats.ATTR has ATTR_show() function. rename them: -- compr_size -> compr_data_size -- pages_zero -> zero_pages Minchan Kim's note: If we really have trouble with atomic stat operation, we could change it with percpu_counter so that it could solve atomic overhead and unnecessary memory space by introducing unsigned long instead of 64bit atomic_t. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 90a7806ea9b9f7cb4751859cc2506e2d80e36ef1) zram: remove zram stats code duplication Introduce ZRAM_ATTR_RO macro that generates device_attribute and default ATTR show() function for existing atomic64_t zram stats. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a68eb3b65e658406d386bebef02277f4007b2f45) zram: report failed read and write stats zram accounted but did not report numbers of failed read and write queries. make these stats available as failed_reads and failed_writes attrs. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6444724939db5de7390c90f7b4a657159b3b4465) zram: drop not used table `count' member struct table `count' member is not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 59fc86a4922f1a1c0f69eac758a7e2b2b138aab4) zram: move zram size warning to documentation Move zram warning about disksize and size of memory correlation to zram documentation. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e64cd51d2fa87733176246101df871a8ac5c7c20) zram: delete zram_init_device() allocate new `zram_meta' in disksize_store() only for uninitialised zram device, saving a number of allocations and deallocations in case if disksize_store() was called on currently used device. at the same time zram_meta stack variable is not necessary, because we can set ->meta directly. there is also no need in setting QUEUE_FLAG_NONROT queue on every disksize_store(), set it once during device creation. [minchan@kernel.org: handle zram->meta alloc fail case] [minchan@kernel.org: prevent lockdep spew of init_lock] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b67d1ec189ffb92cdad9b2bd29475fb1e0166983) zram: introduce compressing backend abstraction ZRAM performs direct LZO compression algorithm calls, making it the one and only option. While LZO is generally performs well, LZ4 algorithm tends to have a faster decompression (see http://code.google.com/p/lz4/ for full report) Name Ratio C.speed D.speed MB/s MB/s LZ4 (r101) 2.084 422 1820 LZO 2.06 2.106 414 600 Thus, users who have mostly read (decompress) usage scenarious or mixed workflow (writes with relatively high read ops number) will benefit from using LZ4 compression backend. Introduce compressing backend abstraction zcomp in order to support multiple compression algorithms with the following set of operations: .create .destroy .compress .decompress Schematically zram write() usually contains the following steps: 0) preparation (decompression of partioal IO, etc.) 1) lock buffer_lock mutex (protects meta compress buffers) 2) compress (using meta compress buffers) 3) alloc and map zs_pool object 4) copy compressed data (from meta compress buffers) to object allocated by 3) 5) free previous pool page, assign a new one 6) unlock buffer_lock mutex As we can see, compressing buffers must remain untouched from 1) to 4), because, otherwise, concurrent write() can overwrite data. At the same time, zram_meta must be aware of a) specific compression algorithm memory requirements and b) necessary locking to protect compression buffers. To remove requirement a) new struct zcomp_strm introduced, which contains a compress/decompress `buffer' and compression algorithm `private' part. While struct zcomp implements zcomp_strm stream handling and locking and removes requirement b) from zram meta. zcomp ->create() and ->destroy(), respectively, allocate and deallocate algorithm specific zcomp_strm `private' part. Every zcomp has zcomp stream and mutex to protect its compression stream. Stream usage semantics remains the same -- only one write can hold stream lock and use its buffers. zcomp_strm_find() turns caller into exclusive user of a stream (holding stream mutex until zram release stream), and zcomp_strm_release() makes zcomp stream available (unlock the stream mutex). Hence no concurrent write (compression) operations possible at the moment. iozone -t 3 -R -r 16K -s 60M -I +Z test base patched -------------------------------------------------- Initial write 597992.91 591660.58 Rewrite 609674.34 616054.97 Read 2404771.75 2452909.12 Re-read 2459216.81 2470074.44 Reverse Read 1652769.66 1589128.66 Stride read 2202441.81 2202173.31 Random read 2236311.47 2276565.31 Mixed workload 1423760.41 1709760.06 Random write 579584.08 615933.86 Pwrite 597550.02 594933.70 Pread 1703672.53 1718126.72 Fwrite 1330497.06 1461054.00 Fread 3922851.00 3957242.62 Usage examples: comp = zcomp_create(NAME) / NAME e.g. "lzo" / which initialises compressing backend if requested algorithm is supported. Compress: zstrm = zcomp_strm_find(comp) zcomp_compress(comp, zstrm, src, &dst_len) [..] / copy compressed data */ zcomp_strm_release(comp, zstrm) Decompress: zcomp_decompress(comp, src, src_len, dst); Free compessing backend and its zcomp stream: zcomp_destroy(comp) Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e7e1ef439d18f9a21521116ea9f2b976d7230e54) zram: use zcomp compressing backends Do not perform direct LZO compress/decompress calls, initialise and use zcomp LZO backend (single compression stream) instead. [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b7ca232ee7e85ed3b18e39eb20a7f458ee1d6047) zram: factor out single stream compression This is preparation patch to add multi stream support to zcomp. Introduce struct zcomp_strm_single and a set of functions to manage zcomp_strm stream access. zcomp_strm_single implements single compession stream, same way as current zcomp implementation. This moves zcomp_strm stream control and locking from zcomp, so compressing backend zcomp is not aware of required locking. Single and multi streams require different locking schemes. Minchan Kim reported that spinlock-based locking scheme (which is used in multi stream implementation) has demonstrated a severe perfomance regression for single compression stream case, comparing to mutex-based. see https://lkml.org/lkml/2014/2/18/16 The following set of functions added: - zcomp_strm_single_find()/zcomp_strm_single_release() find and release a compression stream, implement required locking - zcomp_strm_single_create()/zcomp_strm_single_destroy() create and destroy zcomp_strm_single New ->strm_find() and ->strm_release() callbacks added to zcomp, which are set to zcomp_strm_single_find() and zcomp_strm_single_release() during initialisation. Instead of direct locking and zcomp_strm access from zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find() and ->strm_release() correspondingly. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9cc97529a180b369fcb7e5265771b6ba7e01f05b) zram: document failed_reads, failed_writes stats Document `failed_reads' and `failed_writes' device attributes. Remove info about `discard' - there is no such zram attr. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 8dd1d3247e6c00b50ef83934ea8b22a1590015de) zram: add multi stream functionality Existing zram (zcomp) implementation has only one compression stream (buffer and algorithm private part), so in order to prevent data corruption only one write (compress operation) can use this compression stream, forcing all concurrent write operations to wait for stream lock to be released. This patch changes zcomp to keep a compression streams list of user-defined size (via sysfs device attr). Each write operation still exclusively holds compression stream, the difference is that we can have N write operations (depending on size of streams list) executing in parallel. See TEST section later in commit message for performance data. Introduce struct zcomp_strm_multi and a set of functions to manage zcomp_strm stream access. zcomp_strm_multi has a list of idle zcomp_strm structs, spinlock to protect idle list and wait queue, making it possible to perform parallel compressions. The following set of functions added: - zcomp_strm_multi_find()/zcomp_strm_multi_release() find and release a compression stream, implement required locking - zcomp_strm_multi_create()/zcomp_strm_multi_destroy() create and destroy zcomp_strm_multi zcomp ->strm_find() and ->strm_release() callbacks are set during initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release() correspondingly. Each time zcomp issues a zcomp_strm_multi_find() call, the following set of operations performed: - spin lock strm_lock - if idle list is not empty, remove zcomp_strm from idle list, spin unlock and return zcomp stream pointer to caller - if idle list is empty, current adds itself to wait queue. it will be awaken by zcomp_strm_multi_release() caller. zcomp_strm_multi_release(): - spin lock strm_lock - add zcomp stream to idle list - spin unlock, wake up sleeper Minchan Kim reported that spinlock-based locking scheme has demonstrated a severe perfomance regression for single compression stream case, comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16) base spinlock mutex ==Initial write ==Initial write ==Initial write records: 5 records: 5 records: 5 avg: 1642424.35 avg: 699610.40 avg: 1655583.71 std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96 max: 1690170.94 max: 1163473.45 max: 1697164.75 min: 1568669.52 min: 573429.88 min: 1553410.23 ==Rewrite ==Rewrite ==Rewrite records: 5 records: 5 records: 5 avg: 1611775.39 avg: 501406.64 avg: 1684419.11 std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42 max: 1641800.95 max: 531356.78 max: 1706445.84 min: 1593515.27 min: 488817.78 min: 1655335.73 When only one compression stream available, mutex with spin on owner tends to perform much better than frequent wait_event()/wake_up(). This is why single stream implemented as a special case with mutex locking. Introduce and document zram device attribute max_comp_streams. This attr shows and stores current zcomp's max number of zcomp streams (max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter. `max_strm' limits the number of zcomp_strm structs in compression backend's idle list (max_comp_streams). max_comp_streams used during initialisation as follows: -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp using single compression stream zcomp_strm_single (mutex-based locking). -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp using multi compression stream zcomp_strm_multi (spinlock-based locking). default max_comp_streams value is 1, meaning that zram with single stream will be initialised. Later patch will introduce configuration knob to change max_comp_streams on already initialised and used zcomp. TEST iozone -t 3 -R -r 16K -s 60M -I +Z test base 1 strm (mutex) 3 strm (spinlock) ----------------------------------------------------------------------- Initial write 589286.78 583518.39 718011.05 Rewrite 604837.97 596776.38 1515125.72 Random write 584120.11 595714.58 1388850.25 Pwrite 535731.17 541117.38 739295.27 Fwrite 1418083.88 1478612.72 1484927.06 Usage example: set max_comp_streams to 4 echo 4 > /sys/block/zram0/max_comp_streams show current max_comp_streams (default value is 1). cat /sys/block/zram0/max_comp_streams Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit beca3ec71fe5490ee9237dc42400f50402baf83e) zram: add set_max_streams knob This patch allows to change max_comp_streams on initialised zcomp. Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams() and zcomp_strm_single_set_max_streams() callbacks to change streams limit for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams for single steam zcomp does nothing. If user has lowered the limit, then zcomp_strm_multi_set_max_streams() attempts to immediately free extra streams (as much as it can, depending on idle streams availability). Note, this patch does not allow to change stream 'policy' from single to multi stream (or vice versa) on already initialised compression backend. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit fe8eb122c82b2049c460fc6df6e8583a2f935cff) zram: make compression algorithm selection possible Add and document `comp_algorithm' device attribute. This attribute allows to show supported compression and currently selected compression algorithms: cat /sys/block/zram0/comp_algorithm [lzo] lz4 and change selected compression algorithm: echo lzo > /sys/block/zram0/comp_algorithm Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e46b8a030d76d3c94156c545c3f4c3676d813435) zram: add lz4 algorithm backend Introduce LZ4 compression backend and make it available for selection. LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config option. The default compression backend is LZO. TEST (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G, ext4 file system, 3 compression streams) iozone -t 3 -R -r 16K -s 60M -I +Z Test LZO LZ4 ---------------------------------------------- Initial write 1642744.62 1317005.09 Rewrite 2498980.88 1800645.16 Read 3957026.38 5877043.75 Re-read 3950997.38 5861847.00 Reverse Read 2937114.56 5047384.00 Stride read 2948163.19 4929587.38 Random read 3292692.69 4880793.62 Mixed workload 1545602.62 3502940.38 Random write 2448039.75 1758786.25 Pwrite 1670051.03 1338329.69 Pread 2530682.00 5097177.62 Fwrite 3232085.62 3275942.56 Fread 6306880.25 6645271.12 So on my system LZ4 is slower in write-only tests, while it performs better in read-only and mixed (reads + writes) tests. Official LZ4 benchmarks available here http://code.google.com/p/lz4/ (linux kernel uses revision r90). Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6e76668e415adf799839f0ab205142ad7002d260) zram: move comp allocation out of init_lock While fixing lockdep spew of ->init_lock reported by Sasha Levin [1], Minchan Kim noted [2] that it's better to move compression backend allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as with zram_meta_alloc(), in order to prevent the same lockdep spew. [1] https://lkml.org/lkml/2014/2/27/337 [2] https://lkml.org/lkml/2014/3/3/32 Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d61f98c70e8b0d324e8e83be2ed546d6295e63f3) zram: return error-valued pointer from zcomp_create() Instead of returning just NULL, return ERR_PTR from zcomp_create() if compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or compression stream) error. Perform IS_ERR() check of returned from zcomp_create() value in disksize_store() and set return code to PTR_ERR(). Change suggested by Jerome Marchand. [akpm@linux-foundation.org: clean up error recovery flow] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Jerome Marchand <jmarchan@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit fcfa8d95cacf5cbbe6dee6b8d229fe86142266e0) zram: propagate error to user When we initialized zcomp with single, we couldn't change max_comp_streams without zram reset but current interface doesn't show any error to user and even it changes max_comp_streams's value without any effect so it would make user very confusing. This patch prevents max_comp_streams's change when zcomp was initialized as single zcomp and emit the error to user(ex, echo). [akpm@linux-foundation.org: don't return with the lock held, per Sergey] [fengguang.wu@intel.com: fix coccinelle warnings] Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 60a726e33375a1096e85399cfa1327081b4c38be) zram: use scnprintf() in attrs show() methods sysfs.txt documentation lists the following requirements: - The buffer will always be PAGE_SIZE bytes in length. On i386, this is 4096. - show() methods should return the number of bytes printed into the buffer. This is the return value of scnprintf(). - show() should always use scnprintf(). Use scnprintf() in show() functions. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 56b4e8cb85827a2ccc4752a2a7148e56b62b7e96) zram: support REQ_DISCARD zram is ram based block device and can be used by backend of filesystem. When filesystem deletes a file, it normally doesn't do anything on data block of that file. It just marks on metadata of that file. This behavior has no problem on disk based block device, but has problems on ram based block device, since we can't free memory used for data block. To overcome this disadvantage, there is REQ_DISCARD functionality. If block device support REQ_DISCARD and filesystem is mounted with discard option, filesystem sends REQ_DISCARD to block device whenever some data blocks are discarded. All we have to do is to handle this request. This patch implements to flag up QUEUE_FLAG_DISCARD and handle this REQ_DISCARD request. With it, we can free memory used by zram if it isn't used. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f4659d8e620d08bd1a84a8aec5d2f5294a242764)	2015-09-13 12:15:49 -07:00
Wanpeng Li	50deac6cf7	mm/hwpoison: fix page refcount of unknown non LRU page commit 4f32be677b124a49459e2603321c7a5605ceb9f8 upstream. After trying to drain pages from pagevec/pageset, we try to get reference count of the page again, however, the reference count of the page is not reduced if the page is still not on LRU list. Fix it by adding the put_page() to drop the page reference which is from __get_any_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-09-13 09:07:59 -07:00
Veena Sambasivan	52acbe414c	lowmemorykiller: Don't count swap cache pages twice The lowmem_shrink function discounts all the swap cache pages from the file cache count. The zone aware code also discounts all file cache pages from a certain zone. This results in some swap cache pages being discounted twice, which can result in the low memory killer being unnecessarily aggressive. Fix the low memory killer to only discount the swap cache pages once. Change-Id: I650bbfbf0fbbabd01d82bdb3502b57ff59c3e14f Signed-off-by: Veena Sambasivan <veenas@codeaurora.org>	2015-09-12 22:21:27 +00:00
Naveen Ramaraj	66b3b83bc7	mm: Fix mem_init_print_info() for UML UML uses _end instead of __bss_stop to represent the boundary Bug: 21631098 Change-Id: I9ad92d99d3de2ca89497b1a14d98322c43ea99fa Signed-off-by: Naveen Ramaraj <nramaraj@codeaurora.org>	2015-08-22 00:02:45 -07:00
Michal Hocko	022d35a6db	mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. Nikolay has reported a hang when a memcg reclaim got stuck with the following backtrace: PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" #0 __schedule at ffffffff815ab152 #1 schedule at ffffffff815ab76e #2 schedule_timeout at ffffffff815ae5e5 #3 io_schedule_timeout at ffffffff815aad6a #4 bit_wait_io at ffffffff815abfc6 #5 __wait_on_bit at ffffffff815abda5 #6 wait_on_page_bit at ffffffff8111fd4f #7 shrink_page_list at ffffffff81135445 #8 shrink_inactive_list at ffffffff81135845 #9 shrink_lruvec at ffffffff81135ead #10 shrink_zone at ffffffff811360c3 #11 shrink_zones at ffffffff81136eff #12 do_try_to_free_pages at ffffffff8113712f #13 try_to_free_mem_cgroup_pages at ffffffff811372be #14 try_charge at ffffffff81189423 #15 mem_cgroup_try_charge at ffffffff8118c6f5 #16 __add_to_page_cache_locked at ffffffff8112137d #17 add_to_page_cache_lru at ffffffff81121618 #18 pagecache_get_page at ffffffff8112170b #19 grow_dev_page at ffffffff811c8297 #20 __getblk_slow at ffffffff811c91d6 #21 __getblk_gfp at ffffffff811c92c1 #22 ext4_ext_grow_indepth at ffffffff8124565c #23 ext4_ext_create_new_leaf at ffffffff81246ca8 #24 ext4_ext_insert_extent at ffffffff81246f09 #25 ext4_ext_map_blocks at ffffffff8124a848 #26 ext4_map_blocks at ffffffff8121a5b7 #27 mpage_map_one_extent at ffffffff8121b1fa #28 mpage_map_and_submit_extent at ffffffff8121f07b #29 ext4_writepages at ffffffff8121f6d5 #30 do_writepages at ffffffff8112c490 #31 __filemap_fdatawrite_range at ffffffff81120199 #32 filemap_flush at ffffffff8112041c #33 ext4_alloc_da_blocks at ffffffff81219da1 #34 ext4_rename at ffffffff81229b91 #35 ext4_rename2 at ffffffff81229e32 #36 vfs_rename at ffffffff811a08a5 #37 SYSC_renameat2 at ffffffff811a3ffc #38 sys_renameat2 at ffffffff811a408e #39 sys_rename at ffffffff8119e51e #40 system_call_fastpath at ffffffff815afa89 Dave Chinner has properly pointed out that this is a deadlock in the reclaim code because ext4 doesn't submit pages which are marked by PG_writeback right away. The heuristic was introduced by commit `e62e384e9d` ("memcg: prevent OOM with too many dirty pages") and it was applied only when may_enter_fs was specified. The code has been changed by `c3b94f44fc` ("memcg: further prevent OOM with too many dirty pages") which has removed the __GFP_FS restriction with a reasoning that we do not get into the fs code. But this is not sufficient apparently because the fs doesn't necessarily submit pages marked PG_writeback for IO right away. ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily submit the bio. Instead it tries to map more pages into the bio and mpage_map_one_extent might trigger memcg charge which might end up waiting on a page which is marked PG_writeback but hasn't been submitted yet so we would end up waiting for something that never finishes. Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) before we go to wait on the writeback. The page fault path, which is the only path that triggers memcg oom killer since 3.12, shouldn't require GFP_NOFS and so we shouldn't reintroduce the premature OOM killer issue which was originally addressed by the heuristic. As per David Chinner the xfs is doing similar thing since 2.6.15 already so ext4 is not the only affected filesystem. Moreover he notes: : For example: IO completion might require unwritten extent conversion : which executes filesystem transactions and GFP_NOFS allocations. The : writeback flag on the pages can not be cleared until unwritten : extent conversion completes. Hence memory reclaim cannot wait on : page writeback to complete in GFP_NOFS context because it is not : safe to do so, memcg reclaim or otherwise. [tytso@mit.edu: corrected the control flow] Fixes: `c3b94f44fc` ("memcg: further prevent OOM with too many dirty pages") Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-08-16 20:51:43 -07:00
Kirill A. Shutemov	efcbc94afe	mm: avoid setting up anonymous pages into file mapping commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream. Reading page fault handler code I've noticed that under right circumstances kernel would map anonymous pages into file mappings: if the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated on ->mmap(), kernel would handle page fault to not populated pte with do_anonymous_page(). Let's change page fault handler to use do_anonymous_page() only on anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not shared. For file mappings without vm_ops->fault() or shred VMA without vm_ops, page fault on pte_none() entry would lead to SIGBUS. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-08-10 12:20:29 -07:00
Gu Zheng	31c6d4e4ff	mm/memory_hotplug.c: set zone->wait_table to null after freeing it commit 85bd839983778fcd0c1c043327b14a046e979b39 upstream. Izumi found the following oops when hot re-adding a node: BUG: unable to handle kernel paging request at ffffc90008963690 IP: __wake_up_bit+0x20/0x70 Oops: 0000 [#1] SMP CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80 Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015 task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000 RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP: 0018:ffff880017b97be8 EFLAGS: 00010246 RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9 RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648 RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800 R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000 FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0 Call Trace: unlock_page+0x6d/0x70 generic_write_end+0x53/0xb0 xfs_vm_write_end+0x29/0x80 [xfs] generic_perform_write+0x10a/0x1e0 xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs] xfs_file_write_iter+0x79/0x120 [xfs] __vfs_write+0xd4/0x110 vfs_write+0xac/0x1c0 SyS_write+0x58/0xd0 system_call_fastpath+0x12/0x76 Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48 RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70 RSP <ffff880017b97be8> CR2: ffffc90008963690 Reproduce method (re-add a node):: Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic) This seems an use-after-free problem, and the root cause is zone->wait_table was not set to NULL after free it in try_offline_node. When hot re-add a node, we will reuse the pgdat of it, so does the zone struct, and when add pages to the target zone, it will init the zone first (including the wait_table) if the zone is not initialized. The judgement of zone initialized is based on zone->wait_table: static inline bool zone_is_initialized(struct zone zone) { return !!zone->wait_table; } so if we do not set the zone->wait_table to NULL* after free it, the memory hotplug routine will skip the init of new zone when hot re-add the node, and the wait_table still points to the freed memory, then we will access the invalid address when trying to wake up the waiting people after the i/o operation with the page is done, such as mentioned above. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-06-22 16:55:54 -07:00
Theodore Ts'o	15d98c0c2b	bludgeon the qualcomm kernel until it builds on i386 for qemu testing Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2015-06-15 15:10:00 -07:00
Lukas Czerner	bfbf1395fa	mm: change invalidatepage prototype to accept length Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Change-Id: Id47992f86b307985b3215bcf141d56d1849d71df Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> (cherry picked from commit d47992f86b307985b3215bcf141d56d1849d71df)	2015-06-15 15:09:55 -07:00
Theodore Ts'o	3a4b50f6df	ext4: backport mm portion of: fix data integrity sync in ordered mode Commit 1c8349a17137: "ext4: fix data integrity sync in ordered mode" included changes to include/linux/page-flags.h and mm/page-writeback.c. Apply them as part of the 3.18 ext4 backport. Change-Id: I14d09741af22f50b7c50aff0f2f165a4ff8ea76d Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@google.com>	2015-06-15 15:09:44 -07:00
Theodore Ts'o	2405316c4f	mm: add find_get_page_flags() for 3.18 ext4 backport Change-Id: I1bf562a302cae211c0b28fb3872f8ee15660c182 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@google.com>	2015-06-15 15:09:35 -07:00
Jeff Vander Stoep	22cf851e4e	mm: reorder can_do_mlock to fix audit denial A userspace call to mmap(MAP_LOCKED) may result in the successful locking of memory while also producing a confusing audit log denial. can_do_mlock checks capable and rlimit. If either of these return positive can_do_mlock returns true. The capable check leads to an LSM hook used by apparmour and selinux which produce the audit denial. Reordering so rlimit is checked first eliminates the denial on success, only recording a denial when the lock is unsuccessful as a result of the denial. Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Acked-by: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2015-06-15 09:39:53 -07:00
Naoya Horiguchi	0073e613da	mm/memory-failure: call shake_page() when error hits thp tail page commit 09789e5de18e4e442870b2d700831f5cb802eb05 upstream. Currently memory_failure() calls shake_page() to sweep pages out from pcplists only when the victim page is 4kB LRU page or thp head page. But we should do this for a thp tail page too. Consider that a memory error hits a thp tail page whose head page is on a pcplist when memory_failure() runs. Then, the current kernel skips shake_pages() part, so hwpoison_user_mappings() returns without calling split_huge_page() nor try_to_unmap() because PageLRU of the thp head is still cleared due to the skip of shake_page(). As a result, me_huge_page() runs for the thp, which is broken behavior. One effect is a leak of the thp. And another is to fail to isolate the memory error, so later access to the error address causes another MCE, which kills the processes which used the thp. This patch fixes this problem by calling shake_page() for thp tail case. Fixes: `385de35722` ("thp: allow a hwpoisoned head page to be put back to LRU") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-05-17 09:51:32 -07:00
Ian Maund	faa9cc5339	This is the 3.10.73 stable release -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVFBE+AAoJEDjbvchgkmk+oTkP/j2ipSvgXghFEipZbOJUQkqC fa8elfoF7riTKpKOuDtDU2WI1ttCGYs5gmTNpd4KaEt23eJOQgVqIpV8GhAkW5Af NVyGhjF3dXNqpBkxnyuIkk5OLrNKGRNS2xpz1U254iGObYrK+tr62IzGPxEcPAhX Y+58xPVSjLtNdTJW3YLT3DohUbnbHG6Br9geI1IHtlxg1oDiTxtnX2FmOFzzDpP5 qu8gnPIekg/+1EE46nEiq0C59AwC3aCzNxwlYe1Kd41SY3LUFF1eZMzmOnnwyI5K 3FslAzT6x/sOmGJFTYrKjFA4GKsW67xHVkB/hp/Mu768RqxiQCxV4kgmPsAFLbXb D5qbNwr3i0iQ/9AaD7h8HJkxC/KHmszMux00L/mgZ3SGdGMEIBxHg+oP8+nP8V6C WfXKSWA94dpdRyULEfWdnKnUnp2860C7kt7ASTkOl8rIgU8HgaRqeu+U/KPM2ovD ZJtXPVB5UXCRuVAhZwbvvrLOY8UMZTnv2auAaeLYG8YptcvGeN5Z398/8qdV/z7c A9kOsgebs74X+lR3rbVgSDPQaq2AEiuIvtX77SfmrWXBXGmc99i9+PikuFggRprz cJm5bCM9DaHu/3b77X9Fwl7vnpReB0zPHiwTdH/p7OPMf5m1uQt7SqegC6btLPHs iYgjLd4oW+6uiV/2X1Vx =L+mC -----END PGP SIGNATURE----- Merge commit 'v3.10.73' into LA.BF64.1.2.9 This merge brings us up to date with upstream kernel.org tag v3.10.73. As part of the conflict resolution, changes introduced by commit `72684eae7` ("arm64: Fix up /proc/cpuinfo") have been intentionally dropped, as they conflict with Android changes msm-3.10 kernel to solve the problems in a different way. Since userspace readers of this file may depend on the existing msm-3.10 implementation, it's left as-is for now. The commit may later be introduced if it is found to not impact userspaces paired with this kernel. * commit 'v3.10.73' (264 commits): Linux 3.10.73 target: Allow Write Exclusive non-reservation holders to READ target: Allow AllRegistrants to re-RESERVE existing reservation target: Fix R_HOLDER bit usage for AllRegistrants target/pscsi: Fix NULL pointer dereference in get_device_type iscsi-target: Avoid early conn_logout_comp for iser connections target: Fix reference leak in target_get_sess_cmd() error path ARM: at91: pm: fix at91rm9200 standby ipvs: rerouting to local clients is not needed anymore ipvs: add missing ip_vs_pe_put in sync code powerpc/smp: Wait until secondaries are active & online x86/vdso: Fix the build on GCC5 x86/fpu: Drop_fpu() should not assume that tsk equals current x86/fpu: Avoid math_state_restore() without used_math() in __restore_xstate_sig() crypto: aesni - fix memory usage in GCM decryption libsas: Fix Kernel Crash in smp_execute_task xen-pciback: limit guest control of command register nilfs2: fix deadlock of segment constructor during recovery regulator: core: Fix enable GPIO reference counting regulator: Only enable disabled regulators on resume ALSA: hda - Treat stereo-to-mono mix properly ALSA: hda - Add workaround for MacBook Air 5,2 built-in mic ALSA: hda - Set single_adc_amp flag for CS420x codecs ALSA: hda - Don't access stereo amps for mono channel widgets ALSA: hda - Fix built-in mic on Compaq Presario CQ60 ALSA: control: Add sanity checks for user ctl id name string spi: pl022: Fix race in giveback() leading to driver lock-up tpm/ibmvtpm: Additional LE support for tpm_ibmvtpm_send workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE can: add missing initialisations in CAN related skbuffs Change email address for 8250_pci virtio_console: init work unconditionally fuse: notify: don't move pages fuse: set stolen page uptodate drm/radeon: drop setting UPLL to sleep mode drm/radeon: do a posting read in rs600_set_irq drm/radeon: do a posting read in si_set_irq drm/radeon: do a posting read in r600_set_irq drm/radeon: do a posting read in r100_set_irq drm/radeon: do a posting read in evergreen_set_irq drm/radeon: fix DRM_IOCTL_RADEON_CS oops tcp: make connect() mem charging friendly net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour tcp: fix tcp fin memory accounting Revert "net: cx82310_eth: use common match macro" rxrpc: bogus MSG_PEEK test in rxrpc_recvmsg() caif: fix MSG_OOB test in caif_seqpkt_recvmsg() inet_diag: fix possible overflow in inet_diag_dump_one_icsk() rds: avoid potential stack overflow net: sysctl_net_core: check SNDBUF and RCVBUF for min length sparc64: Fix several bugs in memmove(). sparc: Touch NMI watchdog when walking cpus and calling printk sparc: perf: Make counting mode actually work sparc: perf: Remove redundant perf_pmu_{en\|dis}able calls sparc: semtimedop() unreachable due to comparison error sparc32: destroy_context() and switch_mm() needs to disable interrupts. Linux 3.10.72 ath5k: fix spontaneus AR5312 freezes ACPI / video: Load the module even if ACPI is disabled drm/radeon: fix 1 RB harvest config setup for TN/RL Drivers: hv: vmbus: incorrect device name is printed when child device is unregistered HID: fixup the conflicting keyboard mappings quirk HID: input: fix confusion on conflicting mappings staging: comedi: cb_pcidas64: fix incorrect AI range code handling dm snapshot: fix a possible invalid memory access on unload dm: fix a race condition in dm_get_md dm io: reject unsupported DISCARD requests with EOPNOTSUPP dm mirror: do not degrade the mirror on discard error staging: comedi: comedi_compat32.c: fix COMEDI_CMD copy back clk: sunxi: Support factor clocks with N factor starting not from 0 fixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segment boundary limit. nilfs2: fix potential memory overrun on inode IB/qib: Do not write EEPROM sg: fix read() error reporting ALSA: hda - Add pin configs for ASUS mobo with IDT 92HD73XX codec ALSA: pcm: Don't leave PREPARED state after draining tty: fix up atime/mtime mess, take four sunrpc: fix braino in ->poll() procfs: fix race between symlink removals and traversals debugfs: leave freeing a symlink body until inode eviction autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation USB: serial: fix potential use-after-free after failed probe TTY: fix tty_wait_until_sent on 64-bit machines USB: serial: fix infinite wait_until_sent timeout net: irda: fix wait_until_sent poll timeout xhci: fix reporting of 0-sized URBs in control endpoint xhci: Allocate correct amount of scratchpad buffers usb: ftdi_sio: Add jtag quirk support for Cyber Cortex AV boards USB: usbfs: don't leak kernel data in siginfo USB: serial: cp210x: Adding Seletek device id's KVM: MIPS: Fix trace event to save PC directly KVM: emulate: fix CMPXCHG8B on 32-bit hosts Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. Btrfs: fix data loss in the fast fsync path btrfs: fix lost return value due to variable shadowing iio: imu: adis16400: Fix sign extension x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimization PM / QoS: remove duplicate call to pm_qos_update_target target: Check for LBA + sectors wrap-around in sbc_parse_cdb mm/memory.c: actually remap enough memory mm/compaction: fix wrong order check in compact_finished() mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() mm/hugetlb: add migration entry check in __unmap_hugepage_range team: don't traverse port list using rcu in team_set_mac_address udp: only allow UFO for packets from SOCK_DGRAM sockets usb: plusb: Add support for National Instruments host-to-host cable macvtap: make sure neighbour code can push ethernet header net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg team: fix possible null pointer dereference in team_handle_frame net: reject creation of netdev names with colons ematch: Fix auto-loading of ematch modules. net: phy: Fix verification of EEE support in phy_init_eee ipv4: ip_check_defrag should not assume that skb_network_offset is zero ipv4: ip_check_defrag should correctly check return value of skb_copy_bits gen_stats.c: Duplicate xstats buffer for later use rtnetlink: call ->dellink on failure when ->newlink exists ipv6: fix ipv6_cow_metrics for non DST_HOST case rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY Linux 3.10.71 libceph: fix double __remove_osd() problem libceph: change from BUG to WARN for __remove_osd() asserts libceph: assert both regular and lingering lists in __remove_osd() MIPS: Export FP functions used by lose_fpu(1) for KVM x86, mm/ASLR: Fix stack randomization on 64-bit systems blk-throttle: check stats_cpu before reading it from sysfs jffs2: fix handling of corrupted summary length md/raid1: fix read balance when a drive is write-mostly. md/raid5: Fix livelock when array is both resyncing and degraded. metag: Fix KSTK_EIP() and KSTK_ESP() macros gpio: tps65912: fix wrong container_of arguments arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian hx4700: regulator: declare full constraints KVM: x86: update masterclock values on TSC writes KVM: MIPS: Don't leak FPU/DSP to guest ARC: fix page address calculation if PAGE_OFFSET != LINUX_LINK_BASE ntp: Fixup adjtimex freq validation on 32-bit systems kdb: fix incorrect counts in KDB summary command output ARM: pxa: add regulator_has_full_constraints to poodle board file ARM: pxa: add regulator_has_full_constraints to corgi board file vt: provide notifications on selection changes usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN USB: fix use-after-free bug in usb_hcd_unlink_urb() USB: cp210x: add ID for RUGGEDCOM USB Serial Console tty: Prevent untrappable signals from malicious program axonram: Fix bug in direct_access cfq-iosched: fix incorrect filing of rt async cfqq cfq-iosched: handle failure of cfq group allocation iscsi-target: Drop problematic active_ts_list usage NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args Added Little Endian support to vtpm module tpm/tpm_i2c_stm_st33: Fix potential bug in tpm_stm_i2c_send tpm: Fix NULL return in tpm_ibmvtpm_get_desired_dma tpm_tis: verify interrupt during init ARM: 8284/1: sa1100: clear RCSR_SMR on resume tracing: Fix unmapping loop in tracing_mark_write MIPS: KVM: Deliver guest interrupts after local_irq_disable() nfs: don't call blocking operations while !TASK_RUNNING mmc: sdhci-pxav3: fix setting of pdata->clk_delay_cycles power_supply: 88pm860x: Fix leaked power supply on probe fail ALSA: hdspm - Constrain periods to 2 on older cards ALSA: off by one bug in snd_riptide_joystick_probe() lmedm04: Fix usb_submit_urb BOGUS urb xfer, pipe 1 != type 3 in interrupt urb cpufreq: speedstep-smi: enable interrupts when waiting PCI: Fix infinite loop with ROM image of size 0 PCI: Generate uppercase hex for modalias var in uevent HID: i2c-hid: Limit reads to wMaxInputLength bytes for input events iwlwifi: mvm: always use mac color zero iwlwifi: mvm: fix failure path when power_update fails in add_interface iwlwifi: mvm: validate tid and sta_id in ba_notif iwlwifi: pcie: disable the SCD_BASE_ADDR when we resume from WoWLAN fsnotify: fix handling of renames in audit xfs: set superblock buffer type correctly xfs: inode unlink does not set AGI buffer type xfs: ensure buffer types are set correctly Bluetooth: ath3k: workaround the compatibility issue with xHCI controller Linux 3.10.70 rbd: drop an unsafe assertion media/rc: Send sync space information on the lirc device net: sctp: fix passing wrong parameter header to param_type2af in sctp_process_param ppp: deflate: never return len larger than output buffer ipv4: tcp: get rid of ugly unicast_sock tcp: ipv4: initialize unicast_sock sk_pacing_rate bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too ping: Fix race in free in receive path udp_diag: Fix socket skipping within chain ipv4: try to cache dst_entries which would cause a redirect net: sctp: fix slab corruption from use after free on INIT collisions netxen: fix netxen_nic_poll() logic ipv6: stop sending PTB packets for MTU < 1280 net: rps: fix cpu unplug ip: zero sockaddr returned on error queue Linux 3.10.69 crypto: crc32c - add missing crypto module alias x86,kvm,vmx: Preserve CR4 across VM entry kvm: vmx: handle invvpid vm exit gracefully smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread() ALSA: ak411x: Fix stall in work callback ASoC: sgtl5000: add delay before first I2C access ASoC: atmel_ssc_dai: fix start event for I2S mode lib/checksum.c: fix build for generic csum_tcpudp_nofold ext4: prevent bugon on race between write/fcntl arm64: Fix up /proc/cpuinfo nilfs2: fix deadlock of segment constructor over I_SYNC flag lib/checksum.c: fix carry in csum_tcpudp_nofold mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range MIPS: Fix kernel lockup or crash after CPU offline/online MIPS: IRQ: Fix disable_irq on CPU IRQs PCI: Add NEC variants to Stratus ftServer PCIe DMI check gpio: sysfs: fix memory leak in gpiod_sysfs_set_active_low gpio: sysfs: fix memory leak in gpiod_export_link Linux 3.10.68 target: Drop arbitrary maximum I/O size limit iser-target: Fix implicit termination of connections iser-target: Handle ADDR_CHANGE event for listener cm_id iser-target: Fix connected_handler + teardown flow race iser-target: Parallelize CM connection establishment iser-target: Fix flush + disconnect completion handling iscsi,iser-target: Initiate termination only once vhost-scsi: Add missing virtio-scsi -> TCM attribute conversion tcm_loop: Fix wrong I_T nexus association vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT ib_isert: Add max_send_sge=2 minimum for control PDU responses IB/isert: Adjust CQ size to HW limits workqueue: fix subtle pool management issue which can stall whole worker_pool gpio: squelch a compiler warning efi-pstore: Make efi-pstore return a unique id pstore/ram: avoid atomic accesses for ioremapped regions pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_prz pstore: skip zero size persistent ram buffer in traverse pstore: clarify clearing of _read_cnt in ramoops_context pstore: d_alloc_name() doesn't return an ERR_PTR pstore: Fail to unlink if a driver has not defined pstore_erase ARM: 8109/1: mm: Modify pte_write and pmd_write logic for LPAE ARM: 8108/1: mm: Introduce {pte,pmd}_isset and {pte,pmd}_isclear ARM: DMA: ensure that old section mappings are flushed from the TLB ARM: 7931/1: Correct virt_addr_valid ARM: fix asm/memory.h build error ARM: 7867/1: include: asm: use 'int' instead of 'unsigned long' for 'oldval' in atomic_cmpxchg(). ARM: 7866/1: include: asm: use 'long long' instead of 'u64' within atomic.h ARM: lpae: fix definition of PTE_HWTABLE_PTRS ARM: fix type of PHYS_PFN_OFFSET to unsigned long ARM: LPAE: use phys_addr_t in alloc_init_pud() ARM: LPAE: use signed arithmetic for mask definitions ARM: mm: correct pte_same behaviour for LPAE. ARM: 7829/1: Add ".text.unlikely" and ".text.hot" to arm unwind tables drivers: net: cpsw: discard dual emac default vlan configuration regulator: core: fix race condition in regulator_put() spi/pxa2xx: Clear cur_chip pointer before starting next message dm cache: fix missing ERR_PTR returns and handling dm thin: don't allow messages to be sent to a pool target in READ_ONLY or FAIL mode nl80211: fix per-station group key get/del and memory leak NFSv4.1: Fix an Oops in nfs41_walk_client_list nfs: fix dio deadlock when O_DIRECT flag is flipped Input: i8042 - add noloop quirk for Medion Akoya E7225 (MD98857) ALSA: seq-dummy: remove deadlock-causing events on close powerpc/xmon: Fix another endiannes issue in RTAS call from xmon can: kvaser_usb: Fix state handling upon BUS_ERROR events can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT can: kvaser_usb: Send correct context to URB completion can: kvaser_usb: Do not sleep in atomic context ASoC: wm8960: Fix capture sample rate from 11250 to 11025 spi: dw-mid: fix FIFO size Signed-off-by: Ian Maund <imaund@codeaurora.org>	2015-05-01 13:49:45 -07:00
Ian Maund	807a44d01a	This is the 3.10.67 stable release -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJUyuGRAAoJEDjbvchgkmk+7EwQALYPOeh+AManQFB1MQvFuOgZ /4ulpjhGXw/RPTKHMeyHo8vRfUhMOx8UPF62uql+g1l9b/Zt2bs6qXu4QcxRRsQc trSTUpi+U14y1hkgqOVOcFYP2ZaTjNEBQgLJ4eGn46CliLqme+rfoyRYm2GXzcR4 6cbSAr3mufdFIpi9/8Dn62Gv0aws5lIv3qkHJXznyuux3tisPT5y6Ux2KJoivPn/ SqADtRpwo+7lTjl15fE++9AqNsGMorV6toT2OO/7nXP+824psInKLmREAT2qC99b BG61vcYdxOuHtzmwrvCf1jSRjxhvZT0j2xhBr/vCKcxy08AT0vDv68zrV1r6TIuu U7/CKXtFBY95cjfnkTLJuswBSuIA/+sQHV6DaddH0V8fcZ6rQMLrblQ9ZcFFFkmT 2SG6lmlXqZvcEKYGMnL/Dcow1rkRhB5stiGgTkYxjiRSRpzAHISRJ/GGpsT+rRqK HpBs5p9JshvRl7RWKwAu+DNGaEK1X/WYxc4/jw6dZFWX7lEWSMIPlr9zXgZCZ39y V6lV1VVlT9/CSs1swKHUyhHHehlFsnIlQ6Fkiycr/KkuqBLs92Hyb7WhpVa819yX osXdxSm6J54skiOLKYpBWHpnY09Tc+p28VEfMpErTExgp2oE8F34K7kdhoQPQb97 2mHiXNa+J4CLUNQ+sRmw =HDBo -----END PGP SIGNATURE----- Merge commit 'v3.10.67' into LA.BF64.1.2.9 This merge brings us up to date with upstream kernel.org tag v3.10.67. It also contains changes to allow forbidden warnings introduced in the commit 'core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors'. Once upstream has corrected these warnings, the changes to scripts/gcc-wrapper.py, in this commit, can be reverted. * 'v3.10.67' (915 commits): Linux 3.10.67 md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. ext4: fix warning in ext4_da_update_reserve_space() quota: provide interface for readding allocated space into reserved space crypto: add missing crypto module aliases crypto: include crypto- module prefix in template crypto: prefix module autoloading with "crypto-" drbd: merge_bvec_fn: properly remap bvm->bi_bdev Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single" ipvs: uninitialized data with IP_VS_IPV6 KEYS: close race between key lookup and freeing sata_dwc_460ex: fix resource leak on error path x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs x86, tls: Interpret an all-zero struct user_desc as "no segment" x86, tls, ldt: Stop checking lm in LDT_empty x86/tsc: Change Fast TSC calibration failed from error to info x86, hyperv: Mark the Hyper-V clocksource as being continuous clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write can: dev: fix crtlmode_supported check bus: mvebu-mbus: fix support of MBus window 13 ARM: dts: imx25: Fix PWM "per" clocks time: adjtimex: Validate the ADJ_FREQUENCY values time: settimeofday: Validate the values of tv from user dm cache: share cache-metadata object across inactive and active DM tables ipr: wait for aborted command responses drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210 libata: prevent HSM state change race between ISR and PIO pinctrl: Fix two deadlocks gpio: sysfs: fix gpio device-attribute leak gpio: sysfs: fix gpio-chip device-attribute leak Linux 3.10.66 s390/3215: fix tty output containing tabs s390/3215: fix hanging console issue fsnotify: next_i is freed during fsnotify_unmount_inodes. netfilter: ipset: small potential read beyond the end of buffer mmc: sdhci: Fix sleep in atomic after inserting SD card LOCKD: Fix a race when initialising nlmsvc_timeout x86, um: actually mark system call tables readonly um: Skip futex_atomic_cmpxchg_inatomic() test decompress_bunzip2: off by one in get_next_block() ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances ARM: omap5/dra7xx: Fix frequency typos ARM: clk-imx6q: fix video divider for rev T0 1.0 ARM: imx6q: drop unnecessary semicolon ARM: dts: imx25: Fix the SPI1 clocks Input: I8042 - add Acer Aspire 7738 to the nomux list Input: i8042 - reset keyboard to fix Elantech touchpad detection can: kvaser_usb: Don't send a RESET_CHIP for non-existing channels can: kvaser_usb: Reset all URB tx contexts upon channel close can: kvaser_usb: Don't free packets when tight on URBs USB: keyspan: fix null-deref at probe USB: cp210x: add IDs for CEL USB sticks and MeshWorks devices USB: cp210x: fix ID for production CEL MeshConnect USB Stick usb: dwc3: gadget: Stop TRB preparation after limit is reached usb: dwc3: gadget: Fix TRB preparation during SG OHCI: add a quirk for ULi M5237 blocking on reset gpiolib: of: Correct error handling in of_get_named_gpiod_flags NFSv4.1: Fix client id trunking on Linux ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing vfio-pci: Fix the check on pci device type in vfio_pci_probe() uvcvideo: Fix destruction order in uvc_delete() smiapp: Take mutex during PLL update in sensor initialisation af9005: fix kernel panic on init if compiled without IR smiapp-pll: Correct clock debug prints video/logo: prevent use of logos after they have been freed storvsc: ring buffer failures may result in I/O freeze iscsi-target: Fail connection on short sendmsg writes hp_accel: Add support for HP ZBook 15 cfg80211: Fix 160 MHz channels with 80+80 and 160 MHz drivers ARC: [nsimosci] move peripherals to match model to FPGA drm/i915: Force the CS stall for invalidate flushes drm/i915: Invalidate media caches on gen7 drm/radeon: properly filter DP1.2 4k modes on non-DP1.2 hw drm/radeon: check the right ring in radeon_evict_flags() drm/vmwgfx: Fix fence event code enic: fix rx skb checksum alx: fix alx_poll() tcp: Do not apply TSO segment limit to non-TSO packets tg3: tg3_disable_ints using uninitialized mailbox value to disable interrupts netlink: Don't reorder loads/stores before marking mmap netlink frame as available netlink: Always copy on mmap TX. Linux 3.10.65 mm: Don't count the stack guard page towards RLIMIT_STACK mm: propagate error from stack expansion even for guard page mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed perf session: Do not fail on processing out of order event perf: Fix events installation during moving group perf/x86/intel/uncore: Make sure only uncore events are collected Btrfs: don't delay inode ref updates during log replay ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP scripts/kernel-doc: don't eat struct members with __aligned nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races nfsd4: fix xdr4 inclusion of escaped char fs: nfsd: Fix signedness bug in compare_blob serial: samsung: wait for transfer completion before clock disable writeback: fix a subtle race condition in I_DIRTY clearing cdc-acm: memory leak in error case genhd: check for int overflow in disk_expand_part_tbl() USB: cdc-acm: check for valid interfaces ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs ALSA: hda - using uninitialized data ALSA: usb-audio: extend KEF X300A FU 10 tweak to Arcam rPAC driver core: Fix unbalanced device reference in drivers_probe x86, vdso: Use asm volatile in __getcpu x86_64, vdso: Fix the vdso address randomization algorithm HID: Add a new id 0x501a for Genius MousePen i608X HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard HID: roccat: potential out of bounds in pyra_sysfs_write_settings() HID: i2c-hid: prevent buffer overflow in early IRQ HID: i2c-hid: fix race condition reading reports iommu/vt-d: Fix an off-by-one bug in __domain_mapping() UBI: Fix double free after do_sync_erase() UBI: Fix invalid vfree() pstore-ram: Allow optional mapping with pgprot_noncached pstore-ram: Fix hangs by using write-combine mappings PCI: Restore detection of read-only BARs ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap ASoC: max98090: Fix ill-defined sidetone route ASoC: sigmadsp: Refuse to load firmware files with a non-supported version ath5k: fix hardware queue index assignment swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single can: peak_usb: fix memset() usage can: peak_usb: fix cleanup sequence order in case of error during init ath9k: fix BE/BK queue order ath9k_hw: fix hardware queue allocation ocfs2: fix journal commit deadlock Linux 3.10.64 Btrfs: fix fs corruption on transaction abort if device supports discard Btrfs: do not move em to modified list when unpinning eCryptfs: Remove buggy and unnecessary write in file name decode routine eCryptfs: Force RO mount when encrypted view is enabled udf: Verify symlink size before loading it exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting ncpfs: return proper error from NCP_IOC_SETROOT ioctl crypto: af_alg - fix backlog handling userns: Unbreak the unprivileged remount tests userns: Allow setting gid_maps without privilege when setgroups is disabled userns: Add a knob to disable setgroups on a per user namespace basis userns: Rename id_map_mutex to userns_state_mutex userns: Only allow the creator of the userns unprivileged mappings userns: Check euid no fsuid when establishing an unprivileged uid mapping userns: Don't allow unprivileged creation of gid mappings userns: Don't allow setgroups until a gid mapping has been setablished userns: Document what the invariant required for safe unprivileged mappings. groups: Consolidate the setgroups permission checks umount: Disallow unprivileged mount force mnt: Update unprivileged remount test mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount mac80211: free management frame keys when removing station mac80211: fix multicast LED blinking and counter KEYS: Fix stale key registration at error path isofs: Fix unchecked printing of ER records x86/tls: Don't validate lm in set_thread_area() after all dm space map metadata: fix sm_bootstrap_get_nr_blocks() dm bufio: fix memleak when using a dm_buffer's inline bio nfs41: fix nfs4_proc_layoutget error handling megaraid_sas: corrected return of wait_event from abort frame path mmc: block: add newline to sysfs display of force_ro mfd: tc6393xb: Fail ohci suspend if full state restore is required md/bitmap: always wait for writes on unplug. x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit x86_64, switch_to(): Load TLS descriptors before switching DS and ES x86/tls: Disallow unusual TLS segments x86/tls: Validate TLS entries to protect espfix isofs: Fix infinite looping over CE entries Linux 3.10.63 ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery powerpc: 32 bit getcpu VDSO function uses 64 bit instructions ARM: sched_clock: Load cycle count after epoch stabilizes igb: bring link up when PHY is powered up ext2: Fix oops in ext2_get_block() called from ext2_quota_write() nEPT: Nested INVEPT net: sctp: use MAX_HEADER for headroom reserve in output path net: mvneta: fix Tx interrupt delay rtnetlink: release net refcnt on error in do_setlink() net/mlx4_core: Limit count field to 24 bits in qp_alloc_res tg3: fix ring init when there are more TX than RX channels ipv6: gre: fix wrong skb->protocol in WCCP sata_fsl: fix error handling of irq_of_parse_and_map ahci: disable MSI on SAMSUNG 0xa800 SSD AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller media: smiapp: Only some selection targets are settable drm/i915: Unlock panel even when LVDS is disabled drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6 i2c: davinci: generate STP always when NACK is received i2c: omap: fix i207 errata handling i2c: omap: fix NACK and Arbitration Lost irq handling xen-netfront: Remove BUGs on paged skb data which crosses a page boundary mm: fix swapoff hang after page migration and fork mm: frontswap: invalidate expired data on a dup-store failure Linux 3.10.62 nfsd: Fix ACL null pointer deref powerpc/powernv: Honor the generic "no_64bit_msi" flag bnx2fc: do not add shared skbs to the fcoe_rx_list nfsd4: fix leak of inode reference on delegation failure nfsd: Fix slot wake up race in the nfsv4.1 callback code rt2x00: do not align payload on modern H/W can: dev: avoid calling kfree_skb() from interrupt context spi: dw: Fix dynamic speed change. iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly target: Don't call TFO->write_pending if data_length == 0 srp-target: Retry when QP creation fails with ENOMEM Input: xpad - use proper endpoint type ARM: 8222/1: mvebu: enable strex backoff delay ARM: 8216/1: xscale: correct auxiliary register in suspend/resume ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices can: esd_usb2: fix memory leak on disconnect USB: xhci: don't start a halted endpoint before its new dequeue is set usb-quirks: Add reset-resume quirk for MS Wireless Laser Mouse 6000 usb: serial: ftdi_sio: add PIDs for Matrix Orbital products USB: serial: cp210x: add IDs for CEL MeshConnect USB Stick USB: keyspan: fix tty line-status reporting USB: keyspan: fix overrun-error reporting USB: ssu100: fix overrun-error reporting iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask powerpc/pseries: Fix endiannes issue in RTAS call from xmon powerpc/pseries: Honor the generic "no_64bit_msi" flag of/base: Fix PowerPC address parsing hack ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use ASoC: sgtl5000: Fix SMALL_POP bit definition PCI/MSI: Add device flag indicating that 64-bit MSIs don't work ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg pptp: fix stack info leak in pptp_getname() qmi_wwan: Add support for HP lt4112 LTE/HSPA+ Gobi 4G Modem ieee802154: fix error handling in ieee802154fake_probe() ipv4: Fix incorrect error code when adding an unreachable route inetdevice: fixed signed integer overflow sparc64: Fix constraints on swab helpers. uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME x86, mm: Set NX across entire PMD at boot x86: Require exact match for 'noxsave' command line option x86_64, traps: Rework bad_iret x86_64, traps: Stop using IST for #SS x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C MIPS: Loongson: Make platform serial setup always built-in. MIPS: oprofile: Fix backtrace on 64-bit kernel Linux 3.10.61 mm: memcg: handle non-error OOM situations more gracefully mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup mm: memcg: enable memcg OOM killer only for user faults x86: finish user fault error path with fatal signal arch: mm: pass userspace fault flag to generic fault handler arch: mm: do not invoke OOM killer on kernel fault OOM arch: mm: remove obsolete init OOM protection mm: invoke oom-killer from remaining unconverted page fault handlers net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks net: sctp: fix panic on duplicate ASCONF chunks net: sctp: fix remote memory pressure from excessive queueing KVM: x86: Don't report guest userspace emulation error to userspace SCSI: hpsa: fix a race in cmd_free/scsi_done net/mlx4_en: Fix BlueFlame race ARM: Correct BUG() assembly to ensure it is endian-agnostic perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge mei: bus: fix possible boundaries violation perf: Handle compat ioctl MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches dell-wmi: Fix access out of memory ARM: probes: fix instruction fetch order with <asm/opcodes.h> br: fix use of ->rx_handler_data in code executed on non-rx_handler path netfilter: nf_nat: fix oops on netns removal netfilter: xt_bpf: add mising opaque struct sk_filter definition netfilter: nf_log: release skbuff on nlmsg put failure netfilter: nfnetlink_log: fix maximum packet length logged to userspace netfilter: nf_log: account for size of NLMSG_DONE attribute ipc: always handle a new value of auto_msgmni clocksource: Remove "weak" from clocksource_default_clock() declaration kgdb: Remove "weak" from kgdb_arch_pc() declaration media: ttusb-dec: buffer overflow in ioctl NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return nfs: Fix use of uninitialized variable in nfs_getattr() NFS: Don't try to reclaim delegation open state if recovery failed NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired Input: alps - allow up to 2 invalid packets without resetting device Input: alps - ignore potential bare packets when device is out of sync dm raid: ensure superblock's size matches device's logical block size dm btree: fix a recursion depth bug in btree walking code block: Fix computation of merged request priority parisc: Use compat layer for msgctl, shmat, shmctl and semtimedop syscalls scsi: only re-lock door after EH on devices that were reset nfs: fix pnfs direct write memory leak firewire: cdev: prevent kernel stack leaking into ioctl arguments arm64: __clear_user: handle exceptions on strb ARM: 8198/1: make kuser helpers depend on MMU drm/radeon: add missing crtc unlock when setting up the MC mac80211: fix use-after-free in defragmentation macvtap: Fix csum_start when VLAN tags are present iwlwifi: configure the LTR libceph: do not crash on large auth tickets xtensa: re-wire umount syscall to sys_oldumount ALSA: usb-audio: Fix memory leak in FTU quirk ahci: disable MSI instead of NCQ on Samsung pci-e SSDs on macbooks ahci: Add Device IDs for Intel Sunrise Point PCH audit: keep inode pinned x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit sparc32: Implement xchg and atomic_xchg using ATOMIC_HASH locks sparc64: Do irq_{enter,exit}() around generic_smp_call_function*(). sparc64: Fix crashes in schizo_pcierr_intr_other(). sunvdc: don't call VD_OP_GET_VTOC vio: fix reuse of vio_dring slot sunvdc: limit each sg segment to a page sunvdc: compute vdisk geometry from capacity sunvdc: add cdrom and v1.1 protocol support net: sctp: fix memory leak in auth key management net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet gre6: Move the setting of dev->iflink into the ndo_init functions. ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function. Linux 3.10.60 libceph: ceph-msgr workqueue needs a resque worker Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup of: Fix overflow bug in string property parsing functions sysfs: driver core: Fix glue dir race condition by gdp_mutex i2c: at91: don't account as iowait acer-wmi: Add acpi_backlight=video quirk for the Acer KAV80 rbd: Fix error recovery in rbd_obj_read_sync() drm/radeon: remove invalid pci id usb: gadget: udc: core: fix kernel oops with soft-connect usb: gadget: function: acm: make f_acm pass USB20CV Chapter9 usb: dwc3: gadget: fix set_halt() bug with pending transfers crypto: algif - avoid excessive use of socket buffer in skcipher mm: Remove false WARN_ON from pagecache_isize_extended() x86, apic: Handle a bad TSC more gracefully posix-timers: Fix stack info leak in timer_create() mac80211: fix typo in starting baserate for rts_cts_rate_idx PM / Sleep: fix recovery during resuming from hibernation tty: Fix high cpu load if tty is unreleaseable quota: Properly return errors from dquot_writeback_dquots() ext3: Don't check quota format when there are no quota files nfsd4: fix crash on unknown operation number cpc925_edac: Report UE events properly e7xxx_edac: Report CE events properly i3200_edac: Report CE events properly i82860_edac: Report CE events properly scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND lib/bitmap.c: fix undefined shift in __bitmap_shift_{left\|right}() cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. usb: Do not allow usb_alloc_streams on unconfigured devices USB: opticon: fix non-atomic allocation in write path usb-storage: handle a skipped data phase spi: pxa2xx: toggle clocks on suspend if not disabled by runtime PM spi: pl022: Fix incorrect dma_unmap_sg usb: dwc3: gadget: Properly initialize LINK TRB wireless: rt2x00: add new rt2800usb device USB: option: add Haier CE81B CDMA modem usb: option: add support for Telit LE910 USB: cdc-acm: only raise DTR on transitions from B0 USB: cdc-acm: add device id for GW Instek AFG-2225 usb: serial: ftdi_sio: add "bricked" FTDI device PID usb: serial: ftdi_sio: add Awinda Station and Dongle products USB: serial: cp210x: add Silicon Labs 358x VID and PID serial: Fix divide-by-zero fault in uart_get_divisor() staging:iio:ade7758: Remove "raw" from channel name staging:iio:ade7758: Fix check if channels are enabled in prenable staging:iio:ade7758: Fix NULL pointer deref when enabling buffer staging:iio:ad5933: Drop "raw" from channel names staging:iio:ad5933: Fix NULL pointer deref when enabling buffer OOM, PM: OOM killed task shouldn't escape PM suspend freezer: Do not freeze tasks killed by OOM killer ext4: fix oops when loading block bitmap failed cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy ext4: fix overflow when updating superblock backups after resize ext4: check s_chksum_driver when looking for bg csum presence ext4: fix reservation overflow in ext4_da_write_begin ext4: add ext4_iget_normal() which is to be used for dir tree lookups ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT ext4: don't check quota format when there are no quota files ext4: check EA value offset when loading jbd2: free bh when descriptor block checksum fails MIPS: tlbex: Properly fix HUGE TLB Refill exception handler target: Fix APTPL metadata handling for dynamic MappedLUNs target: Fix queue full status NULL pointer for SCF_TRANSPORT_TASK_SENSE qla_target: don't delete changed nacls ARC: Update order of registers in KGDB to match GDB 7.5 ARC: [nsimosci] Allow "headless" models to boot KVM: x86: Emulator fixes for eip canonical checks on near branches KVM: x86: Fix wrong masking on relative jump/call kvm: x86: don't kill guest on unknown exit reason KVM: x86: Check non-canonical addresses upon WRMSR KVM: x86: Improve thread safety in pit KVM: x86: Prevent host from panicking on shared MSR writes. kvm: fix excessive pages un-pinning in kvm_iommu_map error path. media: tda7432: Fix setting TDA7432_MUTE bit for TDA7432_RF register media: ds3000: fix LNB supply voltage on Tevii S480 on initialization media: em28xx-v4l: give back all active video buffers to the vb2 core properly on streaming stop media: v4l2-common: fix overflow in v4l_bound_align_image() drm/nouveau/bios: memset dcb struct to zero before parsing drm/tilcdc: Fix the error path in tilcdc_load() drm/ast: Fix HW cursor image Input: i8042 - quirks for Fujitsu Lifebook A544 and Lifebook AH544 Input: i8042 - add noloop quirk for Asus X750LN framebuffer: fix border color modules, lock around setting of MODULE_STATE_UNFORMED dm log userspace: fix memory leak in dm_ulog_tfr_init failure path block: fix alignment_offset math that assumes io_min is a power-of-2 drbd: compute the end before rb_insert_augmented() dm bufio: update last_accessed when relinking a buffer virtio_pci: fix virtio spec compliance on restore selinux: fix inode security list corruption pstore: Fix duplicate {console,ftrace}-efi entries mfd: rtsx_pcr: Fix MSI enable error handling mnt: Prevent pivot_root from creating a loop in the mount tree UBI: add missing kmem_cache_free() in process_pool_aeb error path random: add and use memzero_explicit() for clearing data crypto: more robust crypto_memneq fix misuses of f_count() in ppp and netlink kill wbuf_queued/wbuf_dwork_lock ALSA: pcm: Zero-clear reserved fields of PCM status ioctl in compat mode evm: check xattr value length and type in evm_inode_setxattr() x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE x86_64, entry: Fix out of bounds read on sysenter x86_64, entry: Filter RFLAGS.NT on entry from userspace x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal() x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable() x86: Reject x32 executables if x32 ABI not supported vfs: fix data corruption when blocksize < pagesize for mmaped data UBIFS: fix free log space calculation UBIFS: fix a race condition UBIFS: remove mst_mutex fs: Fix theoretical division by 0 in super_cache_scan(). fs: make cont_expand_zero interruptible mmc: rtsx_pci_sdmmc: fix incorrect last byte in R2 response libata-sff: Fix controllers with no ctl port pata_serverworks: disable 64-KB DMA transfers on Broadcom OSB4 IDE Controller Revert "percpu: free percpu allocation info for uniprocessor system" lockd: Try to reconnect if statd has moved drivers/net: macvtap and tun depend on INET ipv4: dst_entry leak in ip_send_unicast_reply() ax88179_178a: fix bonding failure ipv4: fix nexthop attlen check in fib_nh_match tracing/syscalls: Ignore numbers outside NR_syscalls' range Linux 3.10.59 ecryptfs: avoid to access NULL pointer when write metadata in xattr ARM: at91/PMC: don't forget to write PMC_PCDR register to disable clocks ALSA: usb-audio: Add support for Steinberg UR22 USB interface ALSA: emu10k1: Fix deadlock in synth voice lookup ALSA: pcm: use the same dma mmap codepath both for arm and arm64 arm64: compat: fix compat types affecting struct compat_elf_prpsinfo spi: dw-mid: terminate ongoing transfers at exit kernel: add support for gcc 5 fanotify: enable close-on-exec on events' fd when requested in fanotify_init() mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set Bluetooth: Fix issue with USB suspend in btusb driver Bluetooth: Fix HCI H5 corrupted ack value rt2800: correct BBP1_TX_POWER_CTRL mask PCI: Generate uppercase hex for modalias interface class PCI: Increase IBM ipr SAS Crocodile BARs to at least system page size iwlwifi: Add missing PCI IDs for the 7260 series NFSv4.1: Fix an NFSv4.1 state renewal regression NFSv4: fix open/lock state recovery error handling NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails lzo: check for length overrun in variable length encoding. Revert "lzo: properly check for overruns" Documentation: lzo: document part of the encoding m68k: Disable/restore interrupts in hwreg_present()/hwreg_write() Drivers: hv: vmbus: Fix a bug in vmbus_open() Drivers: hv: vmbus: Cleanup vmbus_establish_gpadl() Drivers: hv: vmbus: Cleanup vmbus_teardown_gpadl() Drivers: hv: vmbus: Cleanup vmbus_post_msg() firmware_class: make sure fw requests contain a name qla2xxx: Use correct offset to req-q-out for reserve calculation mptfusion: enable no_write_same for vmware scsi disks be2iscsi: check ip buffer before copying regmap: fix NULL pointer dereference in _regmap_write/read regmap: debugfs: fix possbile NULL pointer dereference spi: dw-mid: check that DMA was inited before exit spi: dw-mid: respect 8 bit mode x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 instead kvm: don't take vcpu mutex for obviously invalid vcpu ioctls KVM: s390: unintended fallthrough for external call kvm: x86: fix stale mmio cache bug fs: Add a missing permission check to do_umount Btrfs: fix race in WAIT_SYNC ioctl Btrfs: fix build_backref_tree issue with multiple shared blocks Btrfs: try not to ENOSPC on log replay Linux 3.10.58 USB: cp210x: add support for Seluxit USB dongle USB: serial: cp210x: added Ketra N1 wireless interface support USB: Add device quirk for ASUS T100 Base Station keyboard ipv6: reallocate addrconf router for ipv6 address when lo device up tcp: fixing TLP's FIN recovery sctp: handle association restarts when the socket is closed. ip6_gre: fix flowi6_proto value in xmit path hyperv: Fix a bug in netvsc_start_xmit() tg3: Allow for recieve of full-size 8021AD frames tg3: Work around HW/FW limitations with vlan encapsulated frames l2tp: fix race while getting PMTU on PPP pseudo-wire openvswitch: fix panic with multiple vlan headers packet: handle too big packets for PACKET_V3 tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() sit: Fix ipip6_tunnel_lookup device matching criteria myri10ge: check for DMA mapping errors Linux 3.10.57 cpufreq: ondemand: Change the calculation of target frequency cpufreq: Fix wrong time unit conversion nl80211: clear skb cb before passing to netlink drbd: fix regression 'out of mem, failed to invoke fence-peer helper' jiffies: Fix timeval conversion to jiffies md/raid5: disable 'DISCARD' by default due to safety concerns. media: vb2: fix VBI/poll regression mm: numa: Do not mark PTEs pte_numa when splitting huge pages mm, thp: move invariant bug check out of loop in __split_huge_page_map ring-buffer: Fix infinite spin in reading buffer init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu perf: fix perf bug in fork() udf: Avoid infinite loop when processing indirect ICBs Linux 3.10.56 vm_is_stack: use for_each_thread() rather then buggy while_each_thread() oom_kill: add rcu_read_lock() into find_lock_task_mm() oom_kill: has_intersects_mems_allowed() needs rcu_read_lock() oom_kill: change oom_kill.c to use for_each_thread() introduce for_each_thread() to replace the buggy while_each_thread() kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code arm: multi_v7_defconfig: Enable Zynq UART driver ext2: Fix fs corruption in ext2_get_xip_mem() serial: 8250_dma: check the result of TX buffer mapping ARM: 7748/1: oabi: handle faults when loading swi instruction from userspace netfilter: nf_conntrack: avoid large timeout for mid-stream pickup PM / sleep: Use valid_state() for platform-dependent sleep states only PM / sleep: Add state field to pm_states[] entries ipvs: fix ipv6 hook registration for local replies ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack md/raid1: fix_read_error should act on all non-faulty devices. media: cx18: fix kernel oops with tda8290 tuner Fix nasty 32-bit overflow bug in buffer i/o code. perf kmem: Make it work again on non NUMA machines perf: Fix a race condition in perf_remove_from_context() alarmtimer: Lock k_itimer during timer callback alarmtimer: Do not signal SIGEV_NONE timers parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds powerpc/perf: Fix ABIv2 kernel backtraces sched: Fix unreleased llc_shared_mask bit during CPU hotplug ocfs2/dlm: do not get resource spinlock if lockres is new nilfs2: fix data loss with mmap() fs/notify: don't show f_handle if exportfs_encode_inode_fh failed fsnotify/fdinfo: use named constants instead of hardcoded values kcmp: fix standard comparison bug Revert "mac80211: disable uAPSD if all ACs are under ACM" usb: dwc3: core: fix ordering for PHY suspend usb: dwc3: core: fix order of PM runtime calls usb: host: xhci: fix compliance mode workaround genhd: fix leftover might_sleep() in blk_free_devt() lockd: fix rpcbind crash on lockd startup failure rtlwifi: rtl8192cu: Add new ID percpu: perform tlb flush after pcpu_map_pages() failure percpu: fix pcpu_alloc_pages() failure path percpu: free percpu allocation info for uniprocessor system ata_piix: Add Device IDs for Intel 9 Series PCH Input: i8042 - add nomux quirk for Avatar AVIU-145A6 Input: i8042 - add Fujitsu U574 to no_timeout dmi table Input: atkbd - do not try 'deactivate' keyboard on any LG laptops Input: elantech - fix detection of touchpad on ASUS s301l Input: synaptics - add support for ForcePads Input: serport - add compat handling for SPIOCSTYPE ioctl dm crypt: fix access beyond the end of allocated space block: Fix dev_t minor allocation lifetime workqueue: apply __WQ_ORDERED to create_singlethread_workqueue() Revert "iwlwifi: dvm: don't enable CTS to self" SCSI: libiscsi: fix potential buffer overrun in __iscsi_conn_send_pdu NFC: microread: Potential overflows in microread_target_discovered() iscsi-target: Fix memory corruption in iscsit_logout_post_handler_diffcid iscsi-target: avoid NULL pointer in iscsi_copy_param_list failure Target/iser: Don't put isert_conn inside disconnected handler Target/iser: Get isert_conn reference once got to connected_handler iio:inkern: fix overwritten -EPROBE_DEFER in of_iio_channel_get_by_name iio:magnetometer: bugfix magnetometers gain values iio: adc: ad_sigma_delta: Fix indio_dev->trig assignment iio: st_sensors: Fix indio_dev->trig assignment iio: meter: ade7758: Fix indio_dev->trig assignment iio: inv_mpu6050: Fix indio_dev->trig assignment iio: gyro: itg3200: Fix indio_dev->trig assignment iio:trigger: modify return value for iio_trigger_get CIFS: Fix SMB2 readdir error handling CIFS: Fix directory rename error ASoC: davinci-mcasp: Correct rx format unit configuration shmem: fix nlink for rename overwrite directory x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8 KVM: x86: handle idiv overflow at kvm_write_tsc regmap: Fix handling of volatile registers for format_write() chips ACPICA: Update to GPIO region handler interface. MIPS: mcount: Adjust stack pointer for static trace in MIPS32 MIPS: ZBOOT: add missing <linux/string.h> include ARM: 8165/1: alignment: don't break misaligned NEON load/store ARM: 7897/1: kexec: Use the right ISA for relocate_new_kernel ARM: 8133/1: use irq_set_affinity with force=false when migrating irqs ARM: 8128/1: abort: don't clear the exclusive monitors NFSv4: Fix another bug in the close/open_downgrade code NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists() usb:hub set hub->change_bits when over-current happens usb: dwc3: omap: fix ordering for runtime pm calls USB: EHCI: unlink QHs even after the controller has stopped USB: storage: Add quirks for Entrega/Xircom USB to SCSI converters USB: storage: Add quirk for Ariston Technologies iConnect USB to SCSI adapter USB: storage: Add quirk for Adaptec USBConnect 2000 USB-to-SCSI Adapter storage: Add single-LUN quirk for Jaz USB Adapter usb: hub: take hub->hdev reference when processing from eventlist xhci: fix oops when xhci resumes from hibernate with hw lpm capable devices xhci: Fix null pointer dereference if xhci initialization fails USB: zte_ev: fix removed PIDs USB: ftdi_sio: add support for NOVITUS Bono E thermal printer USB: sierra: add 1199:68AA device ID USB: sierra: avoid CDC class functions on "68A3" devices USB: zte_ev: remove duplicate Qualcom PID USB: zte_ev: remove duplicate Gobi PID Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev" USB: option: add VIA Telecom CDS7 chipset device id USB: option: reduce interrupt-urb logging verbosity USB: serial: fix potential heap buffer overflow USB: sisusb: add device id for Magic Control USB video USB: serial: fix potential stack buffer overflow USB: serial: pl2303: add device id for ztek device xtensa: fix a6 and a7 handling in fast_syscall_xtensa xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss xtensa: fix access to THREAD_RA/THREAD_SP/THREAD_DS xtensa: fix address checks in dma_{alloc,free}_coherent xtensa: replace IOCTL code definitions with constants drm/radeon: add connector quirk for fujitsu board drm/vmwgfx: Fix a potential infinite spin waiting for fifo idle drm/ast: AST2000 cannot be detected correctly drm/i915: Wait for vblank before enabling the TV encoder drm/i915: Remove bogus __init annotation from DMI callbacks HID: logitech-dj: prevent false errors to be shown HID: magicmouse: sanity check report size in raw_event() callback HID: picolcd: sanity check report size in raw_event() callback cfq-iosched: Fix wrong children_weight calculation ALSA: pcm: fix fifo_size frame calculation ALSA: hda - Fix invalid pin powermap without jack detection ALSA: hda - Fix COEF setups for ALC1150 codec ALSA: core: fix buffer overflow in snd_info_get_line() arm64: ptrace: fix compat hardware watchpoint reporting trace: Fix epoll hang when we race with new entries i2c: at91: Fix a race condition during signal handling in at91_do_twi_xfer. i2c: at91: add bound checking on SMBus block length bytes arm64: flush TLS registers during exec ibmveth: Fix endian issues with rx_no_buffer statistic ahci: add pcid for Marvel 0x9182 controller ahci: Add Device IDs for Intel 9 Series PCH pata_scc: propagate return value of scc_wait_after_reset drm/i915: read HEAD register back in init_ring_common() to enforce ordering drm/radeon: load the lm63 driver for an lm64 thermal chip. drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan(). drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan(). drm/tilcdc: fix double kfree drm/tilcdc: fix release order on exit drm/tilcdc: panel: fix leak when unloading the module drm/tilcdc: tfp410: fix dangling sysfs connector node drm/tilcdc: slave: fix dangling sysfs connector node drm/tilcdc: panel: fix dangling sysfs connector node carl9170: fix sending URBs with wrong type when using full-speed Linux 3.10.55 libceph: gracefully handle large reply messages from the mon libceph: rename ceph_msg::front_max to front_alloc_len tpm: Provide a generic means to override the chip returned timeouts vfs: fix bad hashing of dentries dcache.c: get rid of pointless macros IB/srp: Fix deadlock between host removal and multipathd blkcg: don't call into policy draining if root_blkg is already gone mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc() mtd/ftl: fix the double free of the buffers allocated in build_maps() CIFS: Fix wrong restart readdir for SMB1 CIFS: Fix wrong filename length for SMB2 CIFS: Fix wrong directory attributes after rename CIFS: Possible null ptr deref in SMB2_tcon CIFS: Fix async reading on reconnects CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2 libceph: do not hard code max auth ticket len libceph: add process_one_ticket() helper libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly md/raid1,raid10: always abort recover on write error. xfs: don't zero partial page cache pages during O_DIRECT writes xfs: don't zero partial page cache pages during O_DIRECT writes xfs: don't dirty buffers beyond EOF xfs: quotacheck leaves dquot buffers without verifiers RDMA/iwcm: Use a default listen backlog if needed md/raid10: Fix memory leak when raid10 reshape completes. md/raid10: fix memory leak when reshaping a RAID10. md/raid6: avoid data corruption during recovery of double-degraded RAID6 Bluetooth: Avoid use of session socket after the session gets freed Bluetooth: never linger on process exit mnt: Add tests for unprivileged remount cases that have found to be faulty mnt: Change the default remount atime from relatime to the existing value mnt: Correct permission checks in do_remount mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount mnt: Only change user settable mount flags in remount ring-buffer: Up rb_iter_peek() loop count to 3 ring-buffer: Always reset iterator to reader page ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock ACPI: Run fixed event device notifications in process context ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject bfa: Fix undefined bit shift on big-endian architectures with 32-bit DMA address ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE ASoC: max98090: Fix missing free_irq ASoC: samsung: Correct I2S DAI suspend/resume ops ASoC: wm_adsp: Add missing MODULE_LICENSE ASoC: pcm: fix dpcm_path_put in dpcm runtime update openrisc: Rework signal handling MIPS: Fix accessing to per-cpu data when flushing the cache MIPS: OCTEON: make get_system_type() thread-safe MIPS: asm: thread_info: Add _TIF_SECCOMP flag MIPS: Cleanup flags in syscall flags handlers. MIPS: asm/reg.h: Make 32- and 64-bit definitions available at the same time MIPS: Remove BUG_ON(!is_fpu_owner()) in do_ade() MIPS: tlbex: Fix a missing statement for HUGETLB MIPS: Prevent user from setting FCSR cause bits MIPS: GIC: Prevent array overrun drivers: scsi: storvsc: Correctly handle TEST_UNIT_READY failure Drivers: scsi: storvsc: Implement a eh_timed_out handler powerpc/pseries: Failure on removing device node powerpc/mm: Use read barrier when creating real_pte powerpc/mm/numa: Fix break placement regulator: arizona-ldo1: remove bypass functionality mfd: omap-usb-host: Fix improper mask use. kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path CAPABILITIES: remove undefined caps from all processes tpm: missing tpm_chip_put in tpm_get_random() firmware: Do not use WARN_ON(!spin_is_locked()) spi: omap2-mcspi: Configure hardware when slave driver changes mode spi: orion: fix incorrect handling of cell-index DT property iommu/amd: Fix cleanup_domain for mass device removal media: media-device: Remove duplicated memset() in media_enum_entities() media: au0828: Only alt setting logic when needed media: xc4000: Fix get_frequency() media: xc5000: Fix get_frequency() Linux 3.10.54 USB: fix build error with CONFIG_PM_RUNTIME disabled NFSv4: Fix problems with close in the presence of a delegation NFSv3: Fix another acl regression svcrdma: Select NFSv4.1 backchannel transport based on forward channel NFSD: Decrease nfsd_users in nfsd_startup_generic fail usb: hub: Prevent hub autosuspend if usbcore.autosuspend is -1 USB: whiteheat: Added bounds checking for bulk command response USB: ftdi_sio: Added PID for new ekey device USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID ARM: OMAP2+: hwmod: Rearm wake-up interrupts for DT when MUSB is idled usb: xhci: amd chipset also needs short TX quirk xhci: Treat not finding the event_seg on COMP_STOP the same as COMP_STOP_INVAL Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt jbd2: fix infinite loop when recovering corrupt journal blocks mei: nfc: fix memory leak in error path mei: reset client state on queued connect request Btrfs: fix csum tree corruption, duplicate and outdated checksums hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub x86_64/vsyscall: Fix warn_bad_vsyscall log output x86: don't exclude low BIOS area when allocating address space for non-PCI cards drm/radeon: add additional SI pci ids ext4: fix BUG_ON in mb_free_blocks() kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601) Revert "KVM: x86: Increase the number of fixed MTRR regs to 10" KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table KVM: x86: Inter-privilege level ret emulation is not implemeneted crypto: ux500 - make interrupt mode plausible serial: core: Preserve termios c_cflag for console resume ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct drivers/i2c/busses: use correct type for dma_map/unmap hwmon: (dme1737) Prevent overflow problem when writing large limits hwmon: (ads1015) Fix out-of-bounds array access hwmon: (lm85) Fix various errors on attribute writes hwmon: (ads1015) Fix off-by-one for valid channel index checking hwmon: (gpio-fan) Prevent overflow problem when writing large limits hwmon: (lm78) Fix overflow problems seen when writing large temperature limits hwmon: (sis5595) Prevent overflow problem when writing large limits drm: omapdrm: fix compiler errors ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case. mei: start disconnect request timer consistently ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co ALSA: hda/ca0132 - Don't try loading firmware at resume when already failed ALSA: virtuoso: add Xonar Essence STX II support ALSA: hda - fix an external mic jack problem on a HP machine USB: Fix persist resume of some SS USB devices USB: ehci-pci: USB host controller support for Intel Quark X1000 USB: serial: ftdi_sio: Add support for new Xsens devices USB: serial: ftdi_sio: Annotate the current Xsens PID assignments USB: OHCI: don't lose track of EDs when a controller dies isofs: Fix unbounded recursion when processing relocated directories HID: fix a couple of off-by-ones HID: logitech: perform bounds checking on device_id early enough stable_kernel_rules: Add pointer to netdev-FAQ for network patches Linux 3.10.53 arch/sparc/math-emu/math_32.c: drop stray break operator sparc64: ldc_connect() should not return EINVAL when handshake is in progress. sunsab: Fix detection of BREAK on sunsab serial console bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000 sparc64: Guard against flushing openfirmware mappings. sparc64: Do not insert non-valid PTEs into the TSB hash table. sparc64: Add membar to Niagara2 memcpy code. sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses. sparc64: Fix top-level fault handling bugs. sparc64: Handle 32-bit tasks properly in compute_effective_address(). sparc64: Make itc_sync_lock raw sparc64: Fix argument sign extension for compat_sys_futex(). sctp: fix possible seqlock seadlock in sctp_packet_transmit() iovec: make sure the caller actually wants anything in memcpy_fromiovecend net: Correctly set segment mac_len in skb_segment(). macvlan: Initialize vlan_features to turn on offload support. net: sctp: inherit auth_capable on INIT collisions tcp: Fix integer-overflow in TCP vegas tcp: Fix integer-overflows in TCP veno net: sendmsg: fix NULL pointer dereference ip: make IP identifiers less predictable inetpeer: get rid of ip_id_count bnx2x: fix crash during TSO tunneling Linux 3.10.52 x86/espfix/xen: Fix allocation of pages for paravirt page tables lib/btree.c: fix leak of whole btree nodes net/l2tp: don't fall back on UDP [get\|set]sockopt net: mvneta: replace Tx timer with a real interrupt net: mvneta: add missing bit descriptions for interrupt masks and causes net: mvneta: do not schedule in mvneta_tx_timeout net: mvneta: use per_cpu stats to fix an SMP lock up net: mvneta: increase the 64-bit rx/tx stats out of the hot path Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan" staging: vt6655: Fix Warning on boot handle_irq_event_percpu. x86_64/entry/xen: Do not invoke espfix64 on Xen x86, espfix: Make it possible to disable 16-bit support x86, espfix: Make espfix64 a Kconfig option, fix UML x86, espfix: Fix broken header guard x86, espfix: Move espfix definitions into a separate header file x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option" timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks printk: rename printk_sched to printk_deferred iio: buffer: Fix demux table creation staging: vt6655: Fix disassociated messages every 10 seconds mm, thp: do not allow thp faults to avoid cpuset restrictions scsi: handle flush errors properly rapidio/tsi721_dma: fix failure to obtain transaction descriptor cfg80211: fix mic_failure tracing ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout crypto: af_alg - properly label AF_ALG socket Linux 3.10.51 core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors x86/efi: Include a .bss section within the PE/COFF headers s390/ptrace: fix PSW mask check Fix gcc-4.9.0 miscompilation of load_balance() in scheduler mm: hugetlb: fix copy_hugetlb_page_range() x86_32, entry: Store badsys error code in %eax hwmon: (smsc47m192) Fix temperature limit and vrm write operations parisc: Remove SA_RESTORER define coredump: fix the setting of PF_DUMPCORE Input: fix defuzzing logic slab_common: fix the check for duplicate slab names slab_common: Do not check for duplicate slab names tracing: Fix wraparound problems in "uptime" trace clock blkcg: don't call into policy draining if root_blkg is already gone ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode) libata: introduce ata_host->n_tags to avoid oops on SAS controllers libata: support the ata host which implements a queue depth less than 32 block: don't assume last put of shared tags is for the host block: provide compat ioctl for BLKZEROOUT media: tda10071: force modulation to QPSK on DVB-S media: hdpvr: fix two audio bugs Linux 3.10.50 ARC: Implement ptrace(PTRACE_GET_THREAD_AREA) sched: Fix possible divide by zero in avg_atom() calculation locking/mutex: Disable optimistic spinning on some architectures PM / sleep: Fix request_firmware() error at resume dm cache metadata: do not allow the data block size to change dm thin metadata: do not allow the data block size to change alarmtimer: Fix bug where relative alarm timers were treated as absolute drm/radeon: avoid leaking edid data drm/qxl: return IRQ_NONE if it was not our irq drm/radeon: set default bl level to something reasonable irqchip: gic: Fix core ID calculation when topology is read from DT irqchip: gic: Add support for cortex a7 compatible string ring-buffer: Fix polling on trace_pipe mwifiex: fix Tx timeout issue perf/x86/intel: ignore CondChgd bit to avoid false NMI handling ipv4: fix buffer overflow in ip_options_compile() dns_resolver: Null-terminate the right string dns_resolver: assure that dns_query() result is null-terminated sunvnet: clean up objects created in vnet_new() on vnet_exit() net: pppoe: use correct channel MTU when using Multilink PPP net: sctp: fix information leaks in ulpevent layer tipc: clear 'next'-pointer of message fragments before reassembly be2net: set EQ DB clear-intr bit in be_open() netlink: Fix handling of error from netlink_dump(). net: mvneta: Fix big endian issue in mvneta_txq_desc_csum() net: mvneta: fix operation in 10 Mbit/s mode appletalk: Fix socket referencing in skb tcp: fix false undo corner cases igmp: fix the problem when mc leave group net: qmi_wwan: add two Sierra Wireless/Netgear devices net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2 ipv4: icmp: Fix pMTU handling for rare case tcp: Fix divide by zero when pushing during tcp-repair bnx2x: fix possible panic under memory stress net: fix sparse warning in sk_dst_set() ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix ipv4: fix dst race in sk_dst_get() 8021q: fix a potential memory leak net: sctp: check proc_dointvec result in proc_sctp_do_auth tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb ip_tunnel: fix ip_tunnel_lookup shmem: fix splicing from a hole while it's punched shmem: fix faulting into a hole, not taking i_mutex shmem: fix faulting into a hole while it's punched iwlwifi: dvm: don't enable CTS to self igb: do a reset on SR-IOV re-init if device is down hwmon: (adt7470) Fix writes to temperature limit registers hwmon: (da9052) Don't use dash in the name attribute hwmon: (da9055) Don't use dash in the name attribute tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs tracing: Fix graph tracer with stack tracer on other archs fuse: handle large user and group ID Bluetooth: Ignore H5 non-link packets in non-active state Drivers: hv: util: Fix a bug in the KVP code media: gspca_pac7302: Add new usb-id for Genius i-Look 317 usb: Check if port status is equal to RxDetect Signed-off-by: Ian Maund <imaund@codeaurora.org>	2015-05-01 13:34:57 -07:00
Linus Torvalds	1f74b26b0f	vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS commit 9c145c56d0c8a0b62e48c8d71e055ad0fb2012ba upstream. The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-29 10:34:01 +02:00
Linus Torvalds	0c42d1fbb3	vm: add VM_FAULT_SIGSEGV handling support commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [shengyong: Backport to 3.10 - adjust context - ignore modification for arch nios2, because 3.10 does not support it - ignore modification for driver lustre, because 3.10 does not support it - ignore VM_FAULT_FALLBACK in VM_FAULT_ERROR, becase 3.10 does not support this flag - add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.10 does not separate it to copro_fault.c - add SIGSEGV handling in mm/memory.c, because 3.10 does not separate it to gup.c ] Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-29 10:34:00 +02:00
Tejun Heo	e58126f570	writeback: fix possible underflow in write bandwidth calculation commit c72efb658f7c8b27ca3d0efb5cfd5ded9fcac89e upstream. From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Mon, 23 Mar 2015 00:08:19 -0400 `2f800fbd77` ("writeback: fix dirtied pages accounting on redirty") introduced account_page_redirty() which reverts stat updates for a redirtied page, making BDI_DIRTIED no longer monotonically increasing. bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the basis for bandwidth calculation. While unlikely, since the above patch, the newer value may be lower than the recorded past value and underflow the bandwidth calculation leading to a wild result. Fix it by subtracing min of the old and new values when calculating delta. AFAIK, there hasn't been any report of it happening but the resulting erratic behavior would be non-critical and temporary, so it's possible that the issue is happening without being reported. The risk of the fix is very low, so tagged for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Fixes: `2f800fbd77` ("writeback: fix dirtied pages accounting on redirty") Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-19 10:10:48 +02:00
Tejun Heo	f16678367d	writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth() commit 7d70e15480c0450d2bfafaad338a32e884fc215e upstream. global_update_bandwidth() uses static variable update_time as the timestamp for the last update but forgets to initialize it to INITIALIZE_JIFFIES. This means that global_dirty_limit will be 5 mins into the future on 32bit and some large amount jiffies into the past on 64bit. This isn't critical as the only effect is that global_dirty_limit won't be updated for the first 5 mins after booting on 32bit machines, especially given the auxiliary nature of global_dirty_limit's role - protecting against global dirty threshold's sudden dips; however, it does lead to unintended suboptimal behavior. Fix it. Fixes: `c42843f2f0` ("writeback: introduce smoothed global dirty limit") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-19 10:10:47 +02:00
Gu Zheng	dfb06c8557	mm/memory hotplug: postpone the reset of obsolete pgdat commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream. Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition: BUG: unable to handle kernel paging request at 0000000000025f60 IP: next_online_pgdat+0x1/0x50 PGD 0 Oops: 0000 [#1] SMP ACPI: Device does not support D3cold Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 Workqueue: events vmstat_update task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000 RIP: 0010: next_online_pgdat+0x1/0x50 RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286 RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082 RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000 RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96 R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440 R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800 FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: refresh_cpu_vm_stats+0xd0/0x140 vmstat_update+0x11/0x50 process_one_work+0x194/0x3d0 worker_thread+0x12b/0x410 kthread+0xc6/0xd0 ret_from_fork+0x7c/0xb0 The cause is the "memset(pgdat, 0, sizeof(pgdat))" at the end of try_offline_node, which will reset all the content of pgdat to 0, as the pgdat is accessed lock-free, so that the users still using the pgdat will panic, such as the vmstat_update routine. process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(pgdat)) zone = next_zone(zone) pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now next_online_pgdat(pgdat) next_online_node(pgdat->node_id); // NULL pointer access So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Xishi Qiu <qiuxishi@huawei.com> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-19 10:10:47 +02:00
Laura Abbott	433083e309	mm/page_alloc: Call kernel_map_pages in unset_migrateype_isolate Commit `d1037ba0b8` (mm/page_alloc: restrict max order of merging on isolated pageblock) changed the logic of unset_migratetype_isolate to check the buddy allocator and explicitly call __free_pages to merge. The page that is being freed in this path never had prep_new_page called so set_page_refcounted is called explicitly but there is no call to kernel_map_pages. With the default kernel_map_pages this is mostly harmless but if kernel_map_pages does any manipulation of the page tables (unmapping or setting pages to read only) this may trigger a fault: alloc_contig_range test_pages_isolated(ceb00, ced00) failed Unable to handle kernel paging request at virtual address ffffffc0cec00000 pgd = ffffffc045fc4000 [ffffffc0cec00000] *pgd=0000000000000000 Internal error: Oops: 9600004f [#1] PREEMPT SMP Modules linked in: exfatfs CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1 task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000 PC is at memset+0xc8/0x1c0 LR is at kernel_map_pages+0x1ec/0x244 Fix this by calling kernel_map_pages to ensure the page is set in the page table properly Change-Id: Ie0c7f38fce24683b6ddebf95874be662ef25021b Signed-off-by: Laura Abbott <lauraa@codeaurora.org>	2015-04-09 09:12:28 -07:00
Grazvydas Ignotas	9113c468b6	mm/memory.c: actually remap enough memory commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream. For whatever reason, generic_access_phys() only remaps one page, but actually allows to access arbitrary size. It's quite easy to trigger large reads, like printing out large structure with gdb, which leads to a crash. Fix it by remapping correct size. Fixes: `28b2ee20c7` ("access_process_vm device memory infrastructure") Signed-off-by: Grazvydas Ignotas <notasas@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-03-18 13:22:28 +01:00
Joonsoo Kim	2295074e44	mm/compaction: fix wrong order check in compact_finished() commit 372549c2a3778fd3df445819811c944ad54609ca upstream. What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pageblocks are movable and high order freepage made by compaction is usually on movable type buddy list. There is some report related to this bug. See below link. http://www.spinics.net/lists/linux-mm/msg81666.html Although the issued system still has load spike comes from compaction, this makes that system completely stable and responsive according to his report. stress-highalloc test in mmtests with non movable order 7 allocation doesn't show any notable difference in allocation success rate, but, it shows more compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 18.47 : 28.94 Fixes: `1fb3f8ca0e` ("mm: compaction: capture a suitable high-order page immediately when it is made available") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-03-18 13:22:28 +01:00
Roman Gushchin	ae9c2f1fe9	mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream. I noticed that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit `c9b1d0981f` ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-03-18 13:22:28 +01:00
Roman Gushchin	992f1caea7	mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream. I noticed, that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit `c9b1d0981f` ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. [akpm@linux-foundation.org: use min_t] Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-03-18 13:22:27 +01:00
Naoya Horiguchi	1a25fb791a	mm/hugetlb: add migration entry check in __unmap_hugepage_range commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream. If __unmap_hugepage_range() tries to unmap the address range over which hugepage migration is on the way, we get the wrong page because pte_page() doesn't work for migration entries. This patch simply clears the pte for migration entries as we do for hwpoison entries. Fixes: `290408d4a2` ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-03-18 13:22:27 +01:00
Shiraz Hashim	48f5cffe36	mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range commit 23aaed6659df9adfabe9c583e67a36b54e21df46 upstream. walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). Userspace applications get the wrong data, so the effect is like just confusing users (if the applications just display the data) or sometimes killing the processes (if the applications do something with misunderstanding virtual addresses due to the wrong data.) For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual address range at wrong index. Eventually userspace may get wrong pagemap data for a task. Corresponding to a VM_PFNMAP marked vma region, kernel may report mappings from subsequent vma regions. User space in turn may account more pages (than really are) to the task. In my case I was using procmem, procrack (Android utility) which uses pagemap interface to account RSS pages of a task. Due to this bug it was giving a wrong picture for vmas (with VM_PFNMAP set). Fixes: `a9ff785e44` ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas") Signed-off-by: Shiraz Hashim <shashim@codeaurora.org> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-02-11 14:48:16 +08:00
Vignesh Radhakrishnan	7f15dd8a75	kmemleak : Make kmemleak_stack_scan optional using config Currently we have kmemleak_stack_scan enabled by default. This can hog the cpu with pre-emption disabled for a long time starving other tasks. Make this optional at compile time, since if required we can always write to sysfs entry and enable this option. Change-Id: Ie30447861c942337c7ff25ac269b6025a527e8eb Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>	2015-02-04 18:38:40 +05:30
Shiraz Hashim	44ad70f1f0	mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range walk_page_range silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data at wrong index. Change-Id: I057b5c8ede1ae4bb9e3f8639e10bd4fcbf23da7e Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>	2015-01-20 16:10:02 +05:30
Linus Torvalds	7d702b4b2b	mm: Don't count the stack guard page towards RLIMIT_STACK commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream. Commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") made sure that we return the error properly for stack growth conditions. It also theorized that counting the guard page towards the stack limit might break something, but also said "Let's see if anybody notices". Somebody did notice. Apparently android-x86 sets the stack limit very close to the limit indeed, and including the guard page in the rlimit check causes the android 'zygote' process problems. So this adds the (fairly trivial) code to make the stack rlimit check be against the actual real stack size, rather than the size of the vma that includes the guard page. Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Cc: Jay Foad <jay.foad@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-01-16 06:59:03 -08:00
Linus Torvalds	88b5d12c64	mm: propagate error from stack expansion even for guard page commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream. Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-01-16 06:59:03 -08:00
Vlastimil Babka	6bb148fb1e	mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream. Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep(): if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; // kswapd will not go to sleep } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to make sure that no process is left sleeping on pfmemalloc_wait when kswapd itself goes to sleep. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue actually empties. To prevent processes from being left sleeping, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Note that if any process puts itself in the queue after this waitqueue_active() check, or after the wake up itself, it means that the process will also wake up kswapd - and since we are under prepare_to_wait(), the wake up won't be missed. Also we update the comment prepare_kswapd_sleep() to hopefully more clearly describe the races it is preventing. Fixes: `5515061d22` ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-01-16 06:59:03 -08:00
Linux Build Service Account	c7585e1f1c	Merge "mm: vmscan: fix the page state calculation in too_many_isolated"	2014-12-29 04:16:04 -08:00
Vinayak Menon	c8f5b9926f	mm: vmscan: fix the page state calculation in too_many_isolated It is observed that sometimes multiple tasks get blocked in the congestion_wait loop below, in shrink_inactive_list. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and the gfp_flag was GFP_KERNEL which resulted in too_many_isolated to return true. But one of the CPU pageset vmstat diff had NR_ISOLATED_FILE as -14. As there weren't any more update to per cpu pageset, the threshold wasn't met, and the tasks were blocked in the congestion wait. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. Change-Id: Iec767a548e524729c7ed79a92fe4718cdd08ce69 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-26 21:12:18 +05:30
Linux Build Service Account	ac6c649990	Merge "mm/page_alloc: restrict max order of merging on isolated pageblock Current pageblock isolation logic could isolate each pageblock individually. This causes freepage accounting problem if freepage with pageblock order on isolate pageblock is merged with other freepage on normal pageblock. We can prevent merging by restricting max order of merging to pageblock order if freepage is on isolate pageblock."	2014-12-18 14:23:16 -08:00
Linux Build Service Account	348e3a3df5	Merge "memcg: Allow non-root users permission to control memory"	2014-12-18 10:29:47 -08:00
Joonsoo Kim	d1037ba0b8	mm/page_alloc: restrict max order of merging on isolated pageblock Current pageblock isolation logic could isolate each pageblock individually. This causes freepage accounting problem if freepage with pageblock order on isolate pageblock is merged with other freepage on normal pageblock. We can prevent merging by restricting max order of merging to pageblock order if freepage is on isolate pageblock. A side-effect of this change is that there could be non-merged buddy freepage even if finishing pageblock isolation, because undoing pageblock isolation is just to move freepage from isolate buddy list to normal buddy list rather than to consider merging. So, the patch also makes undoing pageblock isolation consider freepage merge. When un-isolation, freepage with more than pageblock order and it's buddy are checked. If they are on normal pageblock, instead of just moving, we isolate the freepage and free it in order to get merged. CRs-fixed: 771472 Change-Id: I50d132eeea59de58e68e82f797edf85334512468 Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 3c605096d3158216ba9326a16266f6ba128c2c8d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [lmark@codeaurora.org: fix merge conflicts] Signed-off-by: Liam Mark <lmark@codeaurora.org>	2014-12-17 11:51:09 -08:00
Hugh Dickins	4542246879	mm: fix swapoff hang after page migration and fork commit 2022b4d18a491a578218ce7a4eca8666db895a73 upstream. I've been seeing swapoff hangs in recent testing: it's cycling around trying unsuccessfully to find an mm for some remaining pages of swap. I have been exercising swap and page migration more heavily recently, and now notice a long-standing error in copy_one_pte(): it's trying to add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing so even when it's a migration entry or an hwpoison entry. Which wouldn't matter much, except it adds dst_mm next to src_mm, assuming src_mm is already on the mmlist: which may not be so. Then if pages are later swapped out from dst_mm, swapoff won't be able to find where to replace them. There's already a !non_swap_entry() test for stats: move that up before the swap_duplicate() and the addition to mmlist. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-12-16 09:09:42 -08:00
Weijie Yang	22fff28376	mm: frontswap: invalidate expired data on a dup-store failure commit fb993fa1a2f669215fa03a09eed7848f2663e336 upstream. If a frontswap dup-store failed, it should invalidate the expired page in the backend, or it could trigger some data corruption issue. Such as: 1. use zswap as the frontswap backend with writeback feature 2. store a swap page(version_1) to entry A, success 3. dup-store a newer page(version_2) to the same entry A, fail 4. use __swap_writepage() write version_2 page to swapfile, success 5. zswap do shrink, writeback version_1 page to swapfile 6. version_2 page is overwrited by version_1, data corrupt. This patch fixes this issue by invalidating expired data immediately when meet a dup-store failure. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-12-16 09:09:41 -08:00
Chintan Pandya	d68f06d491	memcg: Allow non-root users permission to control memory In a system like Android, a process with SYS_ADMIN rights controls the system for things like moving process from one cgroup to another. The native cgroup capabilities are only allowed to execute by root user and not system. While adding a new cgroup sub-system, one may override and relax the permission so that 'system' can also control cgroup. Here, memcg is one such cgroup sub system which requires system level control for that. Allow non-root processes to add arbitrary into 'memory' cgroups if it has 'CAP_SYS_ADMIN' capability set. Change-Id: I43d4468186f142c176cb5b5f060751bb1b160344 Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>	2014-12-15 13:13:02 +05:30
Linus Torvalds	86779d367b	Don't trigger congestion wait on dirty-but-not-writeout pages shrink_inactive_list() used to wait 0.1s to avoid congestion when all the pages that were isolated from the inactive list were dirty but not under active writeback. That makes no real sense, and apparently causes major interactivity issues under some loads since 3.11. The ostensible reason for it was to wait for kswapd to start writing pages, but that seems questionable as well, since the congestion wait code seems to trigger for kswapd itself as well. Also, the logic behind delaying anything when we haven't actually started writeback is not clear - it only delays actually starting that writeback. We'll still trigger the congestion waiting if (a) the process is kswapd, and we hit pages flagged for immediate reclaim (b) the process is not kswapd, and the zone backing dev writeback is actually congested. This probably needs to be revisited, but as it is this fixes a reported regression. Reported-by: Felipe Contreras <felipe.contreras@gmail.com> Pinpointed-by: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: b738d764652dc5aab1c8939f637112981fce9e0e Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I4fbcbb10d7ba242caf80da06bd8ed11770571cff [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:15:13 +05:30
Mel Gorman	0783bb8b2a	mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY Commit "mm: vmscan: obey proportional scanning requirements for kswapd" ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 611 1305 Swap Ins 394 1651 Swap Outs 4394 5891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 15921292 16258483 Kswapd pages reclaimed 15913301 16248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans 23% 22% Page writes by reclaim 4858.000 6232.000 Page writes file 464 341 Page writes anon 4394 5891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I93acb1ea93d90afca35f3db2a350f2e6589e7c64 [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:14:57 +05:30
Johannes Weiner	ca6b845aba	mm/page-writeback.c: do not count anon pages as dirtyable memory The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I35ae9cfbcccbf3329e6f15158cc7bb72905cb7ce [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:14:45 +05:30
Mel Gorman	83ea991a3e	mm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK After the patch "mm: vmscan: Flatten kswapd priority loop" was merged the scanning priority of kswapd changed. The priority now rises until it is scanning enough pages to meet the high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of pages were encountered under writeback but this value is scaled based on the priority. As kswapd frequently scans with a higher priority now it is relatively easy to set ZONE_WRITEBACK. This patch removes the scaling and treates writeback pages similar to how it treats unqueued dirty pages and congested pages. The user-visible effect should be that kswapd will writeback fewer pages from reclaim context. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 918fc718c5922520c499ad60f61b8df86b998ae9 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I5f75351d845ab0de4ca1c22ffba10e06ea45d111 [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:14:28 +05:30
Mel Gorman	a0901266c3	mm: vmscan: do not continue scanning if reclaim was aborted for compaction Direct reclaim is not aborting to allow compaction to go ahead properly. do_try_to_free_pages is told to abort reclaim which is happily ignores and instead increases priority instead until it reaches 0 and starts shrinking file/anon equally. This patch corrects the situation by aborting reclaim when requested instead of raising priority. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 5a1c9cbc1550f93335d7c03eb6c271e642deff04 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I1e3fc6b2fea5d5a06edf5c682caffa3a7907a7ad [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:14:07 +05:30
Mel Gorman	0c08500607	mm: vmscan: take page buffers dirty and locked state into account Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: b45972265f823ed01eae0867a176320071665787 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Idabea6f388eddcf5acf4725975d51119169da211 [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-11 15:12:10 +05:30
Mel Gorman	ec0304daef	mm: vmscan: treat pages marked for immediate reclaim as zone congestion Currently a zone will only be marked congested if the underlying BDI is congested but if dirty pages are spread across zones it is possible that an individual zone is full of dirty pages without being congested. The impact is that zone gets scanned very quickly potentially reclaiming really clean pages. This patch treats pages marked for immediate reclaim as congested for the purposes of marking a zone ZONE_CONGESTED and stalling in wait_iff_congested. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: d04e8acd03e5c3421ef18e3da7bc88d56179ca42 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I321615bb32c4efe5889df9ce6482c825d7a816e6 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:48 +05:30
Mel Gorman	f2d250b88f	mm: vmscan: move direct reclaim wait_iff_congested into shrink_list shrink_inactive_list makes decisions on whether to stall based on the number of dirty pages encountered. The wait_iff_congested() call in shrink_page_list does no such thing and it's arbitrary. This patch moves the decision on whether to set ZONE_CONGESTED and the wait_iff_congested call into shrink_page_list. This keeps all the decisions on whether to stall or not in the one place. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 8e950282804558e4605401b9c79c1d34f0d73507 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Ie73206306ff0589877cab6d1a4ec510d88088403 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:48 +05:30
Mel Gorman	97b639a7d5	mm: vmscan: set zone flags before blocking In shrink_page_list a decision may be made to stall and flag a zone as ZONE_WRITEBACK so that if a large number of unqueued dirty pages are encountered later then the reclaimer will stall. Set ZONE_WRITEBACK before potentially going to sleep so it is noticed sooner. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: f7ab8db791a8692f5ed4201dbae25722c1732a8d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I32b015f56fb76c2c2f15163659eda478f63e4b5e Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:48 +05:30
Mel Gorman	ddff702505	mm: vmscan: stall page reclaim after a list of pages have been processed Commit "mm: vmscan: Block kswapd if it is encountering pages under writeback" blocks page reclaim if it encounters pages under writeback marked for immediate reclaim. It blocks while pages are still isolated from the LRU which is unnecessary. This patch defers the blocking until after the isolated pages have been processed and tidies up some of the comments. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: b1a6f21e3b2315d46ae8af88a8f4eb8ea2763107 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Ia6da0949d7bf81cd7c8d3951a7f9c723131b9037 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	8a1c1c901a	mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered Further testing of the "Reduce system disruption due to kswapd" discovered a few problems. First and foremost, it's possible for pages under writeback to be freed which will lead to badness. Second, as pages were not being swapped the file LRU was being scanned faster and clean file pages were being reclaimed. In some cases this results in increased read IO to re-read data from disk. Third, more pages were being written from kswapd context which can adversly affect IO performance. Lastly, it was observed that PageDirty pages are not necessarily dirty on all filesystems (buffers can be clean while PageDirty is set and ->writepage generates no IO) and not all filesystems set PageWriteback when the page is being written (e.g. ext3). This disconnect confuses the reclaim stalling logic. This follow-up series is aimed at these problems. The tests were based on three kernels vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to kswapd" applied on top as per what should be in Andrew's tree right now lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel The first test used memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service. It starts with no background IO on a freshly created ext4 filesystem and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. parallelio 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%) Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%) Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%) Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%) Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%) Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%) Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%) Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%) Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%) Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%) Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%) Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%) Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%) memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 23K/sec to just over 4K/second when there is 2385M of IO going on in the background. With current mmotm, there is no collapse in performance and with this follow-up series there is little change. swaptotal is the total amount of swap traffic. With mmotm and the follow-up series, the total amount of swapping is much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Minor Faults 11160152 10706748 10622316 Major Faults 46305 755 678 Swap Ins 260249 0 0 Swap Outs 683860 18 18 Direct pages scanned 0 678 2520 Kswapd pages scanned 6046108 8814900 1639279 Kswapd pages reclaimed 1081954 `1172267` 1094635 Direct pages reclaimed 0 566 2304 Kswapd efficiency 17% 13% 66% Kswapd velocity 5217.560 7618.953 1414.879 Direct efficiency 100% 83% 91% Direct velocity 0.000 0.586 2.175 Percentage direct scans 0% 0% 0% Zone normal velocity 5105.086 6824.681 671.158 Zone dma32 velocity 112.473 794.858 745.896 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 1929612.000 6861768.000 32821.000 Page writes file 1245752 6861750 32803 Page writes anon 683860 18 18 Page reclaim immediate 7484 40 239 Sector Reads 1130320 93996 86900 Sector Writes 13508052 10823500 11804436 Page rescued immediate 0 0 0 Slabs scanned 33536 27136 18560 Direct inode steals 0 0 0 Kswapd inode steals 8641 1035 0 Kswapd skipped wait 0 0 0 THP fault alloc 8 37 33 THP collapse alloc 508 552 515 THP splits 24 1 1 THP fault fallback 0 0 0 THP collapse fail 0 0 0 There are a number of observations to make here 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the pages swapped were really unused anonymous pages. Related to that, major faults are much reduced. 2. kswapd efficiency was impacted by the initial series but with these follow-up patches, the efficiency is now at 66% indicating that far fewer pages were skipped during scanning due to dirty or writeback pages. 3. kswapd velocity is reduced indicating that fewer pages are being scanned with the follow-up series as kswapd now stalls when the tail of the LRU queue is full of unqueued dirty pages. The stall gives flushers a chance to catch-up so kswapd can reclaim clean pages when it wakes 4. In light of Zlatko's recent reports about zone scanning imbalances, mmtests now reports scanning velocity on a per-zone basis. With mainline, you can see that the scanning activity is dominated by the Normal zone with over 45 times more scanning in Normal than the DMA32 zone. With the series currently in mmotm, the ratio is slightly better but it is still the case that the bulk of scanning is in the highest zone. With this follow-up series, the ratio of scanning between the Normal and DMA32 zone is roughly equal. 5. As Dave Chinner observed, the current patches in mmotm increased the number of pages written from kswapd context which is expected to adversly impact IO performance. With the follow-up patches, far fewer pages are written from kswapd context than the mainline kernel 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With the follow-up series, there is less slab shrinking activity and no inodes were reclaimed. 7. Note that "Sectors Read" is drastically reduced implying that the source data being used for the IO is not being aggressively discarded due to page reclaim skipping over dirty pages and reclaiming clean pages. Note that the reducion in reads could also be due to inode data not being re-read from disk after a slab shrink. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Mean sda-avgqz 166.99 32.09 33.44 Mean sda-await 853.64 192.76 185.43 Mean sda-r_await 6.31 9.24 5.97 Mean sda-w_await 2992.81 202.65 192.43 Max sda-avgqz 1409.91 718.75 698.98 Max sda-await 6665.74 3538.00 3124.23 Max sda-r_await 58.96 111.95 58.00 Max sda-w_await 28458.94 3977.29 3148.61 In light of the changes in writes from reclaim context, the number of reads and Dave Chinner's concerns about IO performance I took a closer look at the IO stats for the test disk. Few observations 1. The average queue size is reduced by the initial series and roughly the same with this follow up. 2. Average wait times for writes are reduced and as the IO is completing faster it at least implies that the gain is because flushers are writing the files efficiently instead of page reclaim getting in the way. 3. The reduction in maximum write latency is staggering. 28 seconds down to 3 seconds. Jan Kara asked how NFS is affected by all of this. Unstable pages can be taken into account as one of the patches in the series shows but it is still the case that filesystems with unusual handling of dirty or writeback could still be treated better. Tests like postmark, fsmark and largedd showed up nothing useful. On my test setup, pages are simply not being written back from reclaim context with or without the patches and there are no changes in performance. My test setup probably is just not strong enough network-wise to be really interesting. I ran a longer-lived memcached test with IO going to NFS instead of a local disk parallelio 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%) Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%) Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%) Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%) Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%) Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%) Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%) Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%) Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%) Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%) Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%) Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%) Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%) Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%) Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%) Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%) Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%) Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%) 1. Performance does not collapse due to IO which is good. IO is also completing faster. Note with mmotm, IO completes in a third of the time and faster again with this series applied 2. Swapping is reduced, although not eliminated. The figures for the follow-up look bad but it does vary a bit as the stalling is not perfect for nfs or filesystems like ext3 with unusual handling of dirty and writeback pages 3. There are swapins, particularly with larger amounts of IO indicating that active pages are being reclaimed. However, the number of much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Minor Faults 36339175 35025445 35219699 Major Faults 310964 27108 51887 Swap Ins 2176399 173069 333316 Swap Outs 3344050 357228 504824 Direct pages scanned 8972 77283 43242 Kswapd pages scanned 20899983 8939566 14772851 Kswapd pages reclaimed 6193156 5172605 5231026 Direct pages reclaimed 8450 73802 39514 Kswapd efficiency 29% 57% 35% Kswapd velocity 3929.743 1847.499 3058.840 Direct efficiency 94% 95% 91% Direct velocity 1.687 15.972 8.954 Percentage direct scans 0% 0% 0% Zone normal velocity 3721.907 939.103 2185.142 Zone dma32 velocity 209.522 924.368 882.651 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 4082185.000 526319.000 537114.000 Page writes file 738135 169091 32290 Page writes anon 3344050 357228 504824 Page reclaim immediate 9524 170 5595843 Sector Reads 8909900 861192 1483680 Sector Writes 13428980 1488744 2076800 Page rescued immediate 0 0 0 Slabs scanned 38016 31744 28672 Direct inode steals 0 0 0 Kswapd inode steals 424 0 0 Kswapd skipped wait 0 0 0 THP fault alloc 14 15 119 THP collapse alloc 1767 1569 1618 THP splits 30 29 25 THP fault fallback 0 0 0 THP collapse fail 8 5 0 Compaction stalls 17 41 100 Compaction success 7 31 95 Compaction failures 10 10 5 Page migrate success 7083 22157 62217 Page migrate failure 0 0 0 Compaction pages isolated 14847 48758 135830 Compaction migrate scanned 18328 48398 138929 Compaction free scanned 2000255 355827 1720269 Compaction cost 7 24 68 I guess the main takeaway again is the much reduced page writes from reclaim context and reduced reads. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 Mean sda-avgqz 23.58 0.35 0.44 Mean sda-await 133.47 15.72 15.46 Mean sda-r_await 4.72 4.69 3.95 Mean sda-w_await 507.69 28.40 33.68 Max sda-avgqz 680.60 12.25 23.14 Max sda-await 3958.89 221.83 286.22 Max sda-r_await 63.86 61.23 67.29 Max sda-w_await 11710.38 883.57 1767.28 And as before, write wait times are much reduced. This patch: The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority" decides whether to writeback pages from reclaim context based on the number of dirty pages encountered. This situation is flagged too easily and flushers are not given the chance to catch up resulting in more pages being written from reclaim context and potentially impacting IO performance. The check for PageWriteback is also misplaced as it happens within a PageDirty check which is nonsense as the dirty may have been cleared for IO. The accounting is updated very late and pages that are already under writeback, were reactivated, could not unmapped or could not be released are all missed. Similarly, a page is considered congested for reasons other than being congested and pages that cannot be written out in the correct context are skipped. Finally, it considers stalling and writing back filesystem pages due to encountering dirty anonymous pages at the tail of the LRU which is dumb. This patch causes kswapd to begin writing filesystem pages from reclaim context only if page reclaim found that all filesystem pages at the tail of the LRU were unqueued dirty pages. Before it starts writing filesystem pages, it will stall to give flushers a chance to catch up. The decision on whether wait_iff_congested is also now determined by dirty filesystem pages only. Congested pages are based on whether the underlying BDI is congested regardless of the context of the reclaiming process. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: e2be15f6c3eecedfbe1550cca8d72c5057abbbd2 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I2c8aee00da5e3e9562984e792d16f9e11bd4a435 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	6fe90c0c0c	mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone() balance_pgdat() is very long and some of the logic can and should be internal to kswapd_shrink_zone(). Move it so the flow of balance_pgdat() is marginally easier to follow. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 7c954f6de6b630de30f265a079aad359f159ebe9 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I6c4e76e6e132c5982c228863c99195d7ad7768bc [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	bc6bd99ed4	mm: vmscan: check if kswapd should writepage once per pgdat scan Currently kswapd checks if it should start writepage as it shrinks each zone without taking into consideration if the zone is balanced or not. This is not wrong as such but it does not make much sense either. This patch checks once per pgdat scan if kswapd should be writing pages. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: b7ea3c417b6c2e74ca1cb051568f60377908928d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [vinmenon@codeaurora.org: resolve trivial merge conflicts] Change-Id: I7cb0fb685f8346f07d0fc4810f6c593334cd1590 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	488cb10a86	mm: vmscan: block kswapd if it is encountering pages under writeback Historically, kswapd used to congestion_wait() at higher priorities if it was not making forward progress. This made no sense as the failure to make progress could be completely independent of IO. It was later replaced by wait_iff_congested() and removed entirely by commit `258401a6` (mm: don't wait on congested zones in balance_pgdat()) as it was duplicating logic in shrink_inactive_list(). This is problematic. If kswapd encounters many pages under writeback and it continues to scan until it reaches the high watermark then it will quickly skip over the pages under writeback and reclaim clean young pages or push applications out to swap. The use of wait_iff_congested() is not suited to kswapd as it will only stall if the underlying BDI is really congested or a direct reclaimer was unable to write to the underlying BDI. kswapd bypasses the BDI congestion as it sets PF_SWAPWRITE but even if this was taken into account then it would cause direct reclaimers to stall on writeback which is not desirable. This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is encountering too many pages under writeback. If this flag is set and kswapd encounters a PageReclaim page under writeback then it'll assume that the LRU lists are being recycled too quickly before IO can complete and block waiting for some IO to complete. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 283aba9f9e0e4882bf09bd37a2983379a6fae805 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Ib34f1959c0e5265242152f98cc52c62ab7015993 [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	ba3a73862f	mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: d43006d503ac921c7df4f94d13c17db6f13c9d26 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I565caf3aef9f3e5f59cda1adc70207412719a2ed Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	64fe14e549	mm: vmscan: do not allow kswapd to scan at maximum priority Page reclaim at priority 0 will scan the entire LRU as priority 0 is considered to be a near OOM condition. Kswapd can reach priority 0 quite easily if it is encountering a large number of pages it cannot reclaim such as pages under writeback. When this happens, kswapd reclaims very aggressively even though there may be no real risk of allocation failure or OOM. This patch prevents kswapd reaching priority 0 and trying to reclaim the world. Direct reclaimers will still reach priority 0 in the event of an OOM situation. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 9aa41348a8d11427feec350b21dcdd4330fd20c4 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I6bd5891e9f2b670b3c495cfad26d69af92e6d856 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:47 +05:30
Mel Gorman	89e36de3c4	mm: vmscan: decide whether to compact the pgdat based on reclaim progress In the past, kswapd makes a decision on whether to compact memory after the pgdat was considered balanced. This more or less worked but it is late to make such a decision and does not fit well now that kswapd makes a decision whether to exit the zone scanning loop depending on reclaim progress. This patch will compact a pgdat if at least the requested number of pages were reclaimed from unbalanced zones for a given priority. If any zone is currently balanced, kswapd will not call compaction as it is expected the necessary pages are already available. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 2ab44f434586b8ccb11f781b4c2730492e6628f5 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Ie490e6df9576de1de1bc0c3c1b634618394dcf8e [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:46 +05:30
Mel Gorman	54715ebd2e	mm: vmscan: flatten kswapd priority loop kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered balanced. It then rechecks if it needs to restart at DEF_PRIORITY and whether high-order reclaim needs to be reset. This is not wrong per-se but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY may require several restarts before it has scanned enough pages to meet the high watermark even at 100% efficiency. This patch irons out the logic a bit by controlling when priority is raised and removing the "goto loop_again". This patch has kswapd raise the scanning priority until it is scanning enough pages that it could meet the high watermark in one shrink of the LRU lists if it is able to reclaim at 100% efficiency. It will not raise the scanning prioirty higher unless it is failing to reclaim any pages. To avoid infinite looping for high-order allocation requests kswapd will not reclaim for high-order allocations when it has reclaimed at least twice the number of pages as the allocation request. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: b8e83b942a16eb73e63406592d3178207a4f07a1 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I93ee675006800f2805408f2865150182bfd4b22b Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:46 +05:30
Mel Gorman	ece8cc6ba1	mm: vmscan: obey proportional scanning requirements for kswapd Simplistically, the anon and file LRU lists are scanned proportionally depending on the value of vm.swappiness although there are other factors taken into account by get_scan_count(). The patch "mm: vmscan: Limit the number of pages kswapd reclaims" limits the number of pages kswapd reclaims but it breaks this proportional scanning and may evenly shrink anon/file LRUs regardless of vm.swappiness. This patch preserves the proportional scanning and reclaim. It does mean that kswapd will reclaim more than requested but the number of pages will be related to the high watermark. [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify] [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target] [hannes@cmpxchg.org: Account for already scanned pages properly] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: e82e0561dae9f3ae5a21fc2d3d3ccbe69d90be46 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: I9dc9b73c0d73c27cda72181b4eb3f625e491f114 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-09 13:25:46 +05:30
Linux Build Service Account	cff9aa5e25	Merge "mm: vmscan: limit the number of pages kswapd reclaims at each priority"	2014-12-08 16:41:53 -08:00
Mel Gorman	5c4bdf3902	mm: vmscan: limit the number of pages kswapd reclaims at each priority This series does not fix all the current known problems with reclaim but it addresses one important swapping bug when there is background IO. Changelog since V3 - Drop the slab shrink changes in light of Glaubers series and discussions highlighted that there were a number of potential problems with the patch. (mel) - Rebased to 3.10-rc1 Changelog since V2 - Preserve ratio properly for proportional scanning (kamezawa) Changelog since V1 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi) - Reformat comment in shrink_page_list (andi) - Clarify some comments (dhillf) - Rework how the proportional scanning is preserved - Add PageReclaim check before kswapd starts writeback - Reset sc.nr_reclaimed on every full zone scan Kswapd and page reclaim behaviour has been screwy in one way or the other for a long time. Very broadly speaking it worked in the far past because machines were limited in memory so it did not have that many pages to scan and it stalled congestion_wait() frequently to prevent it going completely nuts. In recent times it has behaved very unsatisfactorily with some of the problems compounded by the removal of stall logic and the introduction of transparent hugepage support with high-order reclaims. There are many variations of bugs that are rooted in this area. One example is reports of a large copy operations or backup causing the machine to grind to a halt or applications pushed to swap. Sometimes in low memory situations a large percentage of memory suddenly gets reclaimed. In other cases an application starts and kswapd hits 100% CPU usage for prolonged periods of time and so on. There is now talk of introducing features like an extra free kbytes tunable to work around aspects of the problem instead of trying to deal with it. It's compounded by the problem that it can be very workload and machine specific. This series aims at addressing some of the worst of these problems without attempting to fundmentally alter how page reclaim works. Patches 1-2 limits the number of pages kswapd reclaims while still obeying the anon/file proportion of the LRUs it should be scanning. Patches 3-4 control how and when kswapd raises its scanning priority and deletes the scanning restart logic which is tricky to follow. Patch 5 notes that it is too easy for kswapd to reach priority 0 when scanning and then reclaim the world. Down with that sort of thing. Patch 6 notes that kswapd starts writeback based on scanning priority which is not necessarily related to dirty pages. It will have kswapd writeback pages if a number of unqueued dirty pages have been recently encountered at the tail of the LRU. Patch 7 notes that sometimes kswapd should stall waiting on IO to complete to reduce LRU churn and the likelihood that it'll reclaim young clean pages or push applications to swap. It will cause kswapd to block on IO if it detects that pages being reclaimed under writeback are recycling through the LRU before the IO completes. Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they are applied. This was tested using memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service and it is run multiple times. It starts with no background IO and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%) Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%) Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%) Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%) Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%) Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%) Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%) Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%) Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%) Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%) Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%) Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%) Note how the vanilla kernels performance collapses when there is enough IO taking place in the background. This drop in performance is part of what users complain of when they start backups. Note how the swapin and major fault figures indicate that processes were being pushed to swap prematurely. With the series applied, there is no noticable performance drop and while there is still some swap activity, it's tiny. 20 iterations of this test were run in total and averaged. Every 5 iterations, additional IO was generated in the background using dd to measure how the workload was impacted. The 0M, 715M, 2385M and 4055M subblock refer to the amount of IO going on in the background at each iteration. So memcachetest-2385M is reporting how many transactions/second memcachetest recorded on average over 5 iterations while there was 2385M of IO going on in the ground. There are six blocks of information reported here memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 22K/sec to just under 4K/second when there is 2385M of IO going on in the background. This is one type of performance collapse users complain about if a large cp or backup starts in the background io-duration refers to how long it takes for the background IO to complete. It's showing that with the patched kernel that the IO completes faster while not interfering with the memcache workload swaptotal is the total amount of swap traffic. With the patched kernel, the total amount of swapping is much reduced although it is still not zero. swapin in this case is an indication as to whether we are swap trashing. The closer the swapin/swapout ratio is to 1, the worse the trashing is. Note with the patched kernel that there is no swapin activity indicating that all the pages swapped were really inactive unused pages. minorfaults are just minor faults. An increased number of minor faults can indicate that page reclaim is unmapping the pages but not swapping them out before they are faulted back in. With the patched kernel, there is only a small change in minor faults majorfaults are just major faults in the target workload and a high number can indicate that a workload is being prematurely swapped. With the patched kernel, major faults are much reduced. As there are no swapin's recorded so it's not being swapped. The likely explanation is that that libraries or configuration files used by the workload during startup get paged out by the background IO. Overall with the series applied, there is no noticable performance drop due to background IO and while there is still some swap activity, it's tiny and the lack of swapins imply that the swapped pages were inactive and unused. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Page Ins `1234608` 101892 Page Outs 12446272 11810468 Swap Ins 283406 0 Swap Outs 698469 27882 Direct pages scanned 0 136480 Kswapd pages scanned 6266537 5369364 Kswapd pages reclaimed 1088989 930832 Direct pages reclaimed 0 120901 Kswapd efficiency 17% 17% Kswapd velocity 5398.371 4635.115 Direct efficiency 100% 88% Direct velocity 0.000 117.817 Percentage direct scans 0% 2% Page writes by reclaim 1655843 4009929 Page writes file 957374 3982047 Page writes anon 698469 27882 Page reclaim immediate 5245 1745 Page rescued immediate 0 0 Slabs scanned 33664 25216 Direct inode steals 0 0 Kswapd inode steals 19409 778 Kswapd skipped wait 0 0 THP fault alloc 35 30 THP collapse alloc 472 401 THP splits 27 22 THP fault fallback 0 0 THP collapse fail 0 1 Compaction stalls 0 4 Compaction success 0 0 Compaction failures 0 4 Page migrate success 0 0 Page migrate failure 0 0 Compaction pages isolated 0 0 Compaction migrate scanned 0 0 Compaction free scanned 0 0 Compaction cost 0 0 NUMA PTE updates 0 0 NUMA hint faults 0 0 NUMA hint local faults 0 0 NUMA pages migrated 0 0 AutoNUMA cost 0 0 Unfortunately, note that there is a small amount of direct reclaim due to kswapd no longer reclaiming the world. ftrace indicates that the direct reclaim stalls are mostly harmless with the vast bulk of the stalls incurred by dd 23 tclsh-3367 38 memcachetest-13733 49 memcachetest-12443 57 tee-3368 1541 dd-13826 1981 dd-12539 A consequence of the direct reclaim for dd is that the processes for the IO workload may show a higher system CPU usage. There is also a risk that kswapd not reclaiming the world may mean that it stays awake balancing zones, does not stall on the appropriate events and continually scans pages it cannot reclaim consuming CPU. This will be visible as continued high CPU usage but in my own tests I only saw a single spike lasting less than a second and I did not observe any problems related to reclaim while running the series on my desktop. This patch: The number of pages kswapd can reclaim is bound by the number of pages it scans which is related to the size of the zone and the scanning priority. In many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large number of pages it cannot reclaim, it will raise the priority and potentially discard a large percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible effect is a reclaim "spike" where a large percentage of memory is suddenly freed. It would be bad enough if this was just unused memory but because of how anon/file pages are balanced it is possible that applications get pushed to swap unnecessarily. This patch limits the number of pages kswapd will reclaim to the high watermark. Reclaim will still overshoot due to it not being a hard limit as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it prevents kswapd reclaiming the world at higher priorities. The number of pages it reclaims is not adjusted for high-order allocations as kswapd will reclaim excessively if it is to balance zones for high-order allocations. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 75485363ce8552698bfb9970d901f755d5713cca Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Change-Id: Idfce2d7ebe6a809f47ce88344a4954a634e9470e [vinmenon@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2014-12-03 19:20:51 +05:30
Linux Build Service Account	898b59d2d9	Merge "mm: vmscan: support setting of kswapd cpu affinity"	2014-11-26 22:28:13 -08:00

1 2 3 4 5 ...

7463 Commits