android_kernel_lge_bullhead

Commit Graph

Author	SHA1	Message	Date
Nathan Chancellor	a626beca4c	This is the 3.10.105 stable release -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYnZNdAAoJEE44bZycYXAvJv0P/jpPc+jKb+D0FVUOiYDkY5Rw jxsZ3oeruTIeSFAAIzusVMLm9moBJA6DThuTHU5Kt68mRaKB2lgmqwkkvQAPTSYh tnDQwlrF7dOSVmPczJFHalpaLpRdXdQP9r8y+38PibaFZPssKdnZr3BBfdOdi5DT lj029AGKfG7co6Hb/iAhsxuAFfPmvGHY4QNwJ2FRbU1m6MDtmCTbXzF0fc6X5AW1 qrtaWwPulJtZ/5MPk7aFyNpuCpNvIaTEqNaQsZbuz3bHfzDQVLerWze98vgHC0QM 2YOTP6TnEiHhxHGMb9SywUgSV1ylx0X542YDfxmcfyxBWRr0khlxQh1gpX+waqE3 pqdSlvN7AFzifw6kubbG2/XjkNvFtJcDTgrL3qco4utIezSijXmoOsDpKNnJuzk/ kSD5WYd+Q1CSHOkqZX29QPw1Dl/7Ftm7GPfxu7Pis1OBuPByqtRkEfmn9DpiKSs5 Aja0ljZYiQ3jy3fH+WlEzo6PVSxx0ZxKg0fOShlpgjj8KjMUdGfl9cB1OZxyWnNH UiQ9iIWd3tJci7WbsBOfawsQpq3EIJxZKjyUmLYpBht5/YenYxOBDCr/CLJDQBGI IQUPAs/E1JGDxGTUY3AmsaMVrcX2yOfhLzjrsVJGqSdote0um+2PdTLZHE4MMiz2 Dh6CbUVYWS1KNgmQ8T8L =k5mW -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqeiwACgkQmXOSYMts txaBAQ/+KqZh90YZI+gRHGdczbo3XnlryHMdpp+DTIFtN3zU+2LM352oP+haoJfr YhNsixcMhW5TX0is5fg4SkIc0B3ooGKZLVKOPIRw+1NLBAVG5yVuYxW7I1faJgk6 F37+4rvq7KAOPCNMjAEXRt7GqZ4WZjgvgKy+u5wzKh3k5kUylqDwlP2qdgx2L5Rc IxyxgOuaVGV6dZTyAyRlRMild5Tlz+SMY4pWoMe0sulDDXhd5/5PnGNVIgh+XqB6 m0AGkIIzPVe+wmg6n1iYs93dQO0Jmu6DL47Zv4f3ASZNL/XVSLvU9ie63FyWGZXG e52qAPtztXInEOo15vPQSAAq7McZHDTzhHhsU/ZtkBT+LeSUU+rsxXddJ2EO5UgC O3cVm11x1FWMzbBtFNFtkqeri2Y2OxvU4O81mfNP1oOUQBTMeSHTzQ8psbCdXeEr ktSOtI+nakPmDE3aq4YSaz7BwSgt2tU/vZehkrTxtAQJxt0b88r2xFfThy5WScT1 v6muoqxlprjjvFld7v99P8cXxJq4QrxKUxXtEBTdB79Q5xtCC29OAcTelpPFDCED /KpgZflubzH/Z872AW9Ru8OL9PYty6hBNDOP4aHLSFWfCu3KQxL6BMEeqi5qBjBX mJ8JT0dCQYP6xONIWq6a3fICroNMazhNFxdpPSfsQFRhujhjGPg= =zhKv -----END PGP SIGNATURE----- Merge 3.10.105 into android-msm-bullhead-3.10-oreo-m5 Changes in 3.10.105: (315 commits) sched/core: Fix a race between try_to_wake_up() and a woken up task sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() crypto: algif_skcipher - Require setkey before accept(2) crypto: af_alg - Disallow bind/setkey/... after accept(2) crypto: af_alg - Add nokey compatibility path crypto: algif_skcipher - Add nokey compatibility path crypto: hash - Add crypto_ahash_has_setkey crypto: shash - Fix has_key setting crypto: algif_hash - Require setkey before accept(2) crypto: skcipher - Add crypto_skcipher_has_setkey crypto: algif_skcipher - Add key check exception for cipher_null crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path crypto: algif_hash - Remove custom release parent function crypto: algif_skcipher - Remove custom release parent function crypto: af_alg - Forbid bind(2) when nokey child sockets are present crypto: algif_hash - Fix race condition in hash_check_key crypto: algif_skcipher - Fix race condition in skcipher_check_key crypto: algif_skcipher - Load TX SG list after waiting crypto: cryptd - initialize child shash_desc on import crypto: skcipher - Fix blkcipher walk OOM crash crypto: gcm - Fix IV buffer size in crypto_gcm_setkey MIPS: KVM: Fix unused variable build warning KVM: MIPS: Precalculate MMIO load resume PC KVM: MIPS: Drop other CPU ASIDs on guest MMU changes KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write KVM: MIPS: Make ERET handle ERL before EXL KVM: x86: fix wbinvd_dirty_mask use-after-free KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr KVM: Disable irq while unregistering user notifier PM / devfreq: Fix incorrect type issue. ppp: defer netns reference release for ppp channel x86/mm/xen: Suppress hugetlbfs in PV guests xen: Add RING_COPY_REQUEST() xen-netback: don't use last request to determine minimum Tx credit xen-netback: use RING_COPY_REQUEST() throughout xen-blkback: only read request operation from shared ring once xen/pciback: Save xen_pci_op commands before processing it xen/pciback: Save the number of MSI-X entries to be copied later. xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled xen/pciback: Do not install an IRQ handler for MSI interrupts. xen/pciback: For XEN_PCI_OP_disable_msi[\|x] only disable if device has MSI(X) enabled. xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set. xen-pciback: Add name prefix to global 'permissive' variable x86/xen: fix upper bound of pmd loop in xen_cleanhighmap() x86/traps: Ignore high word of regs->cs in early_idt_handler_common x86/mm: Disable preemption during CR3 read+write x86/apic: Do not init irq remapping if ioapic is disabled x86/mm/pat, /dev/mem: Remove superfluous error message x86/paravirt: Do not trace _paravirt_ident_*() functions x86/build: Build compressed x86 kernels as PIE x86/um: reuse asm-generic/barrier.h iommu/amd: Update Alias-DTE in update_device_table() iommu/amd: Free domain id when free a domain of struct dma_ops_domain ARM: 8616/1: dt: Respect property size when parsing CPUs ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7 ARM: sa1100: clear reset status prior to reboot ARM: sa1111: fix pcmcia suspend/resume arm64: avoid returning from bad_mode arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb() arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP MIPS: Malta: Fix IOCU disable switch read for MIPS64 MIPS: ptrace: Fix regs_return_value for kernel context powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET powerpc/vdso64: Use double word compare on pointers powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data() powerpc/64: Fix incorrect return value from __copy_tofrom_user powerpc/nvram: Fix an incorrect partition merge avr32: fix copy_from_user() avr32: fix 'undefined reference to `___copy_from_user' avr32: off by one in at32_init_pio() s390/dasd: fix hanging device after clear subchannel parisc: Ensure consistent state when switching to kernel stack at syscall entry microblaze: fix __get_user() microblaze: fix copy_from_user() mn10300: failing __get_user() and get_user() should zero m32r: fix __get_user() sh64: failing __get_user() should zero score: fix __get_user/get_user s390: get_user() should zero on failure ARC: uaccess: get_user to zero out dest in cause of fault asm-generic: make get_user() clear the destination on errors frv: fix clear_user() cris: buggered copy_from_user/copy_to_user/clear_user blackfin: fix copy_from_user() score: fix copy_from_user() and friends sh: fix copy_from_user() hexagon: fix strncpy_from_user() error return mips: copy_from_user() must zero the destination on access_ok() failure asm-generic: make copy_from_user() zero the destination properly alpha: fix copy_from_user() metag: copy_from_user() should zero the destination on access_ok() failure parisc: fix copy_from_user() openrisc: fix copy_from_user() openrisc: fix the fix of copy_from_user() mn10300: copy_from_user() should zero on access_ok() failure... sparc32: fix copy_from_user() ppc32: fix copy_from_user() ia64: copy_from_user() should zero the destination on access_ok() failure fix fault_in_multipages_...() on architectures with no-op access_ok() fix memory leaks in tracing_buffers_splice_read() arc: don't leak bits of kernel stack into coredump Fix potential infoleak in older kernels swapfile: fix memory corruption via malformed swapfile coredump: fix unfreezable coredumping task usb: dwc3: gadget: increment request->actual once USB: validate wMaxPacketValue entries in endpoint descriptors USB: fix typo in wMaxPacketSize validation usb: xhci: Fix panic if disconnect USB: serial: fix memleak in driver-registration error path USB: kobil_sct: fix non-atomic allocation in write path USB: serial: mos7720: fix non-atomic allocation in write path USB: serial: mos7840: fix non-atomic allocation in write path usb: renesas_usbhs: fix clearing the {BRDY,BEMP}STS condition USB: change bInterval default to 10 ms usb: gadget: fsl_qe_udc: signedness bug in qe_get_frame() USB: serial: cp210x: fix hardware flow-control disable usb: misc: legousbtower: Fix NULL pointer deference usb: gadget: function: u_ether: don't starve tx request queue USB: serial: cp210x: fix tiocmget error handling usb: gadget: u_ether: remove interrupt throttling usb: chipidea: move the lock initialization to core file Fix USB CB/CBI storage devices with CONFIG_VMAP_STACK=y ALSA: rawmidi: Fix possible deadlock with virmidi registration ALSA: timer: fix NULL pointer dereference in read()/ioctl() race ALSA: timer: fix division by zero after SNDRV_TIMER_IOCTL_CONTINUE ALSA: timer: fix NULL pointer dereference on memory allocation failure ALSA: ali5451: Fix out-of-bound position reporting ALSA: pcm : Call kill_fasync() in stream lock zfcp: fix fc_host port_type with NPIV zfcp: fix ELS/GS request&response length for hardware data router zfcp: close window with unblocked rport during rport gone zfcp: retain trace level for SCSI and HBA FSF response records zfcp: restore: Dont use 0 to indicate invalid LUN in rec trace zfcp: trace on request for open and close of WKA port zfcp: restore tracing of handle for port and LUN with HBA records zfcp: fix D_ID field with actual value on tracing SAN responses zfcp: fix payload trace length for SAN request&response zfcp: trace full payload of all SAN records (req,resp,iels) scsi: zfcp: spin_lock_irqsave() is not nestable scsi: mpt3sas: Fix secure erase premature termination scsi: mpt3sas: Unblock device after controller reset scsi: mpt3sas: fix hang on ata passthrough commands mpt2sas: Fix secure erase premature termination scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression scsi: ibmvfc: Fix I/O hang when port is not mapped scsi: Fix use-after-free scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer() scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware ext4: validate that metadata blocks do not overlap superblock ext4: avoid modifying checksum fields directly during checksum verification ext4: use __GFP_NOFAIL in ext4_free_blocks() ext4: reinforce check of i_dtime when clearing high fields of uid and gid ext4: allow DAX writeback for hole punch ext4: sanity check the block and cluster size at mount time reiserfs: fix "new_insert_key may be used uninitialized ..." reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() xfs: fix superblock inprogress check libxfs: clean up _calc_dquots_per_chunk btrfs: ensure that file descriptor used with subvol ioctls is a dir ocfs2/dlm: fix race between convert and migration ocfs2: fix start offset to ocfs2_zero_range_for_truncate() ubifs: Fix assertion in layout_in_gaps() ubifs: Fix xattr_names length in exit paths UBIFS: Fix possible memory leak in ubifs_readdir() ubifs: Abort readdir upon error ubifs: Fix regression in ubifs_readdir() UBI: fastmap: scrub PEB when bitflips are detected in a free PEB EC header NFSv4.x: Fix a refcount leak in nfs_callback_up_net NFSD: Using free_conn free connection NFS: Don't drop CB requests with invalid principals NFSv4: Open state recovery must account for file permission changes fs/seq_file: fix out-of-bounds read fs/super.c: fix race between freeze_super() and thaw_super() isofs: Do not return EACCES for unknown filesystems hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() driver core: Delete an unnecessary check before the function call "put_device" driver core: fix race between creating/querying glue dir and its cleanup drm/radeon: fix radeon_move_blit on 32bit systems drm: Reject page_flip for !DRIVER_MODESET drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on qxl: check for kmap failures Input: i8042 - break load dependency between atkbd/psmouse and i8042 Input: i8042 - set up shared ps2_cmd_mutex for AUX ports Input: ili210x - fix permissions on "calibrate" attribute hwrng: exynos - Disable runtime PM on probe failure hwrng: omap - Fix assumption that runtime_get_sync will always succeed hwrng: omap - Only fail if pm_runtime_get_sync returns < 0 i2c-eg20t: fix race between i2c init and interrupt enable em28xx-i2c: rt_mutex_trylock() returns zero on failure i2c: core: fix NULL pointer dereference under race condition i2c: at91: fix write transfers by clearing pending interrupt first iio: accel: kxsd9: Fix raw read return iio: accel: kxsd9: Fix scaling bug thermal: hwmon: Properly report critical temperature in sysfs cdc-acm: fix wrong pipe type on rx interrupt xfers timers: Use proper base migration in add_timer_on() EDAC: Increment correct counter in edac_inc_ue_error() IB/ipoib: Fix memory corruption in ipoib cm mode connect flow IB/core: Fix use after free in send_leave function IB/ipoib: Don't allow MC joins during light MC flush IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV IB/mlx4: Fix create CQ error flow IB/uverbs: Fix leak of XRC target QPs IB/cm: Mark stale CM id's whenever the mad agent was unregistered mtd: blkdevs: fix potential deadlock + lockdep warnings mtd: pmcmsp-flash: Allocating too much in init_msp_flash() mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl perf symbols: Fixup symbol sizes before picking best ones perf: Tighten (and fix) the grouping condition tty: Prevent ldisc drivers from re-using stale tty fields tty: limit terminal size to 4M chars tty: vt, fix bogus division in csi_J vt: clear selection before resizing drivers/vfio: Rework offsetofend() include/stddef.h: Move offsetofend() from vfio.h to a generic kernel header stddef.h: move offsetofend inside #ifndef/#endif guard, neaten ipv6: don't call fib6_run_gc() until routing is ready ipv6: split duplicate address detection and router solicitation timer ipv6: move DAD and addrconf_verify processing to workqueue ipv6: addrconf: fix dev refcont leak when DAD failed ipv6: fix rtnl locking in setsockopt for anycast and multicast ip6_gre: fix flowi6_proto value in ip6gre_xmit_other() ipv6: correctly add local routes when lo goes up ipv6: dccp: fix out of bound access in dccp_v6_err() ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() ip6_tunnel: disable caching when the traffic class is inherited net/irda: handle iriap_register_lsap() allocation failure tcp: fix use after free in tcp_xmit_retransmit_queue() tcp: properly scale window in tcp_v[46]_reqsk_send_ack() tcp: fix overflow in __tcp_retransmit_skb() tcp: fix wrong checksum calculation on MTU probing tcp: take care of truncations done by sk_filter() bonding: Fix bonding crash net: ratelimit warnings about dst entry refcount underflow or overflow mISDN: Support DR6 indication in mISDNipac driver mISDN: Fixing missing validation in base_sock_bind() net: disable fragment reassembly if high_thresh is set to zero ipvs: count pre-established TCP states as active iwlwifi: pcie: fix access to scratch buffer svc: Avoid garbage replies when pc_func() returns rpc_drop_reply brcmsmac: Free packet if dma_mapping_error() fails in dma_rxfill brcmsmac: Initialize power in brcms_c_stf_ss_algo_channel_get() brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() pstore: Fix buffer overflow while write offset equal to buffer size net/mlx4_core: Allow resetting VF admin mac to zero firewire: net: guard against rx buffer overflows firewire: net: fix fragmented datagram_size off-by-one netfilter: fix namespace handling in nf_log_proc_dostring can: bcm: fix warning in bcm_connect/proc_register net: fix sk_mem_reclaim_partial() net: avoid sk_forward_alloc overflows ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route packet: call fanout_release, while UNREGISTERING a netdev net: sctp, forbid negative length sctp: validate chunk len before actually using it net: clear sk_err_soft in sk_clone_lock() net: mangle zero checksum in skb_checksum_help() dccp: do not send reset to already closed sockets dccp: fix out of bound access in dccp_v4_err() sctp: assign assoc_id earlier in __sctp_connect neigh: check error pointer instead of NULL for ipv4_neigh_lookup() ipv4: use new_gw for redirect neigh lookup mac80211: fix purging multicast PS buffer queue mac80211: discard multicast and 4-addr A-MSDUs cfg80211: limit scan results cache size mwifiex: printk() overflow with 32-byte SSIDs ipv4: Set skb->protocol properly for local output net: sky2: Fix shutdown crash kaweth: fix firmware download tracing: Move mutex to protect against resetting of seq data kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd Revert "ipc/sem.c: optimize sem_lock()" cfq: fix starvation of asynchronous writes drbd: Fix kernel_sendmsg() usage - potential NULL deref lib/genalloc.c: start search from start of chunk tools/vm/slabinfo: fix an unintentional printf rcu: Fix soft lockup for rcu_nocb_kthread ratelimit: fix bug in time interval by resetting right begin time mfd: core: Fix device reference leak in mfd_clone_cell PM / sleep: fix device reference leak in test_suspend mmc: mxs: Initialize the spinlock prior to using it mmc: block: don't use CMD23 with very old MMC cards pstore/core: drop cmpxchg based updates pstore/ram: Use memcpy_toio instead of memcpy pstore/ram: Use memcpy_fromio() to save old buffer mb86a20s: fix the locking logic mb86a20s: fix demod settings cx231xx: don't return error on success cx231xx: fix GPIOs for Pixelview SBTVD hybrid gpio: mpc8xxx: Correct irq handler function uio: fix dmem_region_start computation KEYS: Fix short sprintf buffer in /proc/keys show function hv: do not lose pending heartbeat vmbus packets staging: iio: ad5933: avoid uninitialized variable in error case mei: bus: fix received data size check in NFC fixup ACPI / APEI: Fix incorrect return value of ghes_proc() PCI: Handle read-only BARs on AMD CS553x devices tile: avoid using clocksource_cyc2ns with absolute cycle count dm flakey: fix reads to be issued if drop_writes configured mm,ksm: fix endless looping in allocating memory when ksm enable can: dev: fix deadlock reported after bus-off hwmon: (adt7411) set bit 3 in CFG1 register mpi: Fix NULL ptr dereference in mpi_powm() [ver #3] mfd: 88pm80x: Double shifting bug in suspend/resume ASoC: omap-mcpdm: Fix irq resource handling regulator: tps65910: Work around silicon erratum SWCZ010 dm: mark request_queue dead before destroying the DM device fbdev/efifb: Fix 16 color palette entry calculation metag: Only define atomic_dec_if_positive conditionally Linux 3.10.105 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Conflicts: arch/arm/mach-sa1100/generic.c arch/arm64/kernel/traps.c crypto/blkcipher.c drivers/devfreq/devfreq.c drivers/usb/dwc3/gadget.c drivers/usb/gadget/u_ether.c fs/ubifs/dir.c include/net/if_inet6.h lib/genalloc.c net/ipv6/addrconf.c net/ipv6/tcp_ipv6.c net/wireless/scan.c sound/core/timer.c	2018-01-25 17:45:32 -07:00
Nathan Chancellor	32224079fd	This is the 3.10.76 stable release -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJVQJeUAAoJEDjbvchgkmk+NlgP/jf4vyIubRhZNnEveDjCqCam OWVT/Q8fRLru9tEp8/b+gdWDDTZON7jJh5ERrwzzRud/LQwQudK+OhVlm7kKPXmY 8uz6xdlcieKnMQleAQ/UiYqMv6VvjlAhNzUqcn60EeeAmMTix9WDFU0DFU2tvyFt qUR17px8vJDVI34Vv2d/Ihgt5lxMa8Bue4jZIQmPxdDHNiW9c8IUqr6vMDer4Ih0 KuiES3FkQa61b5rEisNOWqEv/w+BH5Hn1XiN3gjBm5YznOhLHWJ5kR/2ewqCWbef DRdegojueSyN+ktzEDUnEWpC8zLhk3L4lBXILDSLBHvnoeEc17G5wMYKIe8DSk0B +tjSMq/IZMOaj68fYznOWH+UH3iBbQTSGriQShjR1zNqIz0XMMljkJNwnvzN4N3x 0wNxr8mniIWvX7sCMYS6AWPFJCTNB6xi4mD7SJAeXzsD9+y/wUQkVymY0VkI7wxS OenmRy7GF7s/HLMUDTllEZ787dUdZSkrGLYooIii45dwWK87EB+4LkQT/9q1aPry JzAQYUxIK9vco/Xhy+CHfImoaovqzKHdxqgt5PbTHuCwj+3uKYChmBECQ4raeuU8 JIqsr33wxlzhmzXhiwD4IcJvtDd6UiDeTpMtQOvBBce1gsKpL2+MS6s+8OiasO5F 6TQV7+NtW4CS74ExjBjt =KO7Y -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqavQACgkQmXOSYMts txas9Q//d0gcyDkHLn+viInrU5Mm4QV5GmMRPL3LnsNIiiJEjoLxKCX48dmPv1/1 Qk8IOMQRTfBn6l4lNc5AoEP2Q2SYqCKJLKBooRBoDPw9FlsrlJ2BdGz/Bm8SXSxQ fM3IQ76GpeW2YOXF7bVNrmOdlye5/JcmhkOT0augXFWKyTLVpVrqO/V4Nm07UAUe 6iw82ZNxLN9Rcm6H0l92VqI0//gcxYGZi0dfpDoWrvoND10Sz2URMRxzcns8t5Ri 1YmT+Y9cPO+3DCqNbTmwct1zQ2TlnJotm5S5DGqTznoGzNmv2PMTEIAWaWN63P0M T+cJgQxLDJp8ycr9C1oyfwmH2nvS3SQUGS8eDKXN0eNAbAaVrDN80Kl5pzQUfvba sUfK3IPU/8brZRcKYtvlc9TlEUlStxQSaUtcTXLpQ4bkoMst+9W/SFEKLScOW0No PBi3EhTpAjBN+EmDn6QHG0virHPn9M18s0Z1ZlSzdAyCA4mLY//JyabIEofCTz58 mIXmfPix9iNoBL/bekGpmqF/4B9nymUrWHQDvUflXbFwf0DrrLnEEGMNmpjJk4bx oxDuRb0/gzD7zs2pbTaWeysnl5UsmVDiQM9xLQBowNo+I4NoPv2+u9klHii4qEHW JlWsWJg4lFy6u7a9rQYjLEZWVZSNX6Y6e3er5m3qhyCbQwlKIQU= =U1aW -----END PGP SIGNATURE----- Merge 3.10.76 into android-msm-bullhead-3.10-oreo-m5 Changes in 3.10.76: (34 commits) conditionally define U32_MAX remove extra definitions of U32_MAX tcp: prevent fetching dst twice in early demux code ipv6: Don't reduce hop limit for an interface tcp: fix FRTO undo on cumulative ACK of SACKed range tcp: tcp_make_synack() should clear skb->tstamp 8139cp: Call dev_kfree_skby_any instead of kfree_skb. 8139too: Call dev_kfree_skby_any instead of dev_kfree_skb. r8169: Call dev_kfree_skby_any instead of dev_kfree_skb. bnx2: Call dev_kfree_skby_any instead of dev_kfree_skb. tg3: Call dev_kfree_skby_any instead of dev_kfree_skb. ixgb: Call dev_kfree_skby_any instead of dev_kfree_skb. benet: Call dev_kfree_skby_any instead of kfree_skb. serial: 8250_dw: Fix deadlock in LCR workaround jfs: fix readdir regression splice: Apply generic position and size checks to each write mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support Bluetooth: Enable Atheros 0cf3:311e for firmware upload Bluetooth: Add firmware update for Atheros 0cf3:311f Bluetooth: btusb: Add IMC Networks (Broadcom based) Bluetooth: Add support for Intel bootloader devices Bluetooth: Ignore isochronous endpoints for Intel USB bootloader netfilter: conntrack: disable generic tracking for known protocols KVM: x86: SYSENTER emulation is broken kconfig: Fix warning "‘jump’ may be used uninitialized" move d_rcu from overlapping d_child to overlapping d_alias deal with deadlock in d_walk() vm: add VM_FAULT_SIGSEGV handling support vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS x86: mm: move mmap_sem unlock from mm_fault_error() to caller sb_edac: avoid INTERNAL ERROR message in EDAC with unspecified channel arc: mm: Fix build failure dcache: Fix locking bugs in backported "deal with deadlock in d_walk()" Linux 3.10.76 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>	2018-01-25 16:40:36 -07:00
zhong jiang	3419dc516e	mm,ksm: fix endless looping in allocating memory when ksm enable commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream. I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>	2017-02-10 11:04:06 +01:00
Linus Torvalds	0c42d1fbb3	vm: add VM_FAULT_SIGSEGV handling support commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [shengyong: Backport to 3.10 - adjust context - ignore modification for arch nios2, because 3.10 does not support it - ignore modification for driver lustre, because 3.10 does not support it - ignore VM_FAULT_FALLBACK in VM_FAULT_ERROR, becase 3.10 does not support this flag - add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.10 does not separate it to copro_fault.c - add SIGSEGV handling in mm/memory.c, because 3.10 does not separate it to gup.c ] Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-04-29 10:34:00 +02:00
Laura Abbott	28851c135c	ksm: Add showmem notifier KSM is yet another framework which may obfuscate some memory problems. Use the showmem notifier to show how KSM is being used to give some insight into potential issues or non-issues. Change-Id: If82405dc33f212d085e6847f7c511fd4d0a32a10 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>	2014-10-06 09:53:55 -07:00
Abhimanyu Garg	af91c53405	KSM: Start KSM by default Strat running KSM by default at device bootup. Change-Id: I7926c529ea42675f4279bffaf149a0cf1080d61b Signed-off-by: Abhimanyu Garg <agarg@codeaurora.org>	2014-06-12 06:52:48 -07:00
Ian Maund	356fb13538	Merge upstream linux-stable v3.10.36 into msm-3.10 * commit 'v3.10.36': (494 commits) Linux 3.10.36 netfilter: nf_conntrack_dccp: fix skb_header_pointer API usages mm: close PageTail race net: mvneta: rename MVNETA_GMAC2_PSC_ENABLE to MVNETA_GMAC2_PCS_ENABLE x86: fix boot on uniprocessor systems Input: cypress_ps2 - don't report as a button pads Input: synaptics - add manual min/max quirk for ThinkPad X240 Input: synaptics - add manual min/max quirk Input: mousedev - fix race when creating mixed device ext4: atomically set inode->i_flags in ext4_set_inode_flags() Linux 3.10.35 sched/autogroup: Fix race with task_groups list e100: Fix "disabling already-disabled device" warning xhci: Fix resume issues on Renesas chips in Samsung laptops Input: wacom - make sure touch_max is set for touch devices KVM: VMX: fix use after free of vmx->loaded_vmcs KVM: x86: handle invalid root_hpa everywhere KVM: MMU: handle invalid root_hpa at __direct_map Input: elantech - improve clickpad detection ARM: highbank: avoid L2 cache smc calls when PL310 is not present ... Change-Id: Ib68f565291702c53df09e914e637930c5d3e5310 Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-04-23 16:23:49 -07:00
David Rientjes	def52acc90	mm: close PageTail race commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream. Commit `bf6bddf192` ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-04-03 12:01:05 -07:00
Chintan Pandya	913b3be967	ksm: Provide support to use deferred timers for scanner thread KSM thread to scan pages is getting schedule on definite timeout. That wakes up CPU from idle state and hence may affect the power consumption. Provide an optional support to use deferred timer which suites low-power use-cases. To enable deferred timers, $ echo 1 > /sys/kernel/mm/ksm/deferred_timer Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8 Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>	2014-03-17 18:43:12 +05:30
Hugh Dickins	d8fc16a825	ksm: fix m68k build: only NUMA needs pfn_to_nid A CONFIG_DISCONTIGMEM=y m68k config gave mm/ksm.c: In function `get_kpfn_nid': mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid' linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM (see arch/mips/include/asm/mmzone.h for example). Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze, and m68k got away without it so far, so fix the build in mm/ksm.c. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Petr Holasek <pholasek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-08 15:05:34 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Hugh Dickins	ef53d16cde	ksm: allocate roots when needed It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically allocated, particularly when very few users will ever actually tune merge_across_nodes 0 to use more than 1+1 of those trees. Not a big deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity. Start off with 1+1 statically allocated, then if merge_across_nodes is ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to free up the extra if it's tuned back, that would be a waste of effort. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:24 -08:00
Hugh Dickins	5117b3b835	mm,ksm: FOLL_MIGRATION do migration_entry_wait In "ksm: remove old stable nodes more thoroughly" I said that I'd never seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing, but it soon appeared once I tried fuller tests on the whole series. It turned out to be due to the KSM page migration itself: unmerge_and_ remove_all_rmap_items() failed to locate and replace all the KSM pages, because of that hiatus in page migration when old pte has been replaced by migration entry, but not yet by new pte. follow_page() finds no page at that instant, but a KSM page reappears shortly after, without a fault. Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait() for KSM's break_cow(). I'd have preferred to avoid another flag, and do it every time, in case someone else makes the same easy mistake; but did not find another transgressor (the common get_user_pages() is of course safe), and cannot be sure that every follow_page() caller is prepared to sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can already sleep there, since anon_vma locking was changed to mutex, but maybe that's somehow excluded. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:23 -08:00
Hugh Dickins	bc56620b49	ksm: shrink 32-bit rmap_item back to 32 bytes Think of struct rmap_item as an extension of struct page (restricted to MADV_MERGEABLE areas): there may be a lot of them, we need to keep them small, especially on 32-bit architectures of limited lowmem. Siting "int nid" after "unsigned int checksum" works nicely on 64-bit, making no change to its 64-byte struct rmap_item; but bloats the 32-bit struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which rounds up to 40 bytes once allocated from slab. We'd better avoid that. Hey, I only just remembered that the anon_vma pointer in struct rmap_item has no purpose until the rmap_item is hung from a stable tree node (which has its own nid field); and rmap_item's nid field no purpose than to say which tree root to tell rb_erase() when unlinking from an unstable tree. Double them up in a union. There's just one place where we set anon_vma early (when we already hold mmap_sem): now we must remove tree_rmap_item from its unstable tree there, before overwriting nid. No need to spatter BUG()s around: we'd be seeing oopses if this were wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:23 -08:00
Hugh Dickins	b599cbdf1c	ksm: treat unstable nid like in stable tree An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. Now, it might be that a wrong NUMA node in an unstable tree is more likely to correlate with instablility (different content, with rbnode now misplaced) than page migration; but even so, we are accustomed to instablility in the unstable tree. Without strong evidence for which strategy is generally better, I'd rather be consistent with what's done in the stable tree: accept a page from the wrong NUMA node for comparison, but not as a leaf for merging. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:23 -08:00
Hugh Dickins	8fdb3dbf02	ksm: add some comments Added slightly more detail to the Documentation of merge_across_nodes, a few comments in areas indicated by review, and renamed get_ksm_page()'s argument from "locked" to "lock_it". No functional change. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:23 -08:00
Hugh Dickins	ef4d43a807	ksm: stop hotremove lockdep warning Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:20 -08:00
Hugh Dickins	4146d2d673	ksm: make !merge_across_nodes migration safe The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Hugh Dickins	c8d6553b95	ksm: make KSM page migration possible KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Hugh Dickins	cbf86cfe04	ksm: remove old stable nodes more thoroughly Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Hugh Dickins	8aafa6a485	ksm: get_ksm_page locked In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Hugh Dickins	ee0ea59cf9	ksm: reorganize ksm_check_stable_tree Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Hugh Dickins	e850dcf530	ksm: trivial tidyups Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Petr Holasek	90bd6fd31c	ksm: allow trees per NUMA node Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) This patch: Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:19 -08:00
Sasha Levin	4ca3a69bcb	mm/ksm.c: use new hashtable implementation Switch ksm to use the new hashtable implementation. This reduces the amount of generic unrelated code in the ksm module. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:10 -08:00
Johannes Weiner	af34770e55	mm: reduce rmap overhead for ex-KSM page copies created on swap faults When ex-KSM pages are faulted from swap cache, the fault handler is not capable of re-establishing anon_vma-spanning KSM pages. In this case, a copy of the page is created instead, just like during a COW break. These freshly made copies are known to be exclusive to the faulting VMA and there is no reason to go look for this page in parent and sibling processes during rmap operations. Use page_add_new_anon_rmap() for these copies. This also puts them on the proper LRU lists and marks them SwapBacked, so we can get rid of doing this ad-hoc in the KSM copy code. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:09 -08:00
Hugh Dickins	b6b19f25f6	ksm: make rmap walks more scalable The rmap walks in ksm.c are like those in rmap.c: they can safely be done with anon_vma_lock_read(). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-20 07:06:56 -08:00
Linus Torvalds	3d59eebc5e	Automatic NUMA Balancing V11 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb 72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj 3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi Ka0JKgnWvsa6ez6FSzKI =ivQa -----END PGP SIGNATURE----- Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...	2012-12-16 15:18:08 -08:00
David Rientjes	e1e12d2f31	mm, oom: fix race when specifying a thread as the oom origin test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-11 17:22:27 -08:00
David Rientjes	a9c58b907d	mm, oom: change type of oom_score_adj to short The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-11 17:22:27 -08:00
Bob Liu	6219049ae1	mm: introduce mm_find_pmd() Several place need to find the pmd by(mm_struct, address), so introduce a function to simplify it. [akpm@linux-foundation.org: fix warning] Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-11 17:22:22 -08:00
Ingo Molnar	4fc3f1d66b	mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch \| --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock \| \|--50.61%-- rmap_walk \| move_to_new_page \| migrate_pages \| migrate_misplaced_page \| __do_numa_page.isra.69 \| handle_pte_fault \| handle_mm_fault \| __do_page_fault \| do_page_fault \| page_fault \| __memset_sse2 \| \| \| --100.00%-- worker_thread \| \| \| --100.00%-- start_thread \| --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 \| --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>	2012-12-11 14:43:00 +00:00
Haggai Eran	6bdb913f0a	mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end In order to allow sleeping during invalidate_page mmu notifier calls, we need to avoid calling when holding the PT lock. In addition to its direct calls, invalidate_page can also be called as a substitute for a change_pte call, in case the notifier client hasn't implemented change_pte. This patch drops the invalidate_page call from change_pte, and instead wraps all calls to change_pte with invalidate_range_start and invalidate_range_end calls. Note that change_pte still cannot sleep after this patch, and that clients implementing change_pte should not take action on it in case the number of outstanding invalidate_range_start calls is larger than one, otherwise they might miss a later invalidation. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:58 +09:00
Hugh Dickins	39b5f29ac1	mm: remove vma arg from page_evictable page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:55 +09:00
Michel Lespinasse	bf181b9f9d	mm anon rmap: replace same_anon_vma linked list with an interval tree. When a large VMA (anon or private file mapping) is first touched, which will populate its anon_vma field, and then split into many regions through the use of mprotect(), the original anon_vma ends up linking all of the vmas on a linked list. This can cause rmap to become inefficient, as we have to walk potentially thousands of irrelevent vmas before finding the one a given anon page might fall into. By replacing the same_anon_vma linked list with an interval tree (where each avc's interval is determined by its vma's start and last pgoffs), we can make rmap efficient for this use case again. While the change is large, all of its pieces are fairly simple. Most places that were walking the same_anon_vma list were looking for a known pgoff, so they can just use the anon_vma_interval_tree_foreach() interval tree iterator instead. The exception here is ksm, where the page's index is not known. It would probably be possible to rework ksm so that the index would be known, but for now I have decided to keep things simple and just walk the entirety of the interval tree there. When updating vma's that already have an anon_vma assigned, we must take care to re-index the corresponding avc's on their interval tree. This is done through the use of anon_vma_interval_tree_pre_update_vma() and anon_vma_interval_tree_post_update_vma(), which remove the avc's from their interval tree before the update and re-insert them after the update. The anon_vma stays locked during the update, so there is no chance that rmap would miss the vmas that are being updated. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:41 +09:00
Konstantin Khlebnikov	314e51b985	mm: kill vma flag VM_RESERVED and mm->reserved_vm counter A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: \| effect \| alternative flags -+------------------------+--------------------------------------------- 1\| account as reserved_vm \| VM_IO 2\| skip in core dump \| VM_IO, VM_DONTDUMP 3\| do not merge or expand \| VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4\| do not mlock \| VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND \| VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO\|VM_DONTEXPAND\|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND \| VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:19 +09:00
Konstantin Khlebnikov	4b6e1e3702	mm: kill vma flag VM_INSERTPAGE Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn ptes, special ptes and normal ptes. Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like VM_PFNMAP. If driver populates whole VMA at mmap() it probably not expects page-faults. This patch removes special check from vma_wants_writenotify() which disables pages write tracking for VMA populated via vm_instert_page(). BDI below mapped file should not use dirty-accounting, moreover do_wp_page() can handle this. vm_insert_page() still marks vma after first usage. Usually it is called from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it wants to call this function from other places, for example from page-fault handler. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:17 +09:00
Konstantin Khlebnikov	cc2383ec06	mm: introduce arch-specific vma flag VM_ARCH_1 Combine several arch-specific vma flags into one. before patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT powerpc - - VM_SAO - parisc VM_GROWSUP - - - ia64 VM_GROWSUP - - - nommu - VM_MAPPED_COPY - - others - - - - after patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE powerpc - VM_SAO - - parisc - VM_GROWSUP - - ia64 - VM_GROWSUP - - nommu - VM_MAPPED_COPY - - others - VM_ARCH_1 - - And voila! One completely free bit. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:16 +09:00
Bob Liu	ef6942224a	ksm: cleanup: introduce find_mergeable_vma() There are multiple places which perform the same check. Add a new find_mergeable_vma() to handle this. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-21 17:54:59 -07:00
Cong Wang	9b04c5fec4	mm: remove the second argument of k[un]map_atomic() Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:27 +08:00
Hugh Dickins	7512102cf6	memcg: fix GPF when cgroup removal races with last exit When moving tasks from old memcg (with move_charge_at_immigrate on new memcg), followed by removal of old memcg, hit General Protection Fault in mem_cgroup_lru_del_list() (called from release_pages called from free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from exit_mmap from mmput from exit_mm from do_exit). Somewhat reproducible, takes a few hours: the old struct mem_cgroup has been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is still trying to update its stats, and take page off lru before freeing. A task, or a charge, or a page on lru: each secures a memcg against removal. In this case, the last task has been moved out of the old memcg, and it is exiting: anonymous pages are uncharged one by one from the memcg, as they are zapped from its pagetables, so the charge gets down to 0; but the pages themselves are queued in an mmu_gather for freeing. Most of those pages will be on lru (and force_empty is careful to lru_add_drain_all, to add pages from pagevec to lru first), but not necessarily all: perhaps some have been isolated for page reclaim, perhaps some isolated for other reasons. So, force_empty may find no task, no charge and no page on lru, and let the removal proceed. There would still be no problem if these pages were immediately freed; but typically (and the put_page_testzero protocol demands it) they have to be added back to lru before they are found freeable, then removed from lru and freed. We don't see the issue when adding, because the mem_cgroup_iter() loops keep their own reference to the memcg being scanned; but when it comes to mem_cgroup_lru_del_list(). I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and PageCgroupUsed flags were used (like a trick with mirrors) to deflect view of pc->mem_cgroup to the stable root_mem_cgroup when neither set. `38c5d72f3e` ("memcg: simplify LRU handling by new rule") mercifully removed those convolutions, but left this General Protection Fault. But it's surprisingly easy to restore the old behaviour: just check PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec to add), and reset pc to root_mem_cgroup if page is uncharged. A risky change? just going back to how it worked before; testing, and an audit of uses of pc->mem_cgroup, show no problem. And there's a nice bonus: with mem_cgroup_lru_add_list() itself making sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no longer has any purpose, and we can safely revert `4e5f01c2b9` ("memcg: clear pc->mem_cgroup if necessary"). Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c is not strictly necessary: the lru_lock there, with RCU before memcg structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe without that; but it seems cleaner to rely on one dependency less. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-03-05 15:49:43 -08:00
KAMEZAWA Hiroyuki	4e5f01c2b9	memcg: clear pc->mem_cgroup if necessary. This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup and reducing atomic ops/complexity in memcg LRU handling. In some cases, pages are added to lru before charge to memcg and pages are not classfied to memory cgroup at lru addtion. Now, the lru where the page should be added is determined a bit in page_cgroup->flags and pc->mem_cgroup. I'd like to remove the check of flag. To handle the case pc->mem_cgroup may contain stale pointers if pages are added to LRU before classification. This patch resets pc->mem_cgroup to root_mem_cgroup before lru additions. [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build] [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build] [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal] [hughd@google.com: stop oops in mem_cgroup_reset_owner()] [hughd@google.com: fix page migration to reset_owner] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:07 -08:00
David Rientjes	43362a4977	oom: fix race while temporarily setting current's oom_score_adj test_set_oom_score_adj() was introduced in `72788c3856` ("oom: replace PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:45 -07:00
Hugh Dickins	2b472611a3	ksm: fix NULL pointer dereference in scan_get_next_rmap_item() Andrea Righi reported a case where an exiting task can race against ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily triggering a NULL pointer dereference in ksmd. ksm_scan.mm_slot == &ksm_mm_head with only one registered mm CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item) list_empty() is false lock slot == &ksm_mm_head list_del(slot->mm_list) (list now empty) unlock lock slot = list_entry(slot->mm_list.next) (list is empty, so slot is still ksm_mm_head) unlock slot->mm == NULL ... Oops Close this race by revalidating that the new slot is not simply the list head again. Andrea's test case: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #define BUFSIZE getpagesize() int main(int argc, char *argv) { void ptr; if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) { perror("posix_memalign"); exit(1); } if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) { perror("madvise"); exit(1); } (char )NULL = 0; return 0; } Reported-by: Andrea Righi <andrea@betterlinux.com> Tested-by: Andrea Righi <andrea@betterlinux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-15 20:04:02 -07:00
David Rientjes	72788c3856	oom: replace PF_OOM_ORIGIN with toggling oom_score_adj There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:10 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Peter Zijlstra	9e60109f12	mm: rename drop_anon_vma() to put_anon_vma() The normal code pattern used in the kernel is: get/put. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:03 -07:00
Hugh Dickins	2919bfd075	ksm: drain pagevecs to lru It was hard to explain the page counts which were causing new LTP tests of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: CAI Qian <caiqian@redhat.com> Cc:Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Andrea Arcangeli	22e5c47ee2	thp: add compound_trans_head() helper Cleanup some code with common compound_trans_head helper. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Andrea Arcangeli	29ad768cfc	thp: KSM on THP This makes KSM full operational with THP pages. Subpages are scanned while the hugepage is still in place and delivering max cpu performance, and only if there's a match and we're going to deduplicate memory, the single hugepages with the subpage match is split. There will be no false sharing between ksmd and khugepaged. khugepaged won't collapse 2m virtual regions with KSM pages inside. ksmd also should only split pages when the checksum matches and we're likely to split an hugepage for some long living ksm page (usual ksm heuristic to avoid sharing pages that get de-cowed). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00

1 2

98 Commits