Commit Graph

64 Commits

Author SHA1 Message Date
Nathan Chancellor a626beca4c This is the 3.10.105 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYnZNdAAoJEE44bZycYXAvJv0P/jpPc+jKb+D0FVUOiYDkY5Rw
 jxsZ3oeruTIeSFAAIzusVMLm9moBJA6DThuTHU5Kt68mRaKB2lgmqwkkvQAPTSYh
 tnDQwlrF7dOSVmPczJFHalpaLpRdXdQP9r8y+38PibaFZPssKdnZr3BBfdOdi5DT
 lj029AGKfG7co6Hb/iAhsxuAFfPmvGHY4QNwJ2FRbU1m6MDtmCTbXzF0fc6X5AW1
 qrtaWwPulJtZ/5MPk7aFyNpuCpNvIaTEqNaQsZbuz3bHfzDQVLerWze98vgHC0QM
 2YOTP6TnEiHhxHGMb9SywUgSV1ylx0X542YDfxmcfyxBWRr0khlxQh1gpX+waqE3
 pqdSlvN7AFzifw6kubbG2/XjkNvFtJcDTgrL3qco4utIezSijXmoOsDpKNnJuzk/
 kSD5WYd+Q1CSHOkqZX29QPw1Dl/7Ftm7GPfxu7Pis1OBuPByqtRkEfmn9DpiKSs5
 Aja0ljZYiQ3jy3fH+WlEzo6PVSxx0ZxKg0fOShlpgjj8KjMUdGfl9cB1OZxyWnNH
 UiQ9iIWd3tJci7WbsBOfawsQpq3EIJxZKjyUmLYpBht5/YenYxOBDCr/CLJDQBGI
 IQUPAs/E1JGDxGTUY3AmsaMVrcX2yOfhLzjrsVJGqSdote0um+2PdTLZHE4MMiz2
 Dh6CbUVYWS1KNgmQ8T8L
 =k5mW
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqeiwACgkQmXOSYMts
 txaBAQ/+KqZh90YZI+gRHGdczbo3XnlryHMdpp+DTIFtN3zU+2LM352oP+haoJfr
 YhNsixcMhW5TX0is5fg4SkIc0B3ooGKZLVKOPIRw+1NLBAVG5yVuYxW7I1faJgk6
 F37+4rvq7KAOPCNMjAEXRt7GqZ4WZjgvgKy+u5wzKh3k5kUylqDwlP2qdgx2L5Rc
 IxyxgOuaVGV6dZTyAyRlRMild5Tlz+SMY4pWoMe0sulDDXhd5/5PnGNVIgh+XqB6
 m0AGkIIzPVe+wmg6n1iYs93dQO0Jmu6DL47Zv4f3ASZNL/XVSLvU9ie63FyWGZXG
 e52qAPtztXInEOo15vPQSAAq7McZHDTzhHhsU/ZtkBT+LeSUU+rsxXddJ2EO5UgC
 O3cVm11x1FWMzbBtFNFtkqeri2Y2OxvU4O81mfNP1oOUQBTMeSHTzQ8psbCdXeEr
 ktSOtI+nakPmDE3aq4YSaz7BwSgt2tU/vZehkrTxtAQJxt0b88r2xFfThy5WScT1
 v6muoqxlprjjvFld7v99P8cXxJq4QrxKUxXtEBTdB79Q5xtCC29OAcTelpPFDCED
 /KpgZflubzH/Z872AW9Ru8OL9PYty6hBNDOP4aHLSFWfCu3KQxL6BMEeqi5qBjBX
 mJ8JT0dCQYP6xONIWq6a3fICroNMazhNFxdpPSfsQFRhujhjGPg=
 =zhKv
 -----END PGP SIGNATURE-----

Merge 3.10.105 into android-msm-bullhead-3.10-oreo-m5

Changes in 3.10.105: (315 commits)
        sched/core: Fix a race between try_to_wake_up() and a woken up task
        sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
        crypto: algif_skcipher - Require setkey before accept(2)
        crypto: af_alg - Disallow bind/setkey/... after accept(2)
        crypto: af_alg - Add nokey compatibility path
        crypto: algif_skcipher - Add nokey compatibility path
        crypto: hash - Add crypto_ahash_has_setkey
        crypto: shash - Fix has_key setting
        crypto: algif_hash - Require setkey before accept(2)
        crypto: skcipher - Add crypto_skcipher_has_setkey
        crypto: algif_skcipher - Add key check exception for cipher_null
        crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path
        crypto: algif_hash - Remove custom release parent function
        crypto: algif_skcipher - Remove custom release parent function
        crypto: af_alg - Forbid bind(2) when nokey child sockets are present
        crypto: algif_hash - Fix race condition in hash_check_key
        crypto: algif_skcipher - Fix race condition in skcipher_check_key
        crypto: algif_skcipher - Load TX SG list after waiting
        crypto: cryptd - initialize child shash_desc on import
        crypto: skcipher - Fix blkcipher walk OOM crash
        crypto: gcm - Fix IV buffer size in crypto_gcm_setkey
        MIPS: KVM: Fix unused variable build warning
        KVM: MIPS: Precalculate MMIO load resume PC
        KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
        KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
        KVM: MIPS: Make ERET handle ERL before EXL
        KVM: x86: fix wbinvd_dirty_mask use-after-free
        KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
        KVM: Disable irq while unregistering user notifier
        PM / devfreq: Fix incorrect type issue.
        ppp: defer netns reference release for ppp channel
        x86/mm/xen: Suppress hugetlbfs in PV guests
        xen: Add RING_COPY_REQUEST()
        xen-netback: don't use last request to determine minimum Tx credit
        xen-netback: use RING_COPY_REQUEST() throughout
        xen-blkback: only read request operation from shared ring once
        xen/pciback: Save xen_pci_op commands before processing it
        xen/pciback: Save the number of MSI-X entries to be copied later.
        xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
        xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
        xen/pciback: Do not install an IRQ handler for MSI interrupts.
        xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
        xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
        xen-pciback: Add name prefix to global 'permissive' variable
        x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
        x86/traps: Ignore high word of regs->cs in early_idt_handler_common
        x86/mm: Disable preemption during CR3 read+write
        x86/apic: Do not init irq remapping if ioapic is disabled
        x86/mm/pat, /dev/mem: Remove superfluous error message
        x86/paravirt: Do not trace _paravirt_ident_*() functions
        x86/build: Build compressed x86 kernels as PIE
        x86/um: reuse asm-generic/barrier.h
        iommu/amd: Update Alias-DTE in update_device_table()
        iommu/amd: Free domain id when free a domain of struct dma_ops_domain
        ARM: 8616/1: dt: Respect property size when parsing CPUs
        ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7
        ARM: sa1100: clear reset status prior to reboot
        ARM: sa1111: fix pcmcia suspend/resume
        arm64: avoid returning from bad_mode
        arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO
        arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb()
        arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP
        MIPS: Malta: Fix IOCU disable switch read for MIPS64
        MIPS: ptrace: Fix regs_return_value for kernel context
        powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET
        powerpc/vdso64: Use double word compare on pointers
        powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data()
        powerpc/64: Fix incorrect return value from __copy_tofrom_user
        powerpc/nvram: Fix an incorrect partition merge
        avr32: fix copy_from_user()
        avr32: fix 'undefined reference to `___copy_from_user'
        avr32: off by one in at32_init_pio()
        s390/dasd: fix hanging device after clear subchannel
        parisc: Ensure consistent state when switching to kernel stack at syscall entry
        microblaze: fix __get_user()
        microblaze: fix copy_from_user()
        mn10300: failing __get_user() and get_user() should zero
        m32r: fix __get_user()
        sh64: failing __get_user() should zero
        score: fix __get_user/get_user
        s390: get_user() should zero on failure
        ARC: uaccess: get_user to zero out dest in cause of fault
        asm-generic: make get_user() clear the destination on errors
        frv: fix clear_user()
        cris: buggered copy_from_user/copy_to_user/clear_user
        blackfin: fix copy_from_user()
        score: fix copy_from_user() and friends
        sh: fix copy_from_user()
        hexagon: fix strncpy_from_user() error return
        mips: copy_from_user() must zero the destination on access_ok() failure
        asm-generic: make copy_from_user() zero the destination properly
        alpha: fix copy_from_user()
        metag: copy_from_user() should zero the destination on access_ok() failure
        parisc: fix copy_from_user()
        openrisc: fix copy_from_user()
        openrisc: fix the fix of copy_from_user()
        mn10300: copy_from_user() should zero on access_ok() failure...
        sparc32: fix copy_from_user()
        ppc32: fix copy_from_user()
        ia64: copy_from_user() should zero the destination on access_ok() failure
        fix fault_in_multipages_...() on architectures with no-op access_ok()
        fix memory leaks in tracing_buffers_splice_read()
        arc: don't leak bits of kernel stack into coredump
        Fix potential infoleak in older kernels
        swapfile: fix memory corruption via malformed swapfile
        coredump: fix unfreezable coredumping task
        usb: dwc3: gadget: increment request->actual once
        USB: validate wMaxPacketValue entries in endpoint descriptors
        USB: fix typo in wMaxPacketSize validation
        usb: xhci: Fix panic if disconnect
        USB: serial: fix memleak in driver-registration error path
        USB: kobil_sct: fix non-atomic allocation in write path
        USB: serial: mos7720: fix non-atomic allocation in write path
        USB: serial: mos7840: fix non-atomic allocation in write path
        usb: renesas_usbhs: fix clearing the {BRDY,BEMP}STS condition
        USB: change bInterval default to 10 ms
        usb: gadget: fsl_qe_udc: signedness bug in qe_get_frame()
        USB: serial: cp210x: fix hardware flow-control disable
        usb: misc: legousbtower: Fix NULL pointer deference
        usb: gadget: function: u_ether: don't starve tx request queue
        USB: serial: cp210x: fix tiocmget error handling
        usb: gadget: u_ether: remove interrupt throttling
        usb: chipidea: move the lock initialization to core file
        Fix USB CB/CBI storage devices with CONFIG_VMAP_STACK=y
        ALSA: rawmidi: Fix possible deadlock with virmidi registration
        ALSA: timer: fix NULL pointer dereference in read()/ioctl() race
        ALSA: timer: fix division by zero after SNDRV_TIMER_IOCTL_CONTINUE
        ALSA: timer: fix NULL pointer dereference on memory allocation failure
        ALSA: ali5451: Fix out-of-bound position reporting
        ALSA: pcm : Call kill_fasync() in stream lock
        zfcp: fix fc_host port_type with NPIV
        zfcp: fix ELS/GS request&response length for hardware data router
        zfcp: close window with unblocked rport during rport gone
        zfcp: retain trace level for SCSI and HBA FSF response records
        zfcp: restore: Dont use 0 to indicate invalid LUN in rec trace
        zfcp: trace on request for open and close of WKA port
        zfcp: restore tracing of handle for port and LUN with HBA records
        zfcp: fix D_ID field with actual value on tracing SAN responses
        zfcp: fix payload trace length for SAN request&response
        zfcp: trace full payload of all SAN records (req,resp,iels)
        scsi: zfcp: spin_lock_irqsave() is not nestable
        scsi: mpt3sas: Fix secure erase premature termination
        scsi: mpt3sas: Unblock device after controller reset
        scsi: mpt3sas: fix hang on ata passthrough commands
        mpt2sas: Fix secure erase premature termination
        scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices
        scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression
        scsi: ibmvfc: Fix I/O hang when port is not mapped
        scsi: Fix use-after-free
        scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer()
        scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded
        scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware
        ext4: validate that metadata blocks do not overlap superblock
        ext4: avoid modifying checksum fields directly during checksum verification
        ext4: use __GFP_NOFAIL in ext4_free_blocks()
        ext4: reinforce check of i_dtime when clearing high fields of uid and gid
        ext4: allow DAX writeback for hole punch
        ext4: sanity check the block and cluster size at mount time
        reiserfs: fix "new_insert_key may be used uninitialized ..."
        reiserfs: Unlock superblock before calling reiserfs_quota_on_mount()
        xfs: fix superblock inprogress check
        libxfs: clean up _calc_dquots_per_chunk
        btrfs: ensure that file descriptor used with subvol ioctls is a dir
        ocfs2/dlm: fix race between convert and migration
        ocfs2: fix start offset to ocfs2_zero_range_for_truncate()
        ubifs: Fix assertion in layout_in_gaps()
        ubifs: Fix xattr_names length in exit paths
        UBIFS: Fix possible memory leak in ubifs_readdir()
        ubifs: Abort readdir upon error
        ubifs: Fix regression in ubifs_readdir()
        UBI: fastmap: scrub PEB when bitflips are detected in a free PEB EC header
        NFSv4.x: Fix a refcount leak in nfs_callback_up_net
        NFSD: Using free_conn free connection
        NFS: Don't drop CB requests with invalid principals
        NFSv4: Open state recovery must account for file permission changes
        fs/seq_file: fix out-of-bounds read
        fs/super.c: fix race between freeze_super() and thaw_super()
        isofs: Do not return EACCES for unknown filesystems
        hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common()
        driver core: Delete an unnecessary check before the function call "put_device"
        driver core: fix race between creating/querying glue dir and its cleanup
        drm/radeon: fix radeon_move_blit on 32bit systems
        drm: Reject page_flip for !DRIVER_MODESET
        drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on
        qxl: check for kmap failures
        Input: i8042 - break load dependency between atkbd/psmouse and i8042
        Input: i8042 - set up shared ps2_cmd_mutex for AUX ports
        Input: ili210x - fix permissions on "calibrate" attribute
        hwrng: exynos - Disable runtime PM on probe failure
        hwrng: omap - Fix assumption that runtime_get_sync will always succeed
        hwrng: omap - Only fail if pm_runtime_get_sync returns < 0
        i2c-eg20t: fix race between i2c init and interrupt enable
        em28xx-i2c: rt_mutex_trylock() returns zero on failure
        i2c: core: fix NULL pointer dereference under race condition
        i2c: at91: fix write transfers by clearing pending interrupt first
        iio: accel: kxsd9: Fix raw read return
        iio: accel: kxsd9: Fix scaling bug
        thermal: hwmon: Properly report critical temperature in sysfs
        cdc-acm: fix wrong pipe type on rx interrupt xfers
        timers: Use proper base migration in add_timer_on()
        EDAC: Increment correct counter in edac_inc_ue_error()
        IB/ipoib: Fix memory corruption in ipoib cm mode connect flow
        IB/core: Fix use after free in send_leave function
        IB/ipoib: Don't allow MC joins during light MC flush
        IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV
        IB/mlx4: Fix create CQ error flow
        IB/uverbs: Fix leak of XRC target QPs
        IB/cm: Mark stale CM id's whenever the mad agent was unregistered
        mtd: blkdevs: fix potential deadlock + lockdep warnings
        mtd: pmcmsp-flash: Allocating too much in init_msp_flash()
        mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl
        perf symbols: Fixup symbol sizes before picking best ones
        perf: Tighten (and fix) the grouping condition
        tty: Prevent ldisc drivers from re-using stale tty fields
        tty: limit terminal size to 4M chars
        tty: vt, fix bogus division in csi_J
        vt: clear selection before resizing
        drivers/vfio: Rework offsetofend()
        include/stddef.h: Move offsetofend() from vfio.h to a generic kernel header
        stddef.h: move offsetofend inside #ifndef/#endif guard, neaten
        ipv6: don't call fib6_run_gc() until routing is ready
        ipv6: split duplicate address detection and router solicitation timer
        ipv6: move DAD and addrconf_verify processing to workqueue
        ipv6: addrconf: fix dev refcont leak when DAD failed
        ipv6: fix rtnl locking in setsockopt for anycast and multicast
        ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()
        ipv6: correctly add local routes when lo goes up
        ipv6: dccp: fix out of bound access in dccp_v6_err()
        ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped
        ip6_tunnel: Clear IP6CB in ip6tunnel_xmit()
        ip6_tunnel: disable caching when the traffic class is inherited
        net/irda: handle iriap_register_lsap() allocation failure
        tcp: fix use after free in tcp_xmit_retransmit_queue()
        tcp: properly scale window in tcp_v[46]_reqsk_send_ack()
        tcp: fix overflow in __tcp_retransmit_skb()
        tcp: fix wrong checksum calculation on MTU probing
        tcp: take care of truncations done by sk_filter()
        bonding: Fix bonding crash
        net: ratelimit warnings about dst entry refcount underflow or overflow
        mISDN: Support DR6 indication in mISDNipac driver
        mISDN: Fixing missing validation in base_sock_bind()
        net: disable fragment reassembly if high_thresh is set to zero
        ipvs: count pre-established TCP states as active
        iwlwifi: pcie: fix access to scratch buffer
        svc: Avoid garbage replies when pc_func() returns rpc_drop_reply
        brcmsmac: Free packet if dma_mapping_error() fails in dma_rxfill
        brcmsmac: Initialize power in brcms_c_stf_ss_algo_channel_get()
        brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap()
        pstore: Fix buffer overflow while write offset equal to buffer size
        net/mlx4_core: Allow resetting VF admin mac to zero
        firewire: net: guard against rx buffer overflows
        firewire: net: fix fragmented datagram_size off-by-one
        netfilter: fix namespace handling in nf_log_proc_dostring
        can: bcm: fix warning in bcm_connect/proc_register
        net: fix sk_mem_reclaim_partial()
        net: avoid sk_forward_alloc overflows
        ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route
        packet: call fanout_release, while UNREGISTERING a netdev
        net: sctp, forbid negative length
        sctp: validate chunk len before actually using it
        net: clear sk_err_soft in sk_clone_lock()
        net: mangle zero checksum in skb_checksum_help()
        dccp: do not send reset to already closed sockets
        dccp: fix out of bound access in dccp_v4_err()
        sctp: assign assoc_id earlier in __sctp_connect
        neigh: check error pointer instead of NULL for ipv4_neigh_lookup()
        ipv4: use new_gw for redirect neigh lookup
        mac80211: fix purging multicast PS buffer queue
        mac80211: discard multicast and 4-addr A-MSDUs
        cfg80211: limit scan results cache size
        mwifiex: printk() overflow with 32-byte SSIDs
        ipv4: Set skb->protocol properly for local output
        net: sky2: Fix shutdown crash
        kaweth: fix firmware download
        tracing: Move mutex to protect against resetting of seq data
        kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
        Revert "ipc/sem.c: optimize sem_lock()"
        cfq: fix starvation of asynchronous writes
        drbd: Fix kernel_sendmsg() usage - potential NULL deref
        lib/genalloc.c: start search from start of chunk
        tools/vm/slabinfo: fix an unintentional printf
        rcu: Fix soft lockup for rcu_nocb_kthread
        ratelimit: fix bug in time interval by resetting right begin time
        mfd: core: Fix device reference leak in mfd_clone_cell
        PM / sleep: fix device reference leak in test_suspend
        mmc: mxs: Initialize the spinlock prior to using it
        mmc: block: don't use CMD23 with very old MMC cards
        pstore/core: drop cmpxchg based updates
        pstore/ram: Use memcpy_toio instead of memcpy
        pstore/ram: Use memcpy_fromio() to save old buffer
        mb86a20s: fix the locking logic
        mb86a20s: fix demod settings
        cx231xx: don't return error on success
        cx231xx: fix GPIOs for Pixelview SBTVD hybrid
        gpio: mpc8xxx: Correct irq handler function
        uio: fix dmem_region_start computation
        KEYS: Fix short sprintf buffer in /proc/keys show function
        hv: do not lose pending heartbeat vmbus packets
        staging: iio: ad5933: avoid uninitialized variable in error case
        mei: bus: fix received data size check in NFC fixup
        ACPI / APEI: Fix incorrect return value of ghes_proc()
        PCI: Handle read-only BARs on AMD CS553x devices
        tile: avoid using clocksource_cyc2ns with absolute cycle count
        dm flakey: fix reads to be issued if drop_writes configured
        mm,ksm: fix endless looping in allocating memory when ksm enable
        can: dev: fix deadlock reported after bus-off
        hwmon: (adt7411) set bit 3 in CFG1 register
        mpi: Fix NULL ptr dereference in mpi_powm() [ver #3]
        mfd: 88pm80x: Double shifting bug in suspend/resume
        ASoC: omap-mcpdm: Fix irq resource handling
        regulator: tps65910: Work around silicon erratum SWCZ010
        dm: mark request_queue dead before destroying the DM device
        fbdev/efifb: Fix 16 color palette entry calculation
        metag: Only define atomic_dec_if_positive conditionally
        Linux 3.10.105

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Conflicts:
	arch/arm/mach-sa1100/generic.c
	arch/arm64/kernel/traps.c
	crypto/blkcipher.c
	drivers/devfreq/devfreq.c
	drivers/usb/dwc3/gadget.c
	drivers/usb/gadget/u_ether.c
	fs/ubifs/dir.c
	include/net/if_inet6.h
	lib/genalloc.c
	net/ipv6/addrconf.c
	net/ipv6/tcp_ipv6.c
	net/wireless/scan.c
	sound/core/timer.c
2018-01-25 17:45:32 -07:00
Al Viro 138c012118 fix fault_in_multipages_...() on architectures with no-op access_ok()
commit e23d4159b109167126e5bcd7f3775c95de7fee47 upstream.

Switching iov_iter fault-in to multipages variants has exposed an old
bug in underlying fault_in_multipages_...(); they break if the range
passed to them wraps around.  Normally access_ok() done by callers will
prevent such (and it's a guaranteed EFAULT - ERR_PTR() values fall into
such a range and they should not point to any valid objects).

However, on architectures where userland and kernel live in different
MMU contexts (e.g. s390) access_ok() is a no-op and on those a range
with a wraparound can reach fault_in_multipages_...().

Since any wraparound means EFAULT there, the fix is trivial - turn
those

    while (uaddr <= end)
	    ...
into

    if (unlikely(uaddr > end))
	    return -EFAULT;
    do
	    ...
    while (uaddr <= end);

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-02-06 23:33:04 +01:00
Daniel Campello 89fea44a62 Page cache miss tracing using ftrace on mm/filemap
This patch includes two trace events on generic_perform_write and
do_generic_file_read to check on the address_space mapping for the
pages to be accessed by the request.

Change-Id: Ib319b9b2c971b9e5c76645be6cfd995ef9465d77
Signed-off-by: Daniel Campello <campello@google.com>
(cherry picked from commit d3952c50853166bd04562766c9603ed86ab0da75)
2015-11-19 11:03:16 -08:00
Theodore Ts'o 2405316c4f mm: add find_get_page_flags() for 3.18 ext4 backport
Change-Id: I1bf562a302cae211c0b28fb3872f8ee15660c182
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@google.com>
2015-06-15 15:09:35 -07:00
Laura Abbott 7b5104b579 mm: Don't use CMA pages for page cache
All layers of the page cache may take extra references to
pages to avoid migration. This is fine for general movable
pages but renders CMA pages useless as they cannot be
allocated for contiguous mode. Always omitting movable
pages completely has poor utilization of memory so attempt
with the given GFP flags first and if a CMA page is allocated,
free it and get a new non-movable page. This does have a minor
performance impact, but ultimately it's a tradeoff between
never using movable pages and taking the extra hit of a
free/realloc.

Change-Id: I4e9d7fc1c65306d6356492fd1a99935aa62ed2ec
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2013-08-22 18:09:38 -07:00
Paul E. McKenney 8375ad98cc vm: adjust ifdef for TINY_RCU
There is an ifdef in page_cache_get_speculative() that checks for !SMP
and TREE_RCU, which has been an impossible combination since the advent
of TINY_RCU.  The ifdef enables a fastpath that is valid when preemption
is disabled by rcu_read_lock() in UP systems, which is the case when
TINY_RCU is enabled.  This commit therefore adjusts the ifdef to
generate the fastpath when TINY_RCU is enabled.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:28 -07:00
Darrick J. Wong 1d1d1a7672 mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait.  Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function.  This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:19 -08:00
Rafael Aquini 18468d93e5 mm: introduce a common interface for balloon pages mobility
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces a common interface to help a balloon driver on
making its page set movable to compaction, and thus allowing the system
to better leverage the compation efforts on memory defragmentation.

[akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
[rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:26 -08:00
Mel Gorman f981c5950f mm: methods for teaching filesystems about PG_swapcache pages
In order to teach filesystems to handle swap cache pages, three new page
functions are introduced:

  pgoff_t page_file_index(struct page *);
  loff_t page_file_offset(struct page *);
  struct address_space *page_file_mapping(struct page *);

page_file_index() - gives the offset of this page in the file in
PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
function also gives the correct index for PG_swapcache pages.

page_file_offset() - uses page_file_index(), so that it will give the
expected result, even for PG_swapcache pages.

page_file_mapping() - gives the mapping backing the actual page; that is
for swap cache pages it will give swap_file->f_mapping.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:47 -07:00
Paul Gortmaker af2e840971 pagemap.h: fix warning about possibly used before init var
Commit f56f821feb ("mm: extend prefault helpers to fault in more than
PAGE_SIZE") added in the new functions: fault_in_multipages_writeable()
and fault_in_multipages_readable().

However, we currently see:

  include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function
  include/linux/pagemap.h:492: note: 'ret' was declared here

Unlike a lot of gcc nags, this one appears somewhat legit.  i.e.  passing
in an invalid negative value of "size" does make it look like all the
conditionals in there would be bypassed and the uninitialized value would
be returned.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:18 -07:00
Daniel Vetter 9923777dff mm: fixup compilation error due to an asm write through a const pointer
This regression has been introduced in

commit f56f821feb
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sun Mar 25 19:47:41 2012 +0200

    mm: extend prefault helpers to fault in more than PAGE_SIZE

I have failed to notice this because x86 asm seems to happily compile
things as-is.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-04-23 17:41:18 +01:00
Daniel Vetter f56f821feb mm: extend prefault helpers to fault in more than PAGE_SIZE
drm/i915 wants to read/write more than one page in its fastpath
and hence needs to prefault more than PAGE_SIZE bytes.

Add new functions in filemap.h to make that possible.

Also kill a copy&pasted spurious space in both functions while at it.

v2: As suggested by Andrew Morton, add a multipage parameter to both
functions to avoid the additional branch for the pagemap.c hotpath.
My gcc 4.6 here seems to dtrt and indeed reap these branches where not
needed.

v3: Becaus I couldn't find a way around adding a uaddr += PAGE_SIZE to
the filemap.c hotpaths (that the compiler couldn't remove again),
let's go with separate new functions for the multipage use-case.

v4: Adjust comment to CodingStlye and fix spelling.

Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2012-03-27 13:36:30 +02:00
Hugh Dickins 5e5358e7cf mm: cleanup descriptions of filler arg
The often-NULL data arg to read_cache_page() and read_mapping_page()
functions is misdescribed as "destination for read data": no, it's the
first arg to the filler function, often struct file * to ->readpage().

Satisfy checkpatch.pl on those filler prototypes, and tidy up the
declarations in linux/pagemap.h.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25 20:57:10 -07:00
Frederic Weisbecker bdd4e85dc3 sched: Isolate preempt counting in its own config option
Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec
of preempt count offset independently. So that the offset
can be updated by preempt_disable() and preempt_enable()
even without the need for CONFIG_PREEMPT beeing set.

This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working
with !CONFIG_PREEMPT where it currently doesn't detect
code that sleeps inside explicit preemption disabled
sections.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
2011-06-10 15:15:40 +02:00
Wu Fengguang 7b1de5868b readahead: readahead page allocations are OK to fail
Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.

readahead page allocations are completely optional.  They are OK to fail
and in particular shall not trigger OOM on themselves.

Reported-by: Dave Young <hidave.darkstar@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:25 -07:00
KOSAKI Motohiro f62e00cc3a mm: introduce wait_on_page_locked_killable()
commit 2687a356 ("Add lock_page_killable") introduced killable
lock_page().  Similarly this patch introdues killable
wait_on_page_locked().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:08 -07:00
Linus Torvalds 6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Minchan Kim e64a782fec mm: change __remove_from_page_cache()
Now we renamed remove_from_page_cache with delete_from_page_cache.  As
consistency of __remove_from_swap_cache and remove_from_swap_cache, we
change internal page cache handling function name, too.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim 702cfbf93a mm: goodbye remove_from_page_cache()
Now delete_from_page_cache() replaces remove_from_page_cache().  So we
remove remove_from_page_cache so fs or something out of mainline will
notice it when compile time and can fix it.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Minchan Kim 97cecb5a25 mm: introduce delete_from_page_cache()
Presently we increase the page refcount in add_to_page_cache() but don't
decrease it in remove_from_page_cache().  Such asymmetry adds confusion,
requiring that callers notice it and a comment explaining why they release
a page reference.  It's not a good API.

A long time ago, Hugh tried it (http://lkml.org/lkml/2004/10/24/140) but
gave up because reiser4's drop_page() had to unlock the page between
removing it from page cache and doing the page_cache_release().  But now
the situation is changed.  I think at least things in current mainline
don't have any obstacles.  The problem is for out-of-mainline filesystems
- if they have done such things as reiser4, this patch could be a problem
but they will discover this at compile time since we remove
remove_from_page_cache().

This patch:

This function works as just wrapper remove_from_page_cache().  The
difference is that it decreases page references in itself.  So caller have
to make sure it has a page reference before calling.

This patch is ready for removing remove_from_page_cache().

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Miklos Szeredi ef6a3c6311 mm: add replace_page_cache_page() function
This function basically does:

     remove_from_page_cache(old);
     page_cache_release(old);
     add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Jens Axboe 7eaceaccab block: remove per-queue plugging
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:07 +01:00
Steven Rostedt 088e54658f mm: remove likely() from mapping_unevictable()
The mapping_unevictable() has a likely() around the mapping parameter.
This mapping parameter comes from page_mapping() which has an unlikely()
that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
NULL.  One would think that this unlikely() means that the mapping
returned by page_mapping() would not be NULL, but where page_mapping() is
used just above mapping_unevictable(), that unlikely() is incorrect most
of the time.  This means that the "likely(mapping)" in
mapping_unevictable() is incorrect most of the time.

Running the annotated branch profiler on my main box which runs firefox,
evolution, xchat and is part of my distcc farm, I had this:

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
12872836 1269443893  98 mapping_unevictable            pagemap.h            51
35935762 1270265395  97 page_mapping                   mm.h                 659
1306198001   143659   0 page_mapping                   mm.h                 657
203131478   121586   0 page_mapping                   mm.h                 657
 5415491     1116   0 page_mapping                   mm.h                 657
74899487     1116   0 page_mapping                   mm.h                 657
203132845      224   0 page_mapping                   mm.h                 659
 5415464       27   0 page_mapping                   mm.h                 659
   13552        0   0 page_mapping                   mm.h                 657
   13552        0   0 page_mapping                   mm.h                 659
  242630        0   0 page_mapping                   mm.h                 657
  242630        0   0 page_mapping                   mm.h                 659
74899487        0   0 page_mapping                   mm.h                 659

The page_mapping() is a static inline, which is why it shows up multiple
times.  The mapping_unevictable() is also a static inline but seems to be
used only once in my setup.

The unlikely in page_mapping() was correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect.  Perhaps this is
enough to remove the unlikely from page_mapping() as well.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:36 -08:00
Michel Lespinasse d065bd810b mm: retry page fault when blocking on disk transfer
This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs.

It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
site wants mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
do_page_fault() will then re-acquire mmap_sem and retry the page fault.

It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.

Tests:

- microbenchmark: thread A mmaps a large file and does random read accesses
  to the mmaped area - achieves about 55 iterations/s. Thread B does
  mmap/munmap in a loop at a separate location - achieves 55 iterations/s
  before, 15000 iterations/s after.

- We are seeing related effects in some applications in house, which show
  significant performance regressions when running without this change.

[akpm@linux-foundation.org: fix warning & crash]
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:09 -07:00
Linus Torvalds 1021a64534 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
  hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
  hwpoison: rename CONFIG
  HWPOISON, hugetlb: support hwpoison injection for hugepage
  HWPOISON, hugetlb: detect hwpoison in hugetlb code
  HWPOISON, hugetlb: isolate corrupted hugepage
  HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
  HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
  HWPOISON, hugetlb: enable error handling path for hugepage
  hugetlb, rmap: add reverse mapping for hugepage
  hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

Fix up trivial conflicts in mm/memory-failure.c
2010-08-12 10:15:10 -07:00
Naoya Horiguchi 0fe6e20b9c hugetlb, rmap: add reverse mapping for hugepage
This patch adds reverse mapping feature for hugepage by introducing
mapcount for shared/private-mapped hugepage and anon_vma for
private-mapped hugepage.

While hugepage is not currently swappable, reverse mapping can be useful
for memory error handler.

Without this patch, memory error handler cannot identify processes
using the bad hugepage nor unmap it from them. That is:
- for shared hugepage:
  we can collect processes using a hugepage through pagecache,
  but can not unmap the hugepage because of the lack of mapcount.
- for privately mapped hugepage:
  we can neither collect processes nor unmap the hugepage.
This patch solves these problems.

This patch include the bug fix given by commit 23be7468e8, so reverts it.

Dependency:
  "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

ChangeLog since May 24.
- create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
- move functions setting up anon_vma for hugepage into mm/rmap.c.

ChangeLog since May 13.
- rebased to 2.6.34
- fix logic error (in case that private mapping and shared mapping coexist)
- move is_vm_hugetlb_page() into include/linux/mm.h to use this function
  from linear_page_index()
- define and use linear_hugepage_index() instead of compound_order()
- use page_move_anon_rmap() in hugetlb_cow()
- copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
- revert commit 24be7468 completely

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-08-11 09:21:15 +02:00
Naoya Horiguchi 8edf344c66 hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h
is_vm_hugetlb_page() is a widely used inline function to insert hooks
into hugetlb code.
But we can't use it in pagemap.h because of circular dependency of
the header files. This patch removes this limitation.

Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-08-11 09:20:41 +02:00
Andi Kleen 627295e492 gcc-4.6: pagemap: avoid unused-but-set variable
Avoid quite a lot of warnings in header files in a gcc 4.6 -Wall builds

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:44:58 -07:00
Linus Torvalds 0531b2aac5 mm: add new 'read_cache_page_gfp()' helper function
It's a simplified 'read_cache_page()' which takes a page allocation
flag, so that different paths can control how aggressive the memory
allocations are that populate a address space.

In particular, the intel GPU object mapping code wants to be able to do
a certain amount of own internal memory management by automatically
shrinking the address space when memory starts getting tight.  This
allows it to dynamically use different memory allocation policies on a
per-allocation basis, rather than depend on the (static) address space
gfp policy.

The actual new function is a one-liner, but re-organizing the helper
functions to the point where you can do this with a single line of code
is what most of the patch is all about.

Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-27 09:20:03 -08:00
Paul E. McKenney b560d8ad85 rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP
A couple of references to CONFIG_CLASSIC_RCU have survived.
Although these are harmless, it is past time for them to go.
The one in hardirq.h is strictly a readability problem.

The two in pagemap.h appear to disable a !SMP performance
optimization (which this patch re-enables).

This does raise the issue as to whether pagemap.h should really
be referring to the CPU implementation.  Long term, I intend to
make the RCU implementation driven by CONFIG_PREEMPT, at which
point these should change from defined(CONFIG_TREE_RCU) to
!defined(CONFIG_PREEMPT). In the meantime, is there something
else that could be done in pagemap.h?

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <20090822050851.GA8414@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-22 13:07:09 +02:00
KOSAKI Motohiro 6837765963 mm: remove CONFIG_UNEVICTABLE_LRU config option
Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
configurability is unnecessary.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:42 -07:00
David Howells 385e1ca5f2 CacheFiles: Permit the page lock state to be monitored
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03 16:42:39 +01:00
Lee Schermerhorn 9a896c9a48 mm: define a UNIQUE value for AS_UNEVICTABLE flag
A new "address_space flag"--AS_MM_ALL_LOCKS--was defined to use the next
available AS flag while the Unevictable LRU was under development.  The
Unevictable LRU was using the same flag and "no one" noticed.  Current
mainline, since 2.6.28, has same value for two symbolic flag names.

So, define a unique flag value for AS_UNEVICTABLE--up close to the other
flags, [at the cost of an additional #ifdef] so we'll notice next time.
Note that #ifdef is not actually required, if we don't mind having the
unused flag value defined.

Replace #defines with an enum.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: <stable@kernel.org>		[2.6.28.x, 2.6.29.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:49 -07:00
Nick Piggin 54566b2c15 fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened.  They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim.  This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock.  The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
this flag in their write_begin function.  Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
  untouched to the grab_cache_page_write_begin() function.  That
  just simplifies everybody, and may even allow future expansion of the
  logic.   - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-04 13:33:20 -08:00
Nick Piggin 8413ac9d8c mm: page lock use lock bitops
trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.

Also, mark trylock as likely to succeed (and remove the annotation from
callers).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:32 -07:00
Nick Piggin f45840b5c1 mm: pagecache insertion fewer atomics
Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.

This saves one atomic operation when inserting a page into pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:52:31 -07:00
Lee Schermerhorn 89e004ea55 SHM_LOCKED pages are unevictable
Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
kept on the normal LRU, since scanning them is a waste of time and might
throw off kswapd's balancing algorithms.  Place them on the unevictable
LRU list instead.

Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
memory regions as unevictable.  Then these pages will be culled off the
normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's unevictable state when/if
shared memory segment is munlocked.

Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
the shmem segment's mapping [struct address_space] for evictability now
that they're no longer locked.  If so, move them to the appropriate zone
lru list.

Changes depend on [CONFIG_]UNEVICTABLE_LRU.

[kosaki.motohiro@jp.fujitsu.com: revert shm change]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Lee Schermerhorn ba9ddf4939 Ramfs and Ram Disk pages are unevictable
Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists.  When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list.  Round and round she goes...

With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:

Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages.  This will provide for efficient testing of ramfs pages
in page_evictable().

Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.

Set the unevictable state on address_space structures for new ramfs
inodes.  Test the unevictable state in page_evictable() to cull
unevictable pages.

These changes depend on [CONFIG_]UNEVICTABLE_LRU.

[riel@redhat.com: undo the brd.c part]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20 08:50:26 -07:00
Nick Piggin 529ae9aaa0 mm: rename page trylock
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-08-04 21:31:34 -07:00
Nick Piggin ce0ad7f095 powerpc/mm: Lockless get_user_pages_fast() for 64-bit (v3)
Implement lockless get_user_pages_fast for 64-bit powerpc.

Page table existence is guaranteed with RCU, and speculative page references
are used to take a reference to the pages without having a prior existence
guarantee on them.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2008-07-30 15:26:54 +10:00
Andrea Arcangeli 7906d00cd1 mmu-notifiers: add mm_take_all_locks() operation
mm_take_all_locks holds off reclaim from an entire mm_struct.  This allows
mmu notifiers to register into the mm at any time with the guarantee that
no mmu operation is in progress on the mm.

This operation locks against the VM for all pte/vma/mm related operations
that could ever happen on a certain mm.  This includes vmtruncate,
try_to_unmap, and all page faults.

The caller must take the mmap_sem in write mode before calling
mm_take_all_locks().  The caller isn't allowed to release the mmap_sem
until mm_drop_all_locks() returns.

mmap_sem in write mode is required in order to block all operations that
could modify pagetables and free pages without need of altering the vma
layout (for example populate_range() with nonlinear vmas).  It's also
needed in write mode to avoid new anon_vmas to be associated with existing
vmas.

A single task can't take more than one mm_take_all_locks() in a row or it
would deadlock.

mm_take_all_locks() and mm_drop_all_locks are expensive operations that
may have to take thousand of locks.

mm_take_all_locks() can fail if it's interrupted by signals.

When mmu_notifier_register returns, we must be sure that the driver is
notified if some task is in the middle of a vmtruncate for the 'mm' where
the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
is run around the vmtruncation but mmu_notifier_register can run after
mmu_notifier_invalidate_range_start and before
mmu_notifier_invalidate_range_end).  Same problem for rmap paths.  And
we've to remove page pinning to avoid replicating the tlb_gather logic
inside KVM (and GRU doesn't work well with page pinning regardless of
needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
the page, kvm would have no way to notice that it mapped into sptes a page
that is going into the freelist without a chance of any further
mmu_notifier notification.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 16:30:21 -07:00
Nick Piggin e286781d5f mm: speculative page references
If we can be sure that elevating the page_count on a pagecache page will
pin it, we can speculatively run this operation, and subsequently check to
see if we hit the right page rather than relying on holding a lock or
otherwise pinning a reference to the page.

This can be done if get_page/put_page behaves consistently throughout the
whole tree (ie.  if we "get" the page after it has been used for something
else, we must be able to free it with a put_page).

Actually, there is a period where the count behaves differently: when the
page is free or if it is a constituent page of a compound page.  We need
an atomic_inc_not_zero operation to ensure we don't try to grab the page
in either case.

This patch introduces the core locking protocol to the pagecache (ie.
adds page_cache_get_speculative, and tweaks some update-side code to make
it work).

Thanks to Hugh for pointing out an improvement to the algorithm setting
page_count to zero when we have control of all references, in order to
hold off speculative getters.

[kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
[hugh@veritas.com: fix add_to_page_cache]
[akpm@linux-foundation.org: repair a comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:06 -07:00
Andrew Morton 2185e69f68 mapping_set_error: add unlikely()
This is called on a per-page basis and in the vast majority of cases
`error' is zero.

Cc: Guillaume Chazarain <guichaz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:15 -07:00
Harvey Harrison b3c9752868 include/linux: Remove all users of FASTCALL() macro
FASTCALL() is always expanded to empty, remove it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Matthew Wilcox 2687a3569e Add lock_page_killable
This routine is like lock_page, but can be interrupted by a fatal signal

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2007-12-06 17:35:41 -05:00
Nick Piggin afddba49d1 fs: introduce write_begin, write_end, and perform_write aops
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:55 -07:00
Nick Piggin 08291429cf mm: fix pagecache write deadlocks
Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

1.  generic_buffered_write
2.   lock_page
3.    prepare_write
4.     unlock_page+vmtruncate
5.     copy_from_user
6.      mmap_sem(r)
7.       handle_mm_fault
8.        lock_page (filemap_nopage)
9.    commit_write
10.  unlock_page

a. sys_munmap / sys_mlock / others
b.  mmap_sem(w)
c.   make_pages_present
d.    get_user_pages
e.     handle_mm_fault
f.      lock_page (filemap_nopage)

2,8	- recursive deadlock if page is same
2,8;2,8	- ABBA deadlock is page is different
2,6;b,f	- ABBA deadlock if page is same

The solution is as follows:
1.  If we find the destination page is uptodate, continue as normal, but use
    atomic usercopies which do not take pagefaults and do not zero the uncopied
    tail of the destination. The destination is already uptodate, so we can
    commit_write the full length even if there was a partial copy: it does not
    matter that the tail was not modified, because if it is dirtied and written
    back to disk it will not cause any problems (uptodate *means* that the
    destination page is as new or newer than the copy on disk).

1a. The above requires that fault_in_pages_readable correctly returns access
    information, because atomic usercopies cannot distinguish between
    non-present pages in a readable mapping, from lack of a readable mapping.

2.  If we find the destination page is non uptodate, unlock it (this could be
    made slightly more optimal), then allocate a temporary page to copy the
    source data into. Relock the destination page and continue with the copy.
    However, instead of a usercopy (which might take a fault), copy the data
    from the pinned temporary page via the kernel address space.

(also, rename maxlen to seglen, because it was confusing)

This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:54 -07:00
Fengguang Wu 57f6b96c09 filemap: convert some unsigned long to pgoff_t
Convert some 'unsigned long' to pgoff_t.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Guillaume Chazarain 3e9f45bd18 Factor outstanding I/O error handling
Cleanup: setting an outstanding error on a mapping was open coded too many
times.  Factor it out in mapping_set_error().

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:57 -07:00
Nick Piggin 6fe6900e1e mm: make read_cache_page synchronous
Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd.  All depending on whether the filler is async and/or can return
with a !uptodate page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00