UBSAN reports signed integer overflow in kernel/futex.c:
UBSAN: Undefined behaviour in kernel/futex.c:2041:18
signed integer overflow:
0 - -2147483648 cannot be represented in type 'int'
Add a sanity check to catch negative values of nr_wake and nr_requeue.
Signed-off-by: Li Jinyue <lijinyue@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Erick Reyes <erickreyes@google.com>
Signed-off-by: Oleg Matcovschi <omatcovschi@google.com>
Cc: peterz@infradead.org
Cc: dvhart@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
Cherry-picked from fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a
Bug: 76106267
Change-Id: I954cc2848678318b60ec3f103d0c15f87b4605a4
commit 65d8fc777f6dcfee12785c057a6b57f679641c90 upstream.
When dealing with key handling for shared futexes, we can drastically reduce
the usage/need of the page lock. 1) For anonymous pages, the associated futex
object is the mm_struct which does not require the page lock. 2) For inode
based, keys, we can check under RCU read lock if the page mapping is still
valid and take reference to the inode. This just leaves one rare race that
requires the page lock in the slow path when examining the swapcache.
Additionally realtime users currently have a problem with the page lock being
contended for unbounded periods of time during futex operations.
Task A
get_futex_key()
lock_page()
---> preempted
Now any other task trying to lock that page will have to wait until
task A gets scheduled back in, which is an unbound time.
With this patch, we pretty much have a lockless futex_get_key().
Experiments show that this patch can boost/speedup the hashing of shared
futexes with the perf futex benchmarks (which is good for measuring such
change) by up to 45% when there are high (> 100) thread counts on a 60 core
Westmere. Lower counts are pretty much in the noise range or less than 10%,
but mid range can be seen at over 30% overall throughput (hash ops/sec).
This makes anon-mem shared futexes much closer to its private counterpart.
Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Ported on top of thp refcount rework, changelog, comments, fixes. ]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Mason <clm@fb.com>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 74250718
Change-Id: I80e300aa740a553fbbbf84143d6a4165f31c8a90
Signed-off-by: David Lin <dtwlin@google.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZUiosAAoJEE44bZycYXAvcHYP/1OKMYQB/3G7GfEhMXdlpV31
VjdzUg5X1JOE60anYNopvWQJgDFXMy9mTceUI3axDkfYb5iDFUpRBFEh70ggDL04
bGB/J4n2Linjkj35u+S5P3fK6qBfg9+VDpTfUYPZGB5YjOjmaD06E8InBF8iUuC3
6pkMtQKOptmKOc2hw84PsB3qm9ER2MMa92Lrs1rtcOihEqQMyKjkI/kzogs8XGje
5gMt31VweScZed3d7i1r9tl/DTmzGcpEyVpz/x8gI7Xwi69FeeLy6cWbhK0VOsLA
u7ul9mDa77bUC/jpBzJmIkS8fhzaTyUw8NQbtol9RSSIfzb+mvXyx9Vr7o4LYK2B
P6AekC16x6R8KUED1hfxKdagguRACDfKf91bMAxDCN/PXqITVbk3RxxxH6wHAvOx
Ihf4G5h800/ks6X1oMBYZcbFFbNCUHZjyL7V1M/iy1TrKuRhEtou4Ft3X+gOauLS
CG8VR9Jo1/BAvMaJmy5Hg9RPNoxEMstDi6x3ugD0wH57XHSZ5QmFMBzCbuWR6hWM
q1DvBK/I54BXlsdYU9WySn1hm2gKCNPZ+zGzLTo1l426vme+YjhC5911V7Tv+WHm
lc5FTXWtXGhoAZuNSIGDrlv3Dyq44iMNrqXrhlPmJjWD3Hx4hFGGp2GyHOpK+5+7
7egPk9m1WrhUKzA9m1/M
=InCr
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqfQUACgkQmXOSYMts
txZNghAApD/SW4fTOx6RZFCPVjAP70FfXvZsQYf3Zfp44Ytm2Kax3GIABPuknlI+
IZRAPnXb6KP8DNDdCyGcJ0avI5uw96sXyeZWlDZyeS1WHHizJq3+BLB09zzdegSk
K1dJrobXCYNESmcQMT5diGwqLYkdOs3hh7Ehqut29njwCzVzNG3n43H9F15o9cUZ
6lAM8/Zb6ai+0KgVgwC40QJneVltDEFfXVr6wo/IJXnYNaRCPKQM5lsG09pxxopG
NVSsmUyeJI5bPWEm5vbuBL2JVhaCcMtTfAPHflqbtykE8eSVEWdTeCWPuGWcATB+
2sGp3cVR2W7+4CHpbcnrXolmP/OI3jXHbG1LvyRqg4Iw1jgtZ8wwjCEkdsPz3fED
g2+EtSYl/NLW7N8P4KQV9jzihYIfELBj9HQsEs5aPOstyjyxl12RxJvjw835v5ts
oa7qKQAHIwZsuaB34qK+DjI5coNeKRvDMy5mm0GL3TqmLLFEzSVpaTceGpdvNLi0
6k3RkuJzU0TwAoTShWyYu6AbV+8aHniBQbjzYs5sufRgDy9pjnfWzDqtUM+chTsm
WaxwhpHdpOomwAfZr8/Zaf0xIxP/M99SFKevntE04Ft93P8dKuLqFcNAjQkMdibY
UHrJ67nBllmDtlH8yGO9j4FD89O0QaBX4J3qGyIu5eE73/iibvo=
=J7vi
-----END PGP SIGNATURE-----
Merge 3.10.107 into android-msm-bullhead-3.10-oreo-m5
Changes in 3.10.107: (270 commits)
Revert "Btrfs: don't delay inode ref updates during log, replay"
Btrfs: fix memory leak in reading btree blocks
ext4: use more strict checks for inodes_per_block on mount
ext4: fix in-superblock mount options processing
ext4: add sanity checking to count_overhead()
ext4: validate s_first_meta_bg at mount time
jbd2: don't leak modified metadata buffers on an aborted journal
ext4: fix fencepost in s_first_meta_bg validation
ext4: trim allocation requests to group size
ext4: preserve the needs_recovery flag when the journal is aborted
ext4: return EROFS if device is r/o and journal replay is needed
ext4: fix inode checksum calculation problem if i_extra_size is small
block: fix use-after-free in sys_ioprio_get()
block: allow WRITE_SAME commands with the SG_IO ioctl
block: fix del_gendisk() vs blkdev_ioctl crash
dm crypt: mark key as invalid until properly loaded
dm space map metadata: fix 'struct sm_metadata' leak on failed create
md/raid5: limit request size according to implementation limits
md:raid1: fix a dead loop when read from a WriteMostly disk
md linear: fix a race between linear_add() and linear_congested()
CIFS: Fix a possible memory corruption during reconnect
CIFS: Fix missing nls unload in smb2_reconnect()
CIFS: Fix a possible memory corruption in push locks
CIFS: remove bad_network_name flag
fs/cifs: make share unaccessible at root level mountable
cifs: Do not send echoes before Negotiate is complete
ocfs2: fix crash caused by stale lvb with fsdlm plugin
ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()
can: raw: raw_setsockopt: limit number of can_filter that can be set
can: peak: fix bad memory access and free sequence
can: c_can_pci: fix null-pointer-deref in c_can_start() - set device pointer
can: ti_hecc: add missing prepare and unprepare of the clock
can: bcm: fix hrtimer/tasklet termination in bcm op removal
can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
ALSA: hda - Fix up GPIO for ASUS ROG Ranger
ALSA: seq: Fix race at creating a queue
ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
ALSA: timer: Reject user params with too small ticks
ALSA: seq: Fix link corruption by event error handling
ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
ALSA: seq: Fix race during FIFO resize
ALSA: seq: Don't break snd_use_lock_sync() loop by timeout
ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to volume_control_quirks
usb: gadgetfs: restrict upper bound on device configuration size
USB: gadgetfs: fix unbounded memory allocation bug
USB: gadgetfs: fix use-after-free bug
USB: gadgetfs: fix checks of wTotalLength in config descriptors
xhci: free xhci virtual devices with leaf nodes first
USB: serial: io_ti: bind to interface after fw download
usb: gadget: composite: always set ep->mult to a sensible value
USB: cdc-acm: fix double usb_autopm_put_interface() in acm_port_activate()
USB: cdc-acm: fix open and suspend race
USB: cdc-acm: fix failed open not being detected
usb: dwc3: gadget: make Set Endpoint Configuration macros safe
usb: host: xhci-plat: Fix timeout on removal of hot pluggable xhci controllers
usb: dwc3: gadget: delay unmap of bounced requests
usb: hub: Wait for connection to be reestablished after port reset
usb: gadget: composite: correctly initialize ep->maxpacket
USB: UHCI: report non-PME wakeup signalling for Intel hardware
arm/xen: Use alloc_percpu rather than __alloc_percpu
xfs: set AGI buffer type in xlog_recover_clear_agi_bucket
xfs: clear _XBF_PAGES from buffers when readahead page
ssb: Fix error routine when fallback SPROM fails
drivers/gpu/drm/ast: Fix infinite loop if read fails
scsi: avoid a permanent stop of the scsi device's request queue
scsi: move the nr_phys_segments assert into scsi_init_io
scsi: don't BUG_ON() empty DMA transfers
scsi: storvsc: properly handle SRB_ERROR when sense message is present
scsi: storvsc: properly set residual data length on errors
target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export
scsi: lpfc: Add shutdown method for kexec
scsi: sr: Sanity check returned mode data
scsi: sd: Fix capacity calculation with 32-bit sector_t
s390/vmlogrdr: fix IUCV buffer allocation
libceph: verify authorize reply on connect
nfs_write_end(): fix handling of short copies
powerpc/ps3: Fix system hang with GCC 5 builds
sg_write()/bsg_write() is not fit to be called under KERNEL_DS
ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to it
cred/userns: define current_user_ns() as a function
net: ti: cpmac: Fix compiler warning due to type confusion
tick/broadcast: Prevent NULL pointer dereference
netvsc: reduce maximum GSO size
drop_monitor: add missing call to genlmsg_end
drop_monitor: consider inserted data in genlmsg_end
igmp: Make igmp group member RFC 3376 compliant
HID: hid-cypress: validate length of report
Input: xpad - use correct product id for x360w controllers
Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000
Input: iforce - validate number of endpoints before using them
Input: kbtab - validate number of endpoints before using them
Input: joydev - do not report stale values on first open
Input: tca8418 - use the interrupt trigger from the device tree
Input: mpr121 - handle multiple bits change of status register
Input: mpr121 - set missing event capability
Input: i8042 - add Clevo P650RS to the i8042 reset list
i2c: fix kernel memory disclosure in dev interface
vme: Fix wrong pointer utilization in ca91cx42_slave_get
sysrq: attach sysrq handler correctly for 32-bit kernel
pinctrl: sh-pfc: Do not unconditionally support PIN_CONFIG_BIAS_DISABLE
x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6F
qla2xxx: Fix crash due to null pointer access
ARM: 8634/1: hw_breakpoint: blacklist Scorpion CPUs
ARM: dts: da850-evm: fix read access to SPI flash
NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
vmxnet3: Wake queue from reset work
Fix memory leaks in cifs_do_mount()
Compare prepaths when comparing superblocks
Move check for prefix path to within cifs_get_root()
Fix regression which breaks DFS mounting
apparmor: fix uninitialized lsm_audit member
apparmor: exec should not be returning ENOENT when it denies
apparmor: fix disconnected bind mnts reconnection
apparmor: internal paths should be treated as disconnected
apparmor: check that xindex is in trans_table bounds
apparmor: add missing id bounds check on dfa verification
apparmor: don't check for vmalloc_addr if kvzalloc() failed
apparmor: fix oops in profile_unpack() when policy_db is not present
apparmor: fix module parameters can be changed after policy is locked
apparmor: do not expose kernel stack
vfio/pci: Fix integer overflows, bitmask check
bna: Add synchronization for tx ring.
sg: Fix double-free when drives detach during SG_IO
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
serial: 8250_pci: Detach low-level driver during PCI error recovery
bnx2x: Correct ringparam estimate when DOWN
tile/ptrace: Preserve previous registers for short regset write
sysctl: fix proc_doulongvec_ms_jiffies_minmax()
ISDN: eicon: silence misleading array-bounds warning
ARC: [arcompact] handle unaligned access delay slot corner case
parisc: Don't use BITS_PER_LONG in userspace-exported swab.h header
nfs: Don't increment lock sequence ID after NFS4ERR_MOVED
ipv6: addrconf: Avoid addrconf_disable_change() using RCU read-side lock
af_unix: move unix_mknod() out of bindlock
drm/nouveau/nv1a,nv1f/disp: fix memory clock rate retrieval
crypto: api - Clear CRYPTO_ALG_DEAD bit before registering an alg
ata: sata_mv:- Handle return value of devm_ioremap.
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
mm, fs: check for fatal signals in do_generic_file_read()
ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
sched/debug: Don't dump sched debug info in SysRq-W
tcp: fix 0 divide in __tcp_select_window()
macvtap: read vnet_hdr_size once
packet: round up linear to header len
vfs: fix uninitialized flags in splice_to_pipe()
siano: make it work again with CONFIG_VMAP_STACK
futex: Move futex_init() to core_initcall
rtc: interface: ignore expired timers when enqueuing new timers
irda: Fix lockdep annotations in hashbin_delete().
tty: serial: msm: Fix module autoload
rtlwifi: rtl_usb: Fix for URB leaking when doing ifconfig up/down
af_packet: remove a stray tab in packet_set_ring()
MIPS: Fix special case in 64 bit IP checksumming.
mm: vmpressure: fix sending wrong events on underflow
ipc/shm: Fix shmat mmap nil-page protection
sd: get disk reference in sd_check_events()
samples/seccomp: fix 64-bit comparison macros
ath5k: drop bogus warning on drv_set_key with unsupported cipher
rdma_cm: fail iwarp accepts w/o connection params
NFSv4: fix getacl ERANGE for some ACL buffer sizes
bcma: use (get|put)_device when probing/removing device driver
powerpc/xmon: Fix data-breakpoint
KVM: VMX: use correct vmcs_read/write for guest segment selector/base
KVM: PPC: Book3S PR: Fix illegal opcode emulation
KVM: s390: fix task size check
s390: TASK_SIZE for kernel threads
xtensa: move parse_tag_fdt out of #ifdef CONFIG_BLK_DEV_INITRD
mac80211: flush delayed work when entering suspend
drm/ast: Fix test for VGA enabled
drm/ttm: Make sure BOs being swapped out are cacheable
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
drivers: hv: Turn off write permission on the hypercall page
xhci: fix 10 second timeout on removal of PCI hotpluggable xhci controllers
crypto: improve gcc optimization flags for serpent and wp512
mtd: pmcmsp: use kstrndup instead of kmalloc+strncpy
cpmac: remove hopeless #warning
mvsas: fix misleading indentation
l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
net: don't call strlen() on the user buffer in packet_bind_spkt()
dccp: Unlock sock before calling sk_free()
tcp: fix various issues for sockets morphing to listen state
uapi: fix linux/packet_diag.h userspace compilation error
ipv6: avoid write to a possibly cloned skb
dccp: fix memory leak during tear-down of unsuccessful connection request
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
futex: Add missing error handling to FUTEX_REQUEUE_PI
give up on gcc ilog2() constant optimizations
cancel the setfilesize transation when io error happen
crypto: ghash-clmulni - Fix load failure
crypto: cryptd - Assign statesize properly
ACPI / video: skip evaluating _DOD when it does not exist
Drivers: hv: balloon: don't crash when memory is added in non-sorted order
s390/pci: fix use after free in dma_init
cpufreq: Fix and clean up show_cpuinfo_cur_freq()
igb: Workaround for igb i210 firmware issue
igb: add i211 to i210 PHY workaround
ipv4: provide stronger user input validation in nl_fib_input()
tcp: initialize icsk_ack.lrcvtime at session start time
ACM gadget: fix endianness in notifications
mmc: sdhci: Do not disable interrupts while waiting for clock
uvcvideo: uvc_scan_fallback() for webcams with broken chain
fbcon: Fix vc attr at deinit
crypto: algif_hash - avoid zero-sized array
virtio_balloon: init 1st buffer in stats vq
c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
sparc/ptrace: Preserve previous registers for short regset write
metag/ptrace: Preserve previous registers for short regset write
metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS
metag/ptrace: Reject partial NT_METAG_RPIPE writes
libceph: force GFP_NOIO for socket allocations
ACPI: Fix incompatibility with mcount-based function graph tracing
ACPI / power: Avoid maybe-uninitialized warning
rtc: s35390a: make sure all members in the output are set
rtc: s35390a: implement reset routine as suggested by the reference
rtc: s35390a: improve irq handling
padata: avoid race in reordering
HID: hid-lg: Fix immediate disconnection of Logitech Rumblepad 2
HID: i2c-hid: Add sleep between POWER ON and RESET
drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
drm/vmwgfx: Remove getparam error message
drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
Reset TreeId to zero on SMB2 TREE_CONNECT
metag/usercopy: Drop unused macros
metag/usercopy: Zero rest of buffer from copy_from_user
powerpc: Don't try to fix up misaligned load-with-reservation instructions
mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
mtd: bcm47xxpart: fix parsing first block after aligned TRX
net/packet: fix overflow in check for priv area size
x86/vdso: Plug race between mapping and ELF header setup
iscsi-target: Fix TMR reference leak during session shutdown
iscsi-target: Drop work-around for legacy GlobalSAN initiator
xen, fbfront: fix connecting to backend
char: lack of bool string made CONFIG_DEVPORT always on
platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
platform/x86: acer-wmi: setup accelerometer when ACPI device was found
mm: Tighten x86 /dev/mem with zeroing reads
virtio-console: avoid DMA from stack
catc: Combine failure cleanup code in catc_probe()
catc: Use heap buffer for memory size test
net: ipv6: check route protocol when deleting routes
Drivers: hv: don't leak memory in vmbus_establish_gpadl()
Drivers: hv: get rid of timeout in vmbus_open()
ubi/upd: Always flush after prepared for an update
x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
powerpc: Reject binutils 2.24 when building little endian
net/packet: fix overflow in check for tp_frame_nr
net/packet: fix overflow in check for tp_reserve
tty: nozomi: avoid a harmless gcc warning
hostap: avoid uninitialized variable use in hfa384x_get_rid
gfs2: avoid uninitialized variable warning
net: neigh: guard against NULL solicit() method
sctp: listen on the sock only when it's state is listening or closed
ip6mr: fix notification device destruction
MIPS: Fix crash registers on non-crashing CPUs
RDS: Fix the atomicity for congestion map update
xen/x86: don't lose event interrupts
p9_client_readdir() fix
nfsd: check for oversized NFSv2/v3 arguments
ftrace/x86: Fix triple fault with graph tracing and suspend-to-ram
kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
tun: read vnet_hdr_sz once
printk: use rcuidle console tracepoint
ipv6: check raw payload size correctly in ioctl
x86: standardize mmap_rnd() usage
x86/mm/32: Enable full randomization on i386 and X86_32
mm: larger stack guard gap, between vmas
mm: fix new crash in unmapped_area_topdown()
Allow stack to grow up to address space limit
Linux 3.10.107
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Conflicts:
arch/x86/mm/mmap.c
drivers/mmc/host/sdhci.c
drivers/usb/host/xhci-plat.c
fs/ext4/super.c
kernel/sched/core.c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWz1zgAAoJEDjbvchgkmk+yU8P/10DITNzrhCfz5wbhvvn9Uvo
7H1DziOora3u9h8/rz6xqgFEz2/9cZ03KoLcpGha7kEFBsvgVhN3uSI0YFpVV2mT
8/oh1ADdkky3Pld0f7gDGydDvrmgqx83/69SQ8hDQ8Mr2QTaKNvK05QGC2/EO9kI
OcUAXjdAGglmf5rfhNhXodG/F2DtsA55uCzeyuBhcPE3bM7d4/48pwr1b2tW2CR8
hsprRvSz+kGgHXQy8jYdxKEI66OC/i22xVnxEc8PZmPZ0fFfmszzc9nzhcseWfpe
0JGgfwAtM8Va+bX4kfvqPpc2qR0r8Z2iEKNnAHnGutOvSWvow0l1OEedsb/+s1J6
/AYlPIkgTxwLDAwBIymPgowkEMOPVZzPL0tkoZI8wjB+eqUxxLlIa2dNByCyUs/U
1xTy+0UDMMDXG911mJl+yZFvd4R7lQUavIEStmMQ+A/Go2KrATaqIM8WETBlm7oH
s3hZ3E+RBWmfD/6JQwsJNkwv6yWeaRXNE+bj8C1r/uBdPyGqX9T22OaIOlio+I71
XBNEM5mrTlNeNVIUIKW29qmLBxBrH2LLwpv/dRyfOfzfhi1B+dl9+3sJauvrSmWi
jrR1khGmmaZcfOT2DVmpwlDQCQcyMcy8S8RTTAHhhuNmWtSjdc3TcfRlHXvP0sOu
ruXBufxernb94E7sqsvF
=LW9r
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqdMoACgkQmXOSYMts
txYJKxAAkVgmXLjjtJbCUkYLzohjXabtfF9ekfy7UPRdBU+PPRC2c8tHcR6LCqXd
v+hEiI80h72BqEVE4y3ztFZlhbpSonIcmRrG+/gWsWcWmY9S0owilHwhmrl3uvmC
Fvso6+5oWVvVXuM8I4Ul/3bXmScVhv/rh22iN2hhOS7WgEVdqlhmYHC/KIpRK+rD
dyUQ2eONgr14FyGswgK0zLaFKXvKhQfEjvAu4KXJek0sIPIUEVdZ5xgS2v4eLigN
W0+ewi4DCTESCU8GCnZwwU1OIbe2De09sPIVwBM644bOIJRxOJxnL0a11IjwOaye
P9ne98G3M1vTruiM+/dA40eGh7kFiKKlIqCO1mf1IqrQSYq+sNEuDSmD9XY+huRZ
ktDue8NcUmFgJzJxeRYfdatCNF/esfdIzuzbFnw+Jr+EPACn6FiOXFgkJkUpo204
wvv+nOhiYlSJQT81jqmVTn3iGyvZIJd15uCEryguNt8LmLafGlztYBZ5dSUkejcu
nAipexnYGyrufD5XhshZlcBt1S1FCQZd3lUBETmqLzP+hiZG76ti96i2ro2hnyM5
TWva2zmC1Cp89l0dWJjtNSohD4S6226Jc6ebHTDO/67gpsj3dlbH3IR7rDqKXgof
AFltzPMYnfMPYuDmANTu7vqlJGI5974xrDA1hRAUN49YVxD5YKk=
=fJ2P
-----END PGP SIGNATURE-----
Merge 3.10.98 into android-msm-bullhead-3.10-oreo-m5
Changes in 3.10.98: (55 commits)
ALSA: seq: Fix double port list deletion
wan/x25: Fix use-after-free in x25_asy_open_tty()
staging/speakup: Use tty_ldisc_ref() for paste kworker
pty: fix possible use after free of tty->driver_data
pty: make sure super_block is still valid in final /dev/tty close
AIO: properly check iovec sizes
ext4: fix potential integer overflow
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl
perf: Fix inherited events vs. tracepoint filters
ptrace: use fsuid, fsgid, effective creds for fs access checks
tools lib traceevent: Fix output of %llu for 64 bit values read on 32 bit machines
tracing: Fix freak link error caused by branch tracer
klist: fix starting point removed bug in klist iterators
scsi: restart list search after unlock in scsi_remove_target
scsi_sysfs: Fix queue_ramp_up_period return code
iscsi-target: Fix rx_login_comp hang after login failure
Fix a memory leak in scsi_host_dev_release()
SCSI: Fix NULL pointer dereference in runtime PM
iscsi-target: Fix potential dead-lock during node acl delete
SCSI: fix crashes in sd and sr runtime PM
drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
scsi_dh_rdac: always retry MODE SELECT on command lock violation
scsi: fix soft lockup in scsi_remove_target() on module removal
iio:ad7793: Fix ad7785 product ID
iio: lpc32xx_adc: fix warnings caused by enabling unprepared clock
iio:ad5064: Make sure ad5064_i2c_write() returns 0 on success
iio: adis_buffer: Fix out-of-bounds memory access
iio: dac: mcp4725: set iio name property in sysfs
cifs: fix erroneous return value
nfs: Fix race in __update_open_stateid()
udf: limit the maximum number of indirect extents in a row
udf: Prevent buffer overrun with multi-byte characters
udf: Check output buffer length when converting name to CS0
ARM: 8519/1: ICST: try other dividends than 1
ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz()
fuse: break infinite loop in fuse_fill_write_pages()
mm: soft-offline: check return value in second __get_any_page() call
Input: elantech - add Fujitsu Lifebook U745 to force crc_enabled
Input: elantech - mark protocols v2 and v3 as semi-mt
Input: i8042 - add Fujitsu Lifebook U745 to the nomux list
iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG
mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()
xhci: Fix list corruption in urb dequeue at host removal
m32r: fix m32104ut_defconfig build fail
dma-debug: switch check from _text to _stext
scripts/bloat-o-meter: fix python3 syntax error
memcg: only free spare array when readers are done
radix-tree: fix race in gang lookup
radix-tree: fix oops after radix_tree_iter_retry
intel_scu_ipcutil: underflow in scu_reg_access()
x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
futex: Drop refcount if requeue_pi() acquired the rtmutex
ip6mr: call del_timer_sync() in ip6mr_free_table()
module: wrapper for symbol name.
Linux 3.10.98
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
commit c236c8e95a3d395b0494e7108f0d41cf36ec107c upstream.
While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.
pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.
Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
commit 25f71d1c3e98ef0e52371746220d66458eac75bc upstream.
The UEVENT user mode helper is enabled before the initcalls are executed
and is available when the root filesystem has been mounted.
The user mode helper is triggered by device init calls and the executable
might use the futex syscall.
futex_init() is marked __initcall which maps to device_initcall, but there
is no guarantee that futex_init() is invoked _before_ the first device init
call which triggers the UEVENT user mode helper.
If the user mode helper uses the futex syscall before futex_init() then the
syscall crashes with a NULL pointer dereference because the futex subsystem
has not been initialized yet.
Move futex_init() to core_initcall so futexes are initialized before the
root filesystem is mounted and the usermode helper becomes available.
[ tglx: Rewrote changelog ]
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: jiang.biao2@zte.com.cn
Cc: jiang.zhengxiong@zte.com.cn
Cc: zhong.weidong@zte.com.cn
Cc: deng.huali@zte.com.cn
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>
commit fb75a4282d0d9a3c7c44d940582c2d226cf3acfb upstream.
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream.
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* commit 'v3.10.49': (529 commits)
Linux 3.10.49
ACPI / battery: Retry to get battery information if failed during probing
x86, ioremap: Speed up check for RAM pages
Score: Modify the Makefile of Score, remove -mlong-calls for compiling
Score: The commit is for compiling successfully.
Score: Implement the function csum_ipv6_magic
score: normalize global variables exported by vmlinux.lds
rtmutex: Plug slow unlock race
rtmutex: Handle deadlock detection smarter
rtmutex: Detect changes in the pi lock chain
rtmutex: Fix deadlock detector for real
ring-buffer: Check if buffer exists before polling
drm/radeon: stop poisoning the GART TLB
drm/radeon: fix typo in golden register setup on evergreen
ext4: disable synchronous transaction batching if max_batch_time==0
ext4: clarify error count warning messages
ext4: fix unjournalled bg descriptor while initializing inode bitmap
dm io: fix a race condition in the wake up code for sync_io
Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
clk: spear3xx: Use proper control register offset
...
In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:
The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.
irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.
Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.
Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters
bit or the owner died bit is set or we are called from the requeue_pi
path or from user space just for fun.
The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.
Handle the cases explicit:
Waiter | pi_state | pi->owner | uTID | uODIED | ?
[1] NULL | --- | --- | 0 | 0/1 | Valid
[2] NULL | --- | --- | >0 | 0/1 | Valid
[3] Found | NULL | -- | Any | 0/1 | Invalid
[4] Found | Found | NULL | 0 | 1 | Valid
[5] Found | Found | NULL | >0 | 1 | Invalid
[6] Found | Found | task | 0 | 1 | Valid
[7] Found | Found | NULL | Any | 0 | Invalid
[8] Found | Found | task | ==taskTID | 0/1 | Valid
[9] Found | Found | task | 0 | 0 | Invalid
[10] Found | Found | task | !=taskTID | 0/1 | Invalid
[1] Indicates that the kernel can acquire the futex atomically. We
came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
[2] Valid, if TID does not belong to a kernel thread. If no matching
thread is found then it indicates that the owner TID has died.
[3] Invalid. The waiter is queued on a non PI futex
[4] Valid state after exit_robust_list(), which sets the user space
value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
[5] The user space value got manipulated between exit_robust_list()
and exit_pi_state_list()
[6] Valid state after exit_pi_state_list() which sets the new owner in
the pi_state but cannot access the user space value.
[7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
[8] Owner and user space value match
[9] There is no transient state which sets the user space TID to 0
except exit_robust_list(), but this is indicated by the
FUTEX_OWNER_DIED bit. See [4]
[10] There is no transient state which leaves owner and user space
TID out of sync.
Backport to 3.13
conflicts: kernel/futex.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Git-commit: b50939f89dae691f8da9e0fd22e0eae0db1c0ae4
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.
Clean it up unconditionally.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Git-commit: a2ec8e3dcdc6c93f574a0e22039b791cc5e14fa6
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.
Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.
[ tglx: Use futex_top_waiter() and explain why we do not need to try
restoring the already corrupted user space state. ]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Git-commit: 550c7910f0e2fd4f130fec2f17541f3614fdfaf9
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
then dangling pointers may be left for rt_waiter resulting in an
exploitable condition.
This change brings futex_requeue() into line with
futex_wait_requeue_pi() which performs the same check as per commit
6f7b0a2a5 (futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi())
[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]
Fixes CVE-2014-3153.
Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Git-commit: 5b3bbe42f5874081123fa3aa4dd85d5e5c806f54
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
* commit 'v3.10.40': (203 commits)
Linux 3.10.40
ARC: !PREEMPT: Ensure Return to kernel mode is IRQ safe
drm: cirrus: add power management support
Input: synaptics - add min/max quirk for ThinkPad Edge E431
Input: synaptics - add min/max quirk for ThinkPad T431s, L440, L540, S1 Yoga and X1
lockd: ensure we tear down any live sockets when socket creation fails during lockd_up
dm thin: fix dangling bio in process_deferred_bios error path
dm transaction manager: fix corruption due to non-atomic transaction commit
Skip intel_crt_init for Dell XPS 8700
mtd: sm_ftl: heap corruption in sm_create_sysfs_attributes()
mtd: nuc900_nand: NULL dereference in nuc900_nand_enable()
mtd: atmel_nand: Disable subpage NAND write when using Atmel PMECC
tgafb: fix data copying
gpio: mxs: Allow for recursive enable_irq_wake() call
rtlwifi: rtl8188ee: initialize packet_beacon
rtlwifi: rtl8192se: Fix regression due to commit 1bf4bbb
rtlwifi: rtl8192se: Fix too long disable of IRQs
rtlwifi: rtl8192cu: Fix too long disable of IRQs
rtlwifi: rtl8188ee: Fix too long disable of IRQs
rtlwifi: rtl8723ae: Fix too long disable of IRQs
...
Change-Id: If5388cf980cb123e35e1b29275ba288c89c5aa18
Signed-off-by: Ian Maund <imaund@codeaurora.org>
commit 54a217887a7b658e2650c3feff22756ab80c7339 upstream.
The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters bit
or the owner died bit is set or we are called from the requeue_pi path
or from user space just for fun.
The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.
Handle the cases explicit:
Waiter | pi_state | pi->owner | uTID | uODIED | ?
[1] NULL | --- | --- | 0 | 0/1 | Valid
[2] NULL | --- | --- | >0 | 0/1 | Valid
[3] Found | NULL | -- | Any | 0/1 | Invalid
[4] Found | Found | NULL | 0 | 1 | Valid
[5] Found | Found | NULL | >0 | 1 | Invalid
[6] Found | Found | task | 0 | 1 | Valid
[7] Found | Found | NULL | Any | 0 | Invalid
[8] Found | Found | task | ==taskTID | 0/1 | Valid
[9] Found | Found | task | 0 | 0 | Invalid
[10] Found | Found | task | !=taskTID | 0/1 | Invalid
[1] Indicates that the kernel can acquire the futex atomically. We
came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
[2] Valid, if TID does not belong to a kernel thread. If no matching
thread is found then it indicates that the owner TID has died.
[3] Invalid. The waiter is queued on a non PI futex
[4] Valid state after exit_robust_list(), which sets the user space
value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
[5] The user space value got manipulated between exit_robust_list()
and exit_pi_state_list()
[6] Valid state after exit_pi_state_list() which sets the new owner in
the pi_state but cannot access the user space value.
[7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
[8] Owner and user space value match
[9] There is no transient state which sets the user space TID to 0
except exit_robust_list(), but this is indicated by the
FUTEX_OWNER_DIED bit. See [4]
[10] There is no transient state which leaves owner and user space
TID out of sync.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 13fbca4c6ecd96ec1a1cfa2e4f2ce191fe928a5e upstream.
If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.
Clean it up unconditionally.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b3eaa9fc5cd0a4d74b18f6b8dc617aeaf1873270 upstream.
We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.
Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.
[ tglx: Use futex_top_waiter() and explain why we do not need to try
restoring the already corrupted user space state. ]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e9c243a5a6de0be8e584c604d353412584b592f8 upstream.
If uaddr == uaddr2, then we have broken the rule of only requeueing from
a non-pi futex to a pi futex with this call. If we attempt this, then
dangling pointers may be left for rt_waiter resulting in an exploitable
condition.
This change brings futex_requeue() in line with futex_wait_requeue_pi()
which performs the same check as per commit 6f7b0a2a5c ("futex: Forbid
uaddr == uaddr2 in futex_wait_requeue_pi()")
[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]
Fixes CVE-2014-3153.
Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f0d71b3dcb8332f7971b5f2363632573e6d9486a upstream.
We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.
Found while analysing the fallout of Dave Jones syscall fuzzer.
We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 866293ee54227584ffcb4a42f69c1f365974ba7f upstream.
Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com
That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
- it can detect that issue early
- it avoids a different and more complex fixup for backing out
If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.
The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream.
If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().
This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[geert: Backported to v3.10..v3.13]
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The following commits have been reverted from this merge, as they are
known to introduce new bugs and are currently incompatible with our
audio implementation. Investigation of these commits is ongoing, and
they are expected to be brought in at a later time:
86e6de7 ALSA: compress: fix drain calls blocking other compress functions (v6)
16442d4 ALSA: compress: fix drain calls blocking other compress functions
This merge commit also includes a change in block, necessary for
compilation. Upstream has modified elevator_init_fn to prevent race
conditions, requring updates to row_init_queue and test_init_queue.
* commit 'v3.10.28': (1964 commits)
Linux 3.10.28
ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling
drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init()
serial: amba-pl011: use port lock to guard control register access
mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL
md/raid5: Fix possible confusion when multiple write errors occur.
md/raid10: fix two bugs in handling of known-bad-blocks.
md/raid10: fix bug when raid10 recovery fails to recover a block.
md: fix problem when adding device to read-only array with bitmap.
drm/i915: fix DDI PLLs HW state readout code
nilfs2: fix segctor bug that causes file system corruption
thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
ftrace/x86: Load ftrace_ops in parameter not the variable holding it
SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()
writeback: Fix data corruption on NFS
hwmon: (coretemp) Fix truncated name of alarm attributes
vfs: In d_path don't call d_dname on a mount point
staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq()
staging: comedi: addi_apci_1032: fix subdevice type/flags bug
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
GFS2: Increase i_writecount during gfs2_setattr_chown
perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
perf scripting perl: Fix build error on Fedora 12
ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
Linux 3.10.27
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
SCSI: sd: Reduce buffer size for vpd request
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
mac80211: move "bufferable MMPDU" check to fix AP mode scan
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
ACPI / TPM: fix memory leak when walking ACPI namespace
mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
clk: samsung: exynos4: Correct SRC_MFC register
clk: clk-divider: fix divisor > 255 bug
ahci: add PCI ID for Marvell 88SE9170 SATA controller
parisc: Ensure full cache coherency for kmap/kunmap
drm/nouveau/bios: make jump conditional
ARM: shmobile: mackerel: Fix coherent DMA mask
ARM: shmobile: armadillo: Fix coherent DMA mask
ARM: shmobile: kzm9g: Fix coherent DMA mask
ARM: dts: exynos5250: Fix MDMA0 clock number
ARM: fix "bad mode in ... handler" message for undefined instructions
ARM: fix footbridge clockevent device
net: Loosen constraints for recalculating checksum in skb_segment()
bridge: use spin_lock_bh() in br_multicast_set_hash_max
netpoll: Fix missing TXQ unlock and and OOPS.
net: llc: fix use after free in llc_ui_recvmsg
virtio-net: fix refill races during restore
virtio_net: don't leak memory or block when too many frags
virtio-net: make all RX paths handle errors consistently
virtio_net: fix error handling for mergeable buffers
vlan: Fix header ops passthru when doing TX VLAN offload.
net: rose: restore old recvmsg behavior
rds: prevent dereference of a NULL device
ipv6: always set the new created dst's from in ip6_rt_copy
net: fec: fix potential use after free
hamradio/yam: fix info leak in ioctl
drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
net: inet_diag: zero out uninitialized idiag_{src,dst} fields
ip_gre: fix msg_name parsing for recvfrom/recvmsg
net: unix: allow bind to fail on mutex lock
ipv6: fix illegal mac_header comparison on 32bit
netvsc: don't flush peers notifying work during setting mtu
tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
net: unix: allow set_peek_off to fail
net: drop_monitor: fix the value of maxattr
ipv6: don't count addrconf generated routes against gc limit
packet: fix send path when running with proto == 0
virtio: delete napi structures from netdev before releasing memory
macvtap: signal truncated packets
tun: update file current position
macvtap: update file current position
macvtap: Do not double-count received packets
rds: prevent BUG_ON triggered on congestion update to loopback
net: do not pretend FRAGLIST support
IPv6: Fixed support for blackhole and prohibit routes
HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
gpio-rcar: R-Car GPIO IRQ share interrupt
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
irqchip: renesas-irqc: Fix irqc_probe error handling
Linux 3.10.26
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
ext4: fix bigalloc regression
arm64: Use Normal NonCacheable memory for writecombine
arm64: Do not flush the D-cache for anonymous pages
arm64: Avoid cache flushing in flush_dcache_page()
ARM: KVM: arch_timers: zero CNTVOFF upon return to host
ARM: hyp: initialize CNTVOFF to zero
clocksource: arch_timer: use virtual counters
arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
arm64: dts: Reserve the memory used for secondary CPU release address
arm64: check for number of arguments in syscall_get/set_arguments()
arm64: fix possible invalid FPSIMD initialization state
...
Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106
Signed-off-by: Ian Maund <imaund@codeaurora.org>
commit f12d5bfceb7e1f9051563381ec047f7f13956c3c upstream.
The hugepage code had the exact same bug that regular pages had in
commit 7485d0d375 ("futexes: Remove rw parameter from
get_futex_key()").
The regular page case was fixed by commit 9ea71503a8 ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.
Found by Dave Jones and his trinity tool.
Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 13d60f4b6ab5b702dc8d2ee20999f98a93728aec upstream.
The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.
Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.
Steps to reproduce the bug:
1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
mapping.
The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
their keys solely depend on the user space address.
2. Lock mutex1 and mutex2
3. Create thread1 and in the thread function lock mutex1, which
results in thread1 blocking on the locked mutex1.
4. Create thread2 and in the thread function lock mutex2, which
results in thread2 blocking on the locked mutex2.
5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
still blocks on mutex2 because the futex_key points to mutex1.
To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.
Mappings which are not based on hugetlbfs are not affected and still
use page->index.
Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.
[ tglx: Massaged changelog ]
Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn>
Reviewed-by: 'Mel Gorman' <mgorman@suse.de>
Acked-by: 'Darren Hart' <dvhart@linux.intel.com>
Cc: 'Peter Zijlstra' <peterz@infradead.org>
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I9ccab9c2d201adb66c85432801cdcf43fc91e94f
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
kernel-doc notation format.
Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'
Fix one spello.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Jones reported a bug with futex_lock_pi() that his trinity test
exposed. Sometime between queue_me() and taking the q.lock_ptr, the
lock_ptr became NULL, resulting in a crash.
While futex_wake() is careful to not call wake_futex() on futex_q's with
a pi_state or an rt_waiter (which are either waiting for a
futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
futex_requeue() do not perform the same test.
Update futex_wake_op() and futex_requeue() to test for q.pi_state and
q.rt_waiter and abort with -EINVAL if detected. To ensure any future
breakage is caught, add a WARN() to wake_futex() if the same condition
is true.
This fix has seen 3 hours of testing with "trinity -c futex" on an
x86_64 VM with 4 CPUS.
[akpm@linux-foundation.org: tidy up the WARN()]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Reported-by: Dave Jones <davej@redat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076
The detailed problem analysis shows:
Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Succeeds due to the check for F's userspace TID field == 0
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes T2 via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains real ownership of the
pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T3 --> observes inconsistent state
This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.
So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:
if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.
AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.
Now the proposed change
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ if (unlikely(ownerdied ||
+ !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.
What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!
Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.
Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.
lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.
So the above scenario changes to:
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Finds pi_state and blocks on pi_state->rt_mutex
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes it via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains ownership of the pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.
Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.
As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)
Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
as the trinity test suite manages to do, we miss early wakeups as
q.key is equal to key2 (because they are the same uaddr). We will then
attempt to dereference the pi_mutex (which would exist had the futex_q
been properly requeued to a pi futex) and trigger a NULL pointer
dereference.
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Notify get_robust_list users that the syscall is going away.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It was possible to extract the robust list head address from a setuid
process if it had used set_robust_list(), allowing an ASLR info leak. This
changes the permission checks to be the same as those used for similar
info that comes out of /proc.
Running a setuid program that uses robust futexes would have had:
cred->euid != pcred->euid
cred->euid == pcred->uid
so the old permissions check would allow it. I'm not aware of any setuid
programs that use robust futexes, so this is just a preventative measure.
(This patch is based on changes from grsecurity.)
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull core/locking changes for v3.4 from Ingo Molnar
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Simplify return logic
futex: Cover all PI opcodes with cmpxchg enabled check
No need to assign ret in each case and break. Simply return the result
of the handler function directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Some of the newer futex PI opcodes do not check the cmpxchg enabled
variable and call unconditionally into the handling functions. Cover
all PI opcodes in a separate check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
It was found (by Sasha) that if you use a futex located in the gate
area we get stuck in an uninterruptible infinite loop, much like the
ZERO_PAGE issue.
While looking at this problem, PeterZ realized you'll get into similar
trouble when hitting any install_special_pages() mapping. And are there
still drivers setting up their own special mmaps without page->mapping,
and without special VM or pte flags to make get_user_pages fail?
In most cases, if page->mapping is NULL, we do not need to retry at all:
Linus points out that even /proc/sys/vm/drop_caches poses no problem,
because it ends up using remove_mapping(), which takes care not to
interfere when the page reference count is raised.
But there is still one case which does need a retry: if memory pressure
called shmem_writepage in between get_user_pages_fast dropping page
table lock and our acquiring page lock, then the page gets switched from
filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
Fault it back in to get the page->mapping needed for key->shared.inode.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.
Nothing to see here but a whole lot of instances of:
-#include <linux/module.h>
+#include <linux/export.h>
This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Change a single occurrence of "unlcoked" into "unlocked".
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The variables here are really not used uninitialized.
kernel/futex.c: In function 'fixup_pi_state_owner.clone.17':
kernel/futex.c:1582:6: warning: 'curval' may be used uninitialized in this function
kernel/futex.c: In function 'handle_futex_death':
kernel/futex.c:2486:6: warning: 'nval' may be used uninitialized in this function
kernel/futex.c: In function 'do_futex':
kernel/futex.c:863:11: warning: 'curval' may be used uninitialized in this function
kernel/futex.c:828:6: note: 'curval' was declared here
kernel/futex.c:898:5: warning: 'oldval' may be used uninitialized in this function
kernel/futex.c:890:6: note: 'oldval' was declared here
Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
commit 7485d0d375 (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It
prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW
MAP_PRIVATE futex operations by forcing the COW to occur by
unconditionally performing a write access get_user_pages_fast() to get
the page. The commit also introduced a user-mode regression in that it
broke futex operations on read-only memory maps. For example, this
breaks workloads that have one or more reader processes doing a
FUTEX_WAIT on a futex within a read only shared file mapping, and a
writer processes that has a writable mapping issuing the FUTEX_WAKE.
This fixes the regression for valid futex operations on RO mappings by
trying a RO get_user_pages_fast() when the RW get_user_pages_fast()
fails. This change makes it necessary to also check for invalid use
cases, such as anonymous RO mappings (which can never change) and the
ZERO_PAGE which the commit referenced above was written to address.
This patch does restore the original behavior with RO MAP_PRIVATE
mappings, which have inherent user-mode usage problems and don't really
make sense. With this patch performing a FUTEX_WAIT within a RO
MAP_PRIVATE mapping will be successfully woken provided another process
updates the region of the underlying mapped file. However, the mmap()
man page states that for a MAP_PRIVATE mapping:
It is unspecified whether changes made to the file after
the mmap() call are visible in the mapped region.
So user-mode users attempting to use futex operations on RO MAP_PRIVATE
mappings are depending on unspecified behavior. Additionally a
RO MAP_PRIVATE mapping could fail to wake up in the following case.
Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.
Once again doing something like this is just silly and users who do
something like this get what they deserve.
While RO MAP_PRIVATE mappings are nonsensical, checking for a private
mapping requires walking the vmas and was deemed too costly to avoid a
userspace hang.
This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.
Reported-by: David Oliver <david@rgmadvisors.com>
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: peterz@infradead.org
Cc: eric.dumazet@gmail.com
Cc: zvonler@rgmadvisors.com
Cc: hughd@google.com
Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.
It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...
So I think you essentially hang in the kernel.
The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.
On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.
Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.
The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.
However that isn't the case. get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state. It will eventually
update those bits ... in the struct page, but not in the PTE.
Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.
Basically, gup is the wrong interface for the job. The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.
The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().
This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().
In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.
I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Shan Hai <haishan.bai@gmail.com>
Tested-by: Shan Hai <haishan.bai@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <darren.hart@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>