Commit Graph

83 Commits

Author SHA1 Message Date
Nathan Chancellor 8eef28437c This is the 3.10.107 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZUiosAAoJEE44bZycYXAvcHYP/1OKMYQB/3G7GfEhMXdlpV31
 VjdzUg5X1JOE60anYNopvWQJgDFXMy9mTceUI3axDkfYb5iDFUpRBFEh70ggDL04
 bGB/J4n2Linjkj35u+S5P3fK6qBfg9+VDpTfUYPZGB5YjOjmaD06E8InBF8iUuC3
 6pkMtQKOptmKOc2hw84PsB3qm9ER2MMa92Lrs1rtcOihEqQMyKjkI/kzogs8XGje
 5gMt31VweScZed3d7i1r9tl/DTmzGcpEyVpz/x8gI7Xwi69FeeLy6cWbhK0VOsLA
 u7ul9mDa77bUC/jpBzJmIkS8fhzaTyUw8NQbtol9RSSIfzb+mvXyx9Vr7o4LYK2B
 P6AekC16x6R8KUED1hfxKdagguRACDfKf91bMAxDCN/PXqITVbk3RxxxH6wHAvOx
 Ihf4G5h800/ks6X1oMBYZcbFFbNCUHZjyL7V1M/iy1TrKuRhEtou4Ft3X+gOauLS
 CG8VR9Jo1/BAvMaJmy5Hg9RPNoxEMstDi6x3ugD0wH57XHSZ5QmFMBzCbuWR6hWM
 q1DvBK/I54BXlsdYU9WySn1hm2gKCNPZ+zGzLTo1l426vme+YjhC5911V7Tv+WHm
 lc5FTXWtXGhoAZuNSIGDrlv3Dyq44iMNrqXrhlPmJjWD3Hx4hFGGp2GyHOpK+5+7
 7egPk9m1WrhUKzA9m1/M
 =InCr
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqfQUACgkQmXOSYMts
 txZNghAApD/SW4fTOx6RZFCPVjAP70FfXvZsQYf3Zfp44Ytm2Kax3GIABPuknlI+
 IZRAPnXb6KP8DNDdCyGcJ0avI5uw96sXyeZWlDZyeS1WHHizJq3+BLB09zzdegSk
 K1dJrobXCYNESmcQMT5diGwqLYkdOs3hh7Ehqut29njwCzVzNG3n43H9F15o9cUZ
 6lAM8/Zb6ai+0KgVgwC40QJneVltDEFfXVr6wo/IJXnYNaRCPKQM5lsG09pxxopG
 NVSsmUyeJI5bPWEm5vbuBL2JVhaCcMtTfAPHflqbtykE8eSVEWdTeCWPuGWcATB+
 2sGp3cVR2W7+4CHpbcnrXolmP/OI3jXHbG1LvyRqg4Iw1jgtZ8wwjCEkdsPz3fED
 g2+EtSYl/NLW7N8P4KQV9jzihYIfELBj9HQsEs5aPOstyjyxl12RxJvjw835v5ts
 oa7qKQAHIwZsuaB34qK+DjI5coNeKRvDMy5mm0GL3TqmLLFEzSVpaTceGpdvNLi0
 6k3RkuJzU0TwAoTShWyYu6AbV+8aHniBQbjzYs5sufRgDy9pjnfWzDqtUM+chTsm
 WaxwhpHdpOomwAfZr8/Zaf0xIxP/M99SFKevntE04Ft93P8dKuLqFcNAjQkMdibY
 UHrJ67nBllmDtlH8yGO9j4FD89O0QaBX4J3qGyIu5eE73/iibvo=
 =J7vi
 -----END PGP SIGNATURE-----

Merge 3.10.107 into android-msm-bullhead-3.10-oreo-m5

Changes in 3.10.107: (270 commits)
        Revert "Btrfs: don't delay inode ref updates during log, replay"
        Btrfs: fix memory leak in reading btree blocks
        ext4: use more strict checks for inodes_per_block on mount
        ext4: fix in-superblock mount options processing
        ext4: add sanity checking to count_overhead()
        ext4: validate s_first_meta_bg at mount time
        jbd2: don't leak modified metadata buffers on an aborted journal
        ext4: fix fencepost in s_first_meta_bg validation
        ext4: trim allocation requests to group size
        ext4: preserve the needs_recovery flag when the journal is aborted
        ext4: return EROFS if device is r/o and journal replay is needed
        ext4: fix inode checksum calculation problem if i_extra_size is small
        block: fix use-after-free in sys_ioprio_get()
        block: allow WRITE_SAME commands with the SG_IO ioctl
        block: fix del_gendisk() vs blkdev_ioctl crash
        dm crypt: mark key as invalid until properly loaded
        dm space map metadata: fix 'struct sm_metadata' leak on failed create
        md/raid5: limit request size according to implementation limits
        md:raid1: fix a dead loop when read from a WriteMostly disk
        md linear: fix a race between linear_add() and linear_congested()
        CIFS: Fix a possible memory corruption during reconnect
        CIFS: Fix missing nls unload in smb2_reconnect()
        CIFS: Fix a possible memory corruption in push locks
        CIFS: remove bad_network_name flag
        fs/cifs: make share unaccessible at root level mountable
        cifs: Do not send echoes before Negotiate is complete
        ocfs2: fix crash caused by stale lvb with fsdlm plugin
        ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()
        can: raw: raw_setsockopt: limit number of can_filter that can be set
        can: peak: fix bad memory access and free sequence
        can: c_can_pci: fix null-pointer-deref in c_can_start() - set device pointer
        can: ti_hecc: add missing prepare and unprepare of the clock
        can: bcm: fix hrtimer/tasklet termination in bcm op removal
        can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
        ALSA: hda - Fix up GPIO for ASUS ROG Ranger
        ALSA: seq: Fix race at creating a queue
        ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
        ALSA: timer: Reject user params with too small ticks
        ALSA: seq: Fix link corruption by event error handling
        ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
        ALSA: seq: Fix race during FIFO resize
        ALSA: seq: Don't break snd_use_lock_sync() loop by timeout
        ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to volume_control_quirks
        usb: gadgetfs: restrict upper bound on device configuration size
        USB: gadgetfs: fix unbounded memory allocation bug
        USB: gadgetfs: fix use-after-free bug
        USB: gadgetfs: fix checks of wTotalLength in config descriptors
        xhci: free xhci virtual devices with leaf nodes first
        USB: serial: io_ti: bind to interface after fw download
        usb: gadget: composite: always set ep->mult to a sensible value
        USB: cdc-acm: fix double usb_autopm_put_interface() in acm_port_activate()
        USB: cdc-acm: fix open and suspend race
        USB: cdc-acm: fix failed open not being detected
        usb: dwc3: gadget: make Set Endpoint Configuration macros safe
        usb: host: xhci-plat: Fix timeout on removal of hot pluggable xhci controllers
        usb: dwc3: gadget: delay unmap of bounced requests
        usb: hub: Wait for connection to be reestablished after port reset
        usb: gadget: composite: correctly initialize ep->maxpacket
        USB: UHCI: report non-PME wakeup signalling for Intel hardware
        arm/xen: Use alloc_percpu rather than __alloc_percpu
        xfs: set AGI buffer type in xlog_recover_clear_agi_bucket
        xfs: clear _XBF_PAGES from buffers when readahead page
        ssb: Fix error routine when fallback SPROM fails
        drivers/gpu/drm/ast: Fix infinite loop if read fails
        scsi: avoid a permanent stop of the scsi device's request queue
        scsi: move the nr_phys_segments assert into scsi_init_io
        scsi: don't BUG_ON() empty DMA transfers
        scsi: storvsc: properly handle SRB_ERROR when sense message is present
        scsi: storvsc: properly set residual data length on errors
        target/pscsi: Fix TYPE_TAPE + TYPE_MEDIMUM_CHANGER export
        scsi: lpfc: Add shutdown method for kexec
        scsi: sr: Sanity check returned mode data
        scsi: sd: Fix capacity calculation with 32-bit sector_t
        s390/vmlogrdr: fix IUCV buffer allocation
        libceph: verify authorize reply on connect
        nfs_write_end(): fix handling of short copies
        powerpc/ps3: Fix system hang with GCC 5 builds
        sg_write()/bsg_write() is not fit to be called under KERNEL_DS
        ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to it
        cred/userns: define current_user_ns() as a function
        net: ti: cpmac: Fix compiler warning due to type confusion
        tick/broadcast: Prevent NULL pointer dereference
        netvsc: reduce maximum GSO size
        drop_monitor: add missing call to genlmsg_end
        drop_monitor: consider inserted data in genlmsg_end
        igmp: Make igmp group member RFC 3376 compliant
        HID: hid-cypress: validate length of report
        Input: xpad - use correct product id for x360w controllers
        Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000
        Input: iforce - validate number of endpoints before using them
        Input: kbtab - validate number of endpoints before using them
        Input: joydev - do not report stale values on first open
        Input: tca8418 - use the interrupt trigger from the device tree
        Input: mpr121 - handle multiple bits change of status register
        Input: mpr121 - set missing event capability
        Input: i8042 - add Clevo P650RS to the i8042 reset list
        i2c: fix kernel memory disclosure in dev interface
        vme: Fix wrong pointer utilization in ca91cx42_slave_get
        sysrq: attach sysrq handler correctly for 32-bit kernel
        pinctrl: sh-pfc: Do not unconditionally support PIN_CONFIG_BIAS_DISABLE
        x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6F
        qla2xxx: Fix crash due to null pointer access
        ARM: 8634/1: hw_breakpoint: blacklist Scorpion CPUs
        ARM: dts: da850-evm: fix read access to SPI flash
        NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
        vmxnet3: Wake queue from reset work
        Fix memory leaks in cifs_do_mount()
        Compare prepaths when comparing superblocks
        Move check for prefix path to within cifs_get_root()
        Fix regression which breaks DFS mounting
        apparmor: fix uninitialized lsm_audit member
        apparmor: exec should not be returning ENOENT when it denies
        apparmor: fix disconnected bind mnts reconnection
        apparmor: internal paths should be treated as disconnected
        apparmor: check that xindex is in trans_table bounds
        apparmor: add missing id bounds check on dfa verification
        apparmor: don't check for vmalloc_addr if kvzalloc() failed
        apparmor: fix oops in profile_unpack() when policy_db is not present
        apparmor: fix module parameters can be changed after policy is locked
        apparmor: do not expose kernel stack
        vfio/pci: Fix integer overflows, bitmask check
        bna: Add synchronization for tx ring.
        sg: Fix double-free when drives detach during SG_IO
        move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
        serial: 8250_pci: Detach low-level driver during PCI error recovery
        bnx2x: Correct ringparam estimate when DOWN
        tile/ptrace: Preserve previous registers for short regset write
        sysctl: fix proc_doulongvec_ms_jiffies_minmax()
        ISDN: eicon: silence misleading array-bounds warning
        ARC: [arcompact] handle unaligned access delay slot corner case
        parisc: Don't use BITS_PER_LONG in userspace-exported swab.h header
        nfs: Don't increment lock sequence ID after NFS4ERR_MOVED
        ipv6: addrconf: Avoid addrconf_disable_change() using RCU read-side lock
        af_unix: move unix_mknod() out of bindlock
        drm/nouveau/nv1a,nv1f/disp: fix memory clock rate retrieval
        crypto: api - Clear CRYPTO_ALG_DEAD bit before registering an alg
        ata: sata_mv:- Handle return value of devm_ioremap.
        mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
        mm, fs: check for fatal signals in do_generic_file_read()
        ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
        sched/debug: Don't dump sched debug info in SysRq-W
        tcp: fix 0 divide in __tcp_select_window()
        macvtap: read vnet_hdr_size once
        packet: round up linear to header len
        vfs: fix uninitialized flags in splice_to_pipe()
        siano: make it work again with CONFIG_VMAP_STACK
        futex: Move futex_init() to core_initcall
        rtc: interface: ignore expired timers when enqueuing new timers
        irda: Fix lockdep annotations in hashbin_delete().
        tty: serial: msm: Fix module autoload
        rtlwifi: rtl_usb: Fix for URB leaking when doing ifconfig up/down
        af_packet: remove a stray tab in packet_set_ring()
        MIPS: Fix special case in 64 bit IP checksumming.
        mm: vmpressure: fix sending wrong events on underflow
        ipc/shm: Fix shmat mmap nil-page protection
        sd: get disk reference in sd_check_events()
        samples/seccomp: fix 64-bit comparison macros
        ath5k: drop bogus warning on drv_set_key with unsupported cipher
        rdma_cm: fail iwarp accepts w/o connection params
        NFSv4: fix getacl ERANGE for some ACL buffer sizes
        bcma: use (get|put)_device when probing/removing device driver
        powerpc/xmon: Fix data-breakpoint
        KVM: VMX: use correct vmcs_read/write for guest segment selector/base
        KVM: PPC: Book3S PR: Fix illegal opcode emulation
        KVM: s390: fix task size check
        s390: TASK_SIZE for kernel threads
        xtensa: move parse_tag_fdt out of #ifdef CONFIG_BLK_DEV_INITRD
        mac80211: flush delayed work when entering suspend
        drm/ast: Fix test for VGA enabled
        drm/ttm: Make sure BOs being swapped out are cacheable
        fat: fix using uninitialized fields of fat_inode/fsinfo_inode
        drivers: hv: Turn off write permission on the hypercall page
        xhci: fix 10 second timeout on removal of PCI hotpluggable xhci controllers
        crypto: improve gcc optimization flags for serpent and wp512
        mtd: pmcmsp: use kstrndup instead of kmalloc+strncpy
        cpmac: remove hopeless #warning
        mvsas: fix misleading indentation
        l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
        net: don't call strlen() on the user buffer in packet_bind_spkt()
        dccp: Unlock sock before calling sk_free()
        tcp: fix various issues for sockets morphing to listen state
        uapi: fix linux/packet_diag.h userspace compilation error
        ipv6: avoid write to a possibly cloned skb
        dccp: fix memory leak during tear-down of unsuccessful connection request
        futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
        futex: Add missing error handling to FUTEX_REQUEUE_PI
        give up on gcc ilog2() constant optimizations
        cancel the setfilesize transation when io error happen
        crypto: ghash-clmulni - Fix load failure
        crypto: cryptd - Assign statesize properly
        ACPI / video: skip evaluating _DOD when it does not exist
        Drivers: hv: balloon: don't crash when memory is added in non-sorted order
        s390/pci: fix use after free in dma_init
        cpufreq: Fix and clean up show_cpuinfo_cur_freq()
        igb: Workaround for igb i210 firmware issue
        igb: add i211 to i210 PHY workaround
        ipv4: provide stronger user input validation in nl_fib_input()
        tcp: initialize icsk_ack.lrcvtime at session start time
        ACM gadget: fix endianness in notifications
        mmc: sdhci: Do not disable interrupts while waiting for clock
        uvcvideo: uvc_scan_fallback() for webcams with broken chain
        fbcon: Fix vc attr at deinit
        crypto: algif_hash - avoid zero-sized array
        virtio_balloon: init 1st buffer in stats vq
        c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
        sparc/ptrace: Preserve previous registers for short regset write
        metag/ptrace: Preserve previous registers for short regset write
        metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS
        metag/ptrace: Reject partial NT_METAG_RPIPE writes
        libceph: force GFP_NOIO for socket allocations
        ACPI: Fix incompatibility with mcount-based function graph tracing
        ACPI / power: Avoid maybe-uninitialized warning
        rtc: s35390a: make sure all members in the output are set
        rtc: s35390a: implement reset routine as suggested by the reference
        rtc: s35390a: improve irq handling
        padata: avoid race in reordering
        HID: hid-lg: Fix immediate disconnection of Logitech Rumblepad 2
        HID: i2c-hid: Add sleep between POWER ON and RESET
        drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
        drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
        drm/vmwgfx: Remove getparam error message
        drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
        Reset TreeId to zero on SMB2 TREE_CONNECT
        metag/usercopy: Drop unused macros
        metag/usercopy: Zero rest of buffer from copy_from_user
        powerpc: Don't try to fix up misaligned load-with-reservation instructions
        mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
        mtd: bcm47xxpart: fix parsing first block after aligned TRX
        net/packet: fix overflow in check for priv area size
        x86/vdso: Plug race between mapping and ELF header setup
        iscsi-target: Fix TMR reference leak during session shutdown
        iscsi-target: Drop work-around for legacy GlobalSAN initiator
        xen, fbfront: fix connecting to backend
        char: lack of bool string made CONFIG_DEVPORT always on
        platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
        platform/x86: acer-wmi: setup accelerometer when ACPI device was found
        mm: Tighten x86 /dev/mem with zeroing reads
        virtio-console: avoid DMA from stack
        catc: Combine failure cleanup code in catc_probe()
        catc: Use heap buffer for memory size test
        net: ipv6: check route protocol when deleting routes
        Drivers: hv: don't leak memory in vmbus_establish_gpadl()
        Drivers: hv: get rid of timeout in vmbus_open()
        ubi/upd: Always flush after prepared for an update
        x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
        powerpc: Reject binutils 2.24 when building little endian
        net/packet: fix overflow in check for tp_frame_nr
        net/packet: fix overflow in check for tp_reserve
        tty: nozomi: avoid a harmless gcc warning
        hostap: avoid uninitialized variable use in hfa384x_get_rid
        gfs2: avoid uninitialized variable warning
        net: neigh: guard against NULL solicit() method
        sctp: listen on the sock only when it's state is listening or closed
        ip6mr: fix notification device destruction
        MIPS: Fix crash registers on non-crashing CPUs
        RDS: Fix the atomicity for congestion map update
        xen/x86: don't lose event interrupts
        p9_client_readdir() fix
        nfsd: check for oversized NFSv2/v3 arguments
        ftrace/x86: Fix triple fault with graph tracing and suspend-to-ram
        kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
        tun: read vnet_hdr_sz once
        printk: use rcuidle console tracepoint
        ipv6: check raw payload size correctly in ioctl
        x86: standardize mmap_rnd() usage
        x86/mm/32: Enable full randomization on i386 and X86_32
        mm: larger stack guard gap, between vmas
        mm: fix new crash in unmapped_area_topdown()
        Allow stack to grow up to address space limit
        Linux 3.10.107

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Conflicts:
	arch/x86/mm/mmap.c
	drivers/mmc/host/sdhci.c
	drivers/usb/host/xhci-plat.c
	fs/ext4/super.c
	kernel/sched/core.c
2018-01-25 17:57:41 -07:00
Eric Dumazet 9b4c2e72f8 tcp: fix various issues for sockets morphing to listen state
commit 02b2faaf0af1d85585f6d6980e286d53612acfc2 upstream.

Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)

I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.

1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction

Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
 states from tcp_v6_mtu_reduced()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 14:04:30 +02:00
Ravi Joshi 73cd2de534 WLAN subsystem: Sysctl support for key TCP/IP parameters
It has been observed that default values for some of key tcp/ip
parameters are affecting the tput/performance of the system. Hence
extending configuration capabilities to TCP/Ip stack through
sysctl interface

Change-Id: I4287e9103769535f43e0934bac08435a524ee6a4
CRs-Fixed: 507581
Signed-off-by: Ravi Joshi <ravij@codeaurora.org>
2014-01-08 19:46:01 -08:00
Yuchung Cheng 9b44190dc1 tcp: refactor F-RTO
The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 11:47:50 -04:00
Nandita Dukkipati 9b717a8d24 tcp: TLP loss detection.
This is the second of the TLP patch series; it augments the basic TLP
algorithm with a loss detection scheme.

This patch implements a mechanism for loss detection when a Tail
loss probe retransmission plugs a hole thereby masking packet loss
from the sender. The loss detection algorithm relies on counting
TLP dupacks as outlined in Sec. 3 of:
http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01

The basic idea is: Sender keeps track of TLP "episode" upon
retransmission of a TLP packet. An episode ends when the sender receives
an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
episode. We want to make sure that before the episode ends the sender
receives a "TLP dupack", indicating that the TLP retransmission was
unnecessary, so there was no loss/hole that needed plugging. If the
sender gets no TLP dupack before the end of the episode, then it reduces
ssthresh and the congestion window, because the TLP packet arriving at
the receiver probably plugged a hole.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
Nandita Dukkipati 6ba8a3b19e tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
David S. Miller d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Eric Dumazet e6c022a4fa tcp: better retrans tracking for defer-accept
For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:45:00 -04:00
Jerry Chu 37561f68bd tcp: Reject invalid ack_seq to Fast Open sockets
A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch
to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic.

When a FIN packet with an invalid ack_seq# arrives at a socket in
the TCP_FIN_WAIT1 state, rather than discarding the packet, the current
code will accept the FIN, causing state transition to TCP_CLOSING.

This may be a small deviation from RFC793, which seems to say that the
packet should be dropped. Unfortunately I did not expect this case for
Fast Open hence it will trigger a BUG_ON panic.

It turns out there is really nothing bad about a TFO socket going into
TCP_CLOSING state so I could just remove the BUG_ON statements. But after
some thought I think it's better to treat this case like TCP_SYN_RECV
and return a RST to the confused peer who caused the unacceptable ack_seq
to be generated in the first place.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 02:42:56 -04:00
Jerry Chu 8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Eric Dumazet 144d56e910 tcp: fix possible socket refcount problem
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :

[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]

The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.

timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.

We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.

For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21 14:42:23 -07:00
Eric Dumazet 6f458dfb40 tcp: improve latencies of timer triggered events
Modern TCP stack highly depends on tcp_write_timer() having a small
latency, but current implementation doesn't exactly meet the
expectations.

When a timer fires but finds the socket is owned by the user, it rearms
itself for an additional delay hoping next run will be more
successful.

tcp_write_timer() for example uses a 50ms delay for next try, and it
defeats many attempts to get predictable TCP behavior in term of
latencies.

Use the recently introduced tcp_release_cb(), so that the user owning
the socket will call various handlers right before socket release.

This will permit us to post a followup patch to address the
tcp_tso_should_defer() syndrome (some deferred packets have to wait
RTO timer to be transmitted, while cwnd should allow us to send them
sooner)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20 10:59:41 -07:00
Yuchung Cheng 750ea2bafa tcp: early retransmit: delayed fast retransmit
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02 20:56:10 -04:00
Joe Perches afd465030a net: ipv4: Standardize prefixes for message logging
Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-12 17:05:21 -07:00
Arun Sharma efcdbf24fd net: Disambiguate kernel message
Some of our machines were reporting:

TCP: too many of orphaned sockets

even when the number of orphaned sockets was well below the
limit.

We print a different message depending on whether we're out
of TCP memory or there are too many orphaned sockets.

Also move the check out of line and cleanup the messages
that were printed.

Signed-off-by: Arun Sharma <asharma@fb.com>
Suggested-by: Mohan Srinivasan <mohan@fb.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-01 14:41:50 -05:00
Rusty Russell 3db1cd5c05 net: fix assignment of 0/1 to bool variables.
DaveM said:
   Please, this kind of stuff rots forever and not using bool properly
   drives me crazy.

Joe Perches <joe@perches.com> gave me the spatch script:

	@@
	bool b;
	@@
	-b = 0
	+b = false
	@@
	bool b;
	@@
	-b = 1
	+b = true

I merely installed coccinelle, read the documentation and took credit.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-19 22:27:29 -05:00
Glauber Costa 180d8cd942 foundations of per-cgroup memory pressure controlling.
This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12 19:04:10 -05:00
Eric Dumazet dfd56b8b38 net: use IS_ENABLED(CONFIG_IPV6)
Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-11 18:25:16 -05:00
Flavio Leitner 78d81d15b7 TCP: remove TCP_DEBUG
It was enabled by default and the messages guarded
by the define are useful.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 17:36:08 -04:00
Shan Wei 089c34827e tcp: Remove debug macro of TCP_CHECK_TIMER
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:10:14 -08:00
Ilpo Järvinen c60ce4e265 tcp: use correct counters in CA_CWR state too
As CWR is stronger than CA_Disorder state, we can miscount
SACK/Reno failure into other timeouts. Not a bad problem as
it can happen only due to ECN, FRTO detecting spurious RTO
or xmit error which are the only callers of tcp_enter_cwr.
And even then losses and RTO must still follow thereafter
to actually end up into the relevant code paths.

Compile tested.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-17 13:46:33 -07:00
David S. Miller 21a180cda0 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/Kconfig
	net/ipv4/tcp_timer.c
2010-10-04 11:56:38 -07:00
Damian Lukowski 4d22f7d372 net-2.6: SYN retransmits: Add new parameter to retransmits_timed_out()
Fixes kernel Bugzilla Bug 18952

This patch adds a syn_set parameter to the retransmits_timed_out()
routine and updates its callers. If not set, TCP_RTO_MIN is taken
as the calculation basis as before. If set, TCP_TIMEOUT_INIT is
used instead, so that sysctl_syn_retries represents the actual
amount of SYN retransmissions in case no SYNACKs are received when
establishing a new connection.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-28 13:08:32 -07:00
David S. Miller e548833df8 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/main.c
2010-09-09 22:27:33 -07:00
Jerry Chu dca43c75e7 tcp: Add TCP_USER_TIMEOUT socket option.
This patch provides a "user timeout" support as described in RFC793. The
socket option is also needed for the the local half of RFC5482 "TCP User
Timeout Option".

TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
when > 0, to specify the maximum amount of time in ms that transmitted
data may remain unacknowledged before TCP will forcefully close the
corresponding connection and return ETIMEDOUT to the application. If
0 is given, TCP will continue to use the system default.

Increasing the user timeouts allows a TCP connection to survive extended
periods without end-to-end connectivity. Decreasing the user timeouts
allows applications to "fail fast" if so desired. Otherwise it may take
upto 20 minutes with the current system defaults in a normal WAN
environment.

The socket option can be made during any state of a TCP connection, but
is only effective during the synchronized states of a connection
(ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
connection due to keepalive failure.

The option does not change in anyway when TCP retransmits a packet, nor
when a keepalive probe will be sent.

This option, like many others, will be inherited by an acceptor from its
listener.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-30 13:23:33 -07:00
David S. Miller ad1af0fedb tcp: Combat per-cpu skew in orphan tests.
As reported by Anton Blanchard when we use
percpu_counter_read_positive() to make our orphan socket limit checks,
the check can be off by up to num_cpus_online() * batch (which is 32
by default) which on a 128 cpu machine can be as large as the default
orphan limit itself.

Fix this by doing the full expensive sum check if the optimized check
triggers.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-08-25 02:27:49 -07:00
Eric Dumazet 4bc2f18ba4 net/ipv4: EXPORT_SYMBOL cleanups
CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12 12:57:54 -07:00
Flavio Leitner 6c37e5de45 TCP: avoid to send keepalive probes if receiving data
RFC 1122 says the following:
...
  Keep-alive packets MUST only be sent when no data or
  acknowledgement packets have been received for the
  connection within an interval.
...

The acknowledgement packet is reseting the keepalive
timer but the data packet isn't. This patch fixes it by
checking the timestamp of the last received data packet
too when the keepalive timer expires.

Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-27 12:53:25 -07:00
Eric Dumazet b6c6712a42 net: sk_dst_cache RCUification
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 01:41:33 -07:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Jiri Kosina 318ae2edc3 Merge branch 'for-next' into for-linus
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
2010-03-08 16:55:37 +01:00
Andreas Petlund 36e31b0af5 net: TCP thin linear timeouts
This patch will make TCP use only linear timeouts if the
stream is thin. This will help to avoid the very high latencies
that thin stream suffer because of exponential backoff. This
mechanism is only active if enabled by iocontrol or syscontrol
and the stream is identified as thin. A maximum of 6 linear
timeouts is tried before exponential backoff is resumed.

Signed-off-by: Andreas Petlund <apetlund@simula.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-18 15:43:08 -08:00
Daniel Mack 3ad2f3fbb9 tree-wide: Assorted spelling fixes
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-09 11:13:56 +01:00
Octavian Purdila 72659ecce6 tcp: account SYN-ACK timeouts & retransmissions
Currently we don't increment SYN-ACK timeouts & retransmissions
although we do increment the same stats for SYN. We seem to have lost
the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
(commit 2248761e in the netdev-vger-cvs tree).

This patch fixes this issue. In the process we also rename the v4/v6
syn/ack retransmit functions for clarity. We also add a new
request_socket operations (syn_ack_timeout) so we can keep code in
inet_connection_sock.c protocol agnostic.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-01-17 19:09:39 -08:00
Damian Lukowski 2f7de5710a tcp: Stalling connections: Move timeout calculation routine
This patch moves retransmits_timed_out() from include/net/tcp.h
to tcp_timer.c, where it is used.

Reported-by: Frederic Leroy <fredo@starox.org>
Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 20:56:11 -08:00
Krishna Kumar ea94ff3b55 net: Fix for dst_negative_advice
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-20 18:55:46 -07:00
Eric Dumazet c720c7e838 inet: rename some inet_sock fields
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-18 18:52:53 -07:00
Damian Lukowski 6fa12c8503 Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value.
RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
which may represent a number of allowed retransmissions or a timeout value.
Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
in number of allowed retransmissions.

For any desired threshold R2 (by means of time) one can specify tcp_retries2
(by means of number of retransmissions) such that TCP will not time out
earlier than R2. This is the case, because the RTO schedule follows a fixed
pattern, namely exponential backoff.

However, the RTO behaviour is not predictable any more if RTO backoffs can be
reverted, as it is the case in the draft
"Make TCP more Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).

In the worst case TCP would time out a connection after 3.2 seconds, if the
initial RTO equaled MIN_RTO and each backoff has been reverted.

This patch introduces a function retransmits_timed_out(N),
which calculates the timeout of a TCP connection, assuming an initial
RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.

Whenever timeout decisions are made by comparing the retransmission counter
to some value N, this function can be used, instead.

The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
can occur than the value indicates. However, it yields a timeout which is
similar to the one of an unpatched, exponentially backing off TCP in the same
scenario. As no application could rely on an RTO greater than MIN_RTO, there
should be no risk of a regression.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-01 02:45:47 -07:00
Damian Lukowski f1ecd5d9e7 Revert Backoff [v3]: Revert RTO on ICMP destination unreachable
Here, an ICMP host/network unreachable message, whose payload fits to
TCP's SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path.
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.

This patch reverts one RTO backoff, if an ICMP host/network unreachable
message, whose payload fits to TCP's SND.UNA, arrives.
Based on the new RTO, the retransmission timer is reset to reflect the
remaining time, or - if the revert clocked out the timer - a retransmission
is sent out immediately.
Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
there have been retransmissions and reversible backoffs, already.

Changes from v2:
1) Renaming of skb in tcp_v4_err() moved to another patch.
2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
3) Fixed code comments.

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-01 02:45:42 -07:00
Eric Dumazet df19a62677 tcp: keepalive cleanups
Introduce keepalive_probes(tp) helper, and use it, like 
keepalive_time_when(tp) and keepalive_intvl_when(tp)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-08-28 23:48:54 -07:00
Ilpo Järvinen bc079e9ede tcp: cleanup ca_state mess in tcp_timer
Redundant checks made indentation impossible to follow.
However, it might be useful to make this ca_state+is_sack
indexed array.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-02 03:00:13 -08:00
Matt Mackall 6086ebca13 tcp: Stop scaring users with "treason uncloaked!"
The original message was unhelpful and extremely alarming to our poor
users, despite its charm. Make it less frightening.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-18 22:27:42 -08:00
Eric Dumazet dd24c00191 net: Use a percpu_counter for orphan_count
Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 21:17:14 -08:00
Jianjun Kong fd3f8c4cb6 net: clean up net/ipv4/ip_fragment.c tcp_timer.c ip_input.c
Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03 02:47:38 -08:00
Harvey Harrison 673d57e723 net: replace NIPQUAD() in net/ipv4/ net/ipv6/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:53:57 -07:00
Harvey Harrison 5b095d9892 net: replace %p6 with %pI6
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:52:50 -07:00
Harvey Harrison 0c6ce78abf net: replace uses of NIP6_FMT with %p6
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 23:02:31 -07:00
Peter Zijlstra c57943a1c9 net: wrap sk->sk_backlog_rcv()
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 14:18:42 -07:00
Ilpo Järvinen 547b792cac net: convert BUG_TRAP to generic WARN_ON
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 21:43:18 -07:00
Pavel Emelyanov de0744af1f mib: add net to NET_INC_STATS_BH
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-16 20:31:16 -07:00