Commit Graph

137 Commits

Author SHA1 Message Date
Wanpeng Li 693f42f492 sched/cputime: Fix prev steal time accouting during CPU hotplug
Commit:

  e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-26 16:02:24 +02:00
Nathan Chancellor 459f05e480 This is the 3.10.102 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXXS5iAAoJEE44bZycYXAvDj8P/jbhmGAgW6tw2cnS90QIZDqG
 M/nclEId61jICNvbfP6zsioKeWyrmzr5G7NjqTThsSNhCo/DXs3ddMqLy3pOaFdq
 mytXtHIUpwZoplEib+ODinW40CMqnu11XSWEcee2nrsPuGNsnc7BY0wmFBa6UVCV
 rOZef9SN9lJcZSYY/auvgLDXOXdQ+NMxp5hau30aF5HBO8hTDXStjPRcUwCvz7aR
 govTQJHlS4HzLH3JOYS3Dt8IYFDOrKhQIby2nFdw7eiUxHCRy2F0asabTh3DzCw1
 iLvFroozjyVXwozfWMqLCvMa+514MXJy8Nkva6xiAHraC8UrgfPtcNsTdgtkdH9T
 V2Am9b0L7yiBdG6hsZLxkU3akk7vU/0dtppwzvudANT6i2tGcDSBeaZq3T2pAv7B
 7coY53GzHZdQnbdTZbYeS1fxebxyXw50D5OJkF8DyLhoL7Uj2Dvv0QdjKv+U/e5D
 VQ+ZyGcBdCLuOzflXysI10E01y0/M3FrkubgGBM4Oh0eYKCHJaHG/NCZy5JY/qxy
 S0phem8RbeZPbcL14z+5buWIi1lUkTiCIMG8c32ZEmDh84drnICqABA0RzKmqdkj
 ucQa+PzkMQ1DyhAMUl/CwpBfSqf1Zs3agLo78Kp5MTGfeAA90m0SeVqhmDgWhwqG
 HhSlsPFfMfmJl5S0uJpQ
 =UhFl
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqdSoACgkQmXOSYMts
 txbb1A/6A1pJjz3//6RsYU7G2f4WgAjqCRzQDPtVtBUwpyBtj7DuAxNGeOAvw0KM
 BfOTy0fhtgHfOV6F4kynIuU6scNY8zZlZ2ZCgndhiC45dlDBSto2mYgF9DmDl7m3
 rRhiWmmSqFvJW+USxCETg8PxXVIs0Si+TU8AfBKJt3Mf25UyLsrm/hIDqg3FtkyP
 STZlpmACGQEJl6qTVTubTv6/psJc0oE7gUZ2G4TTuFxt+p3/4MPf+pnicl5jcP04
 laN1k2ce8ciV8Tc7f5zM55ArLGM+M4QQNRqO6Wrl7gQvtXpn6Efno9aY2MuaXtdm
 7sKKvQWj0QMS/9tei+wGS73gDsfIb1qrsaMWD9UF9zGb7miGkRr3wdDZPYurysWy
 5cIL1TErJDiIVlVedL/o8EYOxCYamSQPJ35WGxSgeS9kqfTlh3C1angGy9EOpv27
 ER1myFM4TUc51ziPIFlEeBu1ku4vVY7atCsZU25VqKFLAapeDG3xuK1RDmal/PTd
 d2JahllwPQ4Uh8OUNeHcN4Ptxf/fBVezSCZw1tv6vkAUdt6uXcbweutDw74cWlNJ
 KbKd5yluWVCAVsOSiVNRFX8ij/9GeJvu94eU5o7jiC578TQTRrMdKyxEqVKzz6te
 39rFoX20GZ7IosRoJDp9gsJTA7GAVsCcfU9CK/SNL3jxGLFvJbo=
 =CaKB
 -----END PGP SIGNATURE-----

Merge 3.10.102 into android-msm-bullhead-3.10-oreo-m5

Changes in 3.10.102: (144 commits)
        pipe: Fix buffer offset after partially failed read
        x86/iopl/64: Properly context-switch IOPL on Xen PV
        ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
        compiler-gcc: integrate the various compiler-gcc[345].h files
        x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id"
        KVM: i8254: change PIT discard tick policy
        KVM: fix spin_lock_init order on x86
        EDAC, amd64_edac: Shift wrapping issue in f1x_get_norm_dct_addr()
        PCI: Disable IO/MEM decoding for devices with non-compliant BARs
        linux/const.h: Add _BITUL() and _BITULL()
        x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
        x86, processor-flags: Fix the datatypes and add bit number defines
        x86/iopl: Fix iopl capability check on Xen PV
        sg: fix dxferp in from_to case
        aacraid: Fix memory leak in aac_fib_map_free
        be2iscsi: set the boot_kset pointer to NULL in case of failure
        usb: retry reset if a device times out
        USB: cdc-acm: more sanity checking
        USB: iowarrior: fix oops with malicious USB descriptors
        USB: usb_driver_claim_interface: add sanity checking
        USB: mct_u232: add sanity checking in probe
        USB: digi_acceleport: do sanity checking for the number of ports
        USB: cypress_m8: add endpoint sanity check
        USB: serial: cp210x: Adding GE Healthcare Device ID
        USB: option: add "D-Link DWM-221 B1" device id
        pwc: Add USB id for Philips Spc880nc webcam
        Input: powermate - fix oops with malicious USB descriptors
        net: irda: Fix use-after-free in irtty_open()
        8250: use callbacks to access UART_DLL/UART_DLM
        bttv: Width must be a multiple of 16 when capturing planar formats
        media: v4l2-compat-ioctl32: fix missing length copy in put_v4l2_buffer32
        ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41.
        jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
        bcache: fix cache_set_flush() NULL pointer dereference on OOM
        watchdog: rc32434_wdt: fix ioctl error handling
        splice: handle zero nr_pages in splice_to_pipe()
        xtensa: ISS: don't hang if stdin EOF is reached
        xtensa: clear all DBREAKC registers on start
        md/raid5: Compare apples to apples (or sectors to sectors)
        rapidio/rionet: fix deadlock on SMP
        ipr: Fix out-of-bounds null overwrite
        ipr: Fix regression when loading firmware
        drm/radeon: Don't drop DP 2.7 Ghz link setup on some cards.
        tracing: Have preempt(irqs)off trace preempt disabled functions
        tracing: Fix crash from reading trace_pipe with sendfile
        tracing: Fix trace_printk() to print when not using bprintk()
        scripts/coccinelle: modernize &
        Input: ims-pcu - sanity check against missing interfaces
        Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
        ocfs2/dlm: fix race between convert and recovery
        ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
        mtd: onenand: fix deadlock in onenand_block_markbad
        sched/cputime: Fix steal time accounting vs. CPU hotplug
        perf/x86/intel: Fix PEBS data source interpretation on Nehalem/Westmere
        hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated
        parisc: Avoid function pointers for kernel exception routines
        parisc: Fix kernel crash with reversed copy_from_user()
        ALSA: timer: Use mod_timer() for rearming the system timer
        net: jme: fix suspend/resume on JMC260
        sctp: lack the check for ports in sctp_v6_cmp_addr
        ipv6: re-enable fragment header matching in ipv6_find_hdr
        cdc_ncm: toggle altsetting to force reset before setup
        usbnet: cleanup after bind() in probe()
        udp6: fix UDP/IPv6 encap resubmit path
        sh_eth: fix NULL pointer dereference in sh_eth_ring_format()
        net: Fix use after free in the recvmmsg exit path
        farsync: fix off-by-one bug in fst_add_one
        ath9k: fix buffer overrun for ar9287
        qlge: Fix receive packets drop.
        ppp: take reference on channels netns
        qmi_wwan: add "D-Link DWM-221 B1" device id
        ipv4: l2tp: fix a potential issue in l2tp_ip_recv
        ipv6: l2tp: fix a potential issue in l2tp_ip6_recv
        ip6_tunnel: set rtnl_link_ops before calling register_netdevice
        usb: renesas_usbhs: avoid NULL pointer derefernce in usbhsf_pkt_handler()
        usb: renesas_usbhs: disable TX IRQ before starting TX DMAC transfer
        ext4: add lockdep annotations for i_data_sem
        HID: usbhid: fix inconsistent reset/resume/reset-resume behavior
        drm/radeon: hold reference to fences in radeon_sa_bo_new (3.17 and older)
        usbvision-video: fix memory leak of alt_max_pkt_size
        usbvision: fix leak of usb_dev on failure paths in usbvision_probe()
        usbvision: fix crash on detecting device with invalid configuration
        usb: xhci: fix wild pointers in xhci_mem_cleanup
        usb: hcd: out of bounds access in for_each_companion
        crypto: gcm - Fix rfc4543 decryption crash
        nl80211: check netlink protocol in socket release notification
        Input: gtco - fix crash on detecting device without endpoints
        i2c: cpm: Fix build break due to incompatible pointer types
        EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback
        ASoC: s3c24xx: use const snd_soc_component_driver pointer
        efi: Fix out-of-bounds read in variable_matches()
        workqueue: fix ghost PENDING flag while doing MQ IO
        USB: usbip: fix potential out-of-bounds write
        paride: make 'verbose' parameter an 'int' again
        fbdev: da8xx-fb: fix videomodes of lcd panels
        misc/bmp085: Enable building as a module
        rtc: vr41xx: Wire up alarm_irq_enable
        drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors
        include/linux/poison.h: fix LIST_POISON{1,2} offset
        Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
        perf stat: Document --detailed option
        ARM: OMAP3: Add cpuidle parameters table for omap3430
        compiler-gcc: disable -ftracer for __noclone functions
        ipvs: correct initial offset of Call-ID header search in SIP persistence engine
        nbd: ratelimit error msgs after socket close
        clk: versatile: sp810: support reentrance
        lpfc: fix misleading indentation
        ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel
        proc: prevent accessing /proc/<PID>/environ until it's ready
        batman-adv: Fix broadcast/ogm queue limit on a removed interface
        MAINTAINERS: Remove asterisk from EFI directory names
        ACPICA: Dispatcher: Update thread ID for recursive method calls
        USB: serial: cp210x: add ID for Link ECU
        USB: serial: cp210x: add Straizona Focusers device ids
        Input: ads7846 - correct the value got from SPI
        powerpc: scan_features() updates incorrect bits for REAL_LE
        crypto: hash - Fix page length clamping in hash walk
        get_rock_ridge_filename(): handle malformed NM entries
        Input: max8997-haptic - fix NULL pointer dereference
        asmlinkage, pnp: Make variables used from assembler code visible
        ARM: OMAP3: Fix booting with thumb2 kernel
        decnet: Do not build routes to devices without decnet private data.
        route: do not cache fib route info on local routes with oif
        packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
        atl2: Disable unimplemented scatter/gather feature
        net: fix infoleak in llc
        net: fix infoleak in rtnetlink
        VSOCK: do not disconnect socket when peer has shutdown SEND only
        net: bridge: fix old ioctl unlocked net device walk
        net: fix a kernel infoleak in x25 module
        fs/cifs: correctly to anonymous authentication via NTLMSSP
        ring-buffer: Use long for nr_pages to avoid overflow failures
        ring-buffer: Prevent overflow of size in ring_buffer_resize()
        mfd: omap-usb-tll: Fix scheduling while atomic BUG
        mmc: mmc: Fix partition switch timeout for some eMMCs
        mmc: longer timeout for long read time quirk
        Bluetooth: vhci: purge unhandled skbs
        USB: serial: keyspan: fix use-after-free in probe error path
        USB: serial: quatech2: fix use-after-free in probe error path
        USB: serial: io_edgeport: fix memory leaks in probe error path
        USB: serial: option: add support for Cinterion PH8 and AHxx
        tty: vt, return error when con_startup fails
        serial: samsung: Reorder the sequence of clock control when call s3c24xx_serial_set_termios()
        Linux 3.10.102

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Conflicts:
	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
	fs/pipe.c
	kernel/trace/trace_printk.c
	net/core/rtnetlink.c
	net/socket.c
2018-01-25 17:24:10 -07:00
Thomas Gleixner 0579a12791 sched/cputime: Fix steal time accounting vs. CPU hotplug
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.

On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:

	 u64 steal = paravirt_steal_clock(smp_processor_id());

	 steal -= this_rq()->prev_steal_time;

So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.

Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.

None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.

Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa48380851 "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685acc "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2016-06-07 10:42:48 +02:00
Joonwoo Park a6bd7ae282 sched: inline function scale_load_to_cpu()
Inline relatively small and frequently used function scale_load_to_cpu().

CRs-fixed: 849655
Change-Id: Id5f60595c394959d78e6da4cc4c18c338fec285b
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2015-08-24 17:45:31 -07:00
Syed Rameez Mustafa 6f717a4bf8 sched: Optimize select_best_cpu() to reduce execution time
select_best_cpu() is a crucial wakeup routine that determines the
time taken by the scheduler to wake up a task. Optimize this routine
to get higher performance. The following changes have been made as
part of the optimization listed in order of how they built on top of
one another:

* Several routines called by select_best_cpu() recalculate task load
  and CPU load even though these are already known quantities. For
  example mostly_idle_cpu_sync() calculates CPU load; task_will_fit()
  calculates task load before spill_threshold_crossed() recalculates
  both. Remove these redundant calculations by moving the task load
  and CPU load computations to the select_best_cpu() 'for' loop and
  passing to any functions that need the information.

* Rewrite best_small_task_cpu() to avoid the existing two pass
  approach. The two pass approach was only in place to find the
  minimum power cluster for small task placement. This information
  can easily be established by looking at runqueue capacities. The
  cluster with not the highest capacity constitutes the minimum power
  cluster. A special CPU mask is called the mpc_mask required to safeguard
  against undue side effects on SMP systems. Also terminate the function
  early if the previous CPU is found to be mostly_idle.

* Reorganize code to ensure that no unnecessary computations or
  variable assignments are done. For example there is no need to
  compute CPU load if that information does not end up getting used
  in any iteration of the 'for' loop.

* The tick logic for EA migrations unnecessarily checks for the power
  of all CPUs only for skip_cpu() to throw away the result later.
  Ensure that for EA we only check CPUs within the same cluster
  and avoid running select_best_cpu() whenever possible.

CRs-fixed: 849655
Change-Id: I4e722912fcf3fe4e365a826d4d92a4dd45c05ef3
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed cpufreq_notifier_policy() to set mpc_mask.
 added a comment about prerequisite of lower_power_cpu_available().
 s/struct rq * rq/struct rq *rq/.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2015-08-24 17:45:09 -07:00
Ramakrishnan Ganesh 8c9b8b6409 Revert "sched: Use only partial wait time as task demand"
This reverts commit 2a7c208b28d7558a5e296cec662c691590b0652a.

Change-Id: Ie02645574fd93f4149aa6a3f355c71eb412c9bac
Signed-off-by: Ramakrishnan Ganesh <ramakris@codeaurora.org>
2015-05-28 12:17:00 -07:00
Syed Rameez Mustafa a77f89168a sched: Use only partial wait time as task demand
The scheduler currently either considers a tasks entire wait time as
task demand or completely ignores wait time based on the tunable
sched_account_wait_time. Both approaches have their limitations,
however. The former artificially boosts tasks demand when it may not
actually be justified. With the latter, the scheduler runs the risk
of never being able to recognize true load (consider two CPU hogs on
a single little CPU). To achieve a compromise between these two
extremes, change the load tracking algorithm to only consider part of
a tasks wait time as its demand. The portion of wait time accounted
as demand is determined by each tasks percent load, i.e. a task that
waits for 10ms and has 60 % task load, only 6 ms of the wait will
contribute to task demand. This approach is more fair as the scheduler
now tries to determine how much of its wait time would a task actually
have been using the CPU if it had been executing. It ensures that tasks
with high demand continue to see most of the benefits of accounting
wait time as busy time, however, lower demand tasks don't experience a
disproportionately high boost to demand triggering unjustified big CPU
usage. Note that this new approach is only applicable to wait time
being considered as task demand and not wait time considered as CPU
busy time.

To achieve the above effect, ensure that anytime a task is waiting, its
runtime in every relevant window segment is appropriately adjusted using
its pct load.

Change-Id: I6a698d6cb1adeca49113c3499029b422daf7871f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2015-05-28 12:16:58 -07:00
Syed Rameez Mustafa d191fa671d sched: Update max_capacity when an entire cluster is hotplugged
When an entire cluster is hotplugged, the scheduler's notion of
max_capacity can get outdated. This introduces the following
inefficiencies in behavior:

* task_will_fit() does not return true on all tasks. Consequently
  all big tasks go through fallback CPU selection logic skipping
  C-state and power checks in select_best_cpu().

* During boost, migration_needed() return true unnecessarily
  causing an avoidable rerun of select_best_cpu().

* An unnecessary kick is sent to all little CPUs when boost is set.

* An opportunity for early bailout from nohz_kick_needed() is lost.

Start handling CPUFREQ_REMOVE_POLICY in the policy notifier callback
which indicates the last CPU in a cluster being hotplugged out. Also
modify update_min_max_capacity() to only iterate through online CPUs
instead of possible CPUs. While we can't guarantee the integrity of
the cpu_online_mask in the notifier callback, the scheduler will fix
up all state soon after any changes to the online mask.

The change does have one side effect; early termination from the
notifier callback when min_max_freq or max_possible_freq remain
unchanged is no longer possible. This is because when the last CPU
in a cluster is hot removed, only max_capacity is updated without
affecting min_max_freq or max_possible_freq. Therefore, when the
first CPU in the same cluster gets hot added at a later point
max_capacity must once again be recomputed despite there being no
change in min_max_freq or max_possible_freq.

Change-Id: I9a1256b5c2cd6fcddd85b069faf5e2ace177e122
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2015-05-01 19:40:50 -07:00
Srivatsa Vaddagiri cbf9e59b96 sched: Add cgroup-based criteria for upmigration
It may be desirable to discourage upmigration of tasks belonging to
some cgroups. Add a per-cgroup flag (upmigrate_discourage) that
discourages upmigration of tasks of a cgroup. Tasks of the cgroup are
allowed to upmigrate only under overcommitted scenario.

Change-Id: I1780e420af1b6865c5332fb55ee1ee408b74d8ce
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-03-05 15:50:56 -08:00
Srivatsa Vaddagiri 8f9ba192b6 sched: Keep track of average nr_big_tasks
Extend sched_get_nr_running_avg() API to return average nr_big_tasks,
in addition to average nr_running and average nr_io_wait tasks. Also
add a new trace point to record values returned by
sched_get_nr_running_avg() API.

Change-Id: Id3591e6d04da8db484b4d1cb9d95dba075f5ab9a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-02-26 10:37:26 -08:00
Srivatsa Vaddagiri 3f116af70d sched: Fix bug in average nr_running and nr_iowait calculation
sched_get_nr_running_avg() returns average nr_running and nr_iowait
task count since it was last invoked. Fix several bugs in their
calculation.

* sched_update_nr_prod() needs to consider that nr_running count can
  change by more than 1 when CFS_BANDWIDTH feature is used

* sched_get_nr_running_avg() needs to sum up nr_iowait count across
  all cpus, rather than just one

* sched_get_nr_running_avg() could race with sched_update_nr_prod(),
  as a result of which it could use curr_time which is behind a cpu's
  'last_time' value. That would lead to erroneous calculation of
  average nr_running or nr_iowait.

While at it, fix also a bug in BUG_ON() check in
sched_update_nr_prod() function and remove unnecessary nr_running
argument to sched_update_nr_prod() function.

Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-02-26 10:33:52 -08:00
Srivatsa Vaddagiri 2385d33016 sched: Support CFS_BANDWIDTH feature in HMP scheduler
CFS_BANDWIDTH feature is not currently well-supported by HMP
scheduler. Issues encountered include a kernel panic when
rq->nr_big_tasks count becomes negative. This patch fixes HMP
scheduler code to better handle CFS_BANDWIDTH feature. The most
prominent change introduced is maintenance of HMP stats (nr_big_tasks,
nr_small_tasks, cumulative_runnable_avg) per 'struct cfs_rq' in
addition to being maintained in each 'struct rq'. This allows HMP
stats to be updated easily when a group is throttled on a cpu.

Change-Id: Iad9f378b79ab5d9d76f86d1775913cc1941e266a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-01-28 14:13:19 +05:30
Srivatsa Vaddagiri bbef4c5e1b sched: Consolidate hmp stats into their own struct
Key hmp stats (nr_big_tasks, nr_small_tasks and
cumulative_runnable_average) are currently maintained per-cpu in
'struct rq'. Merge those stats in their own structure (struct
hmp_sched_stats) and modify impacted functions to deal with the newly
introduced structure. This cleanup is required for a subsequent patch
which fixes various issues with use of CFS_BANDWIDTH feature in HMP
scheduler.

Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-01-28 14:13:14 +05:30
Linux Build Service Account 380cadc7f3 Merge "sched: Per-cpu prefer_idle flag" 2014-12-29 17:31:47 -08:00
Srivatsa Vaddagiri 599bfc7503 sched: Per-cpu prefer_idle flag
Remove the global sysctl_sched_prefer_idle flag and replace it with a
per-cpu prefer_idle flag. The per-cpu flag is expected to same for all
cpus in a cluster. It thus provides convenient means to disable
packing in one cluster while allowing packing in another cluster.

Change-Id: Ie4cc73bb1a55b4eac5697be38e558546161faca1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-12-23 09:52:43 +05:30
Olav Haugan 7e13b27b8b sched: Add sysctl to enable power aware scheduling
Add sysctl to enable energy awareness at runtime. This is useful for
performance/power tuning/measurements and debugging. In addition this
will match up with the Documentation/scheduler/sched-hmp.txt documentation.

Change-Id: I0a9185498640d66917b38bf5d55f6c59fc60ad5c
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-12-22 14:37:33 -08:00
Joonwoo Park fc994a4b9e sched: take account of irq preemption when calculating irqload delta
If irq raises while sched_irqload() is calculating irqload delta,
sched_account_irqtime() can update rq's irqload_ts which can be greater
than the jiffies stored in sched_irqload()'s context so delta can be
negative.  This negative delta means there was recent irq occurence.
So remove improper BUG_ON().

CRs-fixed: 771894
Change-Id: I5bb01b50ec84c14bf9f26dd9c95de82ec2cd19b5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-12-16 16:56:50 -08:00
Olav Haugan 2c320f2ffa sched: Add temperature to cpu_load trace point
Add the current CPU temperature to the sched_cpu_load trace point.
This will allow us to track the CPU temperature.

CRs-Fixed: 764788
Change-Id: Ib2e3559bbbe3fe07a6b7c8115db606828bc36254
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-12-13 06:43:48 -08:00
Steve Muckle 75d1c94217 sched: make sched_cpu_high_irqload a runtime tunable
It may be desirable to be able to alter the scehd_cpu_high_irqload
setting easily, so make it a runtime tunable value.

Change-Id: I832030eec2aafa101f0f435a4fd2d401d447880d
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-12-10 23:53:53 -08:00
Steve Muckle 51f0d7663b sched: avoid CPUs with high irq activity
CPUs with significant IRQ activity will not be able to serve tasks
quickly. Avoid them if possible by disqualifying such CPUs from
being recognized as mostly idle.

Change-Id: I2c09272a4f259f0283b272455147d288fce11982
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-12-10 23:53:47 -08:00
Steve Muckle 5fdc1d3aaa sched: track soft/hard irqload per-RQ with decaying avg
The scheduler currently ignores irq activity when deciding which
CPUs to place tasks on. If a CPU is getting hammered with IRQ activity
but has no tasks it will look attractive to the scheduler as it will
not be in a low power mode.

Track irqload with a decaying average. This quantity can be used
in the task placement logic to avoid CPUs which are under high
irqload. The decay factor is 3/4. Note that with this algorithm the
tracked irqload quantity will be higher than the actual irq time
observed in any single window. Some sample outcomes with steady
irqloads per 10ms window and the 3/4 decay factor (irqload of 10 is
used as a threshold in a subsequent patch):

irqload per window        load value asymptote      # windows to > 10
2ms			  8			    n/a
3ms			  12			    7
4ms			  16			    4
5ms			  20			    3

Of course irqload will not be constant in each window, these are just
given as simple examples.

Change-Id: I9dba049f5dfdcecc04339f727c8dd4ff554e01a5
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-12-10 19:50:45 -08:00
Linux Build Service Account b4229d736e Merge "sched: Make RT tasks eligible for boost" 2014-12-05 00:05:48 -08:00
Syed Rameez Mustafa fce95c9a12 sched: Make RT tasks eligible for boost
During sched boost RT tasks currently end up going to the lowest
power cluster. This can be a performance bottleneck especially if
the frequency and IPC differences between clusters are high.
Furthermore, when RT tasks go over to the little cluster during
boost, the load balancer keeps attempting to pull work over to the
big cluster. This results in pre-emption of the executing RT task
causing more delays. Finally, containing more work on a single
cluster during boost might help save some power if the little
cluster can then enter deeper low power modes.

Change-Id: I177b2e81be5657c23e7ac43889472561ce9993a9
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-12-03 19:50:25 -08:00
Srivatsa Vaddagiri 57da62614c sched: Packing support until a frequency threshold
Add another dimension for task packing based on frequency. This patch
adds a per-cpu tunable, rq->mostly_idle_freq, which when set will
result in tasks being packed on a single cpu in cluster as long as
cluster frequency is less than set threshold.

Change-Id: I318e9af6c8788ddf5dfcda407d621449ea5343c0
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-12-02 11:48:30 +05:30
Srivatsa Vaddagiri ed7d7749e9 sched: per-cpu mostly_idle threshold
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.

This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.

Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-11-06 15:27:00 +05:30
Srivatsa Vaddagiri f3386c7cfb sched: update governor notification logic
Make criteria for notifying governor to be per-cpu. Governor is
notified of any large change in cpu's busy time statistics
(rq->prev_runnable_sum) since the last reported value.

Change-Id: I727354d994d909b166d093b94d3dade7c7dddc0d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-15 14:57:18 -07:00
Srivatsa Vaddagiri 2568673dd6 sched: window-stats: Enhance cpu busy time accounting
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.

This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.

CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-03 14:03:51 -07:00
Srivatsa Vaddagiri 86df733742 sched: improve logic for alerting governor
Currently we send notification to governor not taking note of cpus
that are synchronized with regard to their frequency. As a result,
scheduler could send pointless notifications (notification spam!).

Avoid this by considering synchronized cpus and alerting governor only
when the highest demand of any cpu within cluster far exceeds or falls
behind current frequency.

Change-Id: I74908b5a212404ca56b38eb94548f9b1fbcca33d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-03 13:46:18 -07:00
Linux Build Service Account 0dbd5f1b7b Merge "sched: window-stats: add a new AVG policy" 2014-09-09 04:47:32 -07:00
Linux Build Service Account 672d3eb95f Merge "sched: fix wrong load_scale_factor/capacity/nr_big/small_tasks" 2014-09-09 00:57:10 -07:00
Srivatsa Vaddagiri 9e37153f17 sched: fix wrong load_scale_factor/capacity/nr_big/small_tasks
A couple bugs exist with incorrect use of cpu_online_mask in
pre/post_big_small_task() functions, leading to potentially incorrect
computation of load_scale_factor/capacity/nr_big/small_tasks.

pre/post_big_small_task_count_change() use cpu_online_mask in an
unreliable manner. While local_irq_disable() in
pre_big_small_task_count_change() ensures a cpu won't go away in
cpu_online_mask, nothing prevents a cpu from coming online
concurrently. As a result, cpu_online_mask used in
pre_big_small_task_count_change() can be inconsistent with that used
in post_big_small_task_count_change() which can lead to an attempt to
unlock rq->lock which was not taken before.

Secondly, when either max_possible_freq or min_max_freq is changing,
it needs to trigger recomputation of load_scale_factor and capacity
for *all* cpus, even if some are offline. Otherwise, an offline cpu
could later come online with incorrect load_scale_factor/capacity.

While it should be sufficient to scan online cpus for
updating their nr_big/small_tasks in
post_big_small_task_count_change(), unfortunately it sounds pretty
hard to provide a stable cpu_online_mask when its called from
cpufreq_notifier_policy(). cpufreq framework can trigger a
CPUFREQ_NOTIFY notification in multiple contexts, some in cpu-hotplug
paths, which makes it pretty hard to guess whether get_online_cpus()
can be taken without causing deadlocks or not. To workaround the
insufficient information we have about the hotplug-safety context when
CPUFREQ_NOTIFY is issued, have post_big_small_task_count_change()
traverse all possible cpus in updating nr_big/small_task_count.

CRs-Fixed: 717134
Change-Id: Ife8f3f7cdfd77d5a21eee63627d7a3465930aed5
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-09-08 17:18:24 -07:00
Syed Rameez Mustafa bf3e6c0e55 sched: window-stats: add a new AVG policy
The current WINDOW_STATS_AVG policy is actually a misnomer since it
uses the maximum value of the runtime in the recent window and the
average of the past ravg_hist_size windows. Add a policy that only
uses the average and call it WINDOW_STATS_AVG policy. Rename all the
other polices to make them shorter and unambiguous.

Change-Id: I080a4ea072a84a88858ca9da59a4151dfbdbe62c
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-09-08 11:07:41 -07:00
Syed Rameez Mustafa 9c37494817 sched: fix bail condition in bail_inter_cluster_balance()
Following commit efcad25cbfb (revert "sched: influence cpu_power based
on max_freq and efficiency), all CPUs in the system have the same
cpu_power and consequently the same group capacity. Therefore, the
check in bail_inter_cluster_balance() can now no longer be used to
distinguish a higher performance cluster from one with lower
performance. The check is currently broken and always returns true for
every load balancing attempt. Fix this by using runqueue capacity
instead which can still be used as a good measure of cluster
capabilities.

Also the logic for distinguishing between idle environments and using
a different sched group capacity in update_sd_pick_busiest() is
redundant. sgs->group_capacity would now always be equal to the number
of CPUs in the group. Use sgs->group_capacity directly in conditonal
checks in that function.

Change-Id: Idecfd1ed221d27d4324b20539e5224a92bf8b751
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-09-03 19:23:40 -07:00
Srivatsa Vaddagiri 4f93bebd20 sched: window-stats: use policy_mutex in sched_set_window()
Several configuration variable change will result in
reset_all_window_stats() being called. All of them, except
sched_set_window(), are serialized via policy_mutex. Take
policy_mutex in sched_set_window() as well to serialize use of
reset_all_window_stats() function

Change-Id: Iada7ff8ac85caa1517e2adcf6394c5b050e3968a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:45:17 -07:00
Srivatsa Vaddagiri b432b691fa sched: window_stats: Add "disable" mode support
"disabled" mode (sched_disble_window_stats = 1) disables all
window-stats related activity. This is useful when changing key
configuration variables associated with window-stats feature (like
policy or window size).

Change-Id: I9e55c9eb7f7e3b1b646079c3aa338db6259a9cfe
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:45:15 -07:00
Srivatsa Vaddagiri 85ed6be992 sched: window-stats: legacy mode
Support legacy mode, which results in busy time being seen by governor
that is close to what it would have seen via existing APIs i.e
get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
get_cpu_idle_time_jiffy(). In particular, legacy mode means that only
task execution time is counted in rq->curr_runnable_sum and
rq->prev_runnable_sum. Also task migration does not result in
adjustment of those counters.

Change-Id: If374ccc084aa73f77374b6b3ab4cd0a4ca7b8c90
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:43:14 -07:00
Srivatsa Vaddagiri dafe791457 sched: window-stats: Code cleanup
Remove code duplication associated with update of various window-stats
related sysctl tunables

Change-Id: I64e29ac065172464ba371a03758937999c42a71f
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:43:12 -07:00
Olav Haugan df91ad278c sched: Make RAVG_HIST_SIZE tunable
Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size
to allow tuning of the size of the history that is used in computation
of task demand.

CRs-fixed: 706138
Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-08-12 11:15:20 -07:00
Srivatsa Vaddagiri 098d8371ad sched: window-stats: 64-bit type for curr/prev_runnable_sum
Expand rq->curr_runnable_sum and rq->prev_runnable_sum to be 64-bit
counters as otherwise they can easily overflow when a cpu has many
tasks.

Change-Id: I68ab2658ac6a3174ddb395888ecd6bf70ca70473
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-12 10:51:31 -07:00
Srivatsa Vaddagiri 5e8f14fbbc sched: window-stats: Allow acct_wait_time to be tuned
Add sysctl interface to tune sched_acct_wait_time variable at runtime

Change-Id: I38339cdb388a507019e429709a7c28e80b5b3585
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri 4da7e167b3 sched: window-stats: Account interrupt handling time as busy time
Account cycles spent by idle cpu handling interrupts (irq or softirq)
towards its busy time.

Change-Id: I84cc084ced67502e1cfa7037594f29ed2305b2b1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri fa15bd9937 sched: Fix herding issue
check_for_migration() could run concurrently on multiple cpus,
resulting in multiple tasks wanting to migrate to same cpu. This could
cause cpus to be underutilized and lead to increased scheduling
latencies for tasks. Fix this by serializing select_best_cpu() calls
from cpus running check_for_migration() check and marking selected
cpus as reserved, so that subsequent call to select_best_cpu() from
check_for_migration() will skip reserved cpus.

Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-12 10:51:29 -07:00
Srivatsa Vaddagiri ea5020bcd2 sched: trigger immediate migration of tasks upon boost
Currently turning on boost does not immediately trigger migration of
tasks from lower capacity cpus. Tasks could incur migration latency
of up to one timer tick (when check_for_migration() is run).

Fix this by triggering a migration check on cpus with lower capacity
as soon as boost is turned on for first time.

Change-Id: I244649f9cb6608862d87631325967b887b7f4b7e
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri a57fe9b6df sched: window-stats: Handle policy change properly
sched_window_stat_policy influences task demand and thus various
statistics maintained per-cpu like curr_runnable_sum. Changing policy
non-atomically would lead to improper accounting. For example, when
task is enqueued on a cpu's runqueue, its demand that is added to
rq->cumulative_runnable_avg could be based on AVG policy and when its
dequeued its demand that is removed can be based on MAX, leading to
erroneous accounting.

This change causes policy change to be "atomic" i.e all cpu's rq->lock
are held and all task's window-stats are reset before policy is changed.

Change-Id: I6a3e4fb7bc299dfc5c367693b5717a1ef518c32d
CRs-Fixed: 687409
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-06 15:36:59 +05:30
Steve Muckle 1aa9b6992a sched: fixes for compilation without CONFIG_SCHED_HMP
These fixes are necessary to compile without CONFIG_SCHED_HMP
enabled.

Change-Id: Iabbde3c22a81288242ed3a44fdfdb2a16db8b072
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-07-22 16:08:02 -07:00
Srivatsa Vaddagiri fefafa08b7 sched: remove sysctl control for HMP and power-aware task placement
There is no real need to control HMP and power-aware task placement at
runtime after kernel has booted. Boot-time control should be
sufficient. Not allowing for runtime (sysctl) support simplifies the
code quite a bit.

Also rename sysctl_sched_enable_hmp_task_placement to be shorter.

Change-Id: I60cae51a173c6f73b79cbf90c50ddd41a27604aa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 16:07:58 -07:00
Srivatsa Vaddagiri 87df0beb43 sched: support legacy mode better
It should be possible to bypass all HMP scheduler changes at runtime
by setting sysctl_sched_enable_hmp_task_placement and
sysctl_sched_enable_power_aware to 0.  Fix various code paths to honor
this requirement.

Change-Id: I74254e68582b3f9f1b84661baf7dae14f981c025
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 16:07:05 -07:00
Syed Rameez Mustafa 8f7e5b8ee8 sched: Add a per rq max_possible_capacity for use in power calculations
In the absence of a power driver providing real power values, the scheduler
currently defaults to using capacity of a CPU as a measure of power. This,
however, is not a good measure since the capacity of a CPU can change due
to thermal conditions and/or other hardware restrictions. These frequency
restrictions have no effect on the power efficiency of those CPUs.
Introduce max possible capacity of a CPU to track an absolute measure of
capacity which translates into a good absolute measure of power efficiency.
Max possible capacity takes the max possible frequency of CPUs into account
instead of max frequency.

Change-Id: Ia970b853e43a90eb8cc6fd990b5c47fca7e50db8
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-07-22 14:23:04 -07:00
Syed Rameez Mustafa 3d972d3af1 sched: Make task and CPU load calculations safe from truncation
Load calculations have been modified to accept and return 64 bit values.
Fix up all the places where we make such calculations to store the result
in 64 bit variables. This is necessary to avoid issues caused by
truncation of values.

While at it update scale_task_load() to scale_load_to_cpu(). This is
because the API is used to scale load of both individual tasks as well as
the cumulative load of CPUs. In this sense the name was a misnomer. Also
clean up power_cost() to use max_task_load().

Change-Id: I51e683e1592a5ea3c4e4b2b06d7a7339a49cce9c
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-07-22 14:23:03 -07:00
Syed Rameez Mustafa 8eebaa1826 sched/fair: Introduce C-state aware task placement for small tasks
Small tasks execute for small durations. This means that the power
cost of taking CPUs out of a low power mode outweigh any performance
advantage of using an idle core or power advantage of using the most
power efficient CPU. Introduce C-state aware task placement for small
tasks. This requires a two pass approach where we first determine the
most power effecient CPU and establish a band of CPUs offering a
similar power cost for the task. The order of preference then is as
follows:

1) Any mostly idle CPU in active C-state in the same power band.
2) A CPU with the shallowest C-state in the same power band.
3) A CPU with the least load in the same power band.
4) Lowest power CPU in a higher power band.

The patch also modifies the definition of a small task. Small tasks
are now determined relative to minimum capacity CPUs in the system
and not the task CPU.

Change-Id: Ia09840a5972881cad7ba7bea8fe34c45f909725e
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-07-22 14:23:02 -07:00