Commit:
e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")
... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.
However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch. Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.
Since the root cause has been fixed, we can just revert commit e9532e69b8d1.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXXS5iAAoJEE44bZycYXAvDj8P/jbhmGAgW6tw2cnS90QIZDqG
M/nclEId61jICNvbfP6zsioKeWyrmzr5G7NjqTThsSNhCo/DXs3ddMqLy3pOaFdq
mytXtHIUpwZoplEib+ODinW40CMqnu11XSWEcee2nrsPuGNsnc7BY0wmFBa6UVCV
rOZef9SN9lJcZSYY/auvgLDXOXdQ+NMxp5hau30aF5HBO8hTDXStjPRcUwCvz7aR
govTQJHlS4HzLH3JOYS3Dt8IYFDOrKhQIby2nFdw7eiUxHCRy2F0asabTh3DzCw1
iLvFroozjyVXwozfWMqLCvMa+514MXJy8Nkva6xiAHraC8UrgfPtcNsTdgtkdH9T
V2Am9b0L7yiBdG6hsZLxkU3akk7vU/0dtppwzvudANT6i2tGcDSBeaZq3T2pAv7B
7coY53GzHZdQnbdTZbYeS1fxebxyXw50D5OJkF8DyLhoL7Uj2Dvv0QdjKv+U/e5D
VQ+ZyGcBdCLuOzflXysI10E01y0/M3FrkubgGBM4Oh0eYKCHJaHG/NCZy5JY/qxy
S0phem8RbeZPbcL14z+5buWIi1lUkTiCIMG8c32ZEmDh84drnICqABA0RzKmqdkj
ucQa+PzkMQ1DyhAMUl/CwpBfSqf1Zs3agLo78Kp5MTGfeAA90m0SeVqhmDgWhwqG
HhSlsPFfMfmJl5S0uJpQ
=UhFl
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEJDfLduVEy2qz2d/TmXOSYMtstxYFAlpqdSoACgkQmXOSYMts
txbb1A/6A1pJjz3//6RsYU7G2f4WgAjqCRzQDPtVtBUwpyBtj7DuAxNGeOAvw0KM
BfOTy0fhtgHfOV6F4kynIuU6scNY8zZlZ2ZCgndhiC45dlDBSto2mYgF9DmDl7m3
rRhiWmmSqFvJW+USxCETg8PxXVIs0Si+TU8AfBKJt3Mf25UyLsrm/hIDqg3FtkyP
STZlpmACGQEJl6qTVTubTv6/psJc0oE7gUZ2G4TTuFxt+p3/4MPf+pnicl5jcP04
laN1k2ce8ciV8Tc7f5zM55ArLGM+M4QQNRqO6Wrl7gQvtXpn6Efno9aY2MuaXtdm
7sKKvQWj0QMS/9tei+wGS73gDsfIb1qrsaMWD9UF9zGb7miGkRr3wdDZPYurysWy
5cIL1TErJDiIVlVedL/o8EYOxCYamSQPJ35WGxSgeS9kqfTlh3C1angGy9EOpv27
ER1myFM4TUc51ziPIFlEeBu1ku4vVY7atCsZU25VqKFLAapeDG3xuK1RDmal/PTd
d2JahllwPQ4Uh8OUNeHcN4Ptxf/fBVezSCZw1tv6vkAUdt6uXcbweutDw74cWlNJ
KbKd5yluWVCAVsOSiVNRFX8ij/9GeJvu94eU5o7jiC578TQTRrMdKyxEqVKzz6te
39rFoX20GZ7IosRoJDp9gsJTA7GAVsCcfU9CK/SNL3jxGLFvJbo=
=CaKB
-----END PGP SIGNATURE-----
Merge 3.10.102 into android-msm-bullhead-3.10-oreo-m5
Changes in 3.10.102: (144 commits)
pipe: Fix buffer offset after partially failed read
x86/iopl/64: Properly context-switch IOPL on Xen PV
ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
compiler-gcc: integrate the various compiler-gcc[345].h files
x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id"
KVM: i8254: change PIT discard tick policy
KVM: fix spin_lock_init order on x86
EDAC, amd64_edac: Shift wrapping issue in f1x_get_norm_dct_addr()
PCI: Disable IO/MEM decoding for devices with non-compliant BARs
linux/const.h: Add _BITUL() and _BITULL()
x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
x86, processor-flags: Fix the datatypes and add bit number defines
x86/iopl: Fix iopl capability check on Xen PV
sg: fix dxferp in from_to case
aacraid: Fix memory leak in aac_fib_map_free
be2iscsi: set the boot_kset pointer to NULL in case of failure
usb: retry reset if a device times out
USB: cdc-acm: more sanity checking
USB: iowarrior: fix oops with malicious USB descriptors
USB: usb_driver_claim_interface: add sanity checking
USB: mct_u232: add sanity checking in probe
USB: digi_acceleport: do sanity checking for the number of ports
USB: cypress_m8: add endpoint sanity check
USB: serial: cp210x: Adding GE Healthcare Device ID
USB: option: add "D-Link DWM-221 B1" device id
pwc: Add USB id for Philips Spc880nc webcam
Input: powermate - fix oops with malicious USB descriptors
net: irda: Fix use-after-free in irtty_open()
8250: use callbacks to access UART_DLL/UART_DLM
bttv: Width must be a multiple of 16 when capturing planar formats
media: v4l2-compat-ioctl32: fix missing length copy in put_v4l2_buffer32
ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41.
jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
bcache: fix cache_set_flush() NULL pointer dereference on OOM
watchdog: rc32434_wdt: fix ioctl error handling
splice: handle zero nr_pages in splice_to_pipe()
xtensa: ISS: don't hang if stdin EOF is reached
xtensa: clear all DBREAKC registers on start
md/raid5: Compare apples to apples (or sectors to sectors)
rapidio/rionet: fix deadlock on SMP
ipr: Fix out-of-bounds null overwrite
ipr: Fix regression when loading firmware
drm/radeon: Don't drop DP 2.7 Ghz link setup on some cards.
tracing: Have preempt(irqs)off trace preempt disabled functions
tracing: Fix crash from reading trace_pipe with sendfile
tracing: Fix trace_printk() to print when not using bprintk()
scripts/coccinelle: modernize &
Input: ims-pcu - sanity check against missing interfaces
Input: ati_remote2 - fix crashes on detecting device with invalid descriptor
ocfs2/dlm: fix race between convert and recovery
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
mtd: onenand: fix deadlock in onenand_block_markbad
sched/cputime: Fix steal time accounting vs. CPU hotplug
perf/x86/intel: Fix PEBS data source interpretation on Nehalem/Westmere
hwmon: (max1111) Return -ENODEV from max1111_read_channel if not instantiated
parisc: Avoid function pointers for kernel exception routines
parisc: Fix kernel crash with reversed copy_from_user()
ALSA: timer: Use mod_timer() for rearming the system timer
net: jme: fix suspend/resume on JMC260
sctp: lack the check for ports in sctp_v6_cmp_addr
ipv6: re-enable fragment header matching in ipv6_find_hdr
cdc_ncm: toggle altsetting to force reset before setup
usbnet: cleanup after bind() in probe()
udp6: fix UDP/IPv6 encap resubmit path
sh_eth: fix NULL pointer dereference in sh_eth_ring_format()
net: Fix use after free in the recvmmsg exit path
farsync: fix off-by-one bug in fst_add_one
ath9k: fix buffer overrun for ar9287
qlge: Fix receive packets drop.
ppp: take reference on channels netns
qmi_wwan: add "D-Link DWM-221 B1" device id
ipv4: l2tp: fix a potential issue in l2tp_ip_recv
ipv6: l2tp: fix a potential issue in l2tp_ip6_recv
ip6_tunnel: set rtnl_link_ops before calling register_netdevice
usb: renesas_usbhs: avoid NULL pointer derefernce in usbhsf_pkt_handler()
usb: renesas_usbhs: disable TX IRQ before starting TX DMAC transfer
ext4: add lockdep annotations for i_data_sem
HID: usbhid: fix inconsistent reset/resume/reset-resume behavior
drm/radeon: hold reference to fences in radeon_sa_bo_new (3.17 and older)
usbvision-video: fix memory leak of alt_max_pkt_size
usbvision: fix leak of usb_dev on failure paths in usbvision_probe()
usbvision: fix crash on detecting device with invalid configuration
usb: xhci: fix wild pointers in xhci_mem_cleanup
usb: hcd: out of bounds access in for_each_companion
crypto: gcm - Fix rfc4543 decryption crash
nl80211: check netlink protocol in socket release notification
Input: gtco - fix crash on detecting device without endpoints
i2c: cpm: Fix build break due to incompatible pointer types
EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback
ASoC: s3c24xx: use const snd_soc_component_driver pointer
efi: Fix out-of-bounds read in variable_matches()
workqueue: fix ghost PENDING flag while doing MQ IO
USB: usbip: fix potential out-of-bounds write
paride: make 'verbose' parameter an 'int' again
fbdev: da8xx-fb: fix videomodes of lcd panels
misc/bmp085: Enable building as a module
rtc: vr41xx: Wire up alarm_irq_enable
drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors
include/linux/poison.h: fix LIST_POISON{1,2} offset
Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
perf stat: Document --detailed option
ARM: OMAP3: Add cpuidle parameters table for omap3430
compiler-gcc: disable -ftracer for __noclone functions
ipvs: correct initial offset of Call-ID header search in SIP persistence engine
nbd: ratelimit error msgs after socket close
clk: versatile: sp810: support reentrance
lpfc: fix misleading indentation
ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel
proc: prevent accessing /proc/<PID>/environ until it's ready
batman-adv: Fix broadcast/ogm queue limit on a removed interface
MAINTAINERS: Remove asterisk from EFI directory names
ACPICA: Dispatcher: Update thread ID for recursive method calls
USB: serial: cp210x: add ID for Link ECU
USB: serial: cp210x: add Straizona Focusers device ids
Input: ads7846 - correct the value got from SPI
powerpc: scan_features() updates incorrect bits for REAL_LE
crypto: hash - Fix page length clamping in hash walk
get_rock_ridge_filename(): handle malformed NM entries
Input: max8997-haptic - fix NULL pointer dereference
asmlinkage, pnp: Make variables used from assembler code visible
ARM: OMAP3: Fix booting with thumb2 kernel
decnet: Do not build routes to devices without decnet private data.
route: do not cache fib route info on local routes with oif
packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
atl2: Disable unimplemented scatter/gather feature
net: fix infoleak in llc
net: fix infoleak in rtnetlink
VSOCK: do not disconnect socket when peer has shutdown SEND only
net: bridge: fix old ioctl unlocked net device walk
net: fix a kernel infoleak in x25 module
fs/cifs: correctly to anonymous authentication via NTLMSSP
ring-buffer: Use long for nr_pages to avoid overflow failures
ring-buffer: Prevent overflow of size in ring_buffer_resize()
mfd: omap-usb-tll: Fix scheduling while atomic BUG
mmc: mmc: Fix partition switch timeout for some eMMCs
mmc: longer timeout for long read time quirk
Bluetooth: vhci: purge unhandled skbs
USB: serial: keyspan: fix use-after-free in probe error path
USB: serial: quatech2: fix use-after-free in probe error path
USB: serial: io_edgeport: fix memory leaks in probe error path
USB: serial: option: add support for Cinterion PH8 and AHxx
tty: vt, return error when con_startup fails
serial: samsung: Reorder the sequence of clock control when call s3c24xx_serial_set_termios()
Linux 3.10.102
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Conflicts:
drivers/media/v4l2-core/v4l2-compat-ioctl32.c
fs/pipe.c
kernel/trace/trace_printk.c
net/core/rtnetlink.c
net/socket.c
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.
On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:
u64 steal = paravirt_steal_clock(smp_processor_id());
steal -= this_rq()->prev_steal_time;
So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.
Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.
None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.
Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: commit 095c0aa83e "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa48380851 "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685acc "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Inline relatively small and frequently used function scale_load_to_cpu().
CRs-fixed: 849655
Change-Id: Id5f60595c394959d78e6da4cc4c18c338fec285b
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
select_best_cpu() is a crucial wakeup routine that determines the
time taken by the scheduler to wake up a task. Optimize this routine
to get higher performance. The following changes have been made as
part of the optimization listed in order of how they built on top of
one another:
* Several routines called by select_best_cpu() recalculate task load
and CPU load even though these are already known quantities. For
example mostly_idle_cpu_sync() calculates CPU load; task_will_fit()
calculates task load before spill_threshold_crossed() recalculates
both. Remove these redundant calculations by moving the task load
and CPU load computations to the select_best_cpu() 'for' loop and
passing to any functions that need the information.
* Rewrite best_small_task_cpu() to avoid the existing two pass
approach. The two pass approach was only in place to find the
minimum power cluster for small task placement. This information
can easily be established by looking at runqueue capacities. The
cluster with not the highest capacity constitutes the minimum power
cluster. A special CPU mask is called the mpc_mask required to safeguard
against undue side effects on SMP systems. Also terminate the function
early if the previous CPU is found to be mostly_idle.
* Reorganize code to ensure that no unnecessary computations or
variable assignments are done. For example there is no need to
compute CPU load if that information does not end up getting used
in any iteration of the 'for' loop.
* The tick logic for EA migrations unnecessarily checks for the power
of all CPUs only for skip_cpu() to throw away the result later.
Ensure that for EA we only check CPUs within the same cluster
and avoid running select_best_cpu() whenever possible.
CRs-fixed: 849655
Change-Id: I4e722912fcf3fe4e365a826d4d92a4dd45c05ef3
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[joonwoop@codeaurora.org: fixed cpufreq_notifier_policy() to set mpc_mask.
added a comment about prerequisite of lower_power_cpu_available().
s/struct rq * rq/struct rq *rq/.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
The scheduler currently either considers a tasks entire wait time as
task demand or completely ignores wait time based on the tunable
sched_account_wait_time. Both approaches have their limitations,
however. The former artificially boosts tasks demand when it may not
actually be justified. With the latter, the scheduler runs the risk
of never being able to recognize true load (consider two CPU hogs on
a single little CPU). To achieve a compromise between these two
extremes, change the load tracking algorithm to only consider part of
a tasks wait time as its demand. The portion of wait time accounted
as demand is determined by each tasks percent load, i.e. a task that
waits for 10ms and has 60 % task load, only 6 ms of the wait will
contribute to task demand. This approach is more fair as the scheduler
now tries to determine how much of its wait time would a task actually
have been using the CPU if it had been executing. It ensures that tasks
with high demand continue to see most of the benefits of accounting
wait time as busy time, however, lower demand tasks don't experience a
disproportionately high boost to demand triggering unjustified big CPU
usage. Note that this new approach is only applicable to wait time
being considered as task demand and not wait time considered as CPU
busy time.
To achieve the above effect, ensure that anytime a task is waiting, its
runtime in every relevant window segment is appropriately adjusted using
its pct load.
Change-Id: I6a698d6cb1adeca49113c3499029b422daf7871f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
When an entire cluster is hotplugged, the scheduler's notion of
max_capacity can get outdated. This introduces the following
inefficiencies in behavior:
* task_will_fit() does not return true on all tasks. Consequently
all big tasks go through fallback CPU selection logic skipping
C-state and power checks in select_best_cpu().
* During boost, migration_needed() return true unnecessarily
causing an avoidable rerun of select_best_cpu().
* An unnecessary kick is sent to all little CPUs when boost is set.
* An opportunity for early bailout from nohz_kick_needed() is lost.
Start handling CPUFREQ_REMOVE_POLICY in the policy notifier callback
which indicates the last CPU in a cluster being hotplugged out. Also
modify update_min_max_capacity() to only iterate through online CPUs
instead of possible CPUs. While we can't guarantee the integrity of
the cpu_online_mask in the notifier callback, the scheduler will fix
up all state soon after any changes to the online mask.
The change does have one side effect; early termination from the
notifier callback when min_max_freq or max_possible_freq remain
unchanged is no longer possible. This is because when the last CPU
in a cluster is hot removed, only max_capacity is updated without
affecting min_max_freq or max_possible_freq. Therefore, when the
first CPU in the same cluster gets hot added at a later point
max_capacity must once again be recomputed despite there being no
change in min_max_freq or max_possible_freq.
Change-Id: I9a1256b5c2cd6fcddd85b069faf5e2ace177e122
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
It may be desirable to discourage upmigration of tasks belonging to
some cgroups. Add a per-cgroup flag (upmigrate_discourage) that
discourages upmigration of tasks of a cgroup. Tasks of the cgroup are
allowed to upmigrate only under overcommitted scenario.
Change-Id: I1780e420af1b6865c5332fb55ee1ee408b74d8ce
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Extend sched_get_nr_running_avg() API to return average nr_big_tasks,
in addition to average nr_running and average nr_io_wait tasks. Also
add a new trace point to record values returned by
sched_get_nr_running_avg() API.
Change-Id: Id3591e6d04da8db484b4d1cb9d95dba075f5ab9a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
sched_get_nr_running_avg() returns average nr_running and nr_iowait
task count since it was last invoked. Fix several bugs in their
calculation.
* sched_update_nr_prod() needs to consider that nr_running count can
change by more than 1 when CFS_BANDWIDTH feature is used
* sched_get_nr_running_avg() needs to sum up nr_iowait count across
all cpus, rather than just one
* sched_get_nr_running_avg() could race with sched_update_nr_prod(),
as a result of which it could use curr_time which is behind a cpu's
'last_time' value. That would lead to erroneous calculation of
average nr_running or nr_iowait.
While at it, fix also a bug in BUG_ON() check in
sched_update_nr_prod() function and remove unnecessary nr_running
argument to sched_update_nr_prod() function.
Change-Id: I46737614737292fae0d7204c4648fb9b862f65b2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
CFS_BANDWIDTH feature is not currently well-supported by HMP
scheduler. Issues encountered include a kernel panic when
rq->nr_big_tasks count becomes negative. This patch fixes HMP
scheduler code to better handle CFS_BANDWIDTH feature. The most
prominent change introduced is maintenance of HMP stats (nr_big_tasks,
nr_small_tasks, cumulative_runnable_avg) per 'struct cfs_rq' in
addition to being maintained in each 'struct rq'. This allows HMP
stats to be updated easily when a group is throttled on a cpu.
Change-Id: Iad9f378b79ab5d9d76f86d1775913cc1941e266a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Key hmp stats (nr_big_tasks, nr_small_tasks and
cumulative_runnable_average) are currently maintained per-cpu in
'struct rq'. Merge those stats in their own structure (struct
hmp_sched_stats) and modify impacted functions to deal with the newly
introduced structure. This cleanup is required for a subsequent patch
which fixes various issues with use of CFS_BANDWIDTH feature in HMP
scheduler.
Change-Id: Ieffc10a3b82a102f561331bc385d042c15a33998
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Remove the global sysctl_sched_prefer_idle flag and replace it with a
per-cpu prefer_idle flag. The per-cpu flag is expected to same for all
cpus in a cluster. It thus provides convenient means to disable
packing in one cluster while allowing packing in another cluster.
Change-Id: Ie4cc73bb1a55b4eac5697be38e558546161faca1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Add sysctl to enable energy awareness at runtime. This is useful for
performance/power tuning/measurements and debugging. In addition this
will match up with the Documentation/scheduler/sched-hmp.txt documentation.
Change-Id: I0a9185498640d66917b38bf5d55f6c59fc60ad5c
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
If irq raises while sched_irqload() is calculating irqload delta,
sched_account_irqtime() can update rq's irqload_ts which can be greater
than the jiffies stored in sched_irqload()'s context so delta can be
negative. This negative delta means there was recent irq occurence.
So remove improper BUG_ON().
CRs-fixed: 771894
Change-Id: I5bb01b50ec84c14bf9f26dd9c95de82ec2cd19b5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Add the current CPU temperature to the sched_cpu_load trace point.
This will allow us to track the CPU temperature.
CRs-Fixed: 764788
Change-Id: Ib2e3559bbbe3fe07a6b7c8115db606828bc36254
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
It may be desirable to be able to alter the scehd_cpu_high_irqload
setting easily, so make it a runtime tunable value.
Change-Id: I832030eec2aafa101f0f435a4fd2d401d447880d
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
CPUs with significant IRQ activity will not be able to serve tasks
quickly. Avoid them if possible by disqualifying such CPUs from
being recognized as mostly idle.
Change-Id: I2c09272a4f259f0283b272455147d288fce11982
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
The scheduler currently ignores irq activity when deciding which
CPUs to place tasks on. If a CPU is getting hammered with IRQ activity
but has no tasks it will look attractive to the scheduler as it will
not be in a low power mode.
Track irqload with a decaying average. This quantity can be used
in the task placement logic to avoid CPUs which are under high
irqload. The decay factor is 3/4. Note that with this algorithm the
tracked irqload quantity will be higher than the actual irq time
observed in any single window. Some sample outcomes with steady
irqloads per 10ms window and the 3/4 decay factor (irqload of 10 is
used as a threshold in a subsequent patch):
irqload per window load value asymptote # windows to > 10
2ms 8 n/a
3ms 12 7
4ms 16 4
5ms 20 3
Of course irqload will not be constant in each window, these are just
given as simple examples.
Change-Id: I9dba049f5dfdcecc04339f727c8dd4ff554e01a5
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
During sched boost RT tasks currently end up going to the lowest
power cluster. This can be a performance bottleneck especially if
the frequency and IPC differences between clusters are high.
Furthermore, when RT tasks go over to the little cluster during
boost, the load balancer keeps attempting to pull work over to the
big cluster. This results in pre-emption of the executing RT task
causing more delays. Finally, containing more work on a single
cluster during boost might help save some power if the little
cluster can then enter deeper low power modes.
Change-Id: I177b2e81be5657c23e7ac43889472561ce9993a9
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Add another dimension for task packing based on frequency. This patch
adds a per-cpu tunable, rq->mostly_idle_freq, which when set will
result in tasks being packed on a single cpu in cluster as long as
cluster frequency is less than set threshold.
Change-Id: I318e9af6c8788ddf5dfcda407d621449ea5343c0
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.
This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.
Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Make criteria for notifying governor to be per-cpu. Governor is
notified of any large change in cpu's busy time statistics
(rq->prev_runnable_sum) since the last reported value.
Change-Id: I727354d994d909b166d093b94d3dade7c7dddc0d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.
This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.
CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Currently we send notification to governor not taking note of cpus
that are synchronized with regard to their frequency. As a result,
scheduler could send pointless notifications (notification spam!).
Avoid this by considering synchronized cpus and alerting governor only
when the highest demand of any cpu within cluster far exceeds or falls
behind current frequency.
Change-Id: I74908b5a212404ca56b38eb94548f9b1fbcca33d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
A couple bugs exist with incorrect use of cpu_online_mask in
pre/post_big_small_task() functions, leading to potentially incorrect
computation of load_scale_factor/capacity/nr_big/small_tasks.
pre/post_big_small_task_count_change() use cpu_online_mask in an
unreliable manner. While local_irq_disable() in
pre_big_small_task_count_change() ensures a cpu won't go away in
cpu_online_mask, nothing prevents a cpu from coming online
concurrently. As a result, cpu_online_mask used in
pre_big_small_task_count_change() can be inconsistent with that used
in post_big_small_task_count_change() which can lead to an attempt to
unlock rq->lock which was not taken before.
Secondly, when either max_possible_freq or min_max_freq is changing,
it needs to trigger recomputation of load_scale_factor and capacity
for *all* cpus, even if some are offline. Otherwise, an offline cpu
could later come online with incorrect load_scale_factor/capacity.
While it should be sufficient to scan online cpus for
updating their nr_big/small_tasks in
post_big_small_task_count_change(), unfortunately it sounds pretty
hard to provide a stable cpu_online_mask when its called from
cpufreq_notifier_policy(). cpufreq framework can trigger a
CPUFREQ_NOTIFY notification in multiple contexts, some in cpu-hotplug
paths, which makes it pretty hard to guess whether get_online_cpus()
can be taken without causing deadlocks or not. To workaround the
insufficient information we have about the hotplug-safety context when
CPUFREQ_NOTIFY is issued, have post_big_small_task_count_change()
traverse all possible cpus in updating nr_big/small_task_count.
CRs-Fixed: 717134
Change-Id: Ife8f3f7cdfd77d5a21eee63627d7a3465930aed5
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
The current WINDOW_STATS_AVG policy is actually a misnomer since it
uses the maximum value of the runtime in the recent window and the
average of the past ravg_hist_size windows. Add a policy that only
uses the average and call it WINDOW_STATS_AVG policy. Rename all the
other polices to make them shorter and unambiguous.
Change-Id: I080a4ea072a84a88858ca9da59a4151dfbdbe62c
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Following commit efcad25cbfb (revert "sched: influence cpu_power based
on max_freq and efficiency), all CPUs in the system have the same
cpu_power and consequently the same group capacity. Therefore, the
check in bail_inter_cluster_balance() can now no longer be used to
distinguish a higher performance cluster from one with lower
performance. The check is currently broken and always returns true for
every load balancing attempt. Fix this by using runqueue capacity
instead which can still be used as a good measure of cluster
capabilities.
Also the logic for distinguishing between idle environments and using
a different sched group capacity in update_sd_pick_busiest() is
redundant. sgs->group_capacity would now always be equal to the number
of CPUs in the group. Use sgs->group_capacity directly in conditonal
checks in that function.
Change-Id: Idecfd1ed221d27d4324b20539e5224a92bf8b751
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Several configuration variable change will result in
reset_all_window_stats() being called. All of them, except
sched_set_window(), are serialized via policy_mutex. Take
policy_mutex in sched_set_window() as well to serialize use of
reset_all_window_stats() function
Change-Id: Iada7ff8ac85caa1517e2adcf6394c5b050e3968a
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
"disabled" mode (sched_disble_window_stats = 1) disables all
window-stats related activity. This is useful when changing key
configuration variables associated with window-stats feature (like
policy or window size).
Change-Id: I9e55c9eb7f7e3b1b646079c3aa338db6259a9cfe
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Support legacy mode, which results in busy time being seen by governor
that is close to what it would have seen via existing APIs i.e
get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
get_cpu_idle_time_jiffy(). In particular, legacy mode means that only
task execution time is counted in rq->curr_runnable_sum and
rq->prev_runnable_sum. Also task migration does not result in
adjustment of those counters.
Change-Id: If374ccc084aa73f77374b6b3ab4cd0a4ca7b8c90
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Remove code duplication associated with update of various window-stats
related sysctl tunables
Change-Id: I64e29ac065172464ba371a03758937999c42a71f
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size
to allow tuning of the size of the history that is used in computation
of task demand.
CRs-fixed: 706138
Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
Expand rq->curr_runnable_sum and rq->prev_runnable_sum to be 64-bit
counters as otherwise they can easily overflow when a cpu has many
tasks.
Change-Id: I68ab2658ac6a3174ddb395888ecd6bf70ca70473
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Account cycles spent by idle cpu handling interrupts (irq or softirq)
towards its busy time.
Change-Id: I84cc084ced67502e1cfa7037594f29ed2305b2b1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
check_for_migration() could run concurrently on multiple cpus,
resulting in multiple tasks wanting to migrate to same cpu. This could
cause cpus to be underutilized and lead to increased scheduling
latencies for tasks. Fix this by serializing select_best_cpu() calls
from cpus running check_for_migration() check and marking selected
cpus as reserved, so that subsequent call to select_best_cpu() from
check_for_migration() will skip reserved cpus.
Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Currently turning on boost does not immediately trigger migration of
tasks from lower capacity cpus. Tasks could incur migration latency
of up to one timer tick (when check_for_migration() is run).
Fix this by triggering a migration check on cpus with lower capacity
as soon as boost is turned on for first time.
Change-Id: I244649f9cb6608862d87631325967b887b7f4b7e
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
sched_window_stat_policy influences task demand and thus various
statistics maintained per-cpu like curr_runnable_sum. Changing policy
non-atomically would lead to improper accounting. For example, when
task is enqueued on a cpu's runqueue, its demand that is added to
rq->cumulative_runnable_avg could be based on AVG policy and when its
dequeued its demand that is removed can be based on MAX, leading to
erroneous accounting.
This change causes policy change to be "atomic" i.e all cpu's rq->lock
are held and all task's window-stats are reset before policy is changed.
Change-Id: I6a3e4fb7bc299dfc5c367693b5717a1ef518c32d
CRs-Fixed: 687409
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
These fixes are necessary to compile without CONFIG_SCHED_HMP
enabled.
Change-Id: Iabbde3c22a81288242ed3a44fdfdb2a16db8b072
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
There is no real need to control HMP and power-aware task placement at
runtime after kernel has booted. Boot-time control should be
sufficient. Not allowing for runtime (sysctl) support simplifies the
code quite a bit.
Also rename sysctl_sched_enable_hmp_task_placement to be shorter.
Change-Id: I60cae51a173c6f73b79cbf90c50ddd41a27604aa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
It should be possible to bypass all HMP scheduler changes at runtime
by setting sysctl_sched_enable_hmp_task_placement and
sysctl_sched_enable_power_aware to 0. Fix various code paths to honor
this requirement.
Change-Id: I74254e68582b3f9f1b84661baf7dae14f981c025
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
In the absence of a power driver providing real power values, the scheduler
currently defaults to using capacity of a CPU as a measure of power. This,
however, is not a good measure since the capacity of a CPU can change due
to thermal conditions and/or other hardware restrictions. These frequency
restrictions have no effect on the power efficiency of those CPUs.
Introduce max possible capacity of a CPU to track an absolute measure of
capacity which translates into a good absolute measure of power efficiency.
Max possible capacity takes the max possible frequency of CPUs into account
instead of max frequency.
Change-Id: Ia970b853e43a90eb8cc6fd990b5c47fca7e50db8
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Load calculations have been modified to accept and return 64 bit values.
Fix up all the places where we make such calculations to store the result
in 64 bit variables. This is necessary to avoid issues caused by
truncation of values.
While at it update scale_task_load() to scale_load_to_cpu(). This is
because the API is used to scale load of both individual tasks as well as
the cumulative load of CPUs. In this sense the name was a misnomer. Also
clean up power_cost() to use max_task_load().
Change-Id: I51e683e1592a5ea3c4e4b2b06d7a7339a49cce9c
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Small tasks execute for small durations. This means that the power
cost of taking CPUs out of a low power mode outweigh any performance
advantage of using an idle core or power advantage of using the most
power efficient CPU. Introduce C-state aware task placement for small
tasks. This requires a two pass approach where we first determine the
most power effecient CPU and establish a band of CPUs offering a
similar power cost for the task. The order of preference then is as
follows:
1) Any mostly idle CPU in active C-state in the same power band.
2) A CPU with the shallowest C-state in the same power band.
3) A CPU with the least load in the same power band.
4) Lowest power CPU in a higher power band.
The patch also modifies the definition of a small task. Small tasks
are now determined relative to minimum capacity CPUs in the system
and not the task CPU.
Change-Id: Ia09840a5972881cad7ba7bea8fe34c45f909725e
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>