android_kernel_lge_bullhead

Commit Graph

Author	SHA1	Message	Date
Linux Build Service Account	e621b2a191	Merge "sched: window-stats: code cleanup"	2014-08-24 20:01:40 -07:00
Linux Build Service Account	5b1594145e	Merge "sched: window-stats: legacy mode"	2014-08-24 20:01:39 -07:00
Linux Build Service Account	755da8b25c	Merge "sched: window-stats: Code cleanup"	2014-08-24 20:01:38 -07:00
Linux Build Service Account	273f377789	Merge "sched: window-stats: Code cleanup"	2014-08-24 20:01:37 -07:00
Linux Build Service Account	0e3780151f	Merge "sched: window-stats: Code cleanup"	2014-08-24 20:01:37 -07:00
Linux Build Service Account	7b3f011d4e	Merge "sched: window-stats: Remove unused prev_window variable"	2014-08-24 20:01:36 -07:00
Srivatsa Vaddagiri	4f93bebd20	sched: window-stats: use policy_mutex in sched_set_window() Several configuration variable change will result in reset_all_window_stats() being called. All of them, except sched_set_window(), are serialized via policy_mutex. Take policy_mutex in sched_set_window() as well to serialize use of reset_all_window_stats() function Change-Id: Iada7ff8ac85caa1517e2adcf6394c5b050e3968a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:45:17 -07:00
Srivatsa Vaddagiri	e4ff6c07c5	sched: window-stats: Avoid taking all cpu's rq->lock for long reset_all_window_stats() walks task-list with all cpu's rq->lock held, which can cause spinlock timeouts if task-list is huge (and hence lead to a spinlock bug report). Avoid this by walking task-list without cpu's rq->lock held. Change-Id: Id09afd8b730fa32c76cd3bff5da7c0cd7aeb8dfb Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:45:16 -07:00
Srivatsa Vaddagiri	b432b691fa	sched: window_stats: Add "disable" mode support "disabled" mode (sched_disble_window_stats = 1) disables all window-stats related activity. This is useful when changing key configuration variables associated with window-stats feature (like policy or window size). Change-Id: I9e55c9eb7f7e3b1b646079c3aa338db6259a9cfe Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:45:15 -07:00
Srivatsa Vaddagiri	e81a1d6ece	sched: window-stats: Fix exit race Exiting tasks are removed from tasklist and hence at some point will become invisible to do_each_thread/for_each_thread task iterators. This breaks the functionality of reset_all_windows_stats() which has to reset stats for all tasks. This patch causes exiting tasks stats to be reset before they are removed from tasklist. DONT_ACCOUNT bit in exiting task's ravg.flags is also marked so that their remaining execution time is not accounted in cpu busy time counters (rq->curr/prev_runnable_sum). reset_all_windows_stats() is thus guaranteed to return with all task's stats reset to 0. Change-Id: I5f101156a4f958c1b3f31eb0db8cd06e621b75e9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:45:11 -07:00
Srivatsa Vaddagiri	f90ea88fa6	sched: window-stats: code cleanup Provide a wrapper function to reset task's window statistics. This will be reused by a subsequent patch Change-Id: Ied7d32325854088c91285d8fee55d5a5e8a954b3 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:50 -07:00
Srivatsa Vaddagiri	85ed6be992	sched: window-stats: legacy mode Support legacy mode, which results in busy time being seen by governor that is close to what it would have seen via existing APIs i.e get_cpu_idle_time_us(), get_cpu_iowait_time_us() and get_cpu_idle_time_jiffy(). In particular, legacy mode means that only task execution time is counted in rq->curr_runnable_sum and rq->prev_runnable_sum. Also task migration does not result in adjustment of those counters. Change-Id: If374ccc084aa73f77374b6b3ab4cd0a4ca7b8c90 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:14 -07:00
Srivatsa Vaddagiri	da60007442	sched: window-stats: Code cleanup Collapse duplicated comments about keeping few of sysctl knobs initialized to same value as their non-sysctl copies Change-Id: Idc8261d86b9f36e5f2f2ab845213bae268ae9028 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:13 -07:00
Srivatsa Vaddagiri	dafe791457	sched: window-stats: Code cleanup Remove code duplication associated with update of various window-stats related sysctl tunables Change-Id: I64e29ac065172464ba371a03758937999c42a71f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:12 -07:00
Srivatsa Vaddagiri	25d5c94d24	sched: window-stats: Code cleanup add_task_demand() and 'long_sleep' calculation in it are not strictly required. rq_freq_margin() check for need to change frequency, which removes need for long_sleep calculation. Once that is removed, need for add_task_demand() vanishes. Change-Id: I936540c06072eb8238fc18754aba88789ee3c9f5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:12 -07:00
Srivatsa Vaddagiri	e1ea811d7a	sched: window-stats: Remove unused prev_window variable Remove unused prev_window variable in 'struct ravg' Change-Id: I22ec040bae6fa5810f9f8771aa1cb873a2183746 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-22 14:43:11 -07:00
Ian Maund	6440f462f9	Merge upstream tag 'v3.10.49' into msm-3.10 * commit 'v3.10.49': (529 commits) Linux 3.10.49 ACPI / battery: Retry to get battery information if failed during probing x86, ioremap: Speed up check for RAM pages Score: Modify the Makefile of Score, remove -mlong-calls for compiling Score: The commit is for compiling successfully. Score: Implement the function csum_ipv6_magic score: normalize global variables exported by vmlinux.lds rtmutex: Plug slow unlock race rtmutex: Handle deadlock detection smarter rtmutex: Detect changes in the pi lock chain rtmutex: Fix deadlock detector for real ring-buffer: Check if buffer exists before polling drm/radeon: stop poisoning the GART TLB drm/radeon: fix typo in golden register setup on evergreen ext4: disable synchronous transaction batching if max_batch_time==0 ext4: clarify error count warning messages ext4: fix unjournalled bg descriptor while initializing inode bitmap dm io: fix a race condition in the wake up code for sync_io Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code clk: spear3xx: Use proper control register offset ... In addition to bringing in upstream commits, this merge also makes minor changes to mainitain compatibility with upstream: The definition of list_next_entry in qcrypto.c and ipa_dp.c has been removed, as upstream has moved the definition to list.h. The implementation of list_next_entry was identical between the two. irq.c, for both arm and arm64 architecture, has had its calls to __irq_set_affinity_locked updated to reflect changes to the API upstream. Finally, as we have removed the sleep_length member variable of the tick_sched struct, all changes made by upstream commit `ec804bd` do not apply to our tree and have been removed from this merge. Only kernel/time/tick-sched.c is impacted. Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1 Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-08-20 13:23:09 -07:00
Peter Zijlstra	c5ac12693f	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: 4e857c58efeb99393cba5a5d0d8ec7117183137c [joonwoop@codeaurora.org: fixed trivial merge conflict.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>	2014-08-15 11:45:28 -07:00
Peter Zijlstra	7d9e69c77f	arch: Prepare for smp_mb__{before,after}_atomic() Since the smp_mb__{before,after}*() ops are fundamentally dependent on how an arch can implement atomics it doesn't make sense to have 3 variants of them. They must all be the same. Furthermore, the 3 variants suggest they're only valid for those 3 atomic ops, while we have many more where they could be applied. So move away from smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}() and reduce the interface to just the two: smp_mb__{before,after}_atomic(). This patch prepares the way by introducing default implementations in asm-generic/barrier.h that default to a full barrier and providing __deprecated inlines for the previous 6 barriers if they're not provided by the arch. This should allow for a mostly painless transition (lots of deprecated warns in the interim). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-wr59327qdyi9mbzn6x937s4e@git.kernel.org Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Chen, Gong" <gong.chen@linux.intel.com> Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Git-commit: febdbfe8a91ce0d11939d4940b592eb0dba8d663 [joonwoop@codeaurora.org: fixed trivial merge conflict.] Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>	2014-08-15 11:45:27 -07:00
Steve Muckle	0a0adbb0b1	sched: disable frequency notifications by default The frequency notifications from the scheduler do not currently respect synchronous topologies. If demand on CPU 0 is driving frequency high and CPU 1 is in the same frequency domain, and demand on CPU 1 is low, frequency notifiers will be continuously sent by CPU 1 in an attempt to have its frequency lowered. Until the notifiers are fixed, disable them by default. They can still be re-enabled at runtime. Change-Id: Ic8a927af2236d8fe83b4f4a633b20a8ddcfba359 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-08-12 11:15:34 -07:00
Steve Muckle	7ebd479ae3	sched: fix misalignment between requested and actual windows When set_window_start() is first executed sched_clock() has not yet stabilized. Refresh the sched_init_jiffy and sched_clock_at_init_jiffy values until it is known that sched_clock has stabilized - this will be the case by the time a client calls the sched_set_window() API. Change-Id: Icd057707ff44c3b240e5e7e96891b23c95733daa Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-08-12 11:15:33 -07:00
Syed Rameez Mustafa	efcad24cbf	Revert "sched: Influence cpu_power based on max_freq and efficiency" This reverts commit `0951ec0ff1` ("sched: Influence cpu_power based on max_freq and efficiency") to let all cpus be seen at same 'cpu_power' from load balance perspective. Without this revert, some cpus will be seen to have more 'cpu_power' than others, causing tasks to incur wait-time despite availability of idle cpus. This happens because a cpu with low 'cpu_power' can fail to see imbalance with another cpu having higher 'cpu_power' and thus can go idle without pulling any work. Change-Id: Iccb34319c527d5b45f29c2d12d2ebc7acdd9d07e Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-08-12 11:15:33 -07:00
Olav Haugan	df91ad278c	sched: Make RAVG_HIST_SIZE tunable Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size to allow tuning of the size of the history that is used in computation of task demand. CRs-fixed: 706138 Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>	2014-08-12 11:15:20 -07:00
Srivatsa Vaddagiri	7b76c244c2	sched: Fix possibility of "stuck" reserved flag check_for_migration() could mark a thread for migration (in rq->push_task) and invoke active_load_balance_cpu_stop(). However that thread could get migrated to another cpu by the time active_load_balance_cpu_stop() runs, which could fail to clear reserved flag for a cpu and drop task_sruct reference when cpu has only one task (stopper thread running active_load_balance_cpu_stop()). This would cause a cpu to have reserved bit stuck, which prevents it from being used effectively. Fix this by having active_load_balance_cpu_stop() drop reserved bit always. Change-Id: I2464a46b4ddb52376a95518bcc95dd9768e891f9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:31 -07:00
Srivatsa Vaddagiri	0699a566d3	sched: initialize env->flags variable to 0 env->flags and env->new_dst_cpu fields are not initialized in load_balance() function. As a result, load_balance() could wrongly see LBF_SOME_PINNED flag set and access (bogus) new_dst_cpu's runqueue leading to invalid memory reference. Fix this by initializing env->flags field to 0. While we are at it, fix similar issue in active_load_balance_cpu_stop() function, although there is no harm present currently in that function with uninitialized env->flags variable. Change-Id: Ied470b0abd65bf2ecfa33fa991ba554a5393f649 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:31 -07:00
Srivatsa Vaddagiri	098d8371ad	sched: window-stats: 64-bit type for curr/prev_runnable_sum Expand rq->curr_runnable_sum and rq->prev_runnable_sum to be 64-bit counters as otherwise they can easily overflow when a cpu has many tasks. Change-Id: I68ab2658ac6a3174ddb395888ecd6bf70ca70473 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:31 -07:00
Srivatsa Vaddagiri	5e8f14fbbc	sched: window-stats: Allow acct_wait_time to be tuned Add sysctl interface to tune sched_acct_wait_time variable at runtime Change-Id: I38339cdb388a507019e429709a7c28e80b5b3585 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri	4da7e167b3	sched: window-stats: Account interrupt handling time as busy time Account cycles spent by idle cpu handling interrupts (irq or softirq) towards its busy time. Change-Id: I84cc084ced67502e1cfa7037594f29ed2305b2b1 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri	a3c1ecd80a	sched: window-stats: Account idle time as busy time Provide a knob to consider idle time as busy time, when cpu becomes idle as a result of io_schedule() call. This will let governor parameter 'io_is_busy' to be appropriately honored. Change-Id: Id9fb4fe448e8e4909696aa8a3be5a165ad7529d3 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri	c55cc8b64a	sched: window-stats: Account wait time Extend window-based task load accounting mechanism to include wait-time as part of task demand. A subsequent patch will make this feature configurable at runtime. Change-Id: I8e79337c30a19921d5c5527a79ac0133b385f8a9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:29 -07:00
Srivatsa Vaddagiri	dca58d0666	sched: window-stats: update task demand on tick A task can execute on a cpu for a long time without being preempted or migrated. In such case, its demand can become outdated for a long time. Prevent that from happening by updating demand of currently running task during scheduler tick. Change-Id: I321917b4590635c0a612560e3a1baf1e6921e792 CRs-Fixed: 698662 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:29 -07:00
Srivatsa Vaddagiri	fa15bd9937	sched: Fix herding issue check_for_migration() could run concurrently on multiple cpus, resulting in multiple tasks wanting to migrate to same cpu. This could cause cpus to be underutilized and lead to increased scheduling latencies for tasks. Fix this by serializing select_best_cpu() calls from cpus running check_for_migration() check and marking selected cpus as reserved, so that subsequent call to select_best_cpu() from check_for_migration() will skip reserved cpus. Change-Id: I73a22cacab32dee3c14267a98b700f572aa3900c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:29 -07:00
Srivatsa Vaddagiri	02cc889604	sched: window-stats: print window size in /proc/sched_debug Printing window size in /proc/sched_debug would provide useful information to debug scheduler issues. Change-Id: Ia12ab2cb544f41a61c8a1d87bf821b85a19e09fd Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:29 -07:00
Srivatsa Vaddagiri	1f12e6698c	sched: Extend ftrace event to record boost and reason code Add a new ftrace event to record changes to boost setting. Also extend sched_task_load() ftrace event to record boost setting and reason code passed to select_best_cpu(). This will be useful for debug purpose. Change-Id: Idac72f86d954472abe9f88a8db184343b7730287 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:28 -07:00
Srivatsa Vaddagiri	232b0fe6f4	sched: Avoid needless migration Restrict check_for_migration() to operate on fair_sched class tasks only. Also check_for_migration() can result in a call to select_best_cpu() to look for a better cpu for currently running task on a cpu. However select_best_cpu() can end up suggesting a cpu that is not necessarily better than the cpu on which task is running currently. This will result in unnecessary migration. Prevent that from happening. Change-Id: I391cdda0d7285671d5f79aa2da12eaaa6cae42d7 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-12 10:51:28 -07:00
John Stultz	3984bb13c8	printk: rename printk_sched to printk_deferred commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream. After learning we'll need some sort of deferred printk functionality in the timekeeping core, Peter suggested we rename the printk_sched function so it can be reused by needed subsystems. This only changes the function name. No logic changes. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-08-07 14:30:26 -07:00
Srivatsa Vaddagiri	c6d6e960df	sched: Drop active balance request upon cpu going offline A cpu could mark its currently running task to be migrated to another cpu (via rq->push_task/rq->push_cpu) and could go offline before active load balance handles the request. In such case, clear the active load balance request. Change-Id: Ia3e668e34edbeb91d8559c1abb4cbffa25b1830b Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri	ea5020bcd2	sched: trigger immediate migration of tasks upon boost Currently turning on boost does not immediately trigger migration of tasks from lower capacity cpus. Tasks could incur migration latency of up to one timer tick (when check_for_migration() is run). Fix this by triggering a migration check on cpus with lower capacity as soon as boost is turned on for first time. Change-Id: I244649f9cb6608862d87631325967b887b7f4b7e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri	6cd6b83b50	sched: Extend boost benefit for small and low-prio tasks Allow small and low-prio tasks to benefit from boost, which is expected to last for a short duration. Any task that wishes to run during that short period is allowed boost benefit. Change-Id: I02979a0c5feeba0f1256b7ee3d73f6b283fcfafa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri	a57fe9b6df	sched: window-stats: Handle policy change properly sched_window_stat_policy influences task demand and thus various statistics maintained per-cpu like curr_runnable_sum. Changing policy non-atomically would lead to improper accounting. For example, when task is enqueued on a cpu's runqueue, its demand that is added to rq->cumulative_runnable_avg could be based on AVG policy and when its dequeued its demand that is removed can be based on MAX, leading to erroneous accounting. This change causes policy change to be "atomic" i.e all cpu's rq->lock are held and all task's window-stats are reset before policy is changed. Change-Id: I6a3e4fb7bc299dfc5c367693b5717a1ef518c32d CRs-Fixed: 687409 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri	7b5f42a8e1	sched: window-stats: Reset all window stats Currently, few of the window statistics for tasks are not reset when window size is changing. Fix this to completely reset all window statistics for tasks and cpus. Move the reset code to a function, which can be reused by a subsequent patch that resets same statistics upon policy change. Change-Id: Ic626260245b89007c4d70b9a07ebd577e217f283 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:58 +05:30
Srivatsa Vaddagiri	578851abaa	sched: window-stats: Additional error checking in sched_set_window() Check for invalid window size passed as argument to sched_set_window() Also move up local_irq_disable() call to avoid thread from being preempted during calculation of window_start and its comparison against sched_clock(). Use right macro to evluate whether window_start argument is ahead in time or not. Change-Id: Idc0d3ab17ede08471ae63b72a2d55e7f84868fd6 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-08-06 15:36:58 +05:30
Srivatsa Vaddagiri	f2a21ce199	sched: window-stats: Fix incorrect calculation of partial_demand When using MAX_POLICY, partial_demand is calculated incorrectly as 0. Fix this by picking maximum of previous 4 windows and most recent sample. Change-Id: I27850a510746a63b5382c84761920fc021b876c5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-28 12:51:07 -07:00
Srivatsa Vaddagiri	5c351b809f	sched: window-stats: Fix potential wrong use of rq 'rq' reference to a cpu where a waking task last ran can be potentially incorrect leading to incorrect accounting. This happens when task_cpu() changes between points A & B in try_to_wake_up() listed below: try_to_wake_up() { cpu = src_cpu = task_cpu(p); rq = cpu_rq(src_cpu); -> Point A .. while (p->on_cpu) cpu_relax(); smp_rmb(); raw_spin_lock(&rq->lock); -> Point B Fix this by initializing 'rq' variable after task has slept (its on_cpu field becomes 0). Also avoid adding task demand to its old cpu runqueue (prev_runnable_sum) in case it's gone offline. Change-Id: I9e5d3beeca01796d944137b5416805b983a6e06e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-28 12:51:07 -07:00
Mateusz Guzik	4aba6e3634	sched: Fix possible divide by zero in avg_atom() calculation commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream. proc_sched_show_task() does: if (nr_switches) do_div(avg_atom, nr_switches); nr_switches is unsigned long and do_div truncates it to 32 bits, which means it can test non-zero on e.g. x86-64 and be truncated to zero for division. Fix the problem by using div64_ul() instead. As a side effect calculations of avg_atom for big nr_switches are now correct. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-07-28 08:00:07 -07:00
Linux Build Service Account	0c445a72ec	Merge "sched: set initial task load to just above a small task"	2014-07-26 12:41:28 -07:00
Steve Muckle	1eaec37bfd	sched: set initial task load to just above a small task To maximize power savings, set the intial load of newly created tasks to just above a small task. Setting it below the small task threshold would cause new tasks to be packed which is very likely too aggressive. Change-Id: Idace26cc0252e31a5472c73534d2f5277a1e3fa4 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-25 10:55:56 -07:00
Olav Haugan	47c59a6b72	sched/fair: Check whether any CPUs are available There is a possibility that there are no allowed CPUs online when we try to select the best cpu for a small task. Add a check to ensure we don't continue if there are no CPUs available. CRs-fixed: 692505 Change-Id: Iff955fb0d0b07e758a893539f7bc8ea8aa09d9c4 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>	2014-07-25 08:29:24 -07:00
Steve Muckle	1aa9b6992a	sched: fixes for compilation without CONFIG_SCHED_HMP These fixes are necessary to compile without CONFIG_SCHED_HMP enabled. Change-Id: Iabbde3c22a81288242ed3a44fdfdb2a16db8b072 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 16:08:02 -07:00
Steve Muckle	483fa0ade3	sched: enable hmp, power aware scheduling for targets with > 4 CPUs Enabling and disabling hmp/power-aware scheduling is meant to be done via kernel command line options. Until that is fully supported however, take advantage of the fact that current targets with more than 4 CPUs will need these features. Change-Id: I4916805881d58eeb54747e4b972816ffc96caae7 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 16:08:01 -07:00
Srivatsa Vaddagiri	fefafa08b7	sched: remove sysctl control for HMP and power-aware task placement There is no real need to control HMP and power-aware task placement at runtime after kernel has booted. Boot-time control should be sufficient. Not allowing for runtime (sysctl) support simplifies the code quite a bit. Also rename sysctl_sched_enable_hmp_task_placement to be shorter. Change-Id: I60cae51a173c6f73b79cbf90c50ddd41a27604aa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 16:07:58 -07:00
Srivatsa Vaddagiri	87df0beb43	sched: support legacy mode better It should be possible to bypass all HMP scheduler changes at runtime by setting sysctl_sched_enable_hmp_task_placement and sysctl_sched_enable_power_aware to 0. Fix various code paths to honor this requirement. Change-Id: I74254e68582b3f9f1b84661baf7dae14f981c025 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 16:07:05 -07:00
Srivatsa Vaddagiri	1ef9206d1c	sched: code cleanup Avoid the long if() block of code in set_task_cpu(). Move that code to its own function Change-Id: Ia80a99867ff9c23a614635e366777759abaccee4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 16:05:48 -07:00
Srivatsa Vaddagiri	cf0d1f54be	sched: Add BUG_ON when task_cpu() is incorrect It would be fatal if task_cpu() information for a task does not accurately represent the cpu on which its running. All sorts of wierd issues can arise if that were to happen! Add a BUG_ON() in context switch to detect such cases. Change-Id: I4eb2c96c850e2247e22f773bbb6eedb8ccafa49c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:05 -07:00
Srivatsa Vaddagiri	cea49887c4	sched: avoid active migration of tasks not in TASK_RUNNING state Avoid wasting effort in migrating tasks that are about to sleep. Change-Id: Icf9520b1c8fa48d3e071cb9fa1c5526b3b36ff16 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:05 -07:00
Srivatsa Vaddagiri	955b16f3ce	sched: fix up task load during migration Fix the hack to set task's on_rq to 0 during task migration. Task's load is temporarily added back to its runqueue so that update_task_ravg() can fixup task's load when its demand is changing. Task's load is removed immediately afterwards. Temporarily setting p->on_rq to 0 introduces a race condition with try_to_wake_up(). Another task (task A) may be attempting to wake up the migrating task (task B). As long as task A sees task B's p->on_rq as 1, the wake up will not continue. Changing p->on_rq to 0, then back to 1, allows task A to continue "waking" task B, at which point we have both try_to_wake_up and the migration code attempting to set the cpu of task B at the same time. CRs-Fixed: 695071 Change-Id: I525745f144da4ffeba1d539890b4d46720ec3ef1 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:05 -07:00
Prasad Sodagudi	6075c8a6f7	sched: avoid pushing tasks to an offline CPU Currently active_load_balance_cpu_stop is run by cpu stopper and it pushes running tasks off the busiest CPU onto idle target CPU. But there is no check to see whether target cpu is offline or not before pushing the tasks. With the introduction of active migration in the scheduler tick path (see check_for_migration()) there have been instances of attempts to migrate tasks to offline CPUs. Add a check as to whether the target cpu is online or not to prevent scheduling on offline CPUs. Change-Id: Ib8ac7f8aeabd3ca7365f3eae977075952dab4f21 Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>	2014-07-22 14:23:04 -07:00
Syed Rameez Mustafa	8f7e5b8ee8	sched: Add a per rq max_possible_capacity for use in power calculations In the absence of a power driver providing real power values, the scheduler currently defaults to using capacity of a CPU as a measure of power. This, however, is not a good measure since the capacity of a CPU can change due to thermal conditions and/or other hardware restrictions. These frequency restrictions have no effect on the power efficiency of those CPUs. Introduce max possible capacity of a CPU to track an absolute measure of capacity which translates into a good absolute measure of power efficiency. Max possible capacity takes the max possible frequency of CPUs into account instead of max frequency. Change-Id: Ia970b853e43a90eb8cc6fd990b5c47fca7e50db8 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:04 -07:00
Syed Rameez Mustafa	16fa06671f	sched: Disable interrupts when holding the rq lock in sched_get_busy() Interrupts can end up waking processes on the same cpu as the one for which sched_get_busy() is called. Since sched_get_busy() takes the rq lock this can result in a deadlock as the same rq lock is required to enqueue the waking up task. Fix the deadlock by disabling interrupts when taking the rq lock. Change-Id: I46e14a14789c2fb0ead42363fbaaa0a303a5818f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:03 -07:00
Srivatsa Vaddagiri	9361844015	sched: Make wallclock more accurate update_task_ravg() in context switch uses wallclock that is updated before running put_prev_task() and pick_next_task(), both of which can take some time. Its better to update wallclock after those routines, leading to more accurate accounting. Change-Id: I882b1f0e8eddd2cc17d42ca2ab8f7a2841b8d89a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:03 -07:00
Syed Rameez Mustafa	cd0c28ddcc	sched: Resolve some 64 bit compilation issues On 64 bit architectures a pointer is no longer the same size as an int. Therefore any place that does a conversion from int to a pointer type gives a compilation error. Resolve these by type casting to long first which is guaranteed to be the same size as a pointer. Change-Id: I518ac3c562bd3f85893f91ad6dbcd2f0c7bf081b Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:03 -07:00
Syed Rameez Mustafa	3d972d3af1	sched: Make task and CPU load calculations safe from truncation Load calculations have been modified to accept and return 64 bit values. Fix up all the places where we make such calculations to store the result in 64 bit variables. This is necessary to avoid issues caused by truncation of values. While at it update scale_task_load() to scale_load_to_cpu(). This is because the API is used to scale load of both individual tasks as well as the cumulative load of CPUs. In this sense the name was a misnomer. Also clean up power_cost() to use max_task_load(). Change-Id: I51e683e1592a5ea3c4e4b2b06d7a7339a49cce9c Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:03 -07:00
Syed Rameez Mustafa	8eebaa1826	sched/fair: Introduce C-state aware task placement for small tasks Small tasks execute for small durations. This means that the power cost of taking CPUs out of a low power mode outweigh any performance advantage of using an idle core or power advantage of using the most power efficient CPU. Introduce C-state aware task placement for small tasks. This requires a two pass approach where we first determine the most power effecient CPU and establish a band of CPUs offering a similar power cost for the task. The order of preference then is as follows: 1) Any mostly idle CPU in active C-state in the same power band. 2) A CPU with the shallowest C-state in the same power band. 3) A CPU with the least load in the same power band. 4) Lowest power CPU in a higher power band. The patch also modifies the definition of a small task. Small tasks are now determined relative to minimum capacity CPUs in the system and not the task CPU. Change-Id: Ia09840a5972881cad7ba7bea8fe34c45f909725e Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:02 -07:00
Srivatsa Vaddagiri	34e72241a9	sched: Make the scheduler aware of C-state for cpus C-state represents a power-state of a cpu. A cpu could have one or more C-states associated with it. C-state transitions are based on various factors (expected sleep time for example). "Deeper" C-states implies longer wakeup latencies. Scheduler needs to know wakeup latency associated with various C-states. Having this information allows the scheduler to make better decisions during task placement. For example: - Prefer an idle cpu that is in the least shallow C-state - Avoid waking up small tasks on a idle cpu unless it is in the least shallow C-state This patch introduces APIs in the scheduler that can be used by the architecture specific power-management driver to inform the scheduler about C-states for cpus. Change-Id: I39c5ae6dbace4f8bd96e88f75cd2d72620436dd1 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:02 -07:00
Syed Rameez Mustafa	65eab4a6f5	sched/fair: Introduce scheduler boost for low latency workloads Certain low latency bursty workloads require immediate use of highest capacity CPUs in HMP systems. Existing load tracking mechanisms may be unable to respond to the sudden surge in the system load within the latency requirements. Introduce the scheduler boost feature for such workloads. While boost is in effect the scheduler bypasses regular load based task placement and prefers highest capacity CPUs in the system for all non-small fair sched class tasks. Provide both a kernel and userspace API for software that may have apriori knowledge about the system workload. Change-Id: I783f585d1f8c97219e629d9c54f712318821922f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:02 -07:00
Srivatsa Vaddagiri	a6fa50d177	sched: Move call to trace_sched_cpu_load() select_best_cpu() invokes trace_sched_cpu_load() for all online cpus in a loop, before it enters the loop for core selection. Moving invocation of trace_sched_cpu_load() in inner core loop is potentially more efficient. Change-Id: Iae1c58b26632edf9ec5f5da905c31356eb95c925 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:01 -07:00
Srivatsa Vaddagiri	73bf6e4e2d	sched: fair: Reset balance_interval before sending NOHZ kick balance_interval needs to be reset for anycpu being kicked. Otherwise it can end up ignoring the kick (i.e not doing load balance for itself). Also bypass the check for existence of idle cpus in tickless state for !CONFIG_SCHED_HMP to allow for more aggressive load balance. Change-Id: I52365ee7c2997ec09bd93c4e9ae0293a954e39a8 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:01 -07:00
Srivatsa Vaddagiri	5b97613e54	sched: Avoid active migration of small tasks We currently check the need to migrate the currently running task in scheduler_tick(). Skip that check for small tasks, as its not worth the effort! Change-Id: Ic205cc6452f42fde6be6b85c3bf06a8542a73eba Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:01 -07:00
Srivatsa Vaddagiri	e87706488c	sched: Account for cpu's current frequency when calculating its power cost In estimating cost of running a task on a given cpu, cost of cpu at its current frequency needs to over-ride cost at frequency demanded by task, where cur_freq exceeds required frequency of task. This is because placing a task on a cpu can only result in an increase of cpu's frequency. Change-Id: I021a3bbaf179bf1ec2c7f4556870256936797eb9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:01 -07:00
Srivatsa Vaddagiri	8e4ffa0d07	sched: make sched_set_window() return failure when PELT is in use Window-based load tracking is a pre-requisite for the scheduler to feed cpu load information to the governor. When PELT is in use, return failure when governor attempts to set window-size. This will let governor fall back to other APIs for retrieving cpu load statistics. Change-Id: I0e11188594c1a54b3b7ff55447d30bfed1a01115 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:01 -07:00
Srivatsa Vaddagiri	7d8d1bd095	sched: debug: Print additional information in /proc/sched_debug Provide information in /proc/sched_debug on min_capacity, max_capacity and whether pelt or window-based task load statistics is in use. Change-Id: Ie4e9450652f4c83110dda75be3ead8aa5bb355c3 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:00 -07:00
Srivatsa Vaddagiri	df5fd251ed	sched: Move around code Move up chunk of code to be defined early. This helps a subsequent patch that needs update_min_max_capacity() Change-Id: I9403c7b4dcc74ba4ef1034327241c81df97b01ea Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:00 -07:00
Srivatsa Vaddagiri	af9a2812eb	sched: Update capacity of all online cpus when min_max_freq changes During bootup, its possible for min_max_freq to change as frequency information for additional clusters is processed. That would need to trigger recalculation of capacity/load_scale_factor for all (online) cpus, as they strongly depend on min_max_freq variable. Not doing so would imply some cpus will have their capacity/load_scale_factor computed wrongly. Change-Id: Iea5a0a517a2d71be24c2c71cdd805c0733ce37f8 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:23:00 -07:00
Srivatsa Vaddagiri	2baa89cd4c	sched: update task statistics when CPU frequency changes A CPU may have its frequency changed by a different CPU. Because of this, it is not guaranteed that we will update task statistics at approximately the same time that the frequency change occurs. To guard against accruing time to a task at the wrong frequency, update the task's window-based statistics if the CPU it is running on changes frequency. Change-Id: I333c3f8aa82676bd2831797b55fd7af9c4225555 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:23:00 -07:00
Srivatsa Vaddagiri	11262c23d6	sched: Add new trace events Add trace events for update_task_ravg(), update_history(), and set_task_cpu(). These tracepoints are useful for monitoring the per-task and per-runqueue demand statistics. Change-Id: Ibec9f945074ff31d1fc1a76ae37c40c8fea8cda9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:59 -07:00
Steve Muckle	fd3c9f6c53	sched: do not balance on exec if SCHED_HMP Rebalancing at exec time will currently undo any beneficial placement that has been done during fork time, since select_best_cpu() will not discount the currently running task. For now just skip re-evaluating task placement at exec. Change-Id: I1e5e0fcc329b7b53c338c8c73795ebd5e85a118b Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 14:22:59 -07:00
Srivatsa Vaddagiri	00faa770e7	sched: Use historical load for freq governor input Historical load maintained per task can be used to influence cpu frequency better. For example, when a heavy demand task wakes up after prolonged sleep, we could use the historical load information to alert cpufreq governor about the need to raise cpu frequency. This patch changes CPU busy statistics to be aggregation of historical task demand. Also task's historical load (as defined by sysctl_sched_window_stats_policy) is add to cpu's busy statistics (rq->curr_runnable_sum) whenever it executes on a cpu. Change-Id: I2b66136f138b147ba19083b9b044c4feb20d9b57 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:59 -07:00
Srivatsa Vaddagiri	0352c87d18	sched: window-stats: apply scaling to full elapsed windows In the event that a full window (or multiple full windows) have elapsed when updating a task's window-based stats, the runtime of those windows needs to be scaled based on the CPU frequency. This is currently missing, causing full windows to be accounted as having elapsed at maximum frequency, erroneously inflating task demand. Change-Id: I356b4279d44d4f39c8aea881c04327b70ed66183 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:22:58 -07:00
Steve Muckle	80753c7e5e	sched: notify cpufreq on over/underprovisioned CPUs After a migration occurs the source and destination CPUs may not be running at frequencies which match the new task load on those CPUs. Previously, the scheduler was notifying cpufreq anytime a task greater than a certain size migrates. This is suboptimal however since this does not take into account the CPU's current frequency and other task activity that may be present. Change-Id: I5092bda3a517e1343f97e5a455957c25ee19b549 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 14:22:58 -07:00
Syed Rameez Mustafa	dbd2db2471	sched: Introduce spill threshold tunables to manage overcommitment When the number of tasks intended for a cluster exceed the number of mostly idle CPUs in that cluster, the scheduler currently freely uses CPUs in other clusters if possible. While this is optimal for performance the power trade off can be quite significant. Introduce spill threshold tunables that govern the extent to which the scheduler should attempt to contain tasks within a cluster. Change-Id: I797e6c6b2aa0c3a376dad93758abe1d587663624 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:22:58 -07:00
Steve Muckle	dd76a2e00a	sched: add affinity, task load information to sched tracepoints Knowing the affinity mask and CPU usage of a task is helpful in understanding the behavior of the system. Affinity information has been added to the enq_deq trace event, and the migration tracepoint now reports the load of the task migrated. Change-Id: I29d8a610292b4dfeeb8fe16174e9d4dc196649b7 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-07-22 14:22:58 -07:00
Steve Muckle	bb3f4aae22	sched: add migration load change notifier for frequency guidance When a task moves between CPUs in two different frequency domains the cpufreq governor may wish to immediately modify the frequency of both the source and destination CPUs of the migrating task. A tunable is provided to establish what size task is considered "significant" enough to warrant notifying cpufreq. Also fix a bug that would cause load to not be accounted properly during wakeup migrations. Change-Id: Ie8f6b1cc4d43a602840dac18590b42a81327c95a Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 14:22:57 -07:00
Syed Rameez Mustafa	8968469989	sched/fair: Limit MAX_PINNED_INTERVAL for more frequent load balancing Should the system get stuck in a state where load balancing is failing due to all tasks being pinned, deferring load balancing for up to half a second may cause further performance problems. Eventually all tasks will not be pinned and load balancing should not be deferred for a great length of time. Change-Id: I06f93b5448353b5871645b9274ce4419dc9fae0f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:57 -07:00
Syed Rameez Mustafa	6317c544d8	sched/fair: Help out higher capacity CPUs when they are overcommitted This comprises of two parts: If we have a task to schedule, we currently don't consider CPUs where it will not fit even if they are idle. Instead we choose the previous CPU which is sub-optimal for performance if an idle CPU is present. This change introduces tracking of any idle CPUs irrespective of whether the task fits on them or not. If we don't have a good place to put the task, prefer the lowest power idle CPU. The other part involves the load balancer which was unable to move tasks despite the above mentioned task placement to balance out the load. The reason is that the load balancer checks the big cluster's group capacity and determines that it can take twice the amount of workload as the little cluster. Hence the big cluster does not get marked as busy. While this behavior is intended under heavily loaded systems where we want to push more work towards the higher capacity CPUs, it is sub optimal when we have idle CPUs. Add the ability to differentiate between the two scenarios when marking a group as busy. If load_balance is called from a CPU_NOT_IDLE environment use the the group capacity to determine whether the group is busy or not. For everything else use number of CPUs in the group. Change-Id: I4e8290639ad1602541a44a80ba4b2804068cac0f Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:57 -07:00
Syed Rameez Mustafa	5349ec0ee8	sched/rt: Introduce power aware scheduling for real time tasks Real Time task scheduling has historically been geared towards performance with a significant attempt to keep higher priority tasks on the same CPU. This is not optimal for power since the task CPU may not be the most power efficient CPU. Also task movement via select_lowest_rq() gives CPU priority the primary consideration before looking at CPU topologies to find a CPU closest to the task CPU in terms of topology. This again is not optimal for power since the closest CPU may be significantly worse for power than CPUs further away. This patch removes any bias for the task CPU. When the lowest priority CPUs in the system are found we give no consideration to the CPU topology. Instead we find the lowest power CPU within local_cpu_mask. This takes care of select_task_rq_rt() and push_task(). The pull model remains unaffected since we have no room for power optimization there. Change-Id: I4162ebe2f74be14240e62476f231f9e4a18bd9e8 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:57 -07:00
Steve Muckle	ea296243e2	sched: balance power inefficient CPUs with one task Normally the load balancer does not pay attention to CPUs with one task since it is not possible to subdivide that load any further to achieve better balance. With power aware scheduling however it may be desirable to relocate that one task if the CPU it is currently executing on is less power efficient than other CPUs in the system. Change-Id: Idf3f5e22b88048184323513f0052827b884526b6 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:56 -07:00
Steve Muckle	6e6c17d05c	sched: check for power inefficient task placement in tick Although tasks are routed to the most power-efficient CPUs during task wakeup, a CPU-bound task will not go through this decision point. Load balancing can help if it is modified to dislodge a single task from an inefficient CPU. The situation can be further improved if during the tick, the task placement is checked to see if it is optimal. This sort of checking is already being done to ensure proper task placement in heterogneous CPU topologies, so checking for power efficient task placement fits pretty well. Change-Id: I71e56d406d314702bc26dee1438c0eeda7699027 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:56 -07:00
Steve Muckle	7edb149465	sched: do nohz load balancing in order of power efficiency The nohz load balancer CPU does load balancing on behalf of all idle tickless CPUs. In the interest of power efficiency though, we should do load balancing on the most power efficient idle tickless CPU first, and then work our way towards the least power efficient idle tickless CPU. This will help load find its way to the most power efficient CPUs in the system. Since when selecting the CPU to balance next it is unknown what task load would be pulled, a frequency must be assumed in order to do a comparison of CPU power consumption. The maximum freqeuncy supported by all CPUs is used for this. Change-Id: I96c7f4300fde2c677c068dc10fc0e57f763eb9b2 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:56 -07:00
Steve Muckle	ac1defc4d3	sched: run idle_balance() on most power-efficient CPU When a CPU goes idle, it checks to see whether it can pull any load from other busy CPUs. The CPU going idle may not be the most power-efficient idle CPU in the system however. This patch causes the CPU going idle to check to see whether there is a more power-efficient idle CPU within the same lowest sched domain. If there is, then it runs the load balancer on behalf of that CPU instead of itself. Since it is unknown at this point what task load would be pulled, a frequency must be assumed for this in order to do a comparison of CPU power consumption. The maximum freqeuncy supported by all CPUs is used for this. Change-Id: I5eedddc1f7d10df58ecd358f37dba563eeecf4fc Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:56 -07:00
Steve Muckle	5ef89ee90e	sched: add hook for platform-specific CPU power information To enable power-aware scheduling, provide a hook/infrastructure for platforms to communicate CPU power requirements for each supported CPU frequency. This information is then used to estimate the cost of running a task on a given CPU. Currently, an assumption is made that the task will be running by itself on the CPU. Given the current policy tries to spread tasks as much as possible this assumption should not be too far off. Change-Id: I19f1fa760a0d43222d2880f8aec0508c468b39bb Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:56 -07:00
Steve Muckle	bf1afbbbcd	sched: add power aware scheduling sysctl The sched_enable_power_aware sysctl will control whether or not scheduling decisions are influenced by the power consumption of individual CPUs. Change-Id: I312f892cf76a3fccc4ecc8aa6703908b205267f0 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:55 -07:00
Srivatsa Vaddagiri	a6e9741047	sched: Extend update_task_ravg() to accept wallclock as argument This will make it easier to account interrupt time on a cpu, introduced in a subsequent patch. Change-Id: I0e1fb5255c280ca374fd255e7fc19d5de9f8b045 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:55 -07:00
Srivatsa Vaddagiri	61490dcfb4	sched: add sched_get_busy, sched_set_window APIs sched_get_busy() returns the busy time of a cpu during the most recent completed window. sched_set_window() will set window size and aligns windows across all CPUs. Change-Id: Ic53e27f43fd4600109b7b6db979e1c52c7aca103 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:55 -07:00
Steve Muckle	d2cc14b9e2	sched: window-stats: adjust RQ curr, prev sums on task migration Adjust cpu's busy time in its recent and previous window upon task migration. This would enable scheduler to provide better inputs to cpufreq governor on a cpu's busy time in a given window. Change-Id: Idec2ca459382e9f46d882da3af53148412d631c6 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:54 -07:00
Steve Muckle	d7b56a170f	sched: window-stats: Add aggregated runqueue windowed stats Add counters per-cpu to track its busy time in the latest window and one previous to that. This would be needed to track accurate busy time per-cpu that accounts for migrations. Basically once a task migrates, its execution time in current window is migrated as well to new cpu. The idle task's runtime is not accounted since it should not count towards runqueue busy time. Change-Id: I4014dd686f95dbbfaa4274269bc36ed716573421 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:54 -07:00
Srivatsa Vaddagiri	af4d1578b9	sched: window-stats: add prev_window counter per-task Currently windows where tasks had no execution time are ignored. However accurate accounting of cpu busy time that factors in migration would need to know actual utilization of a task in the window previous to the latest one. This would help scheduler guide cpufreq governor on busy time per-cpu that is not subject to migration induced errors. Change-Id: I5841b1732c83e83d69002139de3bdb93333ce347 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:54 -07:00
Srivatsa Vaddagiri	975dbc9783	sched: window-stats: synchronize windows across cpus Synchronizing windows across cpus for task load measurements simplifies cpu busy time accounting during migrations. For task migrations, its usage in current window can be carried over to its new cpu. This lets cpufreq governor see a correct picture of cpu busy time that is not affected by migrations. This patch lines up windows across cpus. One of the cpu, sync_cpu, serves as a reference for all others. During bootup sync_cpu would initialize its window_start (from its sched_clock()). Other cpus will synchronize their window_start in reference to sync_cpu. This patch assumes synchronous sched_clock() across cpus and may need some change to address architectures which do not provide such synchronized sched_clock(). Change-Id: I13381389a72f5f9f85cc2446401d493a55c78ab7 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:54 -07:00
Srivatsa Vaddagiri	bf90a4be22	sched: window-stats: Do not account wait time Task load statistics are used for two purposes : cpu frequency management and placement. Task's load can't be accurately judged by its wait time. For ex: a task could have waited for 10ms and when given opportunity to run, could just execute for 1ms. Accounting for 11ms as task's demand could be over-stating its needs in this example. For now, remove wait time from task demand and instead let task load be derived from its actual exec time. This may need to become a tunable feature. Change-Id: I47e94c444c6b44c3b0810347287d50f1ee685038 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:53 -07:00
Srivatsa Vaddagiri	bd020d066f	sched: window-stats: update during migration and earlier at wakeup During migrations accounting needs to be done in set_task_cpu() to subtract the task activity from the source CPU and add it to the destination CPU. This accounting will require that the task's window based load statistics be up to date. Unfortunately, the window-based statistics cannot always be updated in set_task_cpu() because they are already being updated in the wakeup path. We cannot update the statistics solely in the wakeup path because not all wakeups are migrations. Those non-migrating wakeups will not enter set_task_cpu(). To ensure the window-based stats are always updated for both wakeup migrations and regular migrations, they are updated earlier in the wakeup path, and also updated in set_task_cpu if the task is already runnable (this ensures it is not a wakeup migration, but a regular migration). Change-Id: Ib246028741d0be9bb38ce93679d6e6ba25b10756 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:22:53 -07:00
Srivatsa Vaddagiri	f0ad6a880a	sched: move definition of update_task_ravg() set_task_cpu() will need to call update_task_ravg(). Move up definition to make it easy. Change-Id: I95c1c9e009bd1805f28708e8d6fd3b7b2166410e Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:35 -07:00
Srivatsa Vaddagiri	c497209dd6	sched: Switch to windows based load stats by default Set window-based load stats to be the default mechanism under which tasks get classified (as big/small) and which will drive frequency demand for tasks. sched_ravg_window kernel parameter can be used to change this default setting to use PELT (per-entity load tracking) scheme instead. Change-Id: I626110daa0bb2b53172bedea829d31877255ceaa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:35 -07:00
Srivatsa Vaddagiri	876ec3885e	sched: Provide tunable to switch between PELT and window-based stats Provide a runtime tunable to switch between using PELT-based load stats and window-based load stats. This will be needed for runtime analysis of the two load tracking schemes. Change-Id: I018f6a90b49844bf2c4e5666912621d87acc7217 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:35 -07:00
Srivatsa Vaddagiri	8d825e797f	sched: Reset decay_count when task is enqueued A non-zero and positive decay_count indicates the time when a task went to sleep and thus was removed from its cfs_rq. cfs_rq->blocked_load_avg tracks load of such "blocked" tasks. cfs_rq->blocked_load_avg is decayed over time and in turn signifies decay of (blocked) tasks load. cfs_rq->decay_counter represents time when blocked_load_avg was last decayed. cfs_rq->decay_counter is derived from rq->clock_task, which can be different for each cpu. When tasks go to sleep, their decay_count is set to cfs_rq->decay_counter. When task wakeup from sleep, its (new decayed) load_avg needs to be removed from cfs_rq->blocked_load_avg (as tasks is no longer blocked). Amount of decay for task's load_avg is defined by its sleep time, roughly derived as (cfs_rq->decay_counter - se->decay_count). This is accomplished in __synchronize_entity_decay(). Once task's load_avg is decayed and is subtracted from cfs_rq->blocked_load_avg, decay_count should be reset to 0, to indicate that task is no longer sleeping and its load_avg has been synchronized with decay of blocked_load_avg. A zero decay count thus signifies that a task is on runqueue and its load_avg has been decayed and synchronized with that of cfs_rq->blocked_load_avg. A negative decay_count on the other hand signifies a task that is being migrated across cpus during wakeup. Lets say task went to sleep on CPU0 and is waking on CPU1. In this case, task's load_avg needs to be decayed first (over its sleep time derived as cfs_rq0->decay_counter - se->decay_count), then subtracted from cfs_rq0->blocked_load_avg and finally task's load metrics (runnable_avg_sum) needs to be decayed over its sleep time. As task's sleep_time is deduced from (rq->clock_task - se->avg.last_runnable_update), and since se->avg.last_runnable_update is in reference to CPU0's clock_task, it would be inappropriate to deduce task's sleep time going by CPU1's rq->clock_task. Thus, in this case, when task is migrated to a different cpu at wakeup time, its decay_count is set to negative sleep time derived as - (CPU0 cfs_rq->decay_counter - se->avg.decay_count). This information is used during enqueue of task on CPU1 to adjust task's se->avg.last_runnable_update as (cpu1 rq->clock_task - (-se->avg.decay_count). This will let task's runnable_avg_sum to be decayed correctly over its sleep time by referencing CPU1's rq->clock_task and task's se->avg.last_runnable_update. The bug that currently exists is when task wakes up from a "short" sleep (couple of ms), is woken on the same cpu where it last ran and subsequently migrated. t0 -> task A went to sleep on cpu0. A->se.avg.decay_count = cpu0 cfs_rq->decay_counter = t0 t1 -> task A woke up from sleep. cpu0's cfs_rq->decay_counter is still t0. Because of this, __synchronize_entity_decay() does nothing. It also returns without resetting task's decay_count t2 -> CPU0's blocked_load_avg is decayed. cfs_rq->decay_counter = t2 t3 -> Task A is migrated from CPU0 to CPU1. migrate_task_rq_fair() assumes that this is case of migration during wakeup as A's decay_count is non-zero and positive. It then deduces task's sleep time as (t2-t0) and decays its load_avg over that sleep time. Task's decay_count is set as -(t2-t0). When task is later enqueued on CPU1, task's load metrics (runnable_avg_sum) is decayed to account for its "sleep" interval of (t2-t0), which is wrong and further results in inaccurate load information for the task. Fix for this is to have __synchronize_entity_decay() reset decay_count even when it deduces zero sleep time for task. Change-Id: I1016ecb148d62ff15ed698a5cca1a06afb73151f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:34 -07:00
Srivatsa Vaddagiri	13ffc9114f	sched: Provide scaled load information for tasks in /proc Extend "sched" file in /proc for every task to provide information on scaled load statistics and percentage-scaled based load (load_avg) for a task. This will be valuable debug aid. Change-Id: I6ee0394b409c77c7f79f5b9ac560da03dc879758 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:34 -07:00
Srivatsa Vaddagiri	e26b419bbb	sched: Add additional ftrace events This patch adds two ftrace events: sched_task_load -> records information of a task, such as scaled demand sched_cpu_load -> records information of a cpu, such as nr_running, nr_big_tasks etc This will be useful to debug HMP related task placement decisions by scheduler. Change-Id: If91587149bcd9bed157b5d2bfdecc3c3bf6652ff Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:34 -07:00
Srivatsa Vaddagiri	6f72d3d5a5	sched: Extend /proc/sched_debug with additional information Provide additional information in /proc/sched_debug for every cpu. This will be a valuable debug aid. Change-Id: If22ee530e880cd21719242be7bc2c41308ad4186 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:33 -07:00
Srivatsa Vaddagiri	091aa72394	sched: Disable ARCH_POWER feature Now that scheduler can recognize differences in cpus with regard to their instructions-per-cycle and/or maximum frequency capabilities, it can calculate cpu_power of cpus taking those differences into account. In other words, we don't need to rely on the ARCH_POWER hook to update cpu_power. Change-Id: Ie810b6ecc1a746b2ab5d498d6d026d1eb88f959a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:33 -07:00
Srivatsa Vaddagiri	fb2a169cd5	sched: Tighten controls for tasks spillover to idle cluster Several conditions can cause an idle cluster to pick up load from a busy cluster. One such condition is when busy cluster has number of tasks that exceeds its capacity (or number of cpus). This patch extends that condition to consider small and big tasks on a cluster. Too many "small" tasks should not cause them to spill over to another idle cluster. Like-wise presence of big tasks should be considered by a cluster to pick up load from another another cluster with lower capacity. Change-Id: I0545bf2989c37217d84ed18756c6f5c8946d5ae5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:33 -07:00
Srivatsa Vaddagiri	f68ee0c4ca	sched: Track number of big and small tasks on a cpu This patch adds 'nr_big_tasks' and 'nr_small_tasks' per-cpu counters that tracks number of big and small tasks on a cpu respectively. This will be used in load balance decisions introduced in a subsequent patch. Change-Id: Ia174904140f81dd6d1946286889a50be3f16ea83 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:33 -07:00
Srivatsa Vaddagiri	0ef740001d	sched: Handle cpu-bound tasks stuck on wrong cpu CPU-bound tasks that don't sleep for long intervals can stay stuck on the wrong cpu, as the selection of "ideal" cpu for tasks largely happens during task wakeup time. This patch adds a check in the scheduler tick for task/cpu mismatch (big task on little cpu OR little task on big cpu) and forces migration of such tasks to their ideal cpu (via select_best_cpu()). Change-Id: Icac3485b6aa4b558c4ed9df23c2e81fb8f4bb9d9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:32 -07:00
Srivatsa Vaddagiri	3774d2d6ef	sched: Extend active balance to accept 'push_task' argument Active balance currently picks one task to migrate from busy cpu to a chosen cpu (push_cpu). This patch extends active load balance to recognize a particular task ('push_task') that needs to be migrated to 'push_cpu'. This capability will be leveraged by HMP-aware task placement in a subsequent patch. Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:32 -07:00
Srivatsa Vaddagiri	1904058254	sched: Send NOHZ kick to idle cpu in same cluster A busy cpu will kick (via IPI) one of the idle cpus in tickless state to run load balance and help move tasks off itself. The cpu chosen to receive kick is simply the "first" idle cpu found in nohz.idle_cpus_mask. This could cause unnecessary wakeups of a cluster. A better choice would be to look for an idle cpu that is in the same cluster as busy cpu, which should minimize cluster wakeups. Change-Id: Ia63038d7c34b416b53c8feef3c3b31dab5200e42 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:32 -07:00
Srivatsa Vaddagiri	259de62b7f	sched: Basic task placement support for HMP systems HMP systems have cpus with different power and performance characteristics. Some cpus could offer better power at cost of lower performance while other cpus could offer better performance at cost of higher power. As a result, bandwidth consumed by a task to do some "fixed" amount of work could vary across cpus. Optimal task placement on HMP would involve placing a task on a cpu where it can meet its performance goals at lowest power cost. Since kernel has little to no awareness of performance goals of applications, we guestimate whether task is meeting its performance goals or not by looking at its cpu bandwidth consumption. High bandwidth consumption could imply that task's performance can improve by running on cpus with better capacity/performance-characterisitcs. This patch makes the basic changes to support HMP. It provides a configurable threshold and any task consuming bandwidth in excess of threshold will be placed on a cpu with better capacity. Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:32 -07:00
Srivatsa Vaddagiri	0951ec0ff1	sched: Influence cpu_power based on max_freq and efficiency Update cpu_power metric of a cpu by accounting its efficiency and max_freq factors. cpu_power is defined such that "least" performing cpu (one with lowest efficiency and max_freq factor) gets cpu_power of 1024. Note that no single CPU may have both the lowest efficiency and lowest max_freq. All CPUs will still have cpu_power values relative to this combination receiving a cpu_power of 1024 however. cpu_power differs from capacity metric of a cpu by accounting real-time task activity. Change-Id: I255d3ad80d8bc8237b9ffb8f6e7c0dc4c44ec10f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:31 -07:00
Srivatsa Vaddagiri	03163c31f2	sched: Use rq->efficiency in scaling load stats Extend task load scaling function to account for cpu efficiency factor. Task load is scaled in reference to "most" efficient cpu. Change-Id: I7bf829211a6e1293076e8ba0f93b4f6abcf20d92 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:31 -07:00
Srivatsa Vaddagiri	28445566f5	sched: Introduce efficiency, load_scale_factor and capacity Efficiency reflects instructions per cycle capability of a cpu. load_scale_factor reflects magnification factor that is applied for task load when estimating bandwidth it will consume on a cpu. It accounts for the fact that task load is scaled in reference to "best" cpu that has best efficiency factor and also best possible max_freq. Note that there may be no single CPU in the system that has both the best efficiency and best possible max_freq, but that is still the combination that all task load in the system is scaled against. capacity reflects max_freq and efficiency metric of a cpu. It is defined such that the "least" performing cpu (one with lowest efficiency factor and max_freq) gets capacity of 1024. Again, there may not be a CPU in the system that has both the lowest efficiency and lowest max_freq. This is still the combination that is assigned a capacity of 1024 however, other CPU capacities are relative to this. Change-Id: I4a853f1f0f90020721d2a4ee8b10db3d226b287c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:31 -07:00
Srivatsa Vaddagiri	c1f998027e	sched: Add CONFIG_SCHED_HMP Kconfig option Add a compile-time flag to enable or disable scheduler features for HMP (heterogenous multi-processor) systems. Main feature deals with optimizing task placement for best power/performance tradeoff. Also extend features currently dependent on CONFIG_SCHED_FREQ_INPUT to be enabled for CONFIG_HMP as well. Change-Id: I03b3942709a80cc19f7b934a8089e1d84c14d72d Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:31 -07:00
Srivatsa Vaddagiri	94fd14203c	sched: Add scaled task load statistics Scheduler guided frequency selection as well as task placement on heterogeneous systems require scaled task load statistics. This patch adds a 'runnable_avg_sum_scaled' metric per task that is a scaled derivative of 'runnable_avg_sum'. Load is scaled in reference to "best" cpu, i.e one with best possible max_freq Change-Id: Ie8ae450d0b02753e9927fb769aee734c6d33190f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:30 -07:00
Srivatsa Vaddagiri	dc2e1a4383	sched: Introduce CONFIG_SCHED_FREQ_INPUT Introduce a compile time flag to enable scheduler guidance of frequency selection. This flag is also used to turn on or off window-based load stats feature. Having a compile time flag will let some platforms avoid any overhead that may be present with this scheduler feature. Change-Id: Id8dec9839f90dcac82f58ef7e2bd0ccd0b6bd16c Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:30 -07:00
Srivatsa Vaddagiri	b97dbfcb20	sched: window-based load stats improvements Following cleanups and improvements are made to window-based load stats feature: * Add sysctl to pick max, avg or most recent samples as task's demand. * Fix overflow possibility in calculation of sum for average policy. * Use unscaled statistics when a task is running on a CPU which is thermally throttled. Change-Id: I8293565ca0c2a785dadf8adb6c67f579a445ed29 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:30 -07:00
Srivatsa Vaddagiri	dc5a92d784	sched: Add min_max_freq and rq->max_possible_freq rq->max_possible_freq represents the maximum frequency a cpu is capable of attaining, while rq->max_freq represents the maximum frequency a cpu can attain at a given instant. rq->max_freq includes constraints imposed by user or thermal driver. rq->max_freq <= rq->max_possible_freq. max_possible_freq is derived as max(rq->max_possible_freq) and represents the "best" cpu that can attain best possible frequency. min_max_freq is derived as min(rq->max_possible_freq). For homogeneous systems, max_possible_freq and min_max_freq will be same, while they could be different on heterogeneous systems. Change-Id: Iec485fde35cfd33f55ebf2c2dce4864faa2083c5 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-07-22 14:20:29 -07:00
Steve Muckle	ce707ce509	sched: move task load based functions The task load based functions will need to make use of LOAD_AVG_MAX in a subsequent patch, so move them below the definition of that macro. Change-Id: I02f18ba069b81033e611f8f8bba6dccd7cd81252 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-22 14:20:29 -07:00
Steve Muckle	a0cdd5ba65	sched: fix race between try_to_wake_up() and move_task() Until a task's state has been seen as interruptible/uninterruptible and it is no longer on_cpu, it is possible that the task may move to another CPU (load balancing may cause this). Here is an example where the race condition results in incorrect operation: - cpu 0 calls put_prev_task on task A, task A's state is TASK_RUNNING - cpu 0 runs task B, which attempts to wake up A - cpu 0 begins try_to_wake_up(), recording src_cpu for task A as cpu 0 - cpu 1 then pulls task A (perhaps due to idle balance) - cpu 1 runs task A, which then sleeps, becoming INTERRUPTIBLE - cpu 0 continues in try_to_wake_up(), thinking task A's previous cpu is 0, where it is actually 1 - if select_task_rq returns cpu 0, task A will be woken up on cpu 0 without properly updating its cpu to 0 in set_task_cpu() CRs-Fixed: 665958 Change-Id: Icee004cb320bd8edfc772d9f74e670a9d4978a99 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2014-07-16 16:34:08 -07:00
Matt Wagantall	55896a97a2	sched/rt: Add Kconfig option to enable panicking for RT throttling This may be useful for detecting and debugging RT throttling issues. Change-Id: I5807a897d11997d76421c1fcaa2918aad988c6c9 Signed-off-by: Matt Wagantall <mattw@codeaurora.org>	2014-07-14 09:39:02 -07:00
Matt Wagantall	2cf27b65b9	sched/rt: print RT tasks when RT throttling is activated Existing debug prints do not provide any clues about which tasks may have triggered RT throttling. Print the names and PIDs of all tasks on the throttled rt_rq to help narrow down the source of the problem. Change-Id: I180534c8a647254ed38e89d0c981a8f8bccd741c Signed-off-by: Matt Wagantall <mattw@codeaurora.org>	2014-07-14 09:37:47 -07:00
Lai Jiangshan	b9bf68a30e	sched: Fix hotplug vs. set_cpus_allowed_ptr() Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since `5fbd036b55` ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. CRs-Fixed: 680496 Change-Id: I89ac9b6829acf200072975bc7d028a469167f083 Fixes: `5fbd036b55` ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: `24d52daafc` Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <mattw@codeaurora.org>	2014-06-25 17:21:11 -07:00
Neil Zhang	22c2abbe55	sched: Remove redundant update_runtime notifier migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <mattw@codeaurora.org>	2014-06-23 16:06:28 -07:00
Lai Jiangshan	24d52daafc	sched: Fix hotplug vs. set_cpus_allowed_ptr() commit 6acbfb96976fc3350e30d964acb1dbbdf876d55e upstream. Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since `5fbd036b55` ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. Fixes: `5fbd036b55` ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-06-11 12:03:24 -07:00
Thomas Gleixner	ecaf19a768	sched: Sanitize irq accounting madness commit 2d513868e2a33e1d5315490ef4c861ee65babd65 upstream. Russell reported, that irqtime_account_idle_ticks() takes ages due to: for (i = 0; i < ticks; i++) irqtime_account_process_tick(current, 0, rq); It's sad, that this code was written way _AFTER_ the NOHZ idle functionality was available. I charge myself guitly for not paying attention when that crap got merged with commit `abb74cefa` ("sched: Export ns irqtimes through /proc/stat") So instead of looping nr_ticks times just apply the whole thing at once. As a side note: The whole cputime_t vs. u64 business in that context wants to be cleaned up as well. There is no point in having all these back and forth conversions. Lets standardise on u64 nsec for all kernel internal accounting and be done with it. Everything else does not make sense at all for fine grained accounting. Frederic, can you please take care of that? Reported-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Shaun Ruffell <sruffell@digium.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-06-11 12:03:21 -07:00
Steven Rostedt (Red Hat)	e58d78f941	sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check commit 6227cb00cc120f9a43ce8313bb0475ddabcb7d01 upstream. The check at the beginning of cpupri_find() makes sure that the task_pri variable does not exceed the cp->pri_to_cpu array length. But that length is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two priorities in that array. As task_pri is computed from convert_prio() which should never be bigger than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is hit. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-06-11 12:03:21 -07:00
Linux Build Service Account	8af337cd92	Merge "Merge upstream arm64 scheduler bug fixes into msm-3.10"	2014-06-06 06:05:42 -07:00
Peter Zijlstra	ca454d54c4	sched: Fix up scheduler syscall LTP fails Wu reported LTP failures: > ltp.sched_setparam02.1.TFAIL > ltp.sched_setparam02.2.TFAIL > ltp.sched_setparam02.3.TFAIL > ltp.sched_setparam03.1.TFAIL There were 2 things wrong; firstly __setscheduler() failed on sched_setparam()'s policy = -1, fix that by reading from p->policy in that case. Secondly, getparam() (and getattr()) would still report !0 sched_priority for !FIFO/RR tasks after having been such. So unconditionally set p->rt_priority. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Fixes: d50dde5a10f3 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/r/20140115153320.GH31570@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 39fd8fd22b3224ec6819d33b3e34ae4da6a35f05 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [imaund@codeaurora.org: Resolve merge conflicts. We have not yet pulled DEADLINE support.] Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-05-28 14:51:32 -07:00
Peter Zijlstra	c9c3cf0d37	sched: Add 'flags' argument to sched_{set,get}attr() syscalls Because of a recent syscall design debate; its deemed appropriate for each syscall to have a flags argument for future extension; without immediately requiring new syscalls. Cc: juri.lelli@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Git-commit: 6d35ab48090b10c5ea5604ed5d6e91f302dc6060 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-05-28 14:51:31 -07:00
Dario Faggioli	a6daf19224	sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: d50dde5a10f305253cbc3855307f608f8a3c5f73 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-05-28 14:51:30 -07:00
Srivatsa Vaddagiri	49de610858	sched: Skip load update for idle task Load statistics for idle tasks is not useful in any manner. Skip load update for such idle tasks. CRs-Fixed: 665706 Change-Id: If3a908bad7fbb42dcb3d0a1d073a3750cf32fcf9 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-05-21 11:33:58 -07:00
Srivatsa Vaddagiri	1b0127500f	sched: window-stats: Fix overflow bug Multiplication over-flow possibility exists in update_task_ravg() when updating task's window_start. That would lead to incorrect accounting of task load. Fix the issue by using 64-bit arithmetic. CRs-Fixed: 665706 Change-Id: I92651c41efa6121bb8fe102e495ae956127b237a Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-05-21 11:33:53 -07:00
Linux Build Service Account	87988370c7	Merge "sched: Remove extra put_online_cpus() inside sched_setaffinity()"	2014-05-13 20:27:30 -07:00
Linux Build Service Account	31a1467d9f	Merge "sched: Remove get_online_cpus() usage"	2014-05-13 20:27:29 -07:00
Linux Build Service Account	d0f48d9389	Merge "sched: Window-based load stat improvements"	2014-05-09 21:13:24 -07:00
Srivatsa Vaddagiri	e407865cae	sched: Window-based load stat improvements Some tasks can have a sporadic load pattern such that they can suddenly start running for longer intervals of time after running for shorter durations. To recognize such sharp increase in tasks' demands, max between the average of 5 window load samples and the most recent sample is chosen as the task demand. Make the window size (sched_ravg_window) configurable at boot up time. To prevent users from setting inappropriate values for window size, min and max limits are defined. As 'ravg' struct tracks load for both real-time and non real-time tasks it is moved out of sched_entity struct. In order to prevent changing function signatures for move_tasks() and move_one_task() per-cpu variables are defined to track the total load moved. In case multiple tasks are selected to migrate in one load balance operation, loads > 100 could be sent through migration notifiers. Prevent this scenario by setting mnd.load to 100 in such cases. Define wrapper functions to compute cpu demands for tasks and to change rq->cumulative_runnable_avg. Change-Id: I9abfbf3b5fe23ae615a6acd3db9580cfdeb515b4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>	2014-05-07 09:24:45 -07:00
Kaushal Kumar	7c38f3c80a	sched: Remove extra put_online_cpus() inside sched_setaffinity() Commit 6acce3ef8: sched: Remove get_online_cpus() usage has left one extra put_online_cpus() inside sched_setaffinity(), remove it to fix the WARN: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3166 at kernel/cpu.c:84 put_online_cpus+0x43/0x70() ... [<ffffffff810c3fef>] put_online_cpus+0x43/0x70 [ [<ffffffff810efd59>] sched_setaffinity+0x7d/0x1f9 [ ... CRs-fixed: 647141 Change-Id: I33f799f30a963db3e9459832832e9c786931c8c2 Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/526DD0EE.1090309@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: ac9ff7997b6f2b31949dcd2495ac671fd9ddc990 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Kaushal Kumar <kaushalk@codeaurora.org>	2014-05-07 14:37:06 +05:30
Kaushal Kumar	c4575f83b9	sched: Remove get_online_cpus() usage Remove get_online_cpus() usage from the scheduler; there's 4 sites that use it: - sched_init_smp(); where its completely superfluous since we're in 'early' boot and there simply cannot be any hotplugging. - sched_getaffinity(); we already take a raw spinlock to protect the task cpus_allowed mask, this disables preemption and therefore also stabilizes cpu_online_mask as that's modified using stop_machine. However switch to active mask for symmetry with sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active mask stability by inserting sync_rcu/sched() into _cpu_down. - sched_setaffinity(); we don't appear to need get_online_cpus() either, there's two sites where hotplug appears relevant: * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask, for the cpuset case we hold task_lock, which is a spinlock and thus for mainline disables preemption (might cause pain on RT). * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has preemption properly disabled; also it already deals with hotplug races explicitly where it releases them. - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for us with a little trickery. By adding a sync_sched/rcu() after the CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for cpu_active_mask. Use these to validate that both our cpus are active when queueing the stop work before we queue the stop_machine works for take_cpu_down(). CRs-fixed: 647141 Change-Id: Id41e66659574f716de0e7c29f477e56a86db9404 Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 6acce3ef84520537f8a09a12c9ddbe814a584dd2 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [kaushalk@codeaurora.org: get_online_cpus has only 3 sites of usage in kernel/sched/core.c of msm-3.10 so migrate_swap changes are not applicable here. stop_two_cpus related change is not applicable to msm-3.10, so skip it.] Signed-off-by: Kaushal Kumar <kaushalk@codeaurora.org>	2014-05-07 14:33:26 +05:30
Ian Maund	7f6f84a1a9	Revert "sched: Add new scheduler syscalls to support an extended scheduling parameters ABI" This reverts commit `50f8b04cda`. This commit depends on a commit which is causing audio regressions. Until its dependency can be brought it, it should be removed. Change-Id: Ia6c00e7744a5d4e562299ddb4cabe4fbab65e9bc Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-05-02 12:20:45 -07:00
Ian Maund	4586947944	Revert "sched: Add 'flags' argument to sched_{set,get}attr() syscalls" This reverts commit `088af0e974`. Including this commit causes crashes during audio-playback. Investigation is underway as to the root cause, but until the issue is fully understood this commit should be excluded. Change-Id: I3f6289d6249de5f7080f3fefba6a17a5f569eeac Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-05-01 17:02:43 -07:00
Ian Maund	356fb13538	Merge upstream linux-stable v3.10.36 into msm-3.10 * commit 'v3.10.36': (494 commits) Linux 3.10.36 netfilter: nf_conntrack_dccp: fix skb_header_pointer API usages mm: close PageTail race net: mvneta: rename MVNETA_GMAC2_PSC_ENABLE to MVNETA_GMAC2_PCS_ENABLE x86: fix boot on uniprocessor systems Input: cypress_ps2 - don't report as a button pads Input: synaptics - add manual min/max quirk for ThinkPad X240 Input: synaptics - add manual min/max quirk Input: mousedev - fix race when creating mixed device ext4: atomically set inode->i_flags in ext4_set_inode_flags() Linux 3.10.35 sched/autogroup: Fix race with task_groups list e100: Fix "disabling already-disabled device" warning xhci: Fix resume issues on Renesas chips in Samsung laptops Input: wacom - make sure touch_max is set for touch devices KVM: VMX: fix use after free of vmx->loaded_vmcs KVM: x86: handle invalid root_hpa everywhere KVM: MMU: handle invalid root_hpa at __direct_map Input: elantech - improve clickpad detection ARM: highbank: avoid L2 cache smc calls when PL310 is not present ... Change-Id: Ib68f565291702c53df09e914e637930c5d3e5310 Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-04-23 16:23:49 -07:00
Peter Zijlstra	088af0e974	sched: Add 'flags' argument to sched_{set,get}attr() syscalls Because of a recent syscall design debate; its deemed appropriate for each syscall to have a flags argument for future extension; without immediately requiring new syscalls. Cc: juri.lelli@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Git-commit: 6d35ab48090b10c5ea5604ed5d6e91f302dc6060 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-04-17 17:17:05 -07:00
Dario Faggioli	50f8b04cda	sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by: Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: d50dde5a10f305253cbc3855307f608f8a3c5f73 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-04-17 17:17:04 -07:00
Rohit Gupta	98ccf8a72d	sched: Disable wakeup hints for foreground tasks by default By default sched_wakeup_load_threshold is set to 60 and therefore wakeup hints are sent out for those tasks whose loads are higher that value. This might cause unnecessary wakeup boosts to happen when load based syncing is turned ON for cpu-boost. Disable the wake up hints by setting the sched_wakeup_load_threshold to a value higher than 100 so that wakeup boost doesnt happen unless it is explicitly turned ON from adb shell. Change-Id: I9b8a594c2bfdf2e092cc645e50c0c21efc514c2f Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>	2014-04-15 19:58:28 -07:00
Gerald Schaefer	ccdb5fa37f	sched/autogroup: Fix race with task_groups list commit 41261b6a832ea0e788627f6a8707854423f9ff49 upstream. In autogroup_create(), a tg is allocated and added to the task_groups list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on the list, without locking. This can race with someone walking the list, like __enable_runtime() during CPU unplug, and result in a use-after-free bug. To fix this, move sched_online_group(), which adds the tg to the list, to the end of the autogroup_create() function after the modification. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-31 09:58:14 -07:00
Linux Build Service Account	b198cb9af1	Merge "Merge upstream linux-stable v3.10.28 into msm-3.10"	2014-03-26 23:36:07 -07:00
Ian Maund	f1b32d4e47	Merge upstream linux-stable v3.10.28 into msm-3.10 The following commits have been reverted from this merge, as they are known to introduce new bugs and are currently incompatible with our audio implementation. Investigation of these commits is ongoing, and they are expected to be brought in at a later time: `86e6de7` ALSA: compress: fix drain calls blocking other compress functions (v6) `16442d4` ALSA: compress: fix drain calls blocking other compress functions This merge commit also includes a change in block, necessary for compilation. Upstream has modified elevator_init_fn to prevent race conditions, requring updates to row_init_queue and test_init_queue. * commit 'v3.10.28': (1964 commits) Linux 3.10.28 ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init() serial: amba-pl011: use port lock to guard control register access mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL md/raid5: Fix possible confusion when multiple write errors occur. md/raid10: fix two bugs in handling of known-bad-blocks. md/raid10: fix bug when raid10 recovery fails to recover a block. md: fix problem when adding device to read-only array with bitmap. drm/i915: fix DDI PLLs HW state readout code nilfs2: fix segctor bug that causes file system corruption thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only ftrace/x86: Load ftrace_ops in parameter not the variable holding it SELinux: Fix possible NULL pointer dereference in selinux_inode_permission() writeback: Fix data corruption on NFS hwmon: (coretemp) Fix truncated name of alarm attributes vfs: In d_path don't call d_dname on a mount point staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq() staging: comedi: addi_apci_1032: fix subdevice type/flags bug mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully GFS2: Increase i_writecount during gfs2_setattr_chown perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h perf scripting perl: Fix build error on Fedora 12 ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic Linux 3.10.27 sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper SCSI: sd: Reduce buffer size for vpd request intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters. mac80211: move "bufferable MMPDU" check to fix AP mode scan ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS ACPI / TPM: fix memory leak when walking ACPI namespace mfd: rtsx_pcr: Disable interrupts before cancelling delayed works clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock clk: samsung: exynos4: Correct SRC_MFC register clk: clk-divider: fix divisor > 255 bug ahci: add PCI ID for Marvell 88SE9170 SATA controller parisc: Ensure full cache coherency for kmap/kunmap drm/nouveau/bios: make jump conditional ARM: shmobile: mackerel: Fix coherent DMA mask ARM: shmobile: armadillo: Fix coherent DMA mask ARM: shmobile: kzm9g: Fix coherent DMA mask ARM: dts: exynos5250: Fix MDMA0 clock number ARM: fix "bad mode in ... handler" message for undefined instructions ARM: fix footbridge clockevent device net: Loosen constraints for recalculating checksum in skb_segment() bridge: use spin_lock_bh() in br_multicast_set_hash_max netpoll: Fix missing TXQ unlock and and OOPS. net: llc: fix use after free in llc_ui_recvmsg virtio-net: fix refill races during restore virtio_net: don't leak memory or block when too many frags virtio-net: make all RX paths handle errors consistently virtio_net: fix error handling for mergeable buffers vlan: Fix header ops passthru when doing TX VLAN offload. net: rose: restore old recvmsg behavior rds: prevent dereference of a NULL device ipv6: always set the new created dst's from in ip6_rt_copy net: fec: fix potential use after free hamradio/yam: fix info leak in ioctl drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl() net: inet_diag: zero out uninitialized idiag_{src,dst} fields ip_gre: fix msg_name parsing for recvfrom/recvmsg net: unix: allow bind to fail on mutex lock ipv6: fix illegal mac_header comparison on 32bit netvsc: don't flush peers notifying work during setting mtu tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0 net: unix: allow set_peek_off to fail net: drop_monitor: fix the value of maxattr ipv6: don't count addrconf generated routes against gc limit packet: fix send path when running with proto == 0 virtio: delete napi structures from netdev before releasing memory macvtap: signal truncated packets tun: update file current position macvtap: update file current position macvtap: Do not double-count received packets rds: prevent BUG_ON triggered on congestion update to loopback net: do not pretend FRAGLIST support IPv6: Fixed support for blackhole and prohibit routes HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue"" gpio-rcar: R-Car GPIO IRQ share interrupt clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast irqchip: renesas-irqc: Fix irqc_probe error handling Linux 3.10.26 sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c ext4: fix bigalloc regression arm64: Use Normal NonCacheable memory for writecombine arm64: Do not flush the D-cache for anonymous pages arm64: Avoid cache flushing in flush_dcache_page() ARM: KVM: arch_timers: zero CNTVOFF upon return to host ARM: hyp: initialize CNTVOFF to zero clocksource: arch_timer: use virtual counters arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S arm64: dts: Reserve the memory used for secondary CPU release address arm64: check for number of arguments in syscall_get/set_arguments() arm64: fix possible invalid FPSIMD initialization state ... Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106 Signed-off-by: Ian Maund <imaund@codeaurora.org>	2014-03-24 14:28:34 -07:00
George McCollister	84bb5b645e	sched: Fix double normalization of vruntime commit 791c9e0292671a3bfa95286bb5c08129d8605618 upstream. dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-23 21:38:09 -07:00
Rohit Gupta	f91de12382	sched: Call the notify_on_migrate notifier chain for wakeups as well Add a change to send notify_on_migrate hints on wakeups of foreground tasks from scheduler if their load is above wakeup_load_thresholds (default value is 60). These hints can be used to choose an appropriate CPU frequency corresponding to the load of the task being woken up. Change-Id: Ieca413c1a8bd2b14a15a7591e8e15d22925c42ca Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>	2014-03-20 16:30:40 -07:00
Rohit Gupta	1f25d291e3	cpufreq: cpu-boost: Introduce scheduler assisted load based syncs Previously, on getting a migration notification cpu-boost changed the scaling min of the destination frequency to match that of the source frequency or sync_threshold whichever was minimum. If the scheduler migration notification is extended with task load (cpu demand) information, the cpu boost driver can use this load to compute a suitable frequency for the migrating task. The required frequency for the task is calculated by taking the load percentage of the max frequency and no sync is performed if the load is less than a particular value (migration_load_threshold).This change is beneficial for both perf and power as demand of a task is taken into consideration while making cpufreq decisions and unnecessary syncs for lightweight tasks are avoided. The task load information provided by scheduler comes from a window-based load collection mechanism which also normalizes the load collected by the scheduler to the max possible frequency across all CPUs. Change-Id: Id2ba91cc4139c90602557f9b3801fb06b3c38992 Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>	2014-03-20 16:30:34 -07:00
Srivatsa Vaddagiri	263a242b10	sched: window-based load stats for tasks Provide a metric per task that specifies how cpu bound a task is. Task execution is monitored over several time windows and the fraction of the window for which task was found to be executing or wanting to run is recorded as task's demand. Windows over which task was sleeping are ignored. We track last 5 recent windows for every task and the maximum demand seen in any of the previous 5 windows (where task had some activity) drives freq demand for every task. A per-cpu metric (rq->cumulative_runnable_avg) is also provided which is an aggregation of cpu demand of all tasks currently enqueued on it. rq->cumulative_runnable_avg will be useful to know if cpu frequency will need to be changed to match task demand. Change-Id: Ib83207b9ba8683cd3304ee8a2290695c34f08fe2 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-03-20 16:30:13 -07:00
Srivatsa Vaddagiri	cb4939d690	sched: Make scheduler aware of cpu frequency state Capacity of a cpu (how much performance it can deliver) is partly determined by its frequency (P) state, both current frequency as well as max frequency it can reach. Knowing frequency state of cpus will help scheduler optimize various functions such as tracking every task's cpu demand and placing tasks on various cpus. This patch has scheduler registering for cpufreq notifications to become aware of cpu's frequency state. Subsequent patches will make use of derived information for various purposes, such as task's scaled load (cpu demand) accounting and task placement. Change-Id: I376dffa1e7f3f47d0496cd7e6ef8b5642ab79016 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2014-03-20 16:28:34 -07:00
Syed Rameez Mustafa	5541aa2fa0	sched: Make task load information available in /proc The scheduler maintains load information for every non-realtime task. This represents how cpu bound a task is. Expose this information in /proc for debug purposes. Change-Id: If0ad8ab896b8a0deaa6391ec295b503d60d74dc6 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-03-06 16:25:13 -08:00
Abhimanyu Kapur	c6113eee8a	sched: Fix compiler warnings Gcc-4.8 issues some pointer type cast warnings. Fix them. kernel/kernel/sched/core.c: In function 'try_to_wake_up': kernel/kernel/sched/core.c:1548:14: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] error, forbidden warning: core.c:1548 make[4]: * [kernel/sched/core.o] Error 1 make[3]: * [kernel/sched] Error 2 make[3]: *** Waiting for unfinished jobs.... Change-Id: I1808b435a0e6fdeacac7b8842a80452825205303 Signed-off-by: Abhimanyu Kapur <abhimany@codeaurora.org>	2014-02-05 09:33:38 -08:00
Syed Rameez Mustafa	d2e952b8f6	sched: convert WARN_ON() to printk_sched() in try_to_wake_up_local() try_to_wake_up_local() is called with the rq lock held. Printing to console in this context can result in a deadlock if klogd needs to be woken up. Print to the kernel log buffer via printk_sched() instead which avoids the wakeup. Change-Id: Ia07baea3cb7e0b158545207fdbbb866203256d3c Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>	2014-01-30 18:05:48 -08:00
Paul Turner	5ba4542368	sched: Guarantee new group-entities always have weight commit 0ac9b1c21874d2490331233b3242085f8151e166 upstream. Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-15 15:28:54 -08:00
Ben Segall	9ca715c462	sched: Fix hrtimer_cancel()/rq->lock deadlock commit 927b54fccbf04207ec92f669dce6806848cbec7d upstream. __start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock. Fix this by ensuring that cfs_b->timer_active is cleared only if the _latest_ call to do_sched_cfs_period_timer is returning as idle. Then __start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for that to succeed or timer_active == 1. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-15 15:28:54 -08:00
Ben Segall	373e0a593b	sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining commit db06e78cc13d70f10877e0557becc88ab3ad2be8 upstream. hrtimer_expires_remaining does not take internal hrtimer locks and thus must be guarded against concurrent __hrtimer_start_range_ns (but returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-15 15:28:54 -08:00
Ben Segall	9d80092f8d	sched: Fix race on toggling cfs_bandwidth_used commit 1ee14e6c8cddeeb8a490d7b54cd9016e4bb900b4 upstream. When we transition cfs_bandwidth_used to false, any currently throttled groups will incorrectly return false from cfs_rq_throttled. While tg_set_cfs_bandwidth will unthrottle them eventually, currently running code (including at least dequeue_task_fair and distribute_cfs_runtime) will cause errors. Fix this by turning off cfs_bandwidth_used only after unthrottling all cfs_rqs. Tested: toggle bandwidth back and forth on a loaded cgroup. Caused crashes in minutes without the patch, hasn't crashed with it. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-15 15:28:54 -08:00
Oleg Nesterov	57f74b6ece	sched: fix the theoretical signal_wake_up() vs schedule() race commit e0acd0a68ec7dbf6b7a81a87a867ebd7ac9b76c4 upstream. This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-09 12:24:23 -08:00
Kirill Tkhai	42e7b42b1c	sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities commit 757dfcaa41844595964f1220f1d33182dae49976 upstream. This patch touches the RT group scheduling case. Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's priority, while rt_rq passed to them may be not the top-level rt_rq. This is wrong, because changing of priority on a child level does not guarantee that the priority is the highest all over the rq. So, this leak makes RT balancing unusable. The short example: the task having the highest priority among all rq's RT tasks (no one other task has the same priority) are waking on a throttle rt_rq. The rq's cpupri is set to the task's priority equivalent, but real rq->rt.highest_prio.curr is less. The patch below fixes the problem. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> CC: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-09 12:24:21 -08:00
Mel Gorman	13ea54872a	sched: numa: skip inaccessible VMAs commit 3c67f474558748b604e247d92b55dfe89654c81d upstream. Inaccessible VMA should not be trapping NUMA hint faults. Skip them. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-01-09 12:24:21 -08:00
Ben Segall	5232a71945	sched: Avoid throttle_cfs_rq() racing with period_timer stopping commit f9f9ffc237dd924f048204e8799da74f9ecf40cf upstream. throttle_cfs_rq() doesn't check to make sure that period_timer is running, and while update_curr/assign_cfs_runtime does, a concurrently running period_timer on another cpu could cancel itself between this cpu's update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running in the tg to restart the timer, this causes the cfs_rq to be stranded forever. Fix this by calling __start_cfs_bandwidth() in throttle if the timer is inactive. (Also add some sched_debug lines for cfs_bandwidth.) Tested: make a run/sleep task in a cgroup, loop switching the cgroup between 1ms/100ms quota and unlimited, checking for timer_active=0 and throttled=1 as a failure. With the throttle_cfs_rq() change commented out this fails, with the full patch it passes. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-12-20 07:45:11 -08:00
Matt Wagantall	700c727f34	sched/debug: Make sysrq prints of sched debug data optional Calls to sysrq_sched_debug_show() can yield rather verbose output which contributes to log spew and, under heavy load, may increase the chances of a watchdog bark. Make printing of this data optional with the introduction of a new Kconfig, CONFIG_SYSRQ_SCHED_DEBUG. Change-Id: I5f54d901d0dea403109f7ac33b8881d967a899ed Signed-off-by: Matt Wagantall <mattw@codeaurora.org>	2013-12-10 11:07:58 -08:00
Steve Muckle	26126dd5dc	tracing/sched: add load balancer tracepoint When doing performance analysis it can be useful to see exactly what is going on with the load balancer - when it runs and why exactly it may not be redistributing load. This additional tracepoint will show the idle context of the load balance operation (idle, not idle, newly idle), various values from the load balancing operation, the final result, and the new balance interval. Change-Id: I1538c411c5f9d17d7d37d84ead6210756be2d884 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-11-20 15:23:19 -08:00
Daisuke Nishimura	51f5294797	sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream. There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() \| someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (1), so the old cgroup(and related data) can be freed before (2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-01 09:17:45 -07:00
Stanislaw Gruszka	013b14c306	sched/cputime: Do not scale when utime == 0 commit 5a8e01f8fa51f5cbce8f37acc050eb2319d12956 upstream. scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux \| grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259 Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-10-01 09:17:45 -07:00
Stephen Boyd	38d8910730	Merge branch 'qandroid-3.10' into msm-3.10 * qandroid-3.10: (636 commits) netfilter: xt_qtaguid: Protect iface list access with necessary lock HID: magicmouse: Fix build warning USB: gadget: mtp: Fix OUT endpoint request length usage in read USB: gadget: f_mtp: Fix using tx buffer pointer msm: Fix race condition in domain lookup msm: Add null-pointer checks for domains base: sync: increase size of sync_timeline name USB: gadget: mtp: Add module parameters for Tx transfer length msm: iommu: Lock the genpool allocation gpu: ion: fix page offset in dma_buf_kmap() gpu: ion: Fix bug in ion_system_heap map_user gpu: ion: Only map as much of the vma as the user requested gpu: ion: use vmalloc to allocate page array to map kernel gpu: ion: Remove dead comments gpu: ion: Minimize allocation fallback delay mmc: sd: Set the card removed if card detect fails gpu: ion: don't fault in individual pages for the CP heap gpu: ion: do not ask for compound pages in system heap gpu: ion: Modify the system heap to try to allocate large/huge pages gpu: ion: Set the dma_address of the sg list at alloc time ... Conflicts: arch/arm/Kconfig arch/arm/include/asm/hardware/cache-l2x0.h arch/arm/mm/cache-l2x0.c drivers/mmc/card/block.c drivers/usb/gadget/udc-core.c	2013-09-04 14:46:18 -07:00
Steve Muckle	45a073bba9	sched: change WARN_ON_ONCE to WARN_ON in try_to_wake_up_local() The WARN_ON_ONCE() calls at the beginning of try_to_wake_up_local() were recently converted from BUG_ON() calls. If these hit it indicates something is wrong and that may contribute to other system instability. To eliminate the risk of an instance of one of these errors going un-noticed because there was an earlier instance that occured long ago, change to WARN_ON(). If there ever is a flood of these there are bigger problems. Change-Id: I392832e2b6ec24b3569b001b1af9ecd4ed6828e7 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-08-22 18:09:13 -07:00
Arun Bharadwaj	241bfbb3ba	tracing/sched: Track per-cpu rt and non-rt cpu_load. Add a new tracepoint trace_sched_enq_deq_task to track per-cpu rt and non-rt cpu_load during task enqueue and dequeue. This is useful to visualize and compare the load on different cpus and also to understand how balanced the load is at any point of time. Note: We only print cpu_load[0] because we only care about the most recent load history for tracking load balancer effectiveness. Change-Id: I46f0bb84e81652099ed5edf8c2686c70c8b8330c Signed-off-by: Arun Bharadwaj <abharadw@codeaurora.org>	2013-08-22 18:08:44 -07:00
Srivatsa Vaddagiri	d0feb11030	sched: re-calculate a cpu's next_balance point upon sched domain changes Commit 55ddeb0f (sched: Reset rq->next_interval before going idle) reset a cpu's rq->next_balance when pulled_task = 0, which will be true when the cpu failed to pull any task, causing it go idle. However that patch relied on next_balance being calculated as a result of traversing cpu's sched domain hierarchy. A cpu that is the only online cpu will however not be attached to any sched domain hierarchy. When such a cpu calls into idle_balance(), we will end up initializing next_balance to be 1sec away! Such a CPU will defer load balance check for another 1sec, even though we may bring up more cpus in the meantime requiring it to check for load imbalance more frequently. This could then lead to increased scheduling latency for some tasks. This patch results in a cpu's next_balance being re-calculated when its attaching to a new sched domain hierarchy. This should let cpus call load balance checks at the right time we expect them to! Change-Id: I855cff8da5ca28d278596c3bb0163b839d4704bc Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2013-08-22 18:08:13 -07:00
Steve Muckle	e853a6d6c0	sched: remove migration notification from RT class Commit 88a7e37d265 (sched: provide per cpu-cgroup option to notify on migrations) added a notifier call when a task is moved to a different CPU. Unfortunately the two call sites in the RT sched class where this occurs happens with a runqueue lock held. This can result in a deadlock if the notifier call attempts to do something like wake up a task. Fortunately the benefit of 88a7e37d265 comes mainly from notifying on migration of non-RT tasks, so we can simply ignore the movements of RT tasks. CRs-Fixed: 491370 Change-Id: I8849d826bf1eeaf85a6f6ad872acb475247c5926 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-08-22 18:08:04 -07:00
Steve Muckle	3f7fe9e850	sched: provide per cpu-cgroup option to notify on migrations On systems where CPUs may run asynchronously, task migrations between CPUs running at grossly different speeds can cause problems. This change provides a mechanism to notify a subsystem in the kernel if a task in a particular cgroup migrates to a different CPU. Other subsystems (such as cpufreq) may then register for this notifier to take appropriate action when such a task is migrated. The cgroup attribute to set for this behavior is "notify_on_migrate" . Change-Id: Ie1868249e53ef901b89c837fdc33b0ad0c0a4590 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-08-22 18:08:00 -07:00
Srivatsa Vaddagiri	c3774eeb6d	sched: fix reference to wrong cfs_rq Commit 7db16c8c (sched: Fix SCHED_HRTICK bug leading to late preemption of tasks) introduced a bug in sched_slice() calculation by using wrong cfs_rq for tasks. rq->cfs was incorrectly used as task's cfs_rq, rather than the correct one to which they belonged. Fix the bug by using correct cfs_rq for tasks. Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2013-08-22 18:07:59 -07:00
Jordan Crouse	8ec8f3e0b4	sched: Mark schedule_io_timeout() with EXPORT_SYMBOL Make schedule_io_timeout() visible to modules. Change-Id: Ic0dedbad8a591a9a721f0d2e8f6c372ec75bc4b2 Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>	2013-08-22 18:07:53 -07:00
Srivatsa Vaddagiri	0af4fd87b1	sched: Fix SCHED_HRTICK bug leading to late preemption of tasks SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot (just when they would have exceeded their ideal_runtime). It makes use of a a per-cpu hrtimer resource and hence alarming that hrtimer should be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs, rather than being based on number of tasks in a particular cfs_rq (as implemented currently). As a result, with current code, its possible for a running task (which is the sole task in its cfs_rq) to be preempted much after its ideal_runtime has elapsed, resulting in increased latency for tasks in other cfs_rq on same cpu. Fix this by alarming sched hrtimer based on total number of SCHED_FAIR tasks a CPU has across its various cfs_rqs. Change-Id: I1f23680a64872f8ce0f451ac4bcae28e8967918f Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2013-08-22 18:07:52 -07:00
Srivatsa Vaddagiri	d35555b3f6	sched: Reset rq->next_interval before going idle next_balance, the point in jiffy time scale when a cpu will next load balance, could have been calculated when the cpu was busy. A busy cpu will apply its sched domain's busy_factor (usually > 1) in computing next_balance for that sched domain, which causes the (busy) cpu to load balance less frequently in its sched domains. However when the same cpu is going idle, its next_balance needs to be reset without consideration of busy_factor. Failure to do so would not trigger nohz idle balancer on that cpu for unnecessarily long time (introducing additional scheduling latencies for tasks). Fix bug in scheduler which aims to reset next_balance before a cpu goes idle (as per existing comment) but is clearly not doing so. Change-Id: I7e027a51686528c4092d770c7d33c874d38f5df4 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>	2013-08-22 18:07:52 -07:00
Peter Zijlstra	dead45bd05	sched: Ensure update_cfs_shares() is called for parents of continuously-running tasks commit bf0bd948d1682e3996adc093b43021ed391983e6 upstream. We typically update a task_group's shares within the dequeue/enqueue path. However, continuously running tasks sharing a CPU are not subject to these updates as they are only put/picked. Unfortunately, when we reverted `f269ae046` (in `17bc14b7`), we lost the augmenting periodic update that was supposed to account for this; resulting in a potential loss of fairness. To fix this, re-introduce the explicit update in update_cfs_rq_blocked_load() [called via entity_tick()]. Reported-by: Max Hailperin <max@gustavus.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/n/tip-9545m3apw5d93ubyrotrj31y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-08-20 08:43:02 -07:00
Srivatsa Vaddagiri	39c8206a1d	sched: fix rq->lock recursion Enabling SCHED_HRTICK currently results in rq->lock recursion and a hard hang at bootup. Essentially try_to_wakeup() grabs rq->lock and tries arming a hrtimer via hrtimer_restart(), which deep down tries waking up ksoftirqd, which leads to a recursive call to try_to_wakeup() and thus attempt to take rq->lock recursively!! This is fixed by having scheduler queue hrtimer via __hrtimer_start_range_ns() which avoids waking up ksoftirqd. Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org> Change-Id: I11a13be1d9db3a749614ccf3d4f5fb7bf6f18fa1 (cherry picked from commit 4ca1d04ea0bdc225cc7db302172f3375a63f44de)	2013-07-08 05:52:38 -07:00
Steve Muckle	8e628bd3f9	kernel: reduce sleep duration in wait_task_inactive Sleeping for an entire tick adds unnecessary latency to hotplugging a cpu (cpu_up). Change-Id: Iab323a79f4048bc9101ecfd368e0f275827ed4ab Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-07-08 05:51:41 -07:00
Steve Muckle	f79da145a4	sched: add sysctl for controlling task migrations on wake The PF_WAKE_UP_IDLE per-task flag made it impossible to enable the old behavior of SD_SHARE_PKG_RESOURCES, where every task migrates to an idle CPU on wakeup. The sched_wake_to_idle sysctl value, when made nonzero, will cause all tasks to migrate to an idle CPU if one is available when the task is woken up. This is regardless of how PF_WAKE_UP_IDLE is configured for tasks in the system. Similar to PF_WAKE_UP_IDLE, the SD_SHARE_PKG_RESOURCES scheduler domain flag must be enabled for the sysctl value to have an effect. Change-Id: I23bed846d26502c7aed600bfcf1c13053a7e5f61 Signed-off-by: Steve Muckle <smuckle@codeaurora.org> (cherry picked from commit 9d5b38dc0025d19df5b756b16024b4269e73f282) Conflicts: kernel/sched/fair.c	2013-07-08 05:51:38 -07:00
Steve Muckle	d282c8eea2	sched: add PF_WAKE_UP_IDLE Certain workloads may benefit from the SD_SHARE_PKG_RESOURCES behavior of waking their tasks up on idle CPUs. The feature has too much of a negative impact on other workloads however to apply globally. The PF_WAKE_UP_IDLE flag tells the scheduler to wake up tasks that have this flag set, or tasks woken by tasks with this flag set, on an idle CPU if one is available. Change-Id: I20b28faf35029f9395e9d9f5ddd57ce2de795039 Signed-off-by: Steve Muckle <smuckle@codeaurora.org>	2013-07-08 05:51:38 -07:00
Jeff Ohlstein	e14d767ab7	sched_avg: add run queue averaging Add code to calculate the run queue depth of a cpu and iowait depth of the cpu. The scheduler calls in to sched_update_nr_prod whenever there is a runqueue change. This function maintains the runqueue average and the iowait of that cpu in that time interval. Whoever wants to know the runqueue average is expected to call sched_get_nr_running_avg periodically to get the accumulated runqueue and iowait averages for all the cpus. Change-Id: Id8cb2ecf0ed479f090a83ccb72dd59c53fa73e0c Signed-off-by: Jeff Ohlstein <johlstei@codeaurora.org> (cherry picked from commit 0299fcaaad80e2c0ac9aa583c95107f6edc27750)	2013-07-08 05:51:38 -07:00
Colin Cross	bebadf46e9	cgroup: Add generic cgroup subsystem permission checks Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'allow_attach' handler, then we fall back to doing our checks the old way. Use the 'allow_attach' handler for the 'cpu' cgroup to allow non-root processes to add arbitrary processes to a 'cpu' cgroup if it has the CAP_SYS_NICE capability set. This version of the patch adds a 'allow_attach' handler instead of reusing the 'can_attach' handler. If the 'can_attach' handler is reused, a new cgroup that implements 'can_attach' but not the permission checks could end up with no permission checks at all. Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c Original-Author: San Mehat <san@google.com> Signed-off-by: Colin Cross <ccross@android.com>	2013-07-01 13:38:49 -07:00
Arve Hjønnevåg	aca50f226d	sched: Enable might_sleep before initializing drivers. This allows detection of init bugs in built-in drivers. Signed-off-by: Arve Hjønnevåg <arve@android.com>	2013-07-01 13:34:55 -07:00
Linus Torvalds	a3d5c3460a	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two smaller fixes - plus a context tracking tracing fix that is a bit bigger" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/context-tracking: Add preempt_schedule_context() for tracing sched: Fix clear NOHZ_BALANCE_KICK sched/x86: Construct all sibling maps if smt	2013-06-20 08:18:35 -10:00
Vincent Guittot	873b4c65b5	sched: Fix clear NOHZ_BALANCE_KICK I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-06-19 12:55:09 +02:00
Frederic Weisbecker	45eacc6927	vtime: Use consistent clocks among nohz accounting While computing the cputime delta of dynticks CPUs, we are mixing up clocks of differents natures: * local_clock() which takes care of unstable clock sources and fix these if needed. * sched_clock() which is the weaker version of local_clock(). It doesn't compute any fixup in case of unstable source. If the clock source is stable, those two clocks are the same and we can safely compute the difference against two random points. Otherwise it results in random deltas as sched_clock() can randomly drift away, back or forward, from local_clock(). As a consequence, some strange behaviour with unstable tsc has been observed such as non progressing constant zero cputime. (The 'top' command showing no load). Fix this by only using local_clock(), or its irq safe/remote equivalent, in vtime code. Reported-by: Mike Galbraith <efault@gmx.de> Suggested-by: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-05-31 11:31:50 +02:00
Linus Torvalds	534c97b095	Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull 'full dynticks' support from Ingo Molnar: "This tree from Frederic Weisbecker adds a new, (exciting! :-) core kernel feature to the timer and scheduler subsystems: 'full dynticks', or CONFIG_NO_HZ_FULL=y. This feature extends the nohz variable-size timer tick feature from idle to busy CPUs (running at most one task) as well, potentially reducing the number of timer interrupts significantly. This feature got motivated by real-time folks and the -rt tree, but the general utility and motivation of full-dynticks runs wider than that: - HPC workloads get faster: CPUs running a single task should be able to utilize a maximum amount of CPU power. A periodic timer tick at HZ=1000 can cause a constant overhead of up to 1.0%. This feature removes that overhead - and speeds up the system by 0.5%-1.0% on typical distro configs even on modern systems. - Real-time workload latency reduction: CPUs running critical tasks should experience as little jitter as possible. The last remaining source of kernel-related jitter was the periodic timer tick. - A single task executing on a CPU is a pretty common situation, especially with an increasing number of cores/CPUs, so this feature helps desktop and mobile workloads as well. The cost of the feature is mainly related to increased timer reprogramming overhead when a CPU switches its tick period, and thus slightly longer to-idle and from-idle latency. Configuration-wise a third mode of operation is added to the existing two NOHZ kconfig modes: - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named as a config option. This is the traditional Linux periodic tick design: there's a HZ tick going on all the time, regardless of whether a CPU is idle or not. - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the periodic tick when a CPU enters idle mode. - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the tick when a CPU is idle, also slows the tick down to 1 Hz (one timer interrupt per second) when only a single task is running on a CPU. The .config behavior is compatible: existing !CONFIG_NO_HZ and CONFIG_NO_HZ=y settings get translated to the new values, without the user having to configure anything. CONFIG_NO_HZ_FULL is turned off by default. This feature is based on a lot of infrastructure work that has been steadily going upstream in the last 2-3 cycles: related RCU support and non-periodic cputime support in particular is upstream already. This tree adds the final pieces and activates the feature. The pull request is marked RFC because: - it's marked 64-bit only at the moment - the 32-bit support patch is small but did not get ready in time. - it has a number of fresh commits that came in after the merge window. The overwhelming majority of commits are from before the merge window, but still some aspects of the tree are fresh and so I marked it RFC. - it's a pretty wide-reaching feature with lots of effects - and while the components have been in testing for some time, the full combination is still not very widely used. That it's default-off should reduce its regression abilities and obviously there are no known regressions with CONFIG_NO_HZ_FULL=y enabled either. - the feature is not completely idempotent: there is no 100% equivalent replacement for a periodic scheduler/timer tick. In particular there's ongoing work to map out and reduce its effects on scheduler load-balancing and statistics. This should not impact correctness though, there are no known regressions related to this feature at this point. - it's a pretty ambitious feature that with time will likely be enabled by most Linux distros, and we'd like you to make input on its design/implementation, if you dislike some aspect we missed. Without flaming us to crisp! :-) Future plans: - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off the periodic tick altogether when there's a single busy task on a CPU. We'd first like 1 Hz to be exposed more widely before we go for the 0 Hz target though. - once we reach 0 Hz we can remove the periodic tick assumption from nr_running>=2 as well, by essentially interrupting busy tasks only as frequently as the sched_latency constraints require us to do - once every 4-40 msecs, depending on nr_running. I am personally leaning towards biting the bullet and doing this in v3.10, like the -rt tree this effort has been going on for too long - but the final word is up to you as usual. More technical details can be found in Documentation/timers/NO_HZ.txt" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) sched: Keep at least 1 tick per second for active dynticks tasks rcu: Fix full dynticks' dependency on wide RCU nocb mode nohz: Protect smp_processor_id() in tick_nohz_task_switch() nohz_full: Add documentation. cputime_nsecs: use math64.h for nsec resolution conversion helpers nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config nohz: Reduce overhead under high-freq idling patterns nohz: Remove full dynticks' superfluous dependency on RCU tree nohz: Fix unavailable tick_stop tracepoint in dynticks idle nohz: Add basic tracing nohz: Select wide RCU nocb for full dynticks nohz: Disable the tick when irq resume in full dynticks CPU nohz: Re-evaluate the tick for the new task after a context switch nohz: Prepare to stop the tick on irq exit nohz: Implement full dynticks kick nohz: Re-evaluate the tick from the scheduler IPI sched: New helper to prevent from stopping the tick in full dynticks sched: Kick full dynticks CPU that have more than one task enqueued. perf: New helper to prevent full dynticks CPUs from stopping tick perf: Kick full dynticks CPU if events rotation is needed ...	2013-05-05 13:23:27 -07:00
Frederic Weisbecker	265f22a975	sched: Keep at least 1 tick per second for active dynticks tasks The scheduler doesn't yet fully support environments with a single task running without a periodic tick. In order to ensure we still maintain the duties of scheduler_tick(), keep at least 1 tick per second. This makes sure that we keep the progression of various scheduler accounting and background maintainance even with a very low granularity. Examples include cpu load, sched average, CFS entity vruntime, avenrun and events such as load balancing, amongst other details handled in sched_class::task_tick(). This limitation will be removed in the future once we get these individual items to work in full dynticks CPUs. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2013-05-04 08:32:02 +02:00
Linus Torvalds	0279b3c0ad	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "This fixes the cputime scaling overflow problems for good without having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper as well." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "math64: New div64_u64_rem helper" sched: Avoid prev->stime underflow sched: Do not account bogus utime sched: Avoid cputime scaling overflow	2013-05-02 14:56:31 -07:00
Frederic Weisbecker	c032862fba	Merge commit '8700c95adb03' into timers/nohz The full dynticks tree needs the latest RCU and sched upstream updates in order to fix some dependencies. Merge a common upstream merge point that has these updates. Conflicts: include/linux/perf_event.h kernel/rcutree.h kernel/rcutree_plugin.h Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2013-05-02 17:54:19 +02:00
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
Tejun Heo	3d1cb2059d	workqueue: include workqueue info when printing debug dump of a worker task One of the problems that arise when converting dedicated custom threadpool to workqueue is that the shared worker pool used by workqueue anonimizes each worker making it more difficult to identify what the worker was doing on which target from the output of sysrq-t or debug dump from oops, BUG() and friends. This patch implements set_worker_desc() which can be called from any workqueue work function to set its description. When the worker task is dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion and so on - the description will be printed out together with the workqueue name and the worker function pointer. The printing side is implemented by print_worker_info() which is called from functions in task dump paths - sched_show_task() and dump_stack_print_info(). print_worker_info() can be safely called on any task in any state as long as the task struct itself is accessible. It uses probe_*() functions to access worker fields. It may print garbage if something went very wrong, but it wouldn't cause (another) oops. The description is currently limited to 24bytes including the terminating \0. worker->desc_valid and workder->desc[] are added and the 64 bytes marker which was already incorrect before adding the new fields is moved to the correct position. Here's an example dump with writeback updated to set the bdi name as worker desc. Hardware name: Bochs Modules linked in: Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1 Workqueue: writeback bdi_writeback_workfn (flush-8:0) ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8 ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0 ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08 Call Trace: [<ffffffff81c61845>] dump_stack+0x19/0x1b [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0 ... Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-30 17:04:02 -07:00
Stanislaw Gruszka	68aa8efcd1	sched: Avoid prev->stime underflow Dave Hansen reported strange utime/stime values on his system: https://lkml.org/lkml/2013/4/4/435 This happens because prev->stime value is bigger than rtime value. Root of the problem are non-monotonic rtime values (i.e. current rtime is smaller than previous rtime) and that should be debugged and fixed. But since problem did not manifest itself before commit `62188451f0` "cputime: Avoid multiplication overflow on utime scaling", it should be threated as regression, which we can easily fixed on cputime_adjust() function. For now, let's apply this fix, but further work is needed to fix root of the problem. Reported-and-tested-by: Dave Hansen <dave@sr71.net> Cc: <stable@vger.kernel.org> # 3.9+ Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: rostedt@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-30 19:13:05 +02:00
Stanislaw Gruszka	772c808a25	sched: Do not account bogus utime Due to rounding in scale_stime(), for big numbers, scaled stime values will grow in chunks. Since rtime grow in jiffies and we calculate utime like below: prev->stime = max(prev->stime, stime); prev->utime = max(prev->utime, rtime - prev->stime); we could erroneously account stime values as utime. To prevent that only update prev->{u,s}time values when they are smaller than current rtime. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: rostedt@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave@sr71.net> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-30 19:13:04 +02:00

... 2 3 4 5 6 ...

710 Commits