Commit Graph

154 Commits

Author SHA1 Message Date
Ian Maund f1b32d4e47 Merge upstream linux-stable v3.10.28 into msm-3.10
The following commits have been reverted from this merge, as they are
known to introduce new bugs and are currently incompatible with our
audio implementation. Investigation of these commits is ongoing, and
they are expected to be brought in at a later time:

86e6de7 ALSA: compress: fix drain calls blocking other compress functions (v6)
16442d4 ALSA: compress: fix drain calls blocking other compress functions

This merge commit also includes a change in block, necessary for
compilation. Upstream has modified elevator_init_fn to prevent race
conditions, requring updates to row_init_queue and test_init_queue.

* commit 'v3.10.28': (1964 commits)
  Linux 3.10.28
  ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling
  drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init()
  serial: amba-pl011: use port lock to guard control register access
  mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL
  md/raid5: Fix possible confusion when multiple write errors occur.
  md/raid10: fix two bugs in handling of known-bad-blocks.
  md/raid10: fix bug when raid10 recovery fails to recover a block.
  md: fix problem when adding device to read-only array with bitmap.
  drm/i915: fix DDI PLLs HW state readout code
  nilfs2: fix segctor bug that causes file system corruption
  thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
  ftrace/x86: Load ftrace_ops in parameter not the variable holding it
  SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()
  writeback: Fix data corruption on NFS
  hwmon: (coretemp) Fix truncated name of alarm attributes
  vfs: In d_path don't call d_dname on a mount point
  staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq()
  staging: comedi: addi_apci_1032: fix subdevice type/flags bug
  mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
  GFS2: Increase i_writecount during gfs2_setattr_chown
  perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
  perf scripting perl: Fix build error on Fedora 12
  ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
  Linux 3.10.27
  sched: Guarantee new group-entities always have weight
  sched: Fix hrtimer_cancel()/rq->lock deadlock
  sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
  sched: Fix race on toggling cfs_bandwidth_used
  x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround
  netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
  SCSI: sd: Reduce buffer size for vpd request
  intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
  mac80211: move "bufferable MMPDU" check to fix AP mode scan
  ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
  ACPI / TPM: fix memory leak when walking ACPI namespace
  mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
  clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
  clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
  clk: samsung: exynos4: Correct SRC_MFC register
  clk: clk-divider: fix divisor > 255 bug
  ahci: add PCI ID for Marvell 88SE9170 SATA controller
  parisc: Ensure full cache coherency for kmap/kunmap
  drm/nouveau/bios: make jump conditional
  ARM: shmobile: mackerel: Fix coherent DMA mask
  ARM: shmobile: armadillo: Fix coherent DMA mask
  ARM: shmobile: kzm9g: Fix coherent DMA mask
  ARM: dts: exynos5250: Fix MDMA0 clock number
  ARM: fix "bad mode in ... handler" message for undefined instructions
  ARM: fix footbridge clockevent device
  net: Loosen constraints for recalculating checksum in skb_segment()
  bridge: use spin_lock_bh() in br_multicast_set_hash_max
  netpoll: Fix missing TXQ unlock and and OOPS.
  net: llc: fix use after free in llc_ui_recvmsg
  virtio-net: fix refill races during restore
  virtio_net: don't leak memory or block when too many frags
  virtio-net: make all RX paths handle errors consistently
  virtio_net: fix error handling for mergeable buffers
  vlan: Fix header ops passthru when doing TX VLAN offload.
  net: rose: restore old recvmsg behavior
  rds: prevent dereference of a NULL device
  ipv6: always set the new created dst's from in ip6_rt_copy
  net: fec: fix potential use after free
  hamradio/yam: fix info leak in ioctl
  drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
  net: inet_diag: zero out uninitialized idiag_{src,dst} fields
  ip_gre: fix msg_name parsing for recvfrom/recvmsg
  net: unix: allow bind to fail on mutex lock
  ipv6: fix illegal mac_header comparison on 32bit
  netvsc: don't flush peers notifying work during setting mtu
  tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
  net: unix: allow set_peek_off to fail
  net: drop_monitor: fix the value of maxattr
  ipv6: don't count addrconf generated routes against gc limit
  packet: fix send path when running with proto == 0
  virtio: delete napi structures from netdev before releasing memory
  macvtap: signal truncated packets
  tun: update file current position
  macvtap: update file current position
  macvtap: Do not double-count received packets
  rds: prevent BUG_ON triggered on congestion update to loopback
  net: do not pretend FRAGLIST support
  IPv6: Fixed support for blackhole and prohibit routes
  HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
  gpio-rcar: R-Car GPIO IRQ share interrupt
  clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
  irqchip: renesas-irqc: Fix irqc_probe error handling
  Linux 3.10.26
  sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
  ext4: fix bigalloc regression
  arm64: Use Normal NonCacheable memory for writecombine
  arm64: Do not flush the D-cache for anonymous pages
  arm64: Avoid cache flushing in flush_dcache_page()
  ARM: KVM: arch_timers: zero CNTVOFF upon return to host
  ARM: hyp: initialize CNTVOFF to zero
  clocksource: arch_timer: use virtual counters
  arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
  arm64: dts: Reserve the memory used for secondary CPU release address
  arm64: check for number of arguments in syscall_get/set_arguments()
  arm64: fix possible invalid FPSIMD initialization state
  ...

Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-03-24 14:28:34 -07:00
Tejun Heo b6891ed4e6 cgroup: fix RCU accesses to task->cgroups
commit 14611e51a57df10240817d8ada510842faf0ec51 upstream.

task->cgroups is a RCU pointer pointing to struct css_set.  A task
switches to a different css_set on cgroup migration but a css_set
doesn't change once created and its pointers to cgroup_subsys_states
aren't RCU protected.

task_subsys_state[_check]() is the macro to acquire css given a task
and subsys_id pair.  It RCU-dereferences task->cgroups->subsys[] not
task->cgroups, so the RCU pointer task->cgroups ends up being
dereferenced without read_barrier_depends() after it.  It's broken.

Fix it by introducing task_css_set[_check]() which does
RCU-dereference on task->cgroups.  task_subsys_state[_check]() is
reimplemented to directly dereference ->subsys[] of the css_set
returned from task_css_set[_check]().

This removes some of sparse RCU warnings in cgroup.

v2: Fixed unbalanced parenthsis and there's no need to use
    rcu_dereference_raw() when !CONFIG_PROVE_RCU.  Both spotted by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21 18:21:25 -07:00
Colin Cross bebadf46e9 cgroup: Add generic cgroup subsystem permission checks
Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup via CFS, move permission checks into each
specific cgroup subsystem. If a subsystem does not specify a
'allow_attach' handler, then we fall back to doing our checks
the old way.

Use the 'allow_attach' handler for the 'cpu' cgroup to allow
non-root processes to add arbitrary processes to a 'cpu' cgroup
if it has the CAP_SYS_NICE capability set.

This version of the patch adds a 'allow_attach' handler instead
of reusing the 'can_attach' handler.  If the 'can_attach' handler
is reused, a new cgroup that implements 'can_attach' but not
the permission checks could end up with no permission checks
at all.

Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c
Original-Author: San Mehat <san@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
2013-07-01 13:38:49 -07:00
Tejun Heo 7805d000db cgroup: fix a subtle bug in descendant pre-order walk
When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty.  This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path.  Remove it along with
the later assumption that the subtree isn't empty.  This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
2013-05-24 10:50:24 +09:00
Kent Overstreet a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Linus Torvalds 20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Al Viro 8d8b97ba49 take cgroup_open() and cpuset_open() to fs/proc/base.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
Linus Torvalds 16fa94b532 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this development cycle were:

   - full dynticks preparatory work by Frederic Weisbecker

   - factor out the cpu time accounting code better, by Li Zefan

   - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

   - various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched: Fix init NOHZ_IDLE flag
  sched: Prevent to re-select dst-cpu in load_balance()
  sched: Rename load_balance_tmpmask to load_balance_mask
  sched: Move up affinity check to mitigate useless redoing overhead
  sched: Don't consider other cpus in our group in case of NEWLY_IDLE
  sched: Explicitly cpu_idle_type checking in rebalance_domains()
  sched: Change position of resched_cpu() in load_balance()
  sched: Fix wrong rq's runnable_avg update with rt tasks
  sched: Document task_struct::personality field
  sched/cpuacct/UML: Fix header file dependency bug on the UML build
  cgroup: Kill subsys.active flag
  sched/cpuacct: No need to check subsys active state
  sched/cpuacct: Initialize cpuacct subsystem earlier
  sched/cpuacct: Initialize root cpuacct earlier
  sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
  sched/cpuacct: Clean up cpuacct.h
  sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
  sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
  sched/cpuacct: Add cpuacct_acount_field()
  sched/cpuacct: Add cpuacct_init()
  ...
2013-04-30 07:43:28 -07:00
Linus Torvalds 191a712090 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Fixes and a lot of cleanups.  Locking cleanup is finally complete.
   cgroup_mutex is no longer exposed to individual controlelrs which
   used to cause nasty deadlock issues.  Li fixed and cleaned up quite a
   bit including long standing ones like racy cgroup_path().

 - device cgroup now supports proper hierarchy thanks to Aristeu.

 - perf_event cgroup now supports proper hierarchy.

 - A new mount option "__DEVEL__sane_behavior" is added.  As indicated
   by the name, this option is to be used for development only at this
   point and generates a warning message when used.  Unfortunately,
   cgroup interface currently has too many brekages and inconsistencies
   to implement a consistent and unified hierarchy on top.  The new flag
   is used to collect the behavior changes which are necessary to
   implement consistent unified hierarchy.  It's likely that this flag
   won't be used verbatim when it becomes ready but will be enabled
   implicitly along with unified hierarchy.

   The option currently disables some of broken behaviors in cgroup core
   and also .use_hierarchy switch in memcg (will be routed through -mm),
   which can be used to make very unusual hierarchy where nesting is
   partially honored.  It will also be used to implement hierarchy
   support for blk-throttle which would be impossible otherwise without
   introducing a full separate set of control knobs.

   This is essentially versioning of interface which isn't very nice but
   at this point I can't see any other options which would allow keeping
   the interface the same while moving towards hierarchy behavior which
   is at least somewhat sane.  The planned unified hierarchy is likely
   to require some level of adaptation from userland anyway, so I think
   it'd be best to take the chance and update the interface such that
   it's supportable in the long term.

   Maintaining the existing interface does complicate cgroup core but
   shouldn't put too much strain on individual controllers and I think
   it'd be manageable for the foreseeable future.  Maybe we'll be able
   to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
  cpuset: fix compile warning when CONFIG_SMP=n
  cpuset: fix cpu hotplug vs rebuild_sched_domains() race
  cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
  cgroup: restore the call to eventfd->poll()
  cgroup: fix use-after-free when umounting cgroupfs
  cgroup: fix broken file xattrs
  devcg: remove parent_cgroup.
  memcg: force use_hierarchy if sane_behavior
  cgroup: remove cgrp->top_cgroup
  cgroup: introduce sane_behavior mount option
  move cgroupfs_root to include/linux/cgroup.h
  cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
  cgroup: make cgroup_path() not print double slashes
  Revert "cgroup: remove bind() method from cgroup_subsys."
  perf: make perf_event cgroup hierarchical
  cgroup: implement cgroup_is_descendant()
  cgroup: make sure parent won't be destroyed before its children
  cgroup: remove bind() method from cgroup_subsys.
  devcg: remove broken_hierarchy tag
  cgroup: remove cgroup_lock_is_held()
  ...
2013-04-29 19:14:20 -07:00
Michal Hocko 6d2488f64a cgroup: remove css_get_next
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Li Zefan 712317ad97 cgroup: fix broken file xattrs
We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

  # mount -t cgroup -o xattr /sys/fs/cgroup
  # mkdir /sys/fs/cgroup/test
  # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
  # rmdir /sys/fs/cgroup/test
  # umount /sys/fs/cgroup
  oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-18 23:11:40 -07:00
Tejun Heo f00baae7ad memcg: force use_hierarchy if sane_behavior
Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.

It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.

v2: Explain that mem_cgroup_bind() doesn't have to worry about
    children as suggested by Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2013-04-15 13:46:27 -07:00
Li Zefan 05fb22ec54 cgroup: remove cgrp->top_cgroup
It's not used, and it can be retrieved via cgrp->root->top_cgroup.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-14 23:26:10 -07:00
Tejun Heo 873fe09ea5 cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility.  This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors.  As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default.  Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed.  Also,
  cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
  This is currently pretty crazy.  If one mounts a cgroup, creates a
  subdirectory, unmounts it and then mount it again with different
  option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

v2: Dropped unnecessary explicit file permission setting sane_behavior
    cftype entry as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
2013-04-14 20:15:26 -07:00
Tejun Heo 25a7e6848d move cgroupfs_root to include/linux/cgroup.h
While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly.  This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers.  Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-14 20:15:25 -07:00
Tejun Heo 26d5bbe5ba Revert "cgroup: remove bind() method from cgroup_subsys."
This reverts commit 84cfb6ab48.  There
are scheduled changes which make use of the removed callback.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
2013-04-12 10:29:04 -07:00
Li Zefan 78574cf981 cgroup: implement cgroup_is_descendant()
A couple controllers want to determine whether two cgroups are in
ancestor/descendant relationship.  As it's more likely that the
descendant is the primary subject of interest and there are other
operations focusing on the descendants, let's ask is_descendent rather
than is_ancestor.

Implementation is trivial as the previous patch guarantees that all
ancestors of a cgroup stay accessible as long as the cgroup is
accessible.

tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
    rewrote descriptions.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10 11:07:08 -07:00
Rami Rosen 84cfb6ab48 cgroup: remove bind() method from cgroup_subsys.
The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)

tj: Removed the entry on ->bind() from
    Documentation/cgroups/cgroups.txt.  Also updated a couple
    paragraphs which were suggesting that dynamic re-binding may be
    implemented.  It's not gonna.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-10 10:46:59 -07:00
Li Zefan 479f614110 cgroup: Kill subsys.active flag
The only user was cpuacct.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 13:54:22 +02:00
Tejun Heo 2219449a65 cgroup: remove cgroup_lock_is_held()
We don't want controllers to assume that the information is officially
available and do funky things with it.

The only user is task_subsys_state_check() which uses it to verify RCU
access context.  We can move cgroup_lock_is_held() inside
CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
to conditionally exposing cgroup_mutex.

Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
and use lockdep_is_held() directly on the mutex in
task_subsys_state_check().

While at it, add parentheses around macro arguments in
task_subsys_state_check().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo b9777cf8d7 cgroup: unexport locking interface and cgroup_attach_task()
Now that all external cgroup_lock() users are gone, we can finally
unexport the locking interface and prevent future abuse of
cgroup_mutex.

Make cgroup_[un]lock() and cgroup_lock_live_group() static.  Also,
cgroup_attach_task() doesn't have any user left and can't be used
without locking interface anyway.  Make it static too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:51 -07:00
Tejun Heo 8cc9934520 cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks()
When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources.  This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.

Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported.  Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07 09:29:50 -07:00
Li Zefan 081aa458c3 cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()
These two functions share most of the code.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 07:50:25 -07:00
Li Zefan e7b2dcc52b cgroup: remove cgroup_is_descendant()
It was used by ns cgroup, and ns cgroup was removed long ago.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-12 15:35:58 -07:00
Li Zefan 7d8e0bf56a cgroup: avoid accessing modular cgroup subsys structure without locking
subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
and that's protected by cgroup_mutex, and then the memory *subsys[i]
resides will be freed.

So this is unsafe without any locking:

  if (!ss || ss->module)
  ...

v2:
- add a comment for enum cgroup_subsys_id
- simplify the comment in cgroup_exit()

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-05 09:33:25 -08:00
Li Zefan 65dff759d2 cgroup: fix cgroup_path() vs rename() race
rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.

As accessing dentry->name must be protected by dentry->d_lock or
parent inode's i_mutex, while on the other hand cgroup-path() can
be called with some irq-safe spinlocks held, we can't generate
cgroup path using dentry->d_name.

Alternatively we make a copy of dentry->d_name and save it in
cgrp->name when a cgroup is created, and update cgrp->name at
rename().

v5: use flexible array instead of zero-size array.
v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
    - add cgroup_name() wrapper.
v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
v2: make cgrp->name RCU safe.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-04 09:50:08 -08:00
Li Zefan be44562613 cgroup: remove synchronize_rcu() from cgroup_diput()
Free cgroup via call_rcu(). The actual work is done through
workqueue.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-01-24 12:05:18 -08:00
Tejun Heo 12a9d2fef1 cgroup: implement cgroup_rightmost_descendant()
Implement cgroup_rightmost_descendant() which returns the right most
descendant of the specified cgroup.  This can be used to skip the
cgroup's subtree while iterating with
cgroup_for_each_descendant_pre().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07 08:50:28 -08:00
Tejun Heo d5b1fe68ba cgroup: remove unused dummy cgroup_fork_callbacks()
5edee61ede ("cgroup: cgroup_subsys->fork() should be called after the
task is added to css_set") removed cgroup_fork_callbacks() but forgot
to remove its dummy version for !CONFIG_CGROUPS.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
2013-01-07 08:49:16 -08:00
Tao Ma d0b2fdd2a5 cgroup: remove obsolete guarantee from cgroup_task_migrate.
'guarantee' is already removed from cgroup_task_migrate, so remove
the corresponding comments. Some other typos in cgroup are also
changed.

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-20 06:44:58 -08:00
Tejun Heo 0a950f65e1 cgroup: add cgroup->id
With the introduction of generic cgroup hierarchy iterators, css_id is
being phased out.  It was unnecessarily complex, id'ing the wrong
thing (cgroups need IDs, not CSSes) and has other oddities like not
being available at ->css_alloc().

This patch adds cgroup->id, which is a simple per-hierarchy
ida-allocated ID which is assigned before ->css_alloc() and released
after ->css_free().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2012-11-19 09:02:12 -08:00
Tejun Heo 033fa1c5f5 cgroup, cpuset: remove cgroup_subsys->post_clone()
Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone().  Now
that clone_children is cpuset specific, there's no reason to have this
rather odd option activation mechanism in cgroup core.  cpuset can
check the flag from its ->css_allocate() and take the necessary
action.

Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
remove cgroup_subsys->post_clone().

Loosely based on Glauber's "generalize post_clone into post_create"
patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Glauber Costa <glommer@parallels.com>
Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:39 -08:00
Tejun Heo 2260e7fc1f cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
clone_children is only meaningful for cpuset and will stay that way.
Rename the flag to reflect that and update documentation.  Also, drop
clone_children() wrapper in cgroup.c.  The thin wrapper is used only a
few times and one of them will go away soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:38 -08:00
Tejun Heo 92fb97487a cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are.  Also, update documentation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:38 -08:00
Tejun Heo b1929db42f cgroup: allow ->post_create() to fail
There could be cases where controllers want to do initialization
operations which may fail from ->post_create().  This patch makes
->post_create() return -errno to indicate failure and online_css()
relay such failures.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-19 08:13:38 -08:00
Tejun Heo a31f2d3ff7 cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers
New helpers on/offline_css() respectively wrap ->post_create() and
->pre_destroy() invocations.  online_css() sets CSS_ONLINE after
->post_create() is complete and offline_css() invokes ->pre_destroy()
iff CSS_ONLINE is set and clears it while also handling the temporary
dropping of cgroup_mutex.

This patch doesn't introduce any behavior change at the moment but
will be used to improve cgroup_create() failure path and allow
->post_create() to fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:37 -08:00
Tejun Heo 38b53abaa3 cgroup: make CSS_* flags bit masks instead of bit positions
Currently, CSS_* flags are defined as bit positions and manipulated
using atomic bitops.  There's no reason to use atomic bitops for them
and bit positions are clunkier to deal with than bit masks.  Make
CSS_* bit masks instead and use the usual C bitwise operators to
access them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo febfcef60d cgroup: cgroup->dentry isn't a RCU pointer
cgroup->dentry is marked and used as a RCU pointer; however, it isn't
one - the final dentry put doesn't go through call_rcu().  cgroup and
dentry share the same RCU freeing rule via synchronize_rcu() in
cgroup_diput() (kfree_rcu() used on cgrp is unnecessary).  If cgrp is
accessible under RCU read lock, so is its dentry and dereferencing
cgrp->dentry doesn't need any further RCU protection or annotation.

While not being accurate, before the previous patch, the RCU accessors
served a purpose as memory barriers - cgroup->dentry used to be
assigned after the cgroup was made visible to cgroup_path(), so the
assignment and dereferencing in cgroup_path() needed the memory
barrier pair.  Now that list_add_tail_rcu() happens after
cgroup->dentry is assigned, this no longer is necessary.

Remove the now unnecessary and misleading RCU annotations from
cgroup->dentry.  To make up for the removal of rcu_dereference_check()
in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
the dereference rule of @cgrp, not cgrp->dentry.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-19 08:13:36 -08:00
Tejun Heo 574bd9f7c7 cgroup: implement generic child / descendant walk macros
Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants.  This patch adds the following
three macros.

* cgroup_for_each_child() - walk immediate children of a cgroup.

* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
  in pre-order tree traversal.

* cgroup_for_each_descendant_post() - visit all descendants of a
  cgroup in post-order tree traversal.

All three only require the user to hold RCU read lock during
traversal.  Verifying that each iterated cgroup is online is the
responsibility of the user.  When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way.  See comments for details.

v2: s/config/state/ in commit message and comments per Michal.  More
    documentation on synchronization rules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-09 09:12:29 -08:00
Tejun Heo eb6fd5040e cgroup: use rculist ops for cgroup->children
Use RCU safe list operations for cgroup->children.  This will be used
to implement cgroup children / descendant walking which can be used by
controllers.

Note that cgroup_create() now puts a new cgroup at the end of the
->children list instead of head.  This isn't strictly necessary but is
done so that the iteration order is more conventional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-09 09:12:29 -08:00
Tejun Heo a8638030f6 cgroup: add cgroup_subsys->post_create()
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.

This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.

This patch adds ->post_create().  It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().

When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance.  It will be explained with the descendant iterators.

v2: Added a paragraph about its future use w/ descendant iterators per
    Michal.

v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
    Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-11-09 09:12:29 -08:00
Tejun Heo 1db1e31b1e Merge branch 'cgroup-rmdir-updates' into cgroup/for-3.8
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top.  This pull created a trivial conflict between the
following two commits.

  8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
  ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")

The former added a field to cgroup_subsys and the latter removed one
from it.  They happen to be colocated causing the conflict.  Keeping
what's added and removing what's removed resolves the conflict.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-11-05 09:21:51 -08:00
Tejun Heo bcf6de1b91 cgroup: make ->pre_destroy() return void
All ->pre_destory() implementations return 0 now, which is the only
allowed return value.  Make it return void.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-11-05 09:16:59 -08:00
Tejun Heo b25ed609d0 cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()
CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working.  cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.

Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").

Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary.  Remove it and all the mechanisms supporting it.  Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05 09:16:59 -08:00
Tejun Heo e93160803f cgroup: kill CSS_REMOVED
CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal.  All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.

This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).

Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism.  css_tryget() simply can look at the
reference count and fail if it's deactivated - it's never getting
re-activated.

This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated.  As deactivation and removal are a single
step now, they no longer need to be protected against css_tryget()
happening from irq context.  Remove local_irq_disable/enable() from
cgroup_rmdir().

Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c.  We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.

v2: Comment updated and explanation on local_irq_disable/enable()
    added per Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2012-11-05 09:16:58 -08:00
Tejun Heo ed95779340 cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs
2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs.  This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.

* Conditionals dependent on __DEPRECATED_clear_css_refs removed.

* cgroup_clear_css_refs() can no longer fail.  All that needs to be
  done are deactivating refcnts, setting CSS_REMOVED and putting the
  base reference on each css.  Remove cgroup_clear_css_refs() and the
  failure path, and open-code the loops into cgroup_rmdir().

This patch keeps the two for_each_subsys() loops separate while open
coding them.  They can be merged now but there are scheduled changes
which need them to be separate, so keep them separate to reduce the
amount of churn.

local_irq_save/restore() from cgroup_clear_css_refs() are replaced
with local_irq_disable/enable() for simplicity.  This is safe as
cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
switching is necessary to ensure that css_tryget() isn't called from
IRQ context on the same CPU while lower context is between CSS
deactivation and setting CSS_REMOVED as css_tryget() would hang
forever in such cases waiting for CSS to be re-activated or
CSS_REMOVED set.  This will go away soon.

v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
    message updated to explain local_irq_disable/enable() conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-11-05 09:16:58 -08:00
Tejun Heo 5edee61ede cgroup: cgroup_subsys->fork() should be called after the task is added to css_set
cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state.  If done in the reverse order, an event happening
inbetween will be lost.

cgroup_subsys->fork() is invoked way before the new task is added to
the css_set.  Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer.  If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.

This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.

Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
2012-10-16 15:03:14 -07:00
Linus Torvalds 68d47a137c Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup hierarchy update from Tejun Heo:
 "Currently, different cgroup subsystems handle nested cgroups
  completely differently.  There's no consistency among subsystems and
  the behaviors often are outright broken.

  People at least seem to agree that the broken hierarhcy behaviors need
  to be weeded out if any progress is gonna be made on this front and
  that the fallouts from deprecating the broken behaviors should be
  acceptable especially given that the current behaviors don't make much
  sense when nested.

  This patch makes cgroup emit warning messages if cgroups for
  subsystems with broken hierarchy behavior are nested to prepare for
  fixing them in the future.  This was put in a separate branch because
  more related changes were expected (didn't make it this round) and the
  memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
2012-10-02 10:52:28 -07:00
Tejun Heo 8c7f6edbda cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true.  Fixed a typo spotted by
    Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal.  Dropped unnecessary
    memcg root handling per Michal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2012-09-14 12:01:16 -07:00
Daniel Wagner a6f00298b2 cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
Since we know exactly how many subsystems exists at compile time we are
able to define CGROUP_SUBSYS_COUNT correctly. CGROUP_SUBSYS_COUNT will
be at max 12 (all controllers enabled). Depending on the architecture
we safe either 32 - 12 pointers (80 bytes) or 64 - 12 pointers (416
bytes) per cgroup.

With this change we can also remove the temporary placeholder to avoid
compilation errors.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
2012-09-14 09:57:47 -07:00