Mastering Debian Cgroups with Debian Squeeze

Cgroups are important. They are a relief to all admins of heavily loaded machines.

What are Cgroups?

Cgroups can be used to improve the way Linux schedules time to processes. Instead of scheduling the time by each process, the time will be scheduled first on the cgroups and each time which is given to a cgroup then is distributed along to all processes which are in the cgroup.

Cgroups are best used such, that only a few Cgroups are active in parallel, and that the majority of tasks are bound to a proper Cgroup.

How Cgroups are handled

Cgroups are like Process Groups. But while Process Groups are mainly used for signal propagation of the controlling TTY (or for better control of forked processes), Cgroups are used to distribute all the processing time. The idea is that Cgroups are inherited from the parent, such that all processes which are started run in the same Cgroup.

This can improve things a lot in case the Scheduler sees many processes in parallel. For example if you run a "make -j128" (this is 128 parallel makes) this reduces the CPU time of other processes to barley nothing. When these 128 makes are started within a single process group you will nearly not notice any difference of other processes, as these 128 processes only compete against each other, not against the rest of the system. The interesting part here is that you will not notice a difference on a single core system as well as on a 128 core system, but on a 128 core system the make process will run a lot faster.

So with Cgroups you do not need to worry about the CPU architecture again. Also it allows you a better quota handling of CPU resources.

Having said that I still do not know how to do the same for following:

I would like to have a way to handle processes the same for N filesystem coils (or the limited IO transactions), that is, a "make -j1024" does not compete 1024 times with a single other process waiting for IO, but all those 1024 processes compete together (like a single process) with the other single process (or even less. Perhaps I want the make only to use free bandwidth, so it has to be throttled).

The same is true if you are somewhat short of RAM. You then can kill your machine by starting too many processes. I would like to see a way to let process groups compete against the free RAM and paging, not to have the processes to compete. That is for example, one process group shall get 50% of the RAM where the other process group gets only 10% of the RAM. If there is RAM left, this can be distributed. But if not, the 50% process group is NOT paged out as long as it stays below 50% RAM.

Perhaps this can be done with Cgroups (at least reading about what IBM wants it seems to be so) in future, but I am not yet completely sure.

How to enable Cgroups in Squeeze

apt-get install cgroup-bin
This automatically starts two daemons. (I really hate that, the default should be to keep the daemons shut down as there really is some bad setup here.)

How to fix Cgroups in Squeeze

I really have no idea why Cgroups are mounted to /mnt/cgroups/* in Debain, because this is a ridiculous path. The right path is either:

  • /dev/cgroup/
  • /sys/cgroup/
  • /proc/cgroup/
  • /cgroup/
  • /tmp/cgroup/
but certainly not /mnt/cgroups! So we first have to fix that by changing the configuration.

Note that with my changes in future when the new kernel support is there, we can do
rmdir /cgroups/* /cgroups
ln -s /sys/cgroup /cgroups
to allow our old scripts to continue to work, but this is impossible today.

So the standard that I would propose is to have following structure:
/cgroup/TYPE/   via: mount -o TYPE -t cgroup cgroup:TYPE /cgroup/TYPE

Or programmatically:
awk 'NR>1 {print $1}' /proc/cgroups |
while read -r a
do
  b="/cgroups/$a"
  mkdir -p "$b"
  mount -tcgroup -o"$a" "cgroup:$a" "$b"
done

This would be self explaining, but as always, nobody else cares except me.

Note that you can do this as shown here. You can mount Cgroups multiple times, no problem, as usual with Unix.

So what I propose is have the daemons work in /cgroups/ while you work in /cgroup/

Change the config

vi /etc/cgconfig.conf

Change all /mnt/cgroups/ to /cgroups/

Also add following entries:

(Currently there is nothing important to add.)

For now placing it into the root-FS directly is probably the best thing. Besides from that I would recommend /sys/cgroup/ (that is where the future kernel patches will place the FS to, I think I have read that), but sysfs does not allow to create directories, /proc/cgroup isn't there today, too, so we are stuck with /dev/ or /tmp/ which both are not a good place I fear, and /var/ is out of question.

Do not forget to cleanup the old debris:

umount /mnt/cgroups/*
rmdir /mnt/cgroups/* /mnt/cgroups

You can also change the policies with following command:

vi /etc/cgrules.conf

Restart the daemons

/etc/init.d/cgred stop
/etc/init.d/cgconfig stop
/etc/init.d/cgconfig start
/etc/init.d/cgred start

If you have trouble reloading a config the scripts seem to bail out until next reboot. You can fix that as follows:
/etc/init.d/cgconfig stop
umount /cgroups/*
rmdir /cgroups/* /cgroups
/etc/init.d/cgconfig start

BUG ALERT

Warning! The daemons are seriously broken. They will kill all the current settings when started. This includes to orphan(!) processes which are in a different Cgroup OR have changed their CGroups to another value. So anything the daemons are not aware of is simply killed and destroyed by the daemons on startup.

So I can only consider that the daemons are not yet suitable for production systems because an "apt-get upgrade" can bring your carefully corrected system into an unusable state. YOU HAVE BEEN WARNED!

Conclusion:

Disable these daemons, do not use them, do it yourself.

The problem:

The problem is the command "cgclear" which is invoked on "/etc/init.d/cgconfig stop"

Standards in this documents

Following assumptions are done in the following text:

  • Cgroups are mounted as /cgroups/TYPE/
  • Daemons are installed and running properly

How to administer Cgroups

To list which Cgroups are active use
grep '^cgroup' /proc/mounts
I consider it a bug that the mounts do not show up in "mount" command yet.

To list which types of Cgroups are known use
cat /proc/cgroups
See below for a short note about each type.

Maintaining Cgroups is easy. The things you must know are:

  • Cgroups are administered hierarchically, the directories (path) are the name of the Cgroup, the filenames are the settings. You can choose the directory names by yourself.
  • Child Cgroups are within their parent (directory) and inherit their settings this way and are restricted by the parental group. You can define the hierarchy as you like.
  • The root (empty Cgroup name or /) is the mount point, of course. There are some system default groups, too.
  • Processes which are not in any Cgroup are running in the root Cgroup.
  • There is a Cgroup for each type (the argument to mount -o), Cgroups can have more than one type, however better keep it simple (either single types or all types).
  • Cgroups apply all the settings to each PID which is in the Cgroup.
  • Cgroups administration is PID based.
  • Cgroups are inherited, so a forked process inherits the Cgroup of it's parent.
  • Access to Cgroups is handled through Filesystem access rights, initially only root is allowed alter Cgroups
The two daemons only support us in automatically applying certain settings to new processes:

  • The first daemon mounts the cgroups filesystem, creates the Cgroups with the right parameters.
  • The second daemon applies rules to newly created processes, such that these processes get the right Cgroup.
Sadly the daemons do not learn what we changed on the filesystem level. You also can see that as a feature, such that manual changes are of temporary nature until frozen in the config files.

Manuals

Currently following manuals are in the package:
cgconfigparser.8
cgrulesengd.8

cgred.conf.5
cgconfig.conf.5
cgrules.conf.5

cgdelete.1
cgset.1
cgexec.1
cgget.1
cgclassify.1
cgclear.1
cgcreate.1
lscgroup.1
lssubsys.1

The documentation is very scarce, perhaps have a look into the kernel documentation, this information is much more useful:

Cgroup types

Common files to all Cgroup types:
  • cgroup.procs: output: list of TGIDs (not neccessarily unique) -- Under Linux the PID is the thread id while TGID is the POSIX compliant PID (the main thread). So if you have a multithreaded application like Java you might see thousands of different thread ids (Linux PIDs) in tasks and the single TGID of the main Java process showing up in cgroup.procs a thousand of times (always the same number).
  • notify_on_release: If enabled (1) the relase_agent is invoked if the Cgroup becomes empty. Note that if a Cgroup becomes empty no new PIDs can show up by itself, except you add them to the group from another parallel running script. So the "exit" is race condition free and the "enter" can be programmatically controlled.
  • release_agent: Path to a program to invoke when notify_on_release is true. This file only shows up in the root Cgroup!
  • tasks: input: PIDs to add to this group -- output: Unsorted list of PIDs which are in this Cgroup

cpu

This group manages CPU shares. Each group shares hiararchically, that is each PID in the root group and each sub-group share the same percentage of processing time. So child groups cannot get more CPU than the parent group.

Files:
  • cpu.shares: input: The CPU shares for this group -- output: The set CPU shares
By default most PIDs are in /cgroups/cpu/sysdefault/, only a few tasks run in /cgroups/cpu/

By default cpu.shares is set to 1024. Setting it to a higher value the Cgroup gets a higher quantum of the CPU, setting it lower you get a reduced fraction.

Note that you cannot rise the share of a child higher than the parent, the shares of each siblings are calculated on the parent level.

So if there are 2 siblings and a cpu.shares value of ONE is 1 and TWO is 2 then ONE gets 1/3rd CPU which is given to the parent and TWO gets 2/3rd. If there are tasks in the parent, these get 1024/(1024+2+3) parts of the CPU (I think, I did not test it).

memory

This is missing from the debian kernel for now so I am unable to test. It seems to be available from 2.6.35 but debian is a 2.6.32, I will document as soon as my kernels have reached this version and I start to use it.

Documentation is at www.kernel.org/doc/Documentation/cgroups/memory.txt

cpuacct

This is very interesting for accounting. Read www.kernel.org/doc/Documentation/cgroups/cpuacct.txt

cpuset

Here you can restrict processes to CPUs, CPU-Groups, Memory storage nodes (NUMA, PC usually is not NUMA so there is only 1 node) etc.

The most important thing perhaps is that you can reserve a CPU for certain system tasks or pin some processes to some CPUs. But this area is so broad that you better read www.kernel.org/doc/Documentation/cgroups/cpusets.txt

Files:

devices

This controls access to devices. This is to partition a system such, that tasks cannot interfere with each other, for example such that even root users cannot access the root block filesystem. Seems a little bit theoretically for me and not yet able to improve security on multi user systems. However it may be good to provide some testing scenario, like booting another image and restrict the devices it sees.

Documentation is at www.kernel.org/doc/Documentation/cgroups/devices.txt

Files:

freezer

This interface is extremly interesting. With it you are able to freeze the given Cgroup such that it ceases to use CPU time. The interesting part is, that it is done on a task list instead of a single task. So it's very convenient.

Theoretically you now can save the core, record all the /proc/PID/ settings, read out the buffers of open PIPEs and then kill the task(s). Later you can reinstate the /proc/PID/ settings (by opening everything, re-filling the PIPEs, seeking, etc.) and then load the core and continue it where it left off. This can even be on another node.

(This will not work with each task type out of the box, as most likely you are not able to keep the TCP session states to external systems, but except from this it is higly intersting to be able to "suspend" a task or task list to disk.)

Documentation is at www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt

Files:
  • freezer.state: input: THAWED/FROZEN -- output: THAWED/FREEZING/FROZEN -- This files does not show up in the root Cgroup, as freezing the entire system would be a little bit weird.)

net_cls

I was not able to find any documentation about this.

Files:
  • net_cls.classid: I have no idea

ns

I was not able to find any documentation about this.

Files:
  • (none)
As there are no files in this class this seems to be just a way to track if there is still some process existing in a class (see notify_on_release).

Admin: The filesystem way

This is the most natural way to handle Cgroups. However this will not survive a reboot. Perhaps this is a good thing as long as you experiment with it.

Add a PID to a Cgroup

Just do
echo PID >> /cgroup/TYPE/name/tasks

If PID is in another Cgroup, it is removed there automatically. All newly created child tasks then will inherit the new Cgroup.

Remove a PID to a Cgroup

This is impossible. Add it to another Cgroup. If you want to freeze a PID use the freezer interface.