Mastering Debian Cgroups with Debian Squeeze
Cgroups are important. They are a relief to all admins of heavily loaded machines.What are Cgroups?
Cgroups can be used to improve the way Linux schedules time to processes. Instead of scheduling the time by each process, the time will be scheduled first on the cgroups and each time which is given to a cgroup then is distributed along to all processes which are in the cgroup. Cgroups are best used such, that only a few Cgroups are active in parallel, and that the majority of tasks are bound to a proper Cgroup.How Cgroups are handled
Cgroups are like Process Groups. But while Process Groups are mainly used for signal propagation of the controlling TTY (or for better control of forked processes), Cgroups are used to distribute all the processing time. The idea is that Cgroups are inherited from the parent, such that all processes which are started run in the same Cgroup. This can improve things a lot in case the Scheduler sees many processes in parallel. For example if you run a "make -j128" (this is 128 parallel makes) this reduces the CPU time of other processes to barley nothing. When these 128 makes are started within a single process group you will nearly not notice any difference of other processes, as these 128 processes only compete against each other, not against the rest of the system. The interesting part here is that you will not notice a difference on a single core system as well as on a 128 core system, but on a 128 core system the make process will run a lot faster. So with Cgroups you do not need to worry about the CPU architecture again. Also it allows you a better quota handling of CPU resources.
Having said that I still do not know how to do the same for following:
I would like to have a way to handle processes the same for N filesystem coils (or the limited IO transactions), that is, a "make -j1024" does not compete 1024 times with a single other process waiting for IO, but all those 1024 processes compete together (like a single process) with the other single process (or even less. Perhaps I want the make only to use free bandwidth, so it has to be throttled).
The same is true if you are somewhat short of RAM. You then can kill your machine by starting too many processes. I would like to see a way to let process groups compete against the free RAM and paging, not to have the processes to compete. That is for example, one process group shall get 50% of the RAM where the other process group gets only 10% of the RAM. If there is RAM left, this can be distributed. But if not, the 50% process group is NOT paged out as long as it stays below 50% RAM.
Perhaps this can be done with Cgroups (at least reading about what IBM wants it seems to be so) in future, but I am not yet completely sure.
How to enable Cgroups in Squeeze
apt-get install cgroup-bin
How to fix Cgroups in Squeeze
I really have no idea why Cgroups are mounted to /mnt/cgroups/* in Debain, because this is a ridiculous path. The right path is either:- /dev/cgroup/
- /sys/cgroup/
- /proc/cgroup/
- /cgroup/
- /tmp/cgroup/
rmdir /cgroups/* /cgroups
ln -s /sys/cgroup /cgroups
/cgroup/TYPE/ via: mount -o TYPE -t cgroup cgroup:TYPE /cgroup/TYPE
awk 'NR>1 {print $1}' /proc/cgroups |
while read -r a
do
b="/cgroups/$a"
mkdir -p "$b"
mount -tcgroup -o"$a" "cgroup:$a" "$b"
done
Note that you can do this as shown here. You can mount Cgroups multiple times, no problem, as usual with Unix.
So what I propose is have the daemons work in /cgroups/ while you work in /cgroup/
Change the config
vi /etc/cgconfig.conf
umount /mnt/cgroups/*
rmdir /mnt/cgroups/* /mnt/cgroups
vi /etc/cgrules.conf
Restart the daemons
/etc/init.d/cgred stop
/etc/init.d/cgconfig stop
/etc/init.d/cgconfig start
/etc/init.d/cgred start
/etc/init.d/cgconfig stop
umount /cgroups/*
rmdir /cgroups/* /cgroups
/etc/init.d/cgconfig start
BUG ALERT
Warning! The daemons are seriously broken. They will kill all the current settings when started. This includes to orphan(!) processes which are in a different Cgroup OR have changed their CGroups to another value. So anything the daemons are not aware of is simply killed and destroyed by the daemons on startup. So I can only consider that the daemons are not yet suitable for production systems because an "apt-get upgrade" can bring your carefully corrected system into an unusable state. YOU HAVE BEEN WARNED! Conclusion: Disable these daemons, do not use them, do it yourself. The problem: The problem is the command "cgclear" which is invoked on "/etc/init.d/cgconfig stop"Standards in this documents
Following assumptions are done in the following text:- Cgroups are mounted as /cgroups/TYPE/
- Daemons are installed and running properly
How to administer Cgroups
To list which Cgroups are active usegrep '^cgroup' /proc/mounts
cat /proc/cgroups
- Cgroups are administered hierarchically, the directories (path) are the name of the Cgroup, the filenames are the settings. You can choose the directory names by yourself.
- Child Cgroups are within their parent (directory) and inherit their settings this way and are restricted by the parental group. You can define the hierarchy as you like.
- The root (empty Cgroup name or /) is the mount point, of course. There are some system default groups, too.
- Processes which are not in any Cgroup are running in the root Cgroup.
- There is a Cgroup for each type (the argument to mount -o), Cgroups can have more than one type, however better keep it simple (either single types or all types).
- Cgroups apply all the settings to each PID which is in the Cgroup.
- Cgroups administration is PID based.
- Cgroups are inherited, so a forked process inherits the Cgroup of it's parent.
- Access to Cgroups is handled through Filesystem access rights, initially only root is allowed alter Cgroups
- The first daemon mounts the cgroups filesystem, creates the Cgroups with the right parameters.
- The second daemon applies rules to newly created processes, such that these processes get the right Cgroup.
Manuals
Currently following manuals are in the package:cgconfigparser.8
cgrulesengd.8
cgred.conf.5
cgconfig.conf.5
cgrules.conf.5
cgdelete.1
cgset.1
cgexec.1
cgget.1
cgclassify.1
cgclear.1
cgcreate.1
lscgroup.1
lssubsys.1
Cgroup types
Common files to all Cgroup types:- cgroup.procs: output: list of TGIDs (not neccessarily unique) -- Under Linux the PID is the thread id while TGID is the POSIX compliant PID (the main thread). So if you have a multithreaded application like Java you might see thousands of different thread ids (Linux PIDs) in tasks and the single TGID of the main Java process showing up in cgroup.procs a thousand of times (always the same number).
- notify_on_release: If enabled (1) the relase_agent is invoked if the Cgroup becomes empty. Note that if a Cgroup becomes empty no new PIDs can show up by itself, except you add them to the group from another parallel running script. So the "exit" is race condition free and the "enter" can be programmatically controlled.
- release_agent: Path to a program to invoke when notify_on_release is true. This file only shows up in the root Cgroup!
- tasks: input: PIDs to add to this group -- output: Unsorted list of PIDs which are in this Cgroup
cpu
This group manages CPU shares. Each group shares hiararchically, that is each PID in the root group and each sub-group share the same percentage of processing time. So child groups cannot get more CPU than the parent group. Files:- cpu.shares: input: The CPU shares for this group -- output: The set CPU shares
memory
This is missing from the debian kernel for now so I am unable to test. It seems to be available from 2.6.35 but debian is a 2.6.32, I will document as soon as my kernels have reached this version and I start to use it. Documentation is at www.kernel.org/doc/Documentation/cgroups/memory.txtcpuacct
This is very interesting for accounting. Read www.kernel.org/doc/Documentation/cgroups/cpuacct.txtcpuset
Here you can restrict processes to CPUs, CPU-Groups, Memory storage nodes (NUMA, PC usually is not NUMA so there is only 1 node) etc. The most important thing perhaps is that you can reserve a CPU for certain system tasks or pin some processes to some CPUs. But this area is so broad that you better read www.kernel.org/doc/Documentation/cgroups/cpusets.txt Files:devices
This controls access to devices. This is to partition a system such, that tasks cannot interfere with each other, for example such that even root users cannot access the root block filesystem. Seems a little bit theoretically for me and not yet able to improve security on multi user systems. However it may be good to provide some testing scenario, like booting another image and restrict the devices it sees. Documentation is at www.kernel.org/doc/Documentation/cgroups/devices.txt Files:freezer
This interface is extremly interesting. With it you are able to freeze the given Cgroup such that it ceases to use CPU time. The interesting part is, that it is done on a task list instead of a single task. So it's very convenient. Theoretically you now can save the core, record all the /proc/PID/ settings, read out the buffers of open PIPEs and then kill the task(s). Later you can reinstate the /proc/PID/ settings (by opening everything, re-filling the PIPEs, seeking, etc.) and then load the core and continue it where it left off. This can even be on another node. (This will not work with each task type out of the box, as most likely you are not able to keep the TCP session states to external systems, but except from this it is higly intersting to be able to "suspend" a task or task list to disk.) Documentation is at www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt Files:- freezer.state: input: THAWED/FROZEN -- output: THAWED/FREEZING/FROZEN -- This files does not show up in the root Cgroup, as freezing the entire system would be a little bit weird.)
net_cls
I was not able to find any documentation about this. Files:- net_cls.classid: I have no idea
ns
I was not able to find any documentation about this. Files:- (none)
Admin: The filesystem way
This is the most natural way to handle Cgroups. However this will not survive a reboot. Perhaps this is a good thing as long as you experiment with it.Add a PID to a Cgroup
Just doecho PID >> /cgroup/TYPE/name/tasks