Containerization Mechanisms: Cgroups

Today we’ll be continuing our post series on containerization mechanisms. In our last article on containerization, we talked about isolating processes using the namespaces mechanism. For containerization though, isolating resources isn’t enough. If we launch an application in an isolated environment, we should be sure it has been allocated enough resources and not use an inordinate amount, interrupting the rest of the system. For this task, the Linux kernel has a special mechanism, cgroups (short for control groups), which we will talk about today.

Cgroups is a particularly relevant topic today: kernel version 4.5, which was released in January of last year year, was officially equipped with the latest version of this mechanism: cgroup v2.
While working on it, cgroups was entirely rewritten.

Why did this require so many radical changes? To answer this question, let’s look at how the first version of cgroups came to be.

Cgroups: A Brief History

Development on cgroups was started in 2006 by Google employees Paul Menage and Rohit Seth. At that time, the term “control group” still wasn’t in use, and instead they used “process containers”. Naturally, their initial goal wasn’t to create the cgroups we see today. The original idea was much more modest: to improve the cpuset mechanism for distributing processor time and memory between tasks. But over time, everything grew into a much bigger project.

By the end of 2007, the name process container was replaced by control groups. This was done to avoid confusion with the term “container” (at that time, OpenVZ was in active development, and the word “container” was being used as we known it today).

In 2008, the cgroups mechanism was officially added to the Linux kernel (ver. 2.6.24). What was so new about this version?

Not a single system call for cgroups was added. As for the biggest changes, we should mention the cgroups filesystem, cgroupfs.

In init/main.c, references were added to functions that activate cgroups during bootup: cgroup_init and cgroup_init_early. Some less significant functions used for spawning and ending processes were modified: fork() and exit().

New directories were added to the /proc virtual file system: /proc/{pid}/cgroup (for each process) and /proc/cgroups (for the system as a whole).

Architecture

The cgroups mechanism is made up of two constituent parts: the core (cgroup core) and the subsystem. Kernel version 4.4.0.21 has 12 subsystems:

name Kernel module function
blkio block/blkcroup.c Establishes limits for reading and writing on block devices
cpuacct kernel/sched/cpuacct.c For generating reports on processor resources used
cpu kernel/sched/core.c Gives processes access to the CPU under the control group
cpuset kernel/cpuset.c Divides tasks among processor cores under the control group
devices security/device_group.c Allows or blocks access to devices
freezer kernel/cgroup_freezer.c Suspends and restores task execution under the control group
hugetlb mm/hugetlb_cgroup.c Activates huge pages memory support for control groups
memory mm/memcontrol.c Manages memory distribution for process groups
net_cls net/core/netclassd_cgroup.c Marks network packets with a special tag for identifying packets spawned by a particular task in control group
net_prio net/core/netprio_cgroup.c Used for dynamically establishing traffic priorities
perf_event events/kernel.c Gives control groups access to perf_events
pids kernel/cgroup_pids.c Used for organizing a number of processes in a control group

We can print a list of subsystems in the console using the command:

$ ls /sys/fs/cgroup/
 
blkio	cpu,cpuacct  freezer  net_cls       	perf_event
cpu  	cpuset   	hugetlb  net_cls,net_prio  pids
cpuacct  devices  	memory   net_prio      	systemd

Each subsystem is a directory with control files where configurations are saved. The following control files can be found in each of these directories:

  • cgroup.clone_children – allows you to transfer the settings from parent control groups to child groups
  • tasks – contains a list of PIDs for all processes attached to the control group;
  • cgroup.procs – contains a list of TGIDs for process groups attached to the control group
  • cgroup.event_control – allows you to send notifications if the status of a control group changes
  • release_agent – contains commands that are executed if the notify_on_release option is enabled.RequestThis can be used for automatically deleting empty control groups
  • notify _on_release – contains a boolean variable (0 or 1), enabling (or disabling) the execution of the command given in release_agent

Each subsystem also has its own control files. We’ll discuss some of these below.

To create a control group, we just create an embedded directory in a subsystem. Control files will automatically be added to these embedded directories (which we’ll discuss in more detail below). Adding processes to a group is fairly simple: you just need to write their PID to the tasks control file.

The aggregation of control groups embedded in a subsystem is called a hierarchy. We’ll look at the principles of cgroups functions with some practical use examples.

Cgroups Hierarchy: Practical Orientation

Example 1: Managing CPU Resources

We execute the command:

$ mkdir /sys/fs/cgroup/cpuset/group0

With this command, we can create a control group containing the following control files:

$ ls /sys/fs/cgroup/cpuset/group0
 
group.clone_children   cpuset.memory_pressure
cgroup.procs       	cpuset.memory_spread_page
cpuset.cpu_exclusive   cpuset.memory_spread_slab
cpuset.cpus        	cpuset.mems
cpuset.effective_cpus  cpuset.sched_load_balance
cpuset.effective_mems  cpuset.sched_relax_domain_level
cpuset.mem_exclusive   notify_on_release
cpuset.mem_hardwall	tasks
cpuset.memory_migrate

There aren’t any processes in our group at the moment. To add a process, we need to write its PID to the tasks file:

$ echo $$ > /sys/fs/cgroup/cpuset/group0/tasks

The $$ symbols stand for the PID of the process being executed by the current command shell.

This process is not attached to any single CPU core, which we can verify with the following command:

$ cat /proc/$$/status |grep '_allowed'
Cpus_allowed:   2
Cpus_allowed_list:  0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0

The printout from this command shows us that 2 CPU cores, numbered 0 and 1, are available for this process.

We’ll try to “attach” this process to core number 0:

$ echo 0 >/sys/fs/cgroup/cpuset/group0/cpuset.cpus

Let’s see the results:

$ cat /proc/$$/status |grep '_allowed'
Cpus_allowed:   1
Cpus_allowed_list:  0
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0

Example 2: Managing Memory

We’ll embed the group we created in the previous example in another subsystem:

$ mkdir /sys/fs/cgroup/memory/group0

Then we execute:

$ echo $$ > /sys/fs/cgroup/memory/group0/tasks

We’ll try to limit memory usage for control group 0. For this, we’ll have to enter said limit in the file memory.limit_in_bytes:

$ echo 40M > /sys/fs/cgroup/memory/group0/memory.limit_in_bytes

The cgroups mechanism has comprehensive memory management. For example, it can be used to protect critical processes from falling under the hot hand of OOM-killer.

$ echo 1 > /sys/fs/cgroup/memory/group0/memory.oom_control
$ cat /sys/fs/cgroup/memory/group0/memory.oom_control
oom_kill_disable 1
under_oom 0

If we were to place the ssh daemon in a separate control group, for example, and turn off OOM-killer for that group, then we could be sure that it wouldn’t “die” under the growing memory usage.

Example 3: Managing Devices

We’ll add our control group to another hierarchy:

$ mkdir /sys/fs/cgroup/devices/group0

By default, the group isn’t prohibited from accessing any device:

$ cat /sys/fs/cgroup/devices/group0/devices.list
a *:* rwm

We’ll try to set a restriction:

$ echo 'c 1:3 rmw' > /sys/fs/cgroup/devices/group0/devices.deny

This command adds the device /dev/null to our control group’s list of restricted devices. We wrote the line ‘c 1:3 rmw’ to our control file. We first enter the kind of device—in our case, this is a symbol device marked by the letter c (short for character device). Two other kinds of devices are block (b) and all devices (a). Then we have the device’s major and minor number. You can find a device’s number using the command:

$ ls -l  /dev/null

Instead of /dev/null you can enter any path. The printout will look like this:

crw-rw-rw- 1 root root 1, 3 May 30 10:49 /dev/null

The first digit in the printout is the major number, and the second is the minor.

The last three letter represent the access rights: r — permission to read files from the given device; w — permission to write to the given device; m — permission to make new device files.

Next we execute:

$ echo $$ > /sys/fs/cgroup/devices/group0/tasks
$ echo "test" > /dev/null

When executing the last command, the system returns an error message:

-bash: /dev/null: Operation not permitted

We cannot interact with /dev/null because access is blocked.

We’ll restore access:

$ echo a > /sys/fs/cgroup/devices/group0/devices.allow

After running this command, the entry a *:* nwm will be added to the file /sys/fs/cgroup/devices/group0/devices.all and all restrictions will be lifted.

Cgroups and Containers

From these examples, we can see the principles behind cgroups: we place specific processes in groups, which we can then “embed” in subsystems. Now we’ll look at more complicated examples to see how cgroups is used in modern containerization tools. For these examples, we’ll be using Lxc.

We’ll install LXC and create a container:

$ sudo apt-get install lxc debootstrap bridge-utils
$ sudo lxc-create -n ubuntu -t ubuntu -f /usr/share/doc/lxc/examples/lxc-veth.conf
$ lxc-start -d -n ubuntu

We’ll see what changed in the cgroups directory after creating and launching the container:

$ ls /sys/fs/cgroup/memory

cgroup.clone_children  memory.limit_in_bytes        	memory.swappiness
cgroup.event_control   memory.max_usage_in_bytes    	memory.usage_in_bytes
cgroup.procs       	memory.move_charge_at_immigrate  memory.use_hierarchy
cgroup.sane_behavior   memory.numa_stat             	notify_on_release
lxc                	memory.oom_control           	release_agent
memory.failcnt     	memory.pressure_level        	tasks
memory.force_empty 	memory.soft_limit_in_bytes

As we can see, every hierarchy now has an lxc directory, which itself contains an Ubuntu directory. For every new container in the lxc directory, a separate subdirectory will be created. The PID of all of the processes that can be launched in this container will be written to the file /sys/fs/cgroup/cpu/lxc/[container name]/tasks.

Resources can be allocated for containers using cgroups control files and the lxc command:

$ lxc-cgroup -n [container name] memory.limit_in_bytes 400

This can similarly done for Docker, systemd-nspawn and other containers.

Drawbacks to Cgroups

Throughout its nearly 10 years of existence, cgroups has been subject to criticism on more than one occasion. As the author of one LWN.net article has pointed out, developers “love to hate” the cgroups core. We may be able to understand this from the examples we’ve looked at in this article, although we tried to be as neutral as possible: embedding a control group separately into each subsystem is very inconvenient. If we look carefully, we’ll see that this approach is extremely inconsistent.

If we, for example, create an embedded control group, then the parent group’s configuration will be inherited in some subsystems, but not in others.

In the cpuset subsystem, any change to the parent control group is automatically made to the embedded groups, but other subsystems don’t do this, and the clone.children parameter has to be activated.

There was talk of troubleshooting these and other issues in the kernel developer’s community a long time ago: one of the first texts on this topic dates back to the start of 2012.

The author of this text, Facebook engineer Tejun Heo, directly points out that the main problems with cgroups is its improper organization method, whereby subsystems connect to multiple control group hierarchies. He suggested using one, and only one, hierarchy and to add subsystems for each group separately. This approach involved serious changes, including a name change: the mechanism for isolating resources is now called cgroup (singular) and not cgroups.

We’ll take a closer look at these innovations.

Cgroup v2: What’s New

As we stated above, cgroup v2 has been included in the Linux kernel as of version 4.5. The old version is also supported. For version 4.6, there is already a patch that lets you turn off first-version support when downloading the kernel.

Currently, cgroup v2 can only work with three subsystems: blkio, memory, and PID. Patches have already been released (test releases) which let you manage CPU resources.

Cgroup v2 is mounted with the following command:

$ mount -t cgroup2 none [monitoring point]

Let’s say we mounted cgroup v2 in the /cgroup2 directory. The following control files will automatically be created in this directory:

  • cgroup.controllers – contains a list of supported subsystems
  • cgroup.procs – once mounted, it contains a list of all the processes executed in the system, including zombie processes. If we create a group, then the same kind of file will be created for it; it will remain empty until processes have been added to the group.
  • cgroup.subtree_control – contains a list of active subsystems for a given control group; empty by default

These files are created in every new control group. The file cgroup.events, which is empty in the root directory, is also added to each group.

New groups are created by running:

$ mkdir /cgroup2/group1

To add a subsystem to a group, the subsystem name has to be entered in the file cgroup.subtree_control:

$ echo "+pid" > /cgroup2/group1/cgroup.subtree_control

To delete a subsystem, the same command is used but with a minus sign instead of a plus sign:

$ echo "-pid" > /cgroup2/group1/cgroup.subtree_control

When a subsystem is activated for a group, additional control files are added to it. For example, after activating the PID subsystem, the files pids.max and pids.current will be created in the directory. The former file is used for organizing the number of processes in a group, while the second contains information on the number of processes active in the group at that moment.

Subgroups can be created in existing groups:

$ mkdir /cgroup2/group1/subgroup1
$ mkdir /cgroup2/group1/subgroup2
$ echo "+memory" > /cgroup2/group1/cgroup.subtree_control

All subgroups inherit the parent group’s properties. In the last example, the PID subsystem will be activated for group1 and all subgroups embedded in it; pids.max and pids.current will also be added to them. This can be demonstrated with the following diagram:

To avoid any misunderstanding about embedded groups (see above), cgroup v2 observes the following rule: don’t add a process to an embedded group if any subsystem in it is activated:

In the first version of cgroups, processes could enter several subgroups at one time, even if these subgroups were part of different hierarchies and embedded in different subsystems. In the second version, one process may belong only to one subgroup, avoiding any confusion.

Conclusion

In this article, we looked out how cgroups is build and which changes were introduced in the new version. If you have any questions or anything to add, please do so in the comments below.

For anyone wishing to read more about this topic, below you will find links to relevant materials: