Hand-crafted containers
18 March, 2016
tl;dr
# CTNAME=blah
# mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib
# ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/
# cp /bin/echo /ns/$CTNAME/bin/
# ip netns add $CTNAME
# ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!'
0. Intro
Containers are the latest trend, for a good reason: they leave room for new ideas in terms of security, flexibility, performance and much more.
But what are containers? It is a group of processes isolated together from the host operating system. This isolation can happen in different places (namespaces), be it in the network, the filesystem, the process tree, or all of them (there are more, in fact. More on this later).
We can differentiate three types of containers:
- operating system containers
- application containers
- I LIED!
If we think about it, an operating system is a process /sbin/init
that will
spawn other subprocesses. This way, an operating system is nothing more than
an application (a complex one). In this regard, there is only a single type of
containers.
We can now focus on what's really important, how do they work?
1. Namespaces
That's a keyword, so let's ask our internet god what it means:
In computing, a namespace is a set of symbols that are used to organize objects of various kinds, so that these objects may be referred to by name.
-- sincerely, wikipedia
In other words, a namespace is a way to refer to one or more isolations applied
to a process.
When a namespace is created for a process, all its children will be created
within this namespace, and inherit the "limitations" of the parent.
Mount
The process will be able to mount and unmount filesystems without affecting the rest of the system. For example, if you unmount a partition within the namespace, all the processes within it will see it as unmounted, while it will remain mounted for all others processes on the host.
UTS (Unix Time-Sharing)
This will give the ability to change the host and domain name in the namespace without changing it on the host.
IPC (Inter-Process Communication)
This namespace concern shared memory, System V message queues and sempaphores. Processes in the namespace will be unable to communicate with the host's processes this way.
Network
Processes will have their own network stack. This includes the routing table, firewall rules, sockets, and so on.
PID (Process IDentification)
Processes' IDs will get a different mapping that they have on the host. They will get renumbered, starting from 1.
User
The namespaces will have their own set of user and group IDs.
2. Making containers
Now that we know what containers are and how they work, it's time to make one! For the purpose of this article, we will try an build the simplest container capable of printing "Hello, world!".
Here is the program:
$ more <<EOF> hello.c
#include <unistd.h>
int
main(int argc, char **argv)
{
write(1, "Hello, world!\n", 14);
return 0;
}
EOF
$ cc hello.c -o hello
2.0 chroot(1)
This one is an old tool that will run a command or spawn an interactive
shell after changing the root directory.
It is used to isolate a process, or group of processes from the host's
filesystem tree. This has long be used for security purposes
(see chroot jail), but escaping from
chroot is rather easy for someone with root (UID 0) access.
This is why chroot
alone cannot be considered secure, but coupled with user
namespace and privilege dropping, one can turn a chroot in a real jail.
Back to the topic. Let's copy our hello
binary into the chroot, and try to
run it:
$ mkdir rootfs
$ cp ./hello ./rootfs/hello
# chroot ./rootfs ./hello
chroot: failed to run command "./hello": No such file or directory
This is the worst error message you can get. Of course ./hello
exists!
We just copied it. But what does this error mean then? Let's take a closer
look at this binary:
$ file ./hello
./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-x86-64.so.2, for GNU/Linux 3.12.0, not stripped
The output may differ slightly depending on your system, but the important part here is the following:
dynamically linked, interpreter /lib/ld-linux-x86-64.so.2
Dynamically linked binaries cannot be run on their own. Long story short,
/lib/ld-linux-x86-64.so.2
is a program that is implicitly called to run all
the dynamic binaries on a linux system, it's called the
linker. So in order to have a
binary run in the chroot, you need to copy over the linker AND all the libraries
your binary links to. To get a list of these libraries, use the ldd
command:
$ ldd hello
linux-vdso.so.1 (0x00007ffd3e7dc000)
libc.so.6 => /lib/libc.so.6 (0x00007fdc1a482000)
/lib/ld-linux-x86-64.so.2 (0x00007fdc1a82a000)
You can ignore the vdso
line as it's handled by the C library.
Our hello
binary depends on two files: /lib/ld-linux-x86-64.so.2
, the linker,
and /lib/libc.so.6
, the C library (containing system calls like write(2)
).
In order to run our hello
program, we'll have to copy them over in place. After
that, our program should run totally fine:
$ mkdir -p rootfs/lib
$ cp /lib/ld-linux-x86-64.so.2 /lib/libc.so.6 ./rootfs/lib
# chroot ./rootfs ./hello
Hello, world!
TADAAAA!! That was easy right? Another option is to simply compile our program statically. It means that all the needed objects from libraries will be compiled into the program, removing the need for a linker and libc in the chroot:
$ mkdir rootfs
$ cc hello.c -o hello -static -s
$ cp hello ./rootfs
# chroot ./rootfs ./hello
Hello, world!
Let's take a look at the size of this "container". For scale, the "Smallest possible docker container" weighs 3.6Mib...
$ du -sh rootfs
720K rootfs
That's most likely the lightest container you've seen, right?
2.1 env
To isolate our process from the host, we'll have to clean all the environment
from all its variables, to make sure the container won't know anything about its
host. We can do this with the env
command:
$ export FOO="bar"
$ env -i /bin/sh
$ env # we are now in a subshell
PWD=/home/z3bra
You can see that the subprocess doesn't have the $FOO
variable in its
environment, even though it has been exported earlier.
You can set the environment by passing variables AFTER the env -i
command,
this is useful to set the $container
variable which has been "standardized" as
a way to tell processes they are running inside a container.
We now have a way to isolate our hello
process from the host's environment.
# env -i container="handcraft" chroot ./rootfs ./hello
2.2 unshare(1)
This tool is the one that will actually isolate containers. It has been created
especially for this purpose, and will let you run a process unshared from
different namespaces: mount, user, network, PID, IPC and UTS.
In the same order, each flag will separate your command
from the given
namespace. See unshare(1)
for more information:
unshare -m -U -n -p -i -u <command>
We can actually leave the -n
flag untouched, as some tools provide a better
approach to network isolation (see ip-netns(1)
, described later in this post).
Another point worth mentionning is that if you want to isolate the process from
the PID namespace, you should consider using the options --fork --mount-proc
,
so that the process will see a "virtualized" /proc
that will represent the
namespace, and not the host. For example:
# unshare -p --fork --mount-proc ps -faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 13012 2276 pts/2 R+ 23:57 0:00 ps -aux
We just found a way to isolate our program a bit more:
# unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
For the curious, you can check the nsenter(1)
program, that will help you
run a process within another process namespace.
2.3 ip-netns(1)
The ip(1)
command includes a netns
subcommand to manage network namespaces.
It is useful to give network access to a process while keeping it away from the
host's network stack.
You need to be familiar with the concept of
bridges, and
virtual network interfaces
(veth) pairs here.
Virtual ethernet devices pairs acts like both ends of a tube: when a packet is
written on one end, it is also written on the other. This simple concept will
help us get an internet access inside the container, while using the network
stack of the host.
The process is easy: we will create a veth
pair, move one end inside the
container, and bridge the other side with a physical interface.
Let's assume your physical interface is named eth0
. We will create a bridge
br0
, add eth0
on this bridge, and request an IP for this interface:
# brctl addbr br0
# brctl addif br0 eth0
# dhcpcd br0
Then, we create a network namespace, a veth pair and move one end if this pair inside the namespace (we will name it "handcraft"):
# ip netns add handcraft
# ip link add veth1 type veth peer name eth1
# ip link set eth1 netns handcraft
Now that our namespace has an interface able to communicate with the outside
world, we can bridge it together with eth0
and request an IP:
# brctl addif br0 veth1
# ip link set veth1 up
# ip netns exec dhcpcd eth1
We now have a namespace 100% isolated from the host, that can reach the outside world over ethernet! You can run any command inside this namespace, and they will use the network stack we just created. For example:
# ip netns exec handcraft curl -s z3bra.org/slj
We can now run our hello
program with its own network stack (even though
it doesn't make any sense!):
# ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
Don't feel ashamed by such a long-ass command, because that is what lxc
,
docker
, and other container applications do behind your back!
3. Bonus: cgroups
Control groups are a feature of the kernel used to limit the resources used by a process, or a group of processes. Cgroups can limit CPU shares, RAM, network usage, disk I/O, ...
I will not cover their usage here, as this article is already long, but They are totally worth mentionning as an improvement over our containers.
4. Congratz
... for reading this far.
Containers are a truly awesome concept. They make great use of new
technologies, and all the tools presented above allow the standard users
to exploit them in many different ways.
Applications like LXC and docker both recreate a full operating system,
even though they are used to run a single process (web server, database, ...).
By knowing how this works under the hood, we will be able to use the container technology to isolate the application in a smarter way than shipping it along with a full operating system.
For further reading, check out these links:
- http://doger.io
- http://git.r-36.net/ns-tools
- https://github.com/arachsys/containers
- https://github.com/p8952/bocker
Now get out there, and make some containers!