cgroups v2 works ‼️

btholt · Apr 13, 2024 · 11104ef · 11104ef
1 parent dbd8017
commit 11104ef
Show file tree

Hide file tree

Showing 3 changed files with 249 additions and 0 deletions.
diff --git a/lessons/02-crafting-containers-by-hand/B-chroot.md b/lessons/02-crafting-containers-by-hand/B-chroot.md
@@ -0,0 +1,54 @@
+---
+title: "chroot"
+---
+
+I've heard people call this "cha-root" and "change root". I'm going to stick to "change root" because I feel less ridiculous saying that. It's a Linux command that allows you to set the root directory of a new process. In our container use case, we just set the root directory to be where-ever the new container's new root directory should be. And now the new container group of processes can't see anything outside of it, eliminating our security problem because the new process has no visibility outside of its new root.
+
+Let's try it. Start up a Ubuntu VM however you feel most comfortable. I'll be using Docker (and doing containers within containers 🤯). If you're like me, run `docker run -it --name docker-host --rm --privileged ubuntu:jammy`. This will download the [official Ubuntu container][ubuntu] from Docker Hub and grab the version marked with the _jammy_ tag. In this case, _latest_ means it's the latest stable release (22.04.) You could put `ubuntu:devel` to get the latest development of Ubuntu (as of writing that'd be 24.04). `docker run` means we're going to run some commands in the container, and the `-it` means we want to make the shell interactive (so we can use it like a normal terminal.)
+
+If you're in Windows and using WSL, just open a new WSL terminal in Ubuntu. ✌️
+
+To see what version of Ubuntu you're using, run `cat /etc/issue`. `cat` reads a file and dumps it into the output which means we can read it, and `/etc/issue` is a file that will tell us what distro we're using. Mine says `Ubuntu 22.04.4 LTS \n \l`.
+
+Okay, so let's attempt to use `chroot` right now.
+
+1. Make a new folder in your root directory via `mkdir /my-new-root`.
+1. Inside that new folder, run `echo "my super secret thing" >> /my-new-root/secret.txt`.
+1. Now try to run `chroot /my-new-root bash` and see the error it gives you.
+
+You should see something about failing to run a shell or not being able to find bash. That's because bash is a program and your new root wouldn't have bash to run (because it can't reach outside of its new root.) So let's fix that! Run:
+
+1. `mkdir /my-new-root/bin`
+1. `cp /bin/bash /bin/ls /my-new-root/bin/`
+1. `chroot /my-new-root bash`
+
+Still not working! The problem is that these commands rely on libraries to power them and we didn't bring those with us. So let's do that too. Run `ldd /bin/bash`. This print out something like this:
+
+`bash
+$ ldd /bin/bash
+  linux-vdso.so.1 (0x00007fffa89d8000)
+  libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f6fb8a07000)
+  libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6fb8803000)
+  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6fb8412000)
+  /lib64/ld-linux-x86-64.so.2 (0x00007f6fb8f4b000)
+`
+
+These are the libraries we need for bash. Let's go ahead and copy those into our new environment.
+
+1. `mkdir /my-new-root/lib /my-new-root/lib64` or you can do `/my-new-root/lib{,64}` if you want to be fancy
+1. Then we need to copy all those paths (ignore the lines that don't have paths) into our directory. Make sure you get the right files in the right directory. In my case above (yours likely will be different) it'd be two commands:
+   1. `cp /lib/x86_64-linux-gnu/libtinfo.so.5 /lib/x86_64-linux-gnu/libdl.so.2 /lib/x86_64-linux-gnu/libc.so.6 /my-new-root/lib`
+   1. `cp /lib64/ld-linux-x86-64.so.2 /my-new-root/lib64`
+1. Do it again for `ls`. Run `ldd /bin/ls`
+1. Follow the same process to copy the libraries for `ls` into our `my-new-root`.
+   1. `cp /lib/x86_64-linux-gnu/libselinux.so.1 /lib/x86_64-linux-gnu/libpcre.so.3 /lib/x86_64-linux-gnu/libpthread.so.0 /my-new-root/lib`
+
+Now, finally, run `chroot /my-new-root bash` and run `ls`. You should successfully see everything in the directory. Now try `pwd` to see your working directory. You should see `/`. You can't get out of here! This, before being called containers, was called a jail for this reason. At any time, hit CTRL+D or run `exit` to get out of your chrooted environment.
+
+## cat exercise
+
+Now try running `cat secret.txt`. Oh no! Your new chroot-ed environment doesn't know how to cat! As an exercise, go make `cat` work the same way we did above!
+
+Congrats you just cha-rooted the \*\*\*\* out of your first environment!
+
+[ubuntu]: https://hub.docker.com/_/ubuntu
diff --git a/lessons/02-crafting-containers-by-hand/C-namespaces.md b/lessons/02-crafting-containers-by-hand/C-namespaces.md
@@ -0,0 +1,55 @@
+---
+---
+
+While chroot is a pretty straightforward, namespaces and cgroups are a bit more nebulous to understand but no less important. Both of these next two features are for security and resource management.
+
+Let's say you're running a big server that's in your home and you're selling space to customers (that you don't know) to run their code on your server. What sort of concerns would you have about running their "untrusted" code? Let's say you have Alice and Bob who are running e-commerce services dealing with lots of money. They themselves are good citizens of the servers and minding their own business. But then you have Eve join the server who has other intentions: she wants to steal money, source code, and whatever else she can get her hands on from your other tenants on the server. If just gave all three them unfettered root access to server, what's to stop Eve from taking everything? Or what if she just wants to disrupt their businesses, even if she's not stealing anything?
+
+Your first line of defense is that you could log them into chroot'd environments and limit them to only those. Great! Now they can't see each others' files. Problem solved? Well, no, not quite yet. Despite the fact that she can't see the files, she can still see all the processes going on on the computer. She can kill processes, unmount filesystem and even hijack processes.
+
+Enter namespaces. Namespaces allow you to hide processes from other processes. If we give each chroot'd environment different sets of namespaces, now Alice, Bob, and Eve can't see each others' processes (they even get different process PIDs, or process IDs, so they can't guess what the others have) and you can't steal or hijack what you can't see!
+
+There's a lot more depth to namespaces beyond what I've outlined here. The above is describing _just_ the UTS (or UNIX Timesharing) namespace. There are more namespaces as well and this will help these containers stay isloated from each other.
+
+## The problem with chroot alone
+
+Now, this isn't secure. The only thing we've protected is the file system, mostly.
+
+1. chroot in a terminal into our environment
+1. In another terminal, run `docker exec -it docker-host bash`. This will get another terminal session #2 for us (I'll refer to the chroot'd environment as #1)
+1. Run `tail -f /my-new-root/secret.txt &` in #2. This will start an infinitely running process in the background.
+1. Run `ps` to see the process list in #2 and see the `tail` process running. Copy the PID (process ID) for the tail process.
+1. In #1, the chroot'd shell, run `kill <PID you just copied>`. This will kill the tail process from inside the `chroot'd` environment. This is a problem because that means chroot isn't enough to isolate someone. We need more barriers. This is just one problem, processes, but it's illustrative that we need more isolation beyond just the file system.
+
+## Safety with namespaces
+
+So let's create a chroot'd environment now that's isolated using namespaces using a new command: `unshare`. `unshare` creates a new isolated namespace from its parent (so you, the server provider can't spy on Bob nor Alice either) and all other future tenants. Run this:
+
+**NOTE**: This next command downloads about 150MB and takes at least a few minutes to run. Unlike Docker images, this will redownload it _every_ time you run it and does no caching.
+
+``bash
+exit # from our chroot'd environment if you're still running it, if not skip this
+
+# install debootstrap
+
+apt-get update -y
+apt-get install debootstrap -y
+debootstrap --variant=minbase jammy /better-root
+
+# head into the new namespace'd, chroot'd environment
+
+unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot /better-root bash # this also chroot's for us
+mount -t proc none /proc # process namespace
+mount -t sysfs none /sys # filesystem
+mount -t tmpfs none /tmp # filesystem
+``
+
+This will create a new environment that's isolated on the system with its own PIDs, mounts (like storage and volumes), and network stack. Now we can't see any of the processes!
+
+Now try our previous exercise again.
+
+1. Run `tail -f /my-new-root/secret.txt &` from #2 (not the unshare env)
+1. Run `ps` from #1, grab pid for `tail`
+1. Run `kill <pid for tail>`, see that it doesn't work
+
+We used namespaces to protect our processes! We could explore the other namespaces but know it's a similar exercise: using namespaces to restrict capabilities of containers to interfering with other containers (both for nefarious purposes and to protect ourselves from ourselves.)
diff --git a/lessons/02-crafting-containers-by-hand/D-cgroups.md b/lessons/02-crafting-containers-by-hand/D-cgroups.md
@@ -0,0 +1,140 @@
+---
+title: cgroups
+---
+
+### TODO - This is the valid cgroup2 code
+
+```bash
+
+grep -c cgroup /proc/mounts # if this is greater than 0, then you're on cgroups v1 and this won't work
+
+mkdir /sys/fs/cgroup/sandbox # creates the cgroup
+
+# Find your PID, it's the bash one immediately after the unshare
+cat /sys/fs/cgroup/cgroups.proc # should see the process in the root cgroup
+echo <PID> > /sys/fs/cgroup/sandbox/cgroup.procs # puts the unshared env into the cgroup called sandbox
+cat /sys/fs/cgroup/sandbox/cgroup.proc # should see the process in the sandbox cgroup
+
+cat /sys/fs/cgroup/cgroups.proc # should see the process no longer in the root cgroup - processes belong to exactly 1 cgroup
+mkdir /sys/fs/cgroup/other-procs # make new cgroup for the rest of the processes, you can't modify cgroups that have processes and by default Docker doesn't include any subtree_controllers
+echo <PID> > /sys/fs/cgroup/other-procs/cgroup.procs # you have to do this one at a time for each process
+
+cat /sys/fs/cgroup/sandbox/cgroup.controllers # no controllers
+cat /sys/fs/cgroup/cgroup.controllers # should see all the available controllers
+echo "+cpuset +cpu +io +memory +hugetlb +pids +rdma" > /sys/fs/cgroup/cgroup.subtree_control # add the controllers
+cat /sys/fs/cgroup/sandbox/cgroup.controllers # all the controllers now available
+
+### Peg the CPU
+
+apt-get install htop # a cool visual representation of CPU and RAM being used
+htop
+
+yes > /dev/null # inside #1 / the cgroup/unshare – this will peg one core of a CPU at 100% of the resources available, see it peg 1 CPU
+kill -9 <PID of yes> # from #2, (you'll have to stop htop with CTRL+C) to stop the CPU from being pegged
+htop
+
+echo '5000 100000' > /sys/fs/cgroup/sandbox/cpu.max # this allows the cgroup to only use 5% of a CPU
+yes > /dev/null # inside #1 / the cgroup/unshare – this will peg one core of a CPU at 5% since we limited it
+kill -9 <PID of yes> # from #2, to stop the CPU from being pegged
+htop
+
+### Limit memory
+
+yes | tr \\n x | head -c 1048576000 | grep n # run this from #3 terminal and watch it in htop to see it consume about a gig of RAM and 100% of CPU core, CTRL+C to stop it
+cat /sys/fs/cgroup/sandbox/memory.max # should see max, so the memory is unlimited
+echo 83886080 > /sys/fs/cgroup/sandbox/pids.max # set the limit to 80MB of RAM
+yes | tr \\n x | head -c 1048576000 | grep n # from inside #1, see it limit both the CPU and the RAM taken up
+
+### Stop fork bombs
+
+cat /sys/fs/cgroup/sandbox/pids.current # See how many processes the cgroup has at the moment
+cat /sys/fs/cgroup/sandbox/pids.max # See how many processes the cgroup can create before being limited (max)
+echo 5 > /sys/fs/cgroup/sandbox/pids.max # set a limit that the cgroup can only run 5 processes
+for a in $(seq 1 5); do sleep 60 & done # this runs 5 60 second processes that run and then stop. run this from within #2 and watch it work. now run it in #1 and watch it not be able to.
+
+:(){ :|:& };: # DO NOT RUN THIS ON YOUR COMPUTER. This is a fork bomb. If not accounted for, this would bring down your computer. However we can safely run inside our #1 because we've limited the amount of PIDs available. It will end up spawning about 100 processes total but eventually will run out of forks to fork.
+
+```
+
+### END TODO - below is the old cgroup1 code
+
+Okay, so now we've hidden the processes from Eve so Bob and Alice can engage in commerce in privacy and peace. So we're all good, right? They can no longer mess each other, right? Not quite. We're almost there.
+
+So now say it's Black Friday, Boxing Day or Singles' Day (three of the biggest shopping days in the year, pick the one that makes the most sense to you 😄) and Bob and Alice are gearing up for their biggest sales day of the year. Everything is ready to go and at 9:00AM their site suddenly goes down without warning. What happened!? They log on to their chroot'd, unshare'd shell on your server and see that the CPU is pegged at 100% and there's no more memory available to allocate! Oh no! What happened?
+
+The first explanation could be that Eve has her site running on another virtual server and simple logged on and ran a malicious script that ate up all the available resources so that Bob and Alice so that their sites would go down and Eve would be the only site that was up, increasing her sales.
+
+However another, possibly more likely explanation is that both Bob's and Alice's sites got busy at the same time and that in-and-of-itself took all the resources without any malice involved, taking down their sites and everyone else on the server. Or perhaps Bob's site had a memory leak and that was enough to take all the resources available.
+
+Suffice to say, we still have a problem. Every isolated environment has access to all _physical_ resources of the server. There's no isolation of physical components from these environments.
+
+Enter the hero of this story: cgroups, or control groups. Google saw this same problem when building their own infrastructure and wanted to protect runaway processes from taking down entire servers and made this idea of cgroups so you can say "this isolated environment only gets so much CPU, so much memory, etc. and once it's out of those it's out-of-luck, it won't get any more."
+
+This is a bit more difficult to accomplish but let's go ahead and give it a shot.
+
+``bash
+
+# in #2, outside of unshare'd environment get the tools we'll need here
+
+apt-get install -y cgroup-tools htop
+
+# create new cgroups
+
+cgcreate -g cpu,memory,blkio,devices,freezer:/sandbox
+
+# add our unshare'd env to our cgroup
+
+ps aux # grab the bash PID that's right after the unshare one
+cgclassify -g cpu,memory,blkio,devices,freezer:sandbox <PID>
+
+# list tasks associated to the sandbox cpu group, we should see the above PID
+
+cat /sys/fs/cgroup/cpu/sandbox/tasks
+
+# show the cpu share of the sandbox cpu group, this is the number that determines priority between competing resources, higher is is higher priority
+
+cat /sys/fs/cgroup/cpu/sandbox/cpu.shares
+
+# kill all of sandbox's processes if you need it
+
+# kill -9 $(cat /sys/fs/cgroup/cpu/sandbox/tasks)
+
+# Limit usage at 5% for a multi core system
+
+cgset -r cpu.cfs_period_us=100000 -r cpu.cfs_quota_us=$[ 5000 * $(getconf _NPROCESSORS_ONLN) ] sandbox
+
+# Set a limit of 80M
+
+cgset -r memory.limit_in_bytes=80M sandbox
+
+# Get memory stats used by the cgroup
+
+cgget -r memory.stat sandbox
+
+# in terminal session #2, outside of the unshare'd env
+
+htop # will allow us to see resources being used with a nice visualizer
+
+# in terminal session #1, inside unshared'd env
+
+yes > /dev/null # this will instantly consume one core's worth of CPU power
+
+# notice it's only taking 5% of the CPU, like we set
+
+# if you want, run the docker exec from above to get a third session to see the above command take 100% of the available resources
+
+# CTRL+C stops the above any time
+
+# in terminal session #1, inside unshare'd env
+
+yes | tr \\n x | head -c 1048576000 | grep n # this will ramp up to consume ~1GB of RAM
+
+# notice in htop it'll keep the memory closer to 80MB due to our cgroup
+
+# as above, connect with a third terminal to see it work outside of a cgroup
+
+``
+
+And now we can call this a container. Using these features together, we allow Bob, Alice, and Eve to run whatever code they want and the only people they can mess with is themselves.
+
+So while this is a container at its most basic sense, we haven't broached more advance topics like networking, deploying, bundling, or anything else that something like Docker takes care of for us. But now you know at its most base level what a container is, what it does, and how you _could_ do this yourself but you'll be grateful that Docker does it for you. On to the next lesson!