The power of the root user does not come from having the name root
And it does not come from having a uid of 0. The power of the root user is based in a concept inside the Linux kernel called capabilities.
There are things that only the root user can do. One of the most visible examples is having the ability to open up a network port below 1024. This is restricted because if anyone could do it they could intercept traffic on core services like ssh, http, telnet, etc.
There are currently 38 capabilities in Linux, by my count. They do all sorts of things and are documented in the Linux manpages. man capabilities for a full and up-to-date list. Let’s have a little fun and investigate the one I mentioned above.
Opening ports below 1024 as a regular user
Oh that magic ability! In a normal world, we are not able to start the httpd daemon on port 80 with a regular user.
$ httpd -d $(pwd) -DNO_DETACH (13)Permission denied: AH00072: make_sock: could not bind to address [::]:80 (13)Permission denied: AH00072: make_sock: could not bind to address 0.0.0.0:80 no listening sockets available, shutting down
Of course this doesn’t work. It doesn’t work by design. But what happens if we add the CAP_NET_BIND_SERVICE? We can do this with setcap utility in Linux. It may seem a little counter-intuitive that we add the capability to a file and not a user. But when you think it through, this makes a lot of sense for what we are doing with containers. When we create a container we are going to be able to specify the specific capabilities for the application that starts our container.
But before we get to that, let’s confirm that capabilities even work.
$ sudo setcap cap_net_bind_service=+ep /usr/sbin/httpd $ getcap /usr/sbin/httpd /usr/sbin/httpd = cap_net_bind_service+ep
Now the httpd executable has the CAP_NET_BIND_SERVICE capability. Let’s take this puppy for a test drive!
$ whoami jduncan $ httpd -d $(pwd) -DNO_DETACH
$ sudo netstat --numeric-ports -tpl | grep httpd tcp6 0 0 [::]:80 [::]:* LISTEN 18024/httpd
Holy crap! It looks like it may have worked! If we test it just to be sure we can curl whatever is listening on localhost on port 80.
$ curl localhost <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Test Page for the Apache HTTP Server on Fedora</title> ...
And there we have it. We have Apache running on port 80 as a completely normal user. To be honest, I had to do a little work for httpd itself to start up. I had to copy around some config files and tweak some ownership of logs and pid files, etc. It has nothing to do with the port, however. It is just the stuff that httpd needs to do its job.
Capabilities with docker
The docker daemon has the ability to manage capabilities as well.
The ‘docker run` command has an option called --privileged. This allows the container to share all of the hosts’ namespaces and do all sorts of powerful things. It is painted with a VERY broad brush. But sometimes that is jus what you have to do. But we can also do something like we did above in a container.
By default, docker starts a container with a subset of capabilities turned on. This is documented at https://docs.docker.com/engine/reference/run/#/runtime-privilege-and-linux-capabilities. CAP_NET_BIND_SERVICE is already in that list. That is why containers can open up low port numbers already. These capabilities can be dropped with the --cap-drop option.
If you want to add an additional capability you can use the --cap-add parameter to give a container any additional capability it needs.
This is handled at run time, you may notice. So you can launch a dev version of a container and give it tons of power in a dev lab. But then launch the same container in production and give it an incredibly locked-down set of capabilities for that environment.
Extending capabilities with OpenShift
This is all great if I am running a handful of containers on a host. But OpenShift is designed to serve multiple applications across large clusters at massive scale. We need a workflow that will let us associate these concepts with users in a multi-tenant system. We accomplish this with Security Context Constraints (SCC).
SCC’s allow you to control permissions inside a kubernetes/OpenShift pod. Inside OpenShift, several SCC’s are deployed out of the box.
$ oc get scc NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES anyuid false  MustRunAs RunAsAny RunAsAny RunAsAny 10 false [configMap downwardAPI emptyDir persistentVolumeClaim secret] hostaccess false  MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir hostPath persistentVolumeClaim secret] hostmount-anyuid false  MustRunAs RunAsAny RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim secret] hostnetwork false  MustRunAs MustRunAsRange MustRunAs MustRunAs <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] nonroot false  MustRunAs MustRunAsNonRoot RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] privileged true  RunAsAny RunAsAny RunAsAny RunAsAny <none> false [*] restricted false  MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret]
Let’s take a deeper look at one of these SCCs.
$ oc describe scc restricted Name: restricted Priority: <none> Access: Users: <none> Groups: system:authenticated Settings: Allow Privileged: false Default Add Capabilities: <none> Required Drop Capabilities: KILL,MKNOD,SYS_CHROOT,SETUID,SETGID Allowed Capabilities: <none> Allowed Volume Types: configMap,downwardAPI,emptyDir,persistentVolumeClaim,secret Allow Host Network: false Allow Host Ports: false Allow Host PID: false Allow Host IPC: false Read Only Root Filesystem: false Run As User Strategy: MustRunAsRange UID: <none> UID Range Min: <none> UID Range Max: <none> SELinux Context Strategy: MustRunAs User: <none> Role: <none> Type: <none> Level: <none> FSGroup Strategy: MustRunAs Ranges: <none> Supplemental Groups Strategy: RunAsAny Ranges: <none>
There is a ton of great information in here. For example, the SCC a used to launch an application in OCP defines whether or not it can use any of the host namespaces. But for this topic we care about 3 lines here.
Default Add Capabilities - this is a list of capabilities to add to a pod by default when it is being created.
Required Drop Capabilities - this is a list of capabilities to drop when creating a pod.
Allowed Capabilities - this is a list of other capabilities that applications affected by this SCC are allowed to use.
SCCs are defined with YAML, like everything else in OpenShift.
kind: SecurityContextConstraints apiVersion: v1 metadata: name: scc-admin allowPrivilegedContainer: true requiredDropCapabilities: - KILL - MKNOD - SYS_CHROOT runAsUser: type: RunAsAny seLinuxContext: type: RunAsAny fsGroup: type: RunAsAny supplementalGroups: type: RunAsAny users: - my-admin-user groups: - my-admin-group
In this example, the scc-admin SCC could create priviliged containers, but they would not have the KILL, MKNOD, and SYS_CHROOT capabilities.
Using our new SCC
SCC’s are managed by cluster managers in OpenShift. It is not a permission that everyone has access to. But you can create service accounts that have access to one or more SCC. They can then use these SCCs to create applications with the exact security profiles they need to have.
I am not going to get into service accounts here. If you would like to dig into them, they are documented at https://docs.openshift.com/container-platform/3.4/dev_guide/service_accounts.html#dev-guide-service-accounts.
Putting it all together
There we are. That is a quick stroll through how Linux capabilities can be leveraged by containers in OpenShift.
Linux capabilities allow for very fine-grained access to administrative-level functions for applications.
Docker has a mechanism to add or remove these capabiliites when containers are created.
Kubernetes and OpenShift take this further with the concept of Security Context Constraints that allow for large-scale control of application clusters