Cloud Native
Control Tower: Then vs Now
Control Tower today is not the same Control Tower that you may have been introduced to in the past.
Hey, everybody! Welcome back to the second post in my Kubernetes deployment series.
The links to all three posts in this series are.
Real-World Kubernetes Deployments Part 1 - Cloud Native CI/CD
Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD
Real-World Kubernetes Deployments Part 3 - Cloud Native CI/CD
In my previous post we investigated various Kubernetes manifest directives that help manage containers throughout the pod lifecycle. In that post we looked at how Kubernetes manages a rolling update and verifies pod health status in order to self-heal deployments when issues are encountered.
Where this previous post had explored the pod deployment process, we did not actually purposefully generate any issues in terms of containers crashing. Kubernetes will self-heal deployments when containers produce errors and/or crash. As such, the second post in this trilogy will feature a container image I created that does the following:
All of these features should be pretty self explanatory aside from the latter. I’ve been researching cross-availability-zone traffic within Kubernetes and wanted a way to determine how the Kubernetes proxy handles requests to services. I’ll work on exploring cross-az traffic at a later date in another post.
The container files used for this image can be found at the following Github repo.
https://github.com/trek10inc/probeserver
With all of that said, let’s look at the features built into this container and how we’ll go about using them!
Customize apex route response content
For the purposes of this blog post series, configuring what the application server includes in its response was meant to provide a way to distinguish between different commits being deployed to a cluster. To configure the response of the apex route simply requires setting an environment variable.
Delaying application server startup
Configuring the container to pause before starting an application server is meant to simulate the behavior of a production container taking a few seconds before being able to accept requests. This feature is also configured by setting an environment variable.
Crash on launch
Configuring the container to crash with a certain frequency is meant to simulate a buggy container that made its way into a production environment. Definitely not unheard in today’s fast-paced agile development environments. Again, this feature is configured by setting an environment variable.
Crash on URL path
I wanted to create a way to crash the container at any point after the application server has started. The thought being that at some point I might want to get into some Chaos Testing for future POCs. Accessing the ‘/crash’ endpoint of the application server will achieve this result.
Health status output
Kubernetes liveness and readiness probes require some form of health check. For the sake of simulating web applications, I configured the container’s application server to return a 200 status code and a brief JSON object when a request is made to the ‘/healthz’ endpoint to simulate an application’s (healthy) health check.
Alter health status output
I wanted a way to simulate a bad health check status within each pod. As such, I worked in a mechanism similar to what was used to crash the container outright into the health check endpoint to produce 500 status codes randomly in responses from the container’s application server. This feature is also configured by setting an environment variable.
Resolve and request a Kubernetes service
Again, this feature is meant for researching Kubernetes proxy behavior and its effects on cross-availability zone traffic routing. It won’t be used in this post. Regardless, this feature takes in a hostname, resolves it, makes an HTTP (note: not secure HTTPS) request, and then outputs some information about the container servicing the request, the IP address of the resolved hostname, and the response provided by the remote service. A good way to test this is to run it against “checkip.dyndns.org”.
This ends up looking like the following when run.
$ curl http://127.0.0.1/resolve?service=checkip.dyndns.org
----------------------------------------------------
Hostname = 1bc526ba48ce
Ip Address = 127.0.0.1
Service hostname = checkip.dyndns.org
Service IP = 158.101.44.242
----------------------------------------------------
<html><head><title>Current IP Check</title></head><body>Current IP Address: 174.51.70.14</body></html>
Putting all of this together, in conjunction with what we learned from the previous post, in the form of a Kubernetes manifest looks like the following. Please note that altering the health status being returned by the application server was not included in this manifest. To keep things simple(ish), we’ll work through that later.
kind: Deployment
apiVersion: apps/v1
metadata:
name: prober-foo
namespace: default
labels:
app: prober-foo
deployment: foo
env: prod
spec:
progressDeadlineSeconds: 60
replicas: 3
selector:
matchLabels:
app: prober-foo
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 30%
maxSurge: 1
template:
metadata:
labels:
app: prober-foo
deployment: foo
env: prod
version: 1.0.1
spec:
containers:
- image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
name: prober-foo
env:
- name: START_WAIT_SECS
value: '15'
- name: CRASH_FACTOR
value: '40'
- name: CONTENT
value: '{ "team": "foo", "version": "1.0.1" }'
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 20
successThreshold: 1
failureThreshold: 1
readinessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 5
successThreshold: 1
failureThreshold: 1
You’ll notice that I’ve placed my container image in AWS’ public container registry. It currently lives at “public.ecr.aws/i4a3l2a7/probeserver:latest”. Be advised this may not be the case at some point in the future.
You may also notice that I’ve set the following three environment variables in the deployment manifest.
Additionally, we’ll accompany this deployment manifest with a service so we can access the application being launched from listening ports on each of the worker nodes. We’ll use the following to do so.
apiVersion: v1
kind: Service
metadata:
name: foo-nodeport-svc
labels:
app: prober-foo
deployment: foo
env: prod
version: 1.0.1
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 80
nodePort: 30080
selector:
deployment: foo
env: prod
type: NodePort
After applying the manifest to a Kubernetes cluster and periodically capturing the deployment pod statuses you will initially see output like the following.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 0/1 ContainerCreating 0 1s
prober-foo-67b66bdbf7-7dnzd 0/1 ContainerCreating 0 1s
prober-foo-67b66bdbf7-hw78w 0/1 ContainerCreating 0 1s
We know the application server won’t be started immediately (due to our START_WAIT_SECONDS environment variable) so we expect to see the pods running but not yet in the ready state. Something like the following should be seen when checking pod statuses.
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 0/1 Running 0 4s
prober-foo-67b66bdbf7-7dnzd 0/1 Running 0 4s
prober-foo-67b66bdbf7-hw78w 0/1 Running 0 4s
And once again, looking very closely at the “RESTARTS” column you will see some of the pods restarting. This is where the CRASH_FACTOR environment variable is causing the container to fail and Kubernetes is restarting it.
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 0/1 Running 0 19s
prober-foo-67b66bdbf7-7dnzd 0/1 Running 0 19s
prober-foo-67b66bdbf7-hw78w 0/1 Running 1 (1s ago) 19s
Checking pod status again, you may see some of the pods in a crash backoff loop. This is where Kubernetes is applying an exponential back-off delay when restarting failed pods. As we know the container won’t fail to launch 100% of the time, it should eventually achieve a ready state.
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 0/1 CrashLoopBackOff 2 (4s ago) 70s
prober-foo-67b66bdbf7-7dnzd 1/1 Running 1 (53s ago) 70s
prober-foo-67b66bdbf7-hw78w 0/1 Running 2 (36s ago) 70s
And after a few short minutes, you should see all of the pods successfully launched and in the “ready” state.
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 1/1 Running 3 (33s ago) 1m44s
prober-foo-67b66bdbf7-7dnzd 1/1 Running 1 (2m8s ago) 1m44s
prober-foo-67b66bdbf7-hw78w 1/1 Running 3 (38s ago) 1m44s
Accessing the application’s apex route via its service port on each worker node yields the following.
$ curl http://192.168.0.231:30080
{ "team": "foo", "version": "1.0.1" }
$ curl http://192.168.0.232:30080
{ "team": "foo", "version": "1.0.1" }
Accessing the applications “/healthz” route also provides us what we expected.
$ curl http://192.168.0.231:30080/healthz
{ "status": "ok" }
$ curl http://192.168.0.232:30080/healthz
{ "status": "ok" }
Now that we’ve verified the deployment was successful and operating correctly, let’s crash one of the pods and watch it restart.
$ curl http://192.168.0.232:30080/crash
curl: (52) Empty reply from server
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 1/1 Running 3 (33s ago) 5m30s
prober-foo-67b66bdbf7-7dnzd 1/1 Running 1 (2m8s ago) 5m30s
prober-foo-67b66bdbf7-hw78w 0/1 Completed 4 (1s ago) 5m30s
Checking events shows us that the readiness probe failed for the “prober-foo-67b66bdbf7-hw78” pod prompting Kubernetes to pull the container image in order to restart it.
$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Pulled
16s Warning Unhealthy pod/prober-foo-67b66bdbf7-hw78w Readiness probe failed: Get "http://10.44.0.3:80/healthz": dial tcp 10.44.0.3:80: connect: connection refused
22s Normal Pulled pod/prober-foo-67b66bdbf7-hw78w Successfully pulled image "public.ecr.aws/i4a3l2a7/probeserver:latest" in 781.431535ms
Waiting about a minute shows us that the crashed pod has recovered. Note the incremented value in the RESTARTS column.
NAME READY STATUS RESTARTS AGE
prober-foo-67b66bdbf7-4bxjb 1/1 Running 3 (33s ago) 6m27s
prober-foo-67b66bdbf7-7dnzd 1/1 Running 1 (2m8s ago) 6m27s
prober-foo-67b66bdbf7-hw78w 1/1 Running 4 (58s ago) 6m27s
Great success! Everything went according to plan. We witnessed the following behavior exhibited by the test container during this short exercise.
So as promised, we’ll also look at altering the health status being returned by the application server. We’ll be looking to verify that Kubernetes will self-heal pods failing liveness probes by restarting them.
The following deployment manifest will be utilized for this exercise. The following changes were introduced into the original deployment manifest.
kind: Deployment
apiVersion: apps/v1
metadata:
name: prober-foo
namespace: default
labels:
app: prober-foo
deployment: foo
env: prod
spec:
progressDeadlineSeconds: 60
replicas: 3
selector:
matchLabels:
app: prober-foo
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 30%
maxSurge: 1
template:
metadata:
labels:
app: prober-foo
deployment: foo
env: prod
version: 1.0.2
spec:
containers:
- image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
name: prober-foo
env:
- name: START_WAIT_SECS
value: '15'
- name: HEALTH_STATUS_FACTOR
value: '40'
- name: CONTENT
value: '{ "team": "foo", "version": "1.0.2" }'
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 20
successThreshold: 1
failureThreshold: 2
periodSeconds: 3
readinessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 5
successThreshold: 1
failureThreshold: 1
Applying this updated manifest will force a rolling update for the deployment. Once the update has been completed we will look at pod status again to see if we witness any restarts.
Sure enough, we’re seeing all of the pods periodically restarting.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
prober-foo-764cf9f454-5pqcx 1/1 Running 1 (88s ago) 2m43s
prober-foo-764cf9f454-8fsnc 0/1 Running 3 (3s ago) 3m3s
prober-foo-764cf9f454-8gjl4 0/1 Running 2 (10s ago) 2m13s
Making requests to the application’s health status endpoint reflects that the application’s “/healthz” endpoint is not always returning healthy status codes.
$ while true; do curl http://192.168.0.231:30080/healthz; sleep 1; done
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
Searching through events shows us what was happening under the hood.
$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Killing
2m24s Warning Unhealthy pod/prober-foo-764cf9f454-5pqcx Liveness probe failed: HTTP probe failed with statuscode: 500
3m3s Warning Unhealthy pod/prober-foo-764cf9f454-8fsnc Liveness probe failed: HTTP probe failed with statuscode: 500
3m7s Normal Killing pod/prober-foo-764cf9f454-8fsnc Container prober-foo failed liveness probe, will be restarted
3m3s Normal Killing pod/prober-foo-764cf9f454-5pqcx Container prober-foo failed liveness probe, will be restarted
Kubernetes saw liveness probes fail and then restarted the unhealthy pods. Exactly what we were looking to see happen!
That about wraps up the intended purpose of this blog post. I was successfully able to create a container that allows me to experiment with the Kubernetes pod lifecycle and understand how Kubernetes manages it. I’m pretty happy with the outcome and hope that you’ll find value in it for your POCs.
So once again, thanks for hanging out with me for a bit! Stay tuned for the next installment of this series where we’ll work creating an AWS cloud-native CI/CD pipeline to facilitate deployments to an EKS cluster.
Control Tower today is not the same Control Tower that you may have been introduced to in the past.