An Open Post-Mortem: Investigating a GKE Cluster Connection Bug

This post covers how the Crossplane community addressed a recent minor bug that was shipped in stack-gcp v0.4.0. While it was a routine bug, detailing our review of the issue and the implementation of the subsequent fix is useful to illustrate how Crossplane works under the hood.

On Friday, January 3rd, Crossplane community member, Suraj Banakar, opened an issue regarding scheduling workloads to a GKE cluster provisioned by Crossplane, using the new v1beta1 implementation. Suraj reached out in our Slack channel and after following up with him, it immediately became apparent that there was a bug in some of the work that was done on stack-gcp a few weeks earlier. The community quickly organized to address the problem, discuss potential solutions and decide on a direction. We were able to merge a fix within 14 hours.

What happened?

Some background on Crossplane is useful to understand exactly what happened in this situation. Crossplane allows users to provision infrastructure across cloud providers using Kubernetes custom resources. This infrastructure could be any type of compute or storage, including managed Kubernetes services.

Crossplane also allows you to consume these infrastructure resources from within a Kubernetes cluster. If the infrastructure happens to be another Kubernetes cluster, then you can deploy Kubernetes-native resources to that cluster using our KubernetesApplication custom resource.

Suraj, the user who raised the issue, executed the following steps to reach the point at which he encountered the bug:

  1. Provisioned and connected to a Kubernetes cluster running somewhere.
  2. Installed the Crossplane and stack-gcp operators and CustomResourceDefinitions.
  3. Dynamically provisioned a GKECluster using a GKEClusterClass and a KubernetesCluster claim.
  4. Statically provisioned a GKE NodePool and attached it to the running GKE cluster.
  5. Examined the connection Secret that was propagated to the Namespace of his KubernetesCluster claim.

Upon doing so, he noticed that the username and password fields of the Secret were empty. At this point, if he had tried to schedule application resources to his new GKECluster by referencing his KubernetesCluster claim, he would have been unsuccessful. However, he had not reached that point yet, and actually had discovered something interesting that in and of itself was not a bug. To understand why, a general knowledge of Kubernetes authentication is helpful.

I will not go into extensive depth on this topic, as many have already done quite a good job of doing that. One of the best descriptions is from the Kubernetes docs themselves. Importantly, authentication isn't the same thing as authorization. If you are interested in learning more about authorization, I highly recommend this guide from Bitnami, which walks through not only how RBAC works in Kubernetes, but practically shows you how to use it.

What we need to focus on for this post is the four main ways to authenticate to the Kubernetes API server in GKE. Briefly, they are:

  • Basic Authentication: your traditional username and password authentication with a static CSV file. This is not a recommended strategy, and is considered less secure than other authentication methods.
  • Client Certificate: an x509 certificate that has been signed by the Certificate Authority and pre-provisioned on the cluster. This strategy is also no longer recommended by GKE.
  • Service Account Bearer Token: a token (JWT) issued to a Kubernetes ServiceAccount and signed by the cluster's Certificate Authority.
  • Google as OpenID Connect Provider (OAuth): a token (JWT) that has been signed by Google and includes an ID token, which includes specific information about the user.

Previously, in Crossplane's GKECluster API prior to the v1beta1 version, we used basic authentication, but did not expose any authentication configuration to the user. This means that it used a default username and disabled issuance of the client certificate. It then propagated the username and the Google-generated password back to the cluster in the form of a connection Secret. Crossplane was then able to use that Secret to provision resources to the cluster as a cluster-admin. This was changed with v1beta1 GKE clusters because we want to be a high-fidelity representation of the cloud provider API. Therefore, if a provider exposes configuration, we don't want to make decisions for the user about the values provided for that configuration. We err on the side of verbosity rather than simplicity, believing sane defaults can be deferred to a higher level of abstraction. In this situation, if Google decides that an infrastructure administrator can enable basic authentication, we want to provide the exact same functionality, ensuring no degradation in functionality.

However, the community is still designing a system of securely providing credentials for a managed service at provisioning time. For this reason, we don't want to expose fields like usernames and passwords in our resource specs. While implementing the v1beta1 version of the GKECluster resource, the decision was made to omit both of these fields, and allow for a user to enable issuance of a client certificate that would be propagated back to the connection Secret (this follows the tenets above by striving for full fidelity while not exposing sensitive fields in the spec).

The issue here is not one of security, at least not for the moment. Basic authentication and client certificates are certainly not the most secure way to authenticate to a cluster, but they are being used with this knowledge in mind for the time-being.

As mentioned previously, Crossplane uses the information in the connection Secret to deploy workloads into a remote Kubernetes cluster using a KubernetesApplication. We put all connection information returned to us by the cloud provider into the Secret, and GKE returns both the username/password combination, as well as the client certificate. If a user wanted to schedule a KubernetesApplication to the cluster they provisioned, they should be able to provision a GKECluster with issueClientCertificate: true then let Crossplane uses that certificate to authenticate to the cluster. This is where we ran into a problem: the client certificate is issued with CN=client, a user with no RBAC permissions. This means that while we could authenticate to the cluster, we were not authorized to do anything. Crossplane certainly wouldn't be able to create resources on a user's behalf.

The decision to eliminate the username and password functionality for the time-being was not a bug, but rather a thoughtful design choice. One could even argue that the fact that the client certificate had no RBAC permissions wasn't wrong, and it is not incorrect for Crossplane to be unable to deploy resources into the cluster. However, it was a bug because that wasn't the intention of the change. The intention was to be able to provision resources to the GKE cluster from Crossplane as long as issueClientCertificate: true was in the resource spec. Software that behaves unexpectedly is most certainly a bug regardless of the magnitude of impact it has on the rest of a system.

How did we fix it?

The current fix will likely leave some unsatisfied, but it accomplishes our stated goal for now, and enables the community to utilize Crossplane in its full force. At a different stage of maturity, we might make a different decision, but in our current stage, we felt as though this was the right direction.

The fix consisted of adding a username field that allows a user to supply their own master username. If supplied, we rely on Google to auto-generate a master password, then propagate both the username and the password back in the connection Secret. This allows a user to provision clusters with basic authentication enabled or disabled, while also staying true (i.e. high fidelity) to the cloud provider API. In order to schedule a KubernetesApplication to a GKE cluster a user must enable basic authentication, but they also have the option for a more secure configuration if they intend to utilize the cluster outside of Crossplane.

In the future, we will be exploring more secure methods of authenticating to GKE clusters (and all other managed Kubernetes offerings). When correcting bugs in Crossplane releases, we try to balance implementing a fix in a timely manner, while ensuring no degradation of functionality and security. In this case, we were able to accomplish this goal, and we had the PR merged within 14 hours of the initial issue being opened.

What are we doing to make sure that it does not happen again?

You may have noticed in one of the code references above that we even have a comment on the previous implementation of the GKECluster type that explains why username and password must be enabled for Crossplane's KubernetesApplication to work properly. This is a perfect illustration of the need for establishing systems for ensuring quality, rather than relying on human intervention. We believe very strongly that errors are the result of process failures, not people failures. Just trying to be more thorough with our code and reviews would be a naive strategy for improving the quality of our software moving forward.

This bug was a result of a lack of testing. While we have a robust suite of unit tests on almost all of the code in the Crossplane ecosystem, we still rely on manual integration testing with ad-hoc scenarios. With our manual testing, we do not have a hard and fast set of scenarios that we run on a regular interval, but instead formulate a set of scenarios in isolation when a large change is made to any code. To be perfectly honest, this inexact science has served us decently well in terms of ensuring quality in our releases, though it may have come at the expense of some community members' sleep schedules. However, as contributions and project breadth continue to grow, this system is breaking down. An engineer with the best of intentions is bound to miss a bug or two when substantially modifying any functionality. This process needs automation.

Fortunately, we have been working on building out an integration testing framework that wraps Kubernetes controller-runtime's envtest package. The added functionality, which includes installing CRDs from remote locations, registering an arbitrary number of controllers prior to a sequence of tests, and configurable authentication to any Kubernetes API server to run the tests against, allows us to write integration tests that can execute repeatable complex scenarios as easily as we write our unit tests. We have already begun implementation of integration tests in stack-gcp using this framework. These tests will be executed on a regular interval using our Jenkins infrastructure, ensuring that we catch any quality degradation soon after any new commits are merged. We will continue to iterate on these tests, establishing standards for when new tests should be implemented and how often the pipelines should be run. If you are interested in contributing to this effort, there are issues open across all of the Crossplane GitHub repositories (including stack-aws, stack-azure, and stack-gcp).

Parting thoughts on blameless post mortems

A lot has been written about the concept of a blameless post-mortem, a practice and set of values we aim to implement and adhere to in the Crossplane community. The fact that Crossplane is an open source project makes our post-mortems unique in that we can reference exact pull requests, Slack messages, and lines of code. We believe that being radically open about the work that we do and by critically assessing the events that led to a bug or vulnerability demonstrates trust, respect, and aligned interests in a common goal. Post-mortems require not just identifying and fixing problems, but developing a system and process to ensure that similar shortcomings don't occur in the future.

If you have questions about this post, the systems we are designing, or general community feedback, please feel free to open issues, join us on Slack, or show up at our biweekly community meeting. All are welcome here
Our mission to create an "open multicloud control plane" is lofty, wide-reaching, and highly impactful. We hold ourselves to an extremely high standard as stewards of that mission and invite our community to hold us accountable. In the Crossplane community, we believe a rising tide lifts all boats. We hope that this post makes the work you do at least a little bit easier.

Note to readers: Contributors are not expected to write a public post-mortem when they are not comfortable with doing so. We do not see post mortems as an opportunity to focus on individual failures, but rather as a way of instilling accountability for the process and systems we use to to guide our work.

Keep up with Upbound

* indicates required