Solving the Crossplane Provider CRD Scaling Problem with Provider Families
TL;DR - Summary
- We have listened to the Crossplane community’s challenges with control plane performance due to the large number of CRDs installed by Crossplane providers
- Starting today, there will be massive relief with the new Provider Families available in the marketplace that allow you to install only the resources that are important for your platform’s needs
- Read on for more details…
Crossplane loves CRDs…maybe a little too much
As a “universal” control plane, a major pillar of Crossplane functionality is its extensibility. It is possible to essentially teach Crossplane to manage anything that has an API. However, this broad ability of Crossplane to manage so many things has also introduced performance concerns within the control plane due to the large number of CustomResourceDefinitions (CRDs) that get installed. Most Kubernetes add-ons install no more than tens of CRDs. Crossplane, on the other hand, frequently installs multiple hundreds of CRDs or even thousands in some cases. For example, Upbound’s official provider-aws installs more than 900 just by itself!
This massive number of CRDs could often lead to unstable and unresponsive control planes while the new resource definitions were being processed. The control plane nodes and machinery would undergo scaling-up operations to meet this higher demand on the cluster’s resources. During this scale-up period, which could last up to an hour, the API would be unresponsive, and most requests would fail. Obviously, this is not a good experience for clients, other workloads, and operators running in the cluster.
Going where no add-on has gone before
While the machinery of extending Kubernetes via CRDs is quite clever and beneficial to the ecosystem as a whole (thanks @the_sttts!), the number of CRDs within a single cluster was never officially included as a Kubernetes scaling threshold. The upper limits for the number of nodes, pods, namespaces, etc., have been tested and official guidance has been published for these scaling dimensions. Still, the number of CRDs has never needed to be explored…until Crossplane’s “universal control plane” for everything blew well past any other project’s usage of CRDs that came before it.
The Crossplane contributor community put a lot of work into solving this problem in upstream Kubernetes. We took a few deep dives and found several inefficiencies and performance-restricting issues, then submitted fixes and proposals for addressing these to the community. These fixes, outlined in depth in this blog post, were well received and eventually included in recent versions of Kubernetes client and server releases.
However, these upstream fixes take a long time to roll out to all the distributions and managed services that run Kubernetes, and were overall insufficient to overcome the growing problem. If you recall, the provider ecosystem of Crossplane has been rapidly expanding, and provider vendors like Upbound were investing in code generation functionality to automatically generate support for new resources within their providers. The massive increase enabled by this automation overran the gains made by the upstream improvements, so the performance concerns were never fully resolved.
Light at the end of the tunnel
The Crossplane community suffered from these performance issues and we heard feedback from many affected folks. This feedback was very helpful for deepening our understanding of the problem space and kept leading us toward a more sustainable solution. Feedback was so common that just one of the tracking issues for this area is still the issue with the most “reactions” of all time in the Crossplane issue backlog. We clearly needed to take further steps to supplement the effort we were making in upstream Kubernetes.
We deeply explored many possible solutions for how to address the issue within the Crossplane project. Nic Cope provided a fantastic write-up and proposal of his thorough investigation that was eventually approved and merged into an official design document.
In the Crossplane project, we heavily emphasize API design, user experience, and reliability. We needed to find a solution that would be easy and intuitive to use while driving the number of unsupported “edge cases” toward zero. After lengthy and passionate discussions amongst the community, we concluded that the best experience for adopters of Crossplane would be to more granularly organize the largest providers into smaller and more reasonably scoped units. We call these Provider Families.
This collaboration amongst the community was a strong example of inclusivity and thoughtful consideration of all the complicated trade-offs that accompany such an impactful design decision. There were multiple approaches to choose from, each supported by some community members, so we are truly grateful for all the participation and considerate discussion through the lifecycle of this proposal.
The scoped provider approach that we chose facilitates a number of benefits versus the various alternative solutions. Most importantly, this approach results in:
- No new concepts or configuration that adopters have to learn and figure out
- The default experience for new users is fewer CRDs are installed
- A simple and non-disruptive migration path from previous monolithic providers
Rolling the plan out
Crossplane is considered a framework that enables you to build your own opinionated cloud native control plane. While the proposal was heavily discussed in the context of the general Crossplane community, it resulted in no specific changes being needed within core Crossplane itself. Therefore, after ratifying the proposal into the Crossplane design docs, it still needed to be adopted and integrated by the largest providers that were causing the most pain.
The largest providers within the Crossplane ecosystem are the Upbound Official Providers that utilize code generation to automatically build support for a wide variety of cloud provider resources. Starting today, these providers will now each be available as a “Provider Family”, giving you the choice to install only the subset of functionality for each provider that you actually need in your platform. This will undoubtedly result in less resources and demand on the control plane as well as better alignment with how you’re actually using the resources available in each provider.
Below is a diagram that shows this transformation from a single monolithic provider to a set of smaller related providers grouped into a single provider family:
Learn More
While this blog has focused on the general Crossplane ecosystem experience with this problem area and the pattern we now recommend to all providers, there are further interesting specifics to read about the Upbound Official Providers in their blog post.
Now that the Upbound official providers have implemented this “Provider Family” pattern, we expect other large providers to follow along and integrate this approach as well. The Upbound provider maintainers are happy to provide guidance based on what they have learned from adopting the general pattern and the specific considerations Upbound makes for the scale they are operating at.
Crossplane is a community driven project and we welcome you to join the community and contribute through a variety of opportunities, such as opening and commenting on issues, joining the community meetings, sharing your adoption story, and providing feedback on design docs and pull requests.
We love to hear from the community, as they are exactly what makes this project great. Whether you are a developer, user, or just interested in what we're up to, feel free to join us via one of the following methods: