Scaling the modern data center

Written 2020-12-16.

I recently met with a group of passionate technologists and we had an intriguing conversation around the challenges that they currently face with operating and scaling their SaaS applications hosted within their own data centers. This company has provisions in many of their customer contracts that prohibit them from moving to the cloud.

They are having difficultly scaling and responding quickly enough to handle high burst workloads that inundate their backend infrastructure. The problems are exacerbated when their sales team inks a deal with a new large customer, resulting in significant increases to their user base and a corresponding increase in web traffic, on short notice. The load is generated from consumers using mobile and web applications.

As is the case with many complex issues, this is an interesting and multi-faceted problem that cannot be solved with any one course of action. Thankfully, constraints are the mother of creativity and innovation!

After thinking on this problem, I suggest putting significant effort into exploring the following strategies going forward:

Performance Engineering
Adding elasticity to the data centers
Optimizing the data center supply chain
Partnering with Legal to eliminate anti-cloud provisions on contract renewals
Partnering with Sales to form clear agreements on capacity expansion

Let’s explore what I mean with each of these points.

Performance Engineering

When we started the discussion, it centered around the data centers specifically, and I did not immediately consider application performance. The team shared with me that they have been able to surgically rewrite key portions of their Ruby based application in Go, which has improved performance and allowed them to squeeze more out of their existing data center capacity.

This is brilliant! When you have a data center(s) with compute resource constraints, make sure that you are maximizing the resource efficiency of those limited assets. The team achieved significant performance improvements with this strategy. This result should not surprise you, as compiled languages generally exhibit notably increased performance characteristics compared to interpreted languages. In addition, the concurrency features built into the Go language may have helped them with achieving even greater efficiencies.

This strategy has already paid off for them, and there should be an emphasis on this continuing, but what else could we look at?

Are they using an APM (Application Performance Monitoring) product, which could help them shine a light onto the inner workings of each application? This could unveil bottlenecks and potential areas to target for further improvements.

Has a Content Delivery Network been implemented? This would ensure that their static assets are served at the edge, for optimal end user performance and would offload additional work from their backend servers.

Are they caching data sourced from expensive database queries by employing Redis, Memcached or a similar solution? Depending on data workloads, database types, query patterns, and other factors, database calls can be the most expensive transactions for SaaS applications. If you can eliminate some of the database queries via caching, you can often see significant performance improvements.

Caching is not the only lever for improving expensive data queries. There are many additional factors to consider in this space. Are there database indexes in place in all the appropriate places? Is the current database solution the most optimal one, based on the data shape, query patterns and load? Has a NoSQL solution been explored for query intensive use cases where low latency reads under heavy traffic is of utmost importance?

I would also look at the application architecture and the division of compute labor within the product. Is everything highly coupled together, requiring that every piece of the architecture be scaled, regardless of the discreet need? If the answer is yes, then a decomposition of the application might be beneficial, to allow for independent scaling of just the pieces that require additional computing power under increased load.

Additionally, load testing in a non-production environment can unveil weak points, assist with forecasting capacity capabilities, and help to inform future architectural needs.

Adding elasticity to the data centers

One of the biggest issues with running your own data centers is that your capacity is mostly static. It can take weeks, or even months to acquire new hardware, install it, wire it up, configure it, and finally put it into service.

The cloud revolutionized our industry by providing resource elasticity. If you have to run your own data centers, it is imperative that you add elasticity to your compute layer.

What do I mean by elasticity? Creating your own PaaS (Platform as a Service) based on Kubernetes is one compelling option. With unpredictable workloads and bursting traffic, you won’t have a good time waiting in the range of 5-10 minutes for VM’s to launch and enter service. Compare that to Kubernetes HPA (Horizontal Pod Autoscaling) capabilities, where you can horizontally scale out your applications in milliseconds.

Kubernetes isn’t the only solution, of course, but it’s difficult to ignore the vast mind-share, open source community support, CNCF projects, and numerous other resources that are dedicated to its ecosystem.

Additionally, are the data centers being managed and operated using a Software Defined Data Center (SDDC) model? While this term is heavily used in marketing materials and should set off your buzzword alarms, the underlying concept of managing all (or as much as you can) of your data center resources with software has validity. Leveraging this strategy can increase the time to provision or reallocate resources in your data center, leading to greater elasticity.

Optimizing the data center supply chain

Having more customers than you can handle is a great problem to have. In the context of data center capacity, some of the biggest issues revolve around the fact that it is often slow to add capacity.

This is a classic supply chain issue. Don’t settle for “this is how it’s always been done”. If this is your business’ bottleneck, then you should be fanatical about pushing the limits, exploring new innovations, talking with many suppliers, renegotiating contracts and anything else you can do to start chipping away at your expansion timelines.

If you don’t know how long it takes to increase capacity now, the first step is measuring the entire process, from the start of when new resources are needed, to the very end point at which you have these new resources in service. Once you have that baseline, get the team involved in brainstorming how to reduce those timelines, devise experiments to undertake, and get to work on reducing the cycle time.

Minting a new role within the organization for a dedicated project manager, tasked with data center provisioning optimization, is another option that could be considered.

Partnering with Legal to eliminate anti-cloud provisions on contract renewals

As mentioned in the introduction, this company has a “no cloud” provision written into many of their customer contracts. This was commonplace and expected 5-10 years ago, but in today’s business landscape, even the world’s largest banks and financial institutions are undergoing massive cloud transformations, shifting from their own home grown data centers to the public cloud.

It’s important to listen to your customers and give them what they want, but this can be a delicate balance in certain circumstances. On some things, you have to push back. I believe this is one of those areas where significant effort should be put in place to attempt to eliminate the anti-cloud provisions upon contract renewals. I expect this effort to be difficult, and that some customers would not agree.

How do you help make your customers more comfortable with a cloud solution? Highlight specific practices and safeguards that you will guarantee when moving them to the cloud. Discuss encryption at rest, AWS Well Architected Framework adherence, Cloud Governance processes, Security Incident Response processes, routine Pen testing, and other practices that help mature organizations ensure that their infrastructure and customer data is secured in the cloud.

The messy middle is when some of your customers agree to use cloud backed services and when others are adamant about avoiding it. In this case, I have often seen product capabilities and features diverge, where cloud customers get more features and different product capabilities, more quickly, compared to their on premise peer customers.

This is a delicate strategy and should be carefully considered with intention, before putting into practice. If this is the path taken, then the cloud hosted version of your services offering more features is one carrot that can be used to entice your legacy customers to shift over.

Partnering with Sales to form clear agreements on capacity expansion

Speaking with the team, it was clear that one of the pain points of the data center capacity expansion was missing Agreements (yes, capital A) between the Sales teams and the Engineering teams.

It is critical to get Data Center Engineering and Sales on the same page. There should be no ambiguity. If Sales is pursuing a deal to add 10 million new customers to the platform, then there should be clear expectations on how long it will take the Data Center team to expand capacity to be able to serve that new user base.

A crude example; if Data Center Engineering needs 3 weeks to add capacity to meet 10 million new customers, then Sales should set expectations with the customer that they will not be able to onboard until 4 weeks after the deal is signed. Leaving some room for error is always advisable, so that you can meet or exceed your customer expectations.

Conclusion

Operating a SaaS application out of your own data centers, pressured by tremendous growth, is a challenging endeavor. There are many nuanced considerations to be made in this problem space, and it is imperative that you get creative and take a look at all possible strategic approaches while weighing the pros and cons of each direction.

No one has all the answers, and the specific strategies employed will be dependent on the context of the business, its leaders, and their particular preferences. It’s important to brainstorm with your teams on this, hear out all the ideas, run experiments, iterate, and focus on continual improvement.