Scaling the modern data center

Written 2020-12-16.

I recently met with a group of passionate technologists and we had an intriguing conversation around the challenges that they currently face with operating and scaling their SaaS applications hosted within their own data centers. This company has provisions in many of their customer contracts that prohibit them from moving to the cloud.

They are having difficultly scaling and responding quickly enough to handle high burst workloads that inundate their backend infrastructure. The problems are exacerbated when their sales team inks a deal with a new large customer, resulting in significant increases to their user base and a corresponding increase in web traffic, on short notice. The load is generated from consumers using mobile and web applications.

As is the case with many complex issues, this is an interesting and multi-faceted problem that cannot be solved with any one course of action. Thankfully, constraints are the mother of creativity and innovation!

After thinking on this problem, I suggest putting significant effort into exploring the following strategies going forward:

  1. Performance Engineering
  2. Adding elasticity to the data centers
  3. Optimizing the data center supply chain
  4. Partnering with Legal to eliminate anti-cloud provisions on contract renewals
  5. Partnering with Sales to form clear agreements on capacity expansion

Let’s explore what I mean with each of these points.

Performance Engineering

When we started the discussion, it centered around the data centers specifically, and I did not immediately consider application performance. The team shared with me that they have been able to surgically rewrite key portions of their Ruby based application in Go, which has improved performance and allowed them to squeeze more out of their existing data center capacity.

This is brilliant! When you have a data center(s) with compute resource constraints, make sure that you are maximizing the resource efficiency of those limited assets. The team achieved significant performance improvements with this strategy. This result should not surprise you, as compiled languages generally exhibit notably increased performance characteristics compared to interpreted languages. In addition, the concurrency features built into the Go language may have helped them with achieving even greater efficiencies.

This strategy has already paid off for them, and there should be an emphasis on this continuing, but what else could we look at?

Are they using an APM (Application Performance Monitoring) product, which could help them shine a light onto the inner workings of each application? This could unveil bottlenecks and potential areas to target for further improvements.

Has a Content Delivery Network been implemented? This would ensure that their static assets are served at the edge, for optimal end user performance and would offload additional work from their backend servers.

Are they caching data sourced from expensive database queries by employing Redis, Memcached or a similar solution? Depending on data workloads, database types, query patterns, and other factors, database calls can be the most expensive transactions for SaaS applications. If you can eliminate some of the database queries via caching, you can often see significant performance improvements.

Caching is not the only lever for improving expensive data queries. There are many additional factors to consider in this space. Are there database indexes in place in all the appropriate places? Is the current database solution the most optimal one, based on the data shape, query patterns and load? Has a NoSQL solution been explored for query intensive use cases where low latency reads under heavy traffic is of utmost importance?

I would also look at the application architecture and the division of compute labor within the product. Is everything highly coupled together, requiring that every piece of the architecture be scaled, regardless of the discreet need? If the answer is yes, then a decomposition of the application might be beneficial, to allow for independent scaling of just the pieces that require additional computing power under increased load.

Additionally, load testing in a non-production environment can unveil weak points, assist with forecasting capacity capabilities, and help to inform future architectural needs.

Adding elasticity to the data centers

One of the biggest issues with running your own data centers is that your capacity is mostly static. It can take weeks, or even months to acquire new hardware, install it, wire it up, configure it, and finally put it into service.

The cloud revolutionized our industry by providing resource elasticity. If you have to run your own data centers, it is imperative that you add elasticity to your compute layer.

What do I mean by elasticity? Creating your own PaaS (Platform as a Service) based on Kubernetes is one compelling option. With unpredictable workloads and bursting traffic, you won’t have a good time waiting in the range of 5-10 minutes for VM’s to launch and enter service. Compare that to Kubernetes HPA (Horizontal Pod Autoscaling) capabilities, where you can horizontally scale out your applications in milliseconds.

Kubernetes isn’t the only solution, of course, but it’s difficult to ignore the vast mind-share, open source community support, CNCF projects, and numerous other resources that are dedicated to its ecosystem.

Additionally, are the data centers being managed and operated using a Software Defined Data Center (SDDC) model? While this term is heavily used in marketing materials and should set off your buzzword alarms, the underlying concept of managing all (or as much as you can) of your data center resources with software has validity. Leveraging this strategy can increase the time to provision or reallocate resources in your data center, leading to greater elasticity.

Optimizing the data center supply chain

Having more customers than you can handle is a great problem to have. In the context of data center capacity, some of the biggest issues revolve around the fact that it is often slow to add capacity.

This is a classic supply chain issue. Don’t settle for “this is how it’s always been done”. If this is your business’ bottleneck, then you should be fanatical about pushing the limits, exploring new innovations, talking with many suppliers, renegotiating contracts and anything else you can do to start chipping away at your expansion timelines.

If you don’t know how long it takes to increase capacity now, the first step is measuring the entire process, from the start of when new resources are needed, to the very end point at which you have these new resources in service. Once you have that baseline, get the team involved in brainstorming how to reduce those timelines, devise experiments to undertake, and get to work on reducing the cycle time.

Minting a new role within the organization for a dedicated project manager, tasked with data center provisioning optimization, is another option that could be considered.

Partnering with Legal to eliminate anti-cloud provisions on contract renewals

As mentioned in the introduction, this company has a “no cloud” provision written into many of their customer contracts. This was commonplace and expected 5-10 years ago, but in today’s business landscape, even the world’s largest banks and financial institutions are undergoing massive cloud transformations, shifting from their own home grown data centers to the public cloud.

It’s important to listen to your customers and give them what they want, but this can be a delicate balance in certain circumstances. On some things, you have to push back. I believe this is one of those areas where significant effort should be put in place to attempt to eliminate the anti-cloud provisions upon contract renewals. I expect this effort to be difficult, and that some customers would not agree.

How do you help make your customers more comfortable with a cloud solution? Highlight specific practices and safeguards that you will guarantee when moving them to the cloud. Discuss encryption at rest, AWS Well Architected Framework adherence, Cloud Governance processes, Security Incident Response processes, routine Pen testing, and other practices that help mature organizations ensure that their infrastructure and customer data is secured in the cloud.

The messy middle is when some of your customers agree to use cloud backed services and when others are adamant about avoiding it. In this case, I have often seen product capabilities and features diverge, where cloud customers get more features and different product capabilities, more quickly, compared to their on premise peer customers.

This is a delicate strategy and should be carefully considered with intention, before putting into practice. If this is the path taken, then the cloud hosted version of your services offering more features is one carrot that can be used to entice your legacy customers to shift over.

Partnering with Sales to form clear agreements on capacity expansion

Speaking with the team, it was clear that one of the pain points of the data center capacity expansion was missing Agreements (yes, capital A) between the Sales teams and the Engineering teams.

It is critical to get Data Center Engineering and Sales on the same page. There should be no ambiguity. If Sales is pursuing a deal to add 10 million new customers to the platform, then there should be clear expectations on how long it will take the Data Center team to expand capacity to be able to serve that new user base.

A crude example; if Data Center Engineering needs 3 weeks to add capacity to meet 10 million new customers, then Sales should set expectations with the customer that they will not be able to onboard until 4 weeks after the deal is signed. Leaving some room for error is always advisable, so that you can meet or exceed your customer expectations.

Conclusion

Operating a SaaS application out of your own data centers, pressured by tremendous growth, is a challenging endeavor. There are many nuanced considerations to be made in this problem space, and it is imperative that you get creative and take a look at all possible strategic approaches while weighing the pros and cons of each direction.

No one has all the answers, and the specific strategies employed will be dependent on the context of the business, its leaders, and their particular preferences. It’s important to brainstorm with your teams on this, hear out all the ideas, run experiments, iterate, and focus on continual improvement.

2020

The fam at the park next to our home, in Sandy, UT

Happy Holidays!

If you haven’t read our May 2020 family update, give it a read!

This year has been quite a tumultuous one, as we’ve all lived through the many world changing events that have occurred, many of which are still on-going.

Dylan and Ashton

The boys turned 9 years old in September, and they’re currently in 3rd grade. For Ashton and Dylan, the biggest change for them school wise, was shifting to a “distance learning” model. Monday-Thursday, they meet with their class for “morning meeting”, which spans 30-60 minutes over Google Meet. During these sessions, they get to engage with their teacher and their school mates.

The rest of the day is spent in self paced study and learning, although they do get a curriculum to follow, with the teacher releasing new assignments and activities to complete every morning (on Canvas). Occasionally the assignments aren’t available for the boys when they try to start school at the butt crack of dawn, immediately after they wake up at 7AM.

Thankfully, the boys have both done well with this new learning model and have excelled in their class. We do miss the in person socialization and they miss their friends at the elementary school that they would typically see at lunch and recess. Tera and I look forward to the end of this pandemic and to sending them back to in person schooling.

The shift has led to them becoming proficient in typing, and learning a wide variety of new apps, programs, and computer navigation skills (on Chromebooks), which I’m happy about.

In terms of interests, Roblox and Legos are definitely the boys most beloved activities. They play Roblox just about every day on their “teampads” (Fire HD tablets) and they usually use their “TV time” to watch YouTube videos of gamers playing and talking through games.

Tera

This year Tera has spent most of her time helping the kids with online school, reading books, cooking, doing yoga, meditation and taking walks to help keep sane during the pandemic. The latest book she loved is “The Pale Faced Lie”, a true story by David Crow.

She’s also taken on the job of cutting everyone’s hair in the house, while Wes has taken on the duty of hair colorist for Tera. haha.

We can’t wait to get back to enjoying live concerts, hanging out with friends, going out to our favorite restaurants and traveling, once things go back to more normal. Luckily in 2019 we traveled more than we ever had in previous years. We are grateful for our good health, Wes’ job that he loves and for the extra time we get to spend together as a family.

Wes

In May, I took on a new role leading the Platform Engineering team at Pluralsight. Our team builds libraries, tools, and systems that enables our fellow product teams to be more productive. We work with many different languages, frameworks, and tools and we are always innovating and learning new things. I love this work and I am very grateful to have a phenomenal team of Software, DevOps, and Product people that I get to spend my days with.

Separately from my full-time employment, I am also a Pluralsight Author. This year, I was able to help launch the new Cloud Labs product, authoring roughly a dozen Cloud Labs that cover both AWS and Azure! You can see some of these labs after logging into the app.pluralsight.com Skills Product on my Author Profile.

Wes’ Author profile

In October, I spoke at the Big Mountain Data & Dev conference, with my talk, Scrape the Web for Fun & Nonprofit with Python. Have a look if you’re interested in learning some Python with me!

This year, I also launched a mentorship program where I am mentoring and advising Computer Science students and folks who are aspiring to enter the tech industry. I appreciate the opportunity to help uplift others, serve as a resource, and I am excited to see my mentees continue to succeed.

Well, that’s all for now. We miss you all and hope to spend much more time with friends, family, and colleagues in the coming year. We wish you good health, both mentally and physically, and hope that 2021 is a much better year for us all.

Merry Christmas and Happy 2021!