Scaling to Millions of Users

April 01, 2017

Companies are like boats. The small ones are quick and agile. They move fast and pivot on a dime. But that agility comes at a cost: they are notoriously unstable. The rough seas of the market define what they can and cannot do. Make a bad turn and they risk capsizing. As our companies grow, our boats turn slower but get more stable in rough seas. But until you gain that stability, agility is your strength.

And sticking with boats, the bigger our vessel, the more cargo we can hold. Our cargo is customers. We are filling our holds with them.

Now you may wonder why i have belabored this point in a post about scaling a website. The reason is, that the law of scaling a website is that you must maintain a perspective about the tradeoffs of that scalability. If scaling was just a matter of building in "web scale" from the start without consequence, none of this would be an issue.

But there are consequences.

If we focus on scale too early, we sacrifice a lot of the agility that we need early on. We need to think about our growth as getting more cargo on our journey. With each new customer, we get a bit more valuable. We also get heavier and less agile. Our job as engineers is to patch the holes in the hull as we find them, and scale the size of the infrastructure at the right times.

Fight complexity in everything

The first rule of scaling is to keep your solutions as simple as possible, but not simpler. What does that mean? Before you scale up systems, make sure you have removed unnecessary complexities.

"Keep your solutions as simple as possible, but not simpler."

Is the app running queries it no longer requires? Does it have an expensive script scheduled too frequently? Does the client discard data it requested from the API but may need later? The answer to these questions is rarely "no".

These should be the first places to look when a website begins to slow down. Before we add complexity to a stack, we should be removing unnecessary actions and queries. Optimizing queries is another great way to tackle performance without increasing complexity. While it is far from a panacea, early performance boosts are just a CREATE INDEX command away.

Avoid cacheing and denormalization

Chief among the complexity injections are cacheing and denormalizing. Both are inevitabilities at scale, but it is my experience that neither are necessary before 100,000's of users. Your use-case may vary.

"Normalize until it hurts, denormalize until it works." - Jeff Atwood

Whenever you decide to start denormalizing, or worse, cacheing, be conscious of the difficulties they bring. Each puts distance between your source-of-truth (the denormalized table data) and the data you are using to populate a user's view. The more layers that exist between your source of truth and the data you display, the more likely they are to be different. Bugs caused by these differences are... exciting. Worse is that by the time you hear about them, the system has probably resolved the de-sync, or the cache has been cleared.

Tracking down bugs when your system is clearing the evidence is a bear. Bugs like these don't just upset clients, they can be soul-sucking to the engineering team as a whole. Are you prepared to have morale blood on your hands?

Scale servers vertically

Hardware is cheap these days. Once you have squeezed most of the juice out of query optimization and reduction, a simple answer can be to throw hardware at the problem. This is often an appropriate move.

But as you scale your physical infrastructure, keep it simple. This means prioritizing the bigger machines over more machines. You would much rather manage a single, 16GB server than an 8GB jobs server and an 8GB web server. Databases are the epitome to this. Paying a premium for a bigger database server is simpler to maintain than a sharded solution across boxes.

Time is money

Each injection of complexity, in our code and infrastructure, increases the amount of time that will be spent there in the future. Complexity reduces velocity and increases the number of things that can go wrong. Redundancy can mitigate the risk, but even that is an increase to the things that can go wrong. Complexity = time.

Think of scaling in order-of-magnitude chunks. With each new 0 added to user counts, expect new challenges with performance. Fix things until they work well without making the boat too large and sluggish. As we add more cargo, the boat sinks lower into the water and we find new leaks higher in the hull we didn't know of before. Fix them too.

Scaling is a lot like that. Do enough to keep the site high performing while keeping the team available to build actual value for customers. The longer you can keep the crew small, the more efficient each of you will be.

tl;dr: Scaling companies is a balance between making things as fast as they should be without over-complicating your tech stack.

Written by Ben

https://www.shootskyward.com

Ben is the co-founder of Skyward. He has spent the last 10 years building products and working with startups.