WEB Advent 2012 / Going from One to a Million Users

Web development is a profession that has one of the lowest barriers to entry that I can think of. It requires a computer, an Internet connection, and time. While some may consider these prerequisites too high for many people on the planet less fortunate than us (and indeed, they would be right), programs such as OLPC are attempting to bridge this gap.

In addition to a low barrier to entry, a somewhat unfortunate side-effect of the current ecosystem is that many people (especially those new to web development) are at a loss when it comes to scaling their app to accommodate an influx of traffic and users. Often, these problems are relegated to people with fancy titles such as Systems Architect or Senior Architect, and with good reason; this stuff is difficult. Very difficult. It takes quite a bit of blood, sweat, and sleepless nights to understand just how hard of a problem this can be.

Nonetheless, a superficial understanding of the challenges in scaling a web app can be useful to developers who wish to progress further in their career and to better understand the trade-offs that are made during the lifetime of an app.

Let’s look at a broad and general overview of how a typical app might grow, what problems might arise, and some general-purpose solutions to these problems.

Money does solve problems

You might think that there’s something inherently cool or fun about maintaining all of the disparate servers and services that your app requires — and you might actually enjoy it, at least for a little while. After the initial honeymoon phase wears off, however, the hard facts start to set in. Every minute that you spend fiddling and maintaining your architecture is a minute that could be spent improving some facet of your app that could benefit your users in a much more direct way.

As is often the case, the simplest (and sometimes most cost-effective) solution is to simply throw money at the problem. With the recent affluence of general platforms as a service (PaaS) such as Heroku, Engine Yard, and even language-specific ones like Gondor.io, the need for a full-time operations person on staff for a small- to mid-sized web app has decreased dramatically.

Take it from someone that has been paged at two o’clock in the morning in a foreign country when servers go down due to network issues, and your Internet connection amounts to a hotel using an AOL CD they found in the trash; if you can pay someone else to have that problem, do it.

Every problem you pay someone else to have is time that you can use to focus on what makes your app special.

Starting Small

If you’re like 99% of the web sites in the wild, you most likely started with a single server for running your app. Or, you might even have started on some form of shared hosting. There’s nothing wrong with that; we’ve all been there.

There comes a time when that single host isn’t able to keep up with the demands of your app — your database grows too big, you become memory or IO bound (depending on your app and your service stack), or you’re not able to attain whatever response time or throughput goals that you’ve set for yourself. This might be stressful, but remember, problems of scale are generally good problems to have. Most apps struggle to hold on to a few hundred users, let alone have the privilege of needing architectural changes to scale upwards.

Your go-to solution at this point should be one that minimizes the changes in your procedures, development time, and complexity of the system and app architecture. For most situations, this means getting a more powerful machine. If you’re on some sort of virtualized hardware like Linode or Amazon Web Services, then this is simply a matter of a few clicks in their respsective web consoles, and a few minutes of downtime. While the latter might not be desirable, generally, it is tolerable for your users if handled correctly.

Considering the various memory, processor, and disk configurations available from virtualized or bare-metal providers, this very simplistic approach of making your servers more powerful (also known as vertically scaling) can get you quite far.

Necessary complexity

If your user base continues to grow, vertical scaling will only get you so far. The next step is to separate out your system architecture by broad components and services.

The obvious choice for most is to separate the web server from the database. In doing so, an additional layer of complexity and an inherent network overhead is introduced. However, you suddenly gain the ability to independently scale both major components with respect to their own performance characteristics and requirements for your app as a whole.

Once the database and web server have been separated, the next phase is to add some high-availability to the mix. It’s great that you can now serve more users, but if your web server node goes down for whatever reason, no one cares about your mean response time or throughput. What matters is that your app remains up and functioning for as many minutes during the year as is possible.

Begin by performing an analysis of the SPOFs in the current architecture. A simple heuristic is this: if any particular server were to become unavailable, would the app continue to function? If the answer is no, then you’ve found a SPOF.

In our current architecture, we have two very obvious (and separate) SPOFs: the database server and the web server. If either is disconnected, our app is no longer able to operate normally.

At this point, anyone not familiar with the concept of a shared-nothing architecture has some reading to do. Go on. I’ll wait.

Now, the easiest SPOF to address in this situation is the web server. As long as you aren’t storing any stateful information pertinent to a particular user on the web server node (file-based sessions being a typical offender), then adding additional web servers is feasible.

Now you run into yet another SPOF: your DNS A record, which ties your host (example.com) to an IP address.

(I assume you aren’t managing your own DNS for the sake of simplicity. If this is not the case, then you probably don’t need this overview.)

One solution is an HTTP-capable load balancer such as HAProxy, nginx, the Amazon Web Services Elastic Load Balancer, or one of many other similar products. A properly configured load balancer will let you distribute incoming requests according to a set of rules or heuristics, thus eliminating a SPOF. You can then add or remove web servers from this pool at will; you need only ensure that there are enough active servers in said pool to ensure that your app is able to process requests in a timely manner.

Now you’ve managed to remove the web server as a SPOF. If one or more web servers go offline, our app will continue to respond and serve requests as long as the number of remaining web servers is enough to satisfy our normal performance requirements.

The next obvious SPOF, your database, is a tricky one. If you’re smart and can afford it, this is something that you might want to make someone else’s problem. Products such as Amazon’s Relational Database Service can take a huge chunk of pain out of managing a relational database yourself. If you’re more adventurous and willing to try new things, there are also a plethora of non-relational technologies such as MongoDB, Riak and CouchDB that make various trade-offs in an attempt to better tolerate network partitions and provide better availability.

The little things that make a big difference

Open up your favorite web app in your browser. Now open the network traffic inspector panel (or whatever the equivalent is for your preferred browser). Chances are good that there are several dozen (if not hundreds) of requests for external stylesheets, Javascript, images, and fonts, not to mention any asynchronous XHR requests to load dynamic content after the page has loaded.

Assuming all of these files have been properly minified and compressed, and you use all the best HTTP headers for caching, your web server still needs to respond to every request. A very simple and effective way to reduce the load on your web server is to move these assets somewhere else, either to a dedicated pool of static content servers, or to an external service such as Amazon S3. There are a few caveats to this, but it’s generally a good idea and can save you quite a few CPU cycles that could be put to use elsewhere.

Going even further

As with most things worth doing, your job is far from done. You might have a solid, mostly-available foundation for your web app, but there’s always more work. Here are some additional things to consider:

  1. If your app lets users search through content on the site, moving search-related querying and storage to a separate tier is highly advisable. Products such as Solr and Elasticsearch are well worth your time to investigate.
  2. Caching content that is expensive to generate is sometimes a necessary evil, and it’s always a double-edged sword. Familiarize yourself with common read and write cache policies before adding something like Memcached or Redis.
  3. Often, there are expensive requests that you are unable to cache, such as requests to third parties like Twitter or Facebook for lists of friends or other volatile and personal information. In these cases, an asynchronous job queue is your friend. Anything that does not need to be immediately computed, calculated, or queried to serve a request to a user should (at some point) be moved to an asynchronous task. There are good solutions, including Celery, an abstraction for several adapter-based brokers and result stores ranging from Amazon Simple Queue Service to RabbitMQ and even Redis and MongoDB.

Not quite there yet

This article didn’t begin to scratch the surface of what’s possible for a modern web app. Many things were omitted, caveats were left out, and naive simplifications were made. Fear not! One of the great advantages of web apps over desktop apps is that it is possible (and advisable) to make incremental changes to both functionality and performance while remaining relatively transparent to the end user. All of the above strategies do not have to be implemented all at once. Thankfully.

Many skills and intuition are learned through trial and error, and can take years before becoming part of your natural tech vocabulary. For all you new and aspiring developers and devops out there, don’t give up! Systems and app architecture are not only challenging topics, but also incredibly rewarding.

Other posts