WEB Advent 2011 / Egomaniacal and Scalable Apps

The rise of cloud computing over the past few years has been incredible. Unfortunately, many still do not see the interest or do not comprehend its importance.

This is not a post about cloud computing and its benefits, but rather a refresher about understanding some basic rules of scalability and how to get started with building scalable apps while keeping in mind that most people and companies have limited budgets.

The concept of shared-nothing architectures is well known in the scalability community — if such a thing even exists — but it seems to be frequently forgotten when it comes to building web apps.

Scalability

Before we dive into scalability theory, a reminder is in order; fast apps and infrastructures don’t necessarily scale.

Although scalability is a complex and arduous concept to explain, the generally-accepted meaning for load scalability is the ability for a system to handle a growing amount of work in a capable manner. In other words, it’s the ability to handle more load by adding more computing power. Will you be able to handle more users if you add more servers to your cluster?

Load scalability is about the ability to adjust and adapt. As Darwin once said (or meant to say):

“It is not the fastest web app that survives peaks and popularity; it is the one that is most adaptable to change.”

Share Nothing

For years, the PHP world (most of it) was deploying to a single server that contained the web server, the databases, the uploaded files, the session files, &c.

This outdated setup looks like this:

This was fine until web sites started going down because of a mention on Digg or Slashdot (for those of you who remember Slashdot). Apps were fast, but they couldn’t surpass a certain threshold of users.

This is about the time the concept of shared-nothing architectures began to take hold. Infrastructures are now decoupled, and every component can be easily replaced. This improved setup looks like this:

By sharing nothing, every server that powers your app becomes egomaniacal and does not care about the rest of the infrastructure. Like modular object-oriented code, parts of your infrastructure become independent. For instance, your web server should no longer save sessions in local files, because at any given point in time, the number of web servers can change.

A system that is tightly linked to its filesystem for file uploads, databases, sessions, &c. is not scalable. Luckily, in the PHP world, we have quite a few tools to help us attain a high degree of selfishness.

Sessions

Storing sessions on a single filesystem means that when the system scales and adds or removes servers, some sessions will be lost. There are a few solutions to the problem.

Memcached

Memcached has been used for many years in the PHP world, and it is a great way to store objects and sessions across a cluster of servers. For an even more scalable Memcached infrastructure, I recommend Membase, which is open source and provides elasticity as well as persistence for Memcached.

Once you set up your Memcached infrastructure (and the PHP extension), you simply modify your session handler to point to your Memcached server (or pool) as follows:

<?php
    ini_set('session.save_handler', 'memcached');
    ini_set('session.save_path', '1.2.3.4:11211');

Many people dislike using Memcached for session storage, because the data is ephemeral, so if your server dies, all of your sessions are going to disappear, and your users will be logged out.

Membase avoids this problem and is fully compliant with the Memcached protocol.

Redis

An alternative to distributing sessions persistently across clusters of computers is to use Redis and the Redis PHP extension.

Redis is an advanced key-value store (data structure server) that can contain strings, hashes, lists, sets, and sorted sets. Redis can be naturally replicated, and, despite being an in-memory system, it can also fall back to storing data on disk.

After installing the aforementioned PHP extension, you have modify your session handler to use Redis:

<?php
ini_set('session.save_handler', 'redis');
ini_set('session.save_path', '1.2.3.4:6379');

In the event where you want to use a cluster of Redis instances, you can specify multiple servers:

<?php
ini_set('session.save_handler', 'redis');
ini_set('session.save_path', 'tcp://1.2.3.4:6379?weight=1, tcp://4.3.2.1:6379?weight=2');

SessionHandler

PHP 5.4 has a new class called SessionHandler. This new class lets you write a custom object-oriented session handler.

Saving your sessions to Memcached using SessionHandler would look something like this:

<?php
class MemSession extends SessionHandler {

    public function read($key) {
        return parent::read(sha1($key));
    }

    public function write($key, $value) {
        return parent::write(sha1($key), $value);
    }
}

ini_set('session.save_handler', 'memcached');
ini_set('session.save_path', '1.2.3.4:11211');
session_set_save_handler(new MemSession);

Of course, sessions can be, and often are, saved to a database. Using SessionHandler, it is now easier than ever to do so.

Distributed, Redundant File Storage

Another limitation of most traditional web apps is saving files to a local filesystem.

If an app scales by adding more servers, saving files to the filesystem has the same limitations as saving sessions to the filesystem. The presence of files will be inconsistent across nodes, and your users will be frustrated.

A solution to this problem is to use systems that are built to store, distribute, and replicate files across multiple regions. A good example is Amazon S3 which can be easily supported with Zend_Service_Amazon_S3 or PEAR::Services_Amazon_S3.

For Symfony2, I use a bundle I made named symfony2-s3streambundle. It lets you easily write logs and similar files directly to Amazon S3 using Monolog.

Once the bundle is installed, and your Amazon S3 bucket is created, all you have to do is modify the app/config/config_prod.yml file to add the following:

monolog:
    handlers:
        nested:
            type:  stream
            path:  s3://your-bucket/%kernel.environment%.log
            level: debug

There are other ways to distribute files, such as building your own filesystem cluster using GlusterFS, NFS, or any other variant of a distributed and eventually consistent file-storage system. I usually recommend using Amazon S3, because it is fast, reliable, highly available, and widely supported.

Conclusion

With cloud computing becoming more standardized, and platforms such as Orchestra providing scalable architectures at the click of a mouse, application architecture is becoming more important, while infrastructure architecture is becoming simpler and more reliable.

There are more things to ponder when building a truly scalable architecture, such as message queuing, database replication, automated backups, self-healing, elasticity, and locality. PHP has many extensions and tools to help your app scale.

On a related note, if you are unaware of the Gearman and ZeroMQ projects, I recommend learning more about them and considering whether they might fit into your current or next project.

Developer Gift

Einstein once said, “God does not play dice with the world.” Unfortunately for him, he may have been a few years too early for the very cute mathematician’s dice, and they might have changed his mind. These are a great gift for adults and children alike. Happy Christmas!

Other posts