A few years ago, Damien Seguy started collecting information about what version of PHP people were using and how PHP’s usage compared to competing technologies such as ASP.NET. As a release manager for PHP 5.1 and 5.2, it was particularly interesting to me, because the monthly stats showed the adoption trends of PHP 5.2 and served as a good gauge of how quickly people were migrating. I was also actively involved in the development of FUDforum, and this data helped determine what new PHP features I could rely on and whether support for older versions of PHP could be discontinued.
Unfortunately, sometime in 2008, the process of gathering these stats petered out, and the PHP community was left without it. About a month ago, after talking to Damien, I decided to restart the process and eventually expand it from 11 million domains to about 120 million. I want to share some of the data and conclusions that can be derived from my initial run.
Before we jump right into the statistics, let’s take a moment to review how the data was gathered. The first step of the process was to write a tool that would be able to gather the data. To keep things simple, I decided to use
pecl_http, because it allows multiple parallel requests, which I would need in order to do millions of requests within a reasonable timeframe. Given that the test was running on a fairly powerful server, I assumed that the bottleneck was likely to be network based, and writing the tool in C would not yield any substantial benefits. At this point, the goal was to get a gauge for the speed of data retrieval to see how practical it would be to generate the data on a monthly basis.
To minimize bandwidth usage and to speed things up, I used HEAD requests with a 3-second timeout. I arbitrarily decided that if a site does not respond within 3 seconds, it’s not worth testing, because I would certainly not wait more than 3 seconds for a page to start loading.
My first test run — with 25 parallel requests yielding 10 requests per second — was way too slow for my purposes. Increasing the number of parallel requests, surprisingly, did nothing to improve speed. Looking at the CPU and network utilization also did not expose any issues; the load was negligible, and there was very little traffic on the network. A bit of a WTF moment. Fortunately, the problem was quick to identify, although it did leave me a bit disappointed in libcurl, which
pecl_http relies on. While libcurl can certainly process a large number of parallel requests, when it comes to resolving the domain name to an IP address, the process is actually sequential and not parallel! Surprised? Yeah, I was, too. So, how do you resolve 12 million domains quickly? Well, if Ken Thompson is to be believed, when in doubt, use brute force, and by brute force, I mean C. And thus,
resolv.c came to be, a 150-line multi-process resolver. Using 50 forked children, it blew through 12 million domains in just about 30 hours, and in the process, made
named use about 3.4 gigabytes of memory.
After resolving, I tweaked the original PHP code to make connections to the IPs directly and send a
Host header containing the corresponding domain. With this in place, it was just a matter of determining how many request would saturate the 10 MB pipe. This magic number ended up being 400 parallel requests, which kept the requests going at an average speed of 150 requests per second. After another day of operation, I had 10.8 million successfully resolved and completed requests from the initial 12.3 million data sample.
To minimize the overhead during the request processing, the actual data analysis was left until the end. If you are curious how many gigabytes it takes to store 10.8 million headers, the answer is around 4.9. For my purposes, I focused on three headers,
The above chart shows the breakdown of the 6 major, identifiable languages from 6.7 million domains where the language could be determined. One of the surprising things to me was the popularity of ASP.NET. The next chart, showing the web server popularity, will explain this anomaly.
As you call tell from the data, nearly every (92%) IIS server reports as being powered by ASP.NET, even though many of them are probably serving static data. One of reasons is that it seems to be rather nontrivial to remove versioning information from IIS, which is why many people do not do so. Apache, Lighttpd, and Nginx make this trivial, so hiding versioning information seems to be a fairly common practice. For example, 40% of all domains served by Apache limit information exposure to just the server name. 52% of Lighttpd users hide all extra versioning data, and 29% of Nginx users follow the same pattern. From a statistical point of view, this is a bit annoying, but from a security and minimization of data transmission perspective, this is a good approach.
The PHP version breakdown is fairly positive.
As you can tell, nearly 80% of all PHP installations have migrated to PHP 5, with 5.2 being the most popular of the lot. 22% are holding on to 4.4 and 4.3 at this point, but that’s still a massive improvement over October 2008 (the most recent stats from Damien), when PHP 4 dominated at 52%. Clearly, there is much improvement to be made on the PHP 5.3 side, which has only managed to capture 4% in a little more than a year. That said, this can probably be attributed to the fact that most Linux distributions have only recently started to package PHP 5.3 as stable. So, over the next year, I suspect 5.3’s market share will grow dramatically.
On an amusing note, there are still some 390 sites running PHP 3, and two brave souls are still using PHP 1.3. An even braver group got their hands on the PHP 6 beta version (that is now defunct) and are running some 39 domains on it. Not to be outdone, someone traveled to the future and got their hands on PHP 6.6 and 7.5 and are now running their sites on it. Either that, or they figured out how to change the version string.
Within 5.2, the usage pattern is fairly interesting. Just over a quarter are using the latest stable release, another 20% are within 3 versions of the stable release, and 23% are using PHP 5.2.6, the first 5.2 version that most distributions shipped as stable. The remaining 31% are just about evenly split among all of the remaining 5.2 versions.
When it comes to 5.3, early adopters rule the day. As you can tell from the graph, nearly everyone is either using the latest stable or the preceding version.
As part of the data gathering process, much more information has been captured. Over the next few weeks, I will be working on moving the raw data into a database you can reference, so keep an eye on my blog. Some of the data will include Linux distributions, speed of access, character sets, compression support, &c. You might be surprised just how much data can be gathered from HTTP headers alone. And, of course, there are still 110 million domains to go.