WEB Advent 2011 / Cracks in the Foundation

PHP has been around for a long time, and it’s starting to show its age. From top to bottom, the language has creaky joints. I’ve decided to take a look at how things got to this point, and what can be (and is being) done about it. I start out pretty gloomy, but bear with me; I promise it gets better.

In the Beginning, There Was Apache and CGI

And there was much rejoicing.

In 1994, Rasmus Lerdorf created the “Personal Home Page Tools,” a set of CGI binaries written in C. These tools looked little-to-nothing like the PHP we know today. Embedded in HTML comments, and using a syntax bearing no resemblance to C, they still contribute one critical principle to modern PHP; it just worked.

PHP is a language of convenience. It was built with the idea that anyone could toss together a few lines of code and have a working CGI script, without having to worry about the server interface, the cryptic syntax of Perl, or the pitfalls of C. It’s a great idea, in theory, and for the most part, it’s worked very well in practice.

Unfortunately, nothing’s perfect — PHP included. Over time, PHP has suffered everything from security failures to bad design decisions. Some of these problems were avoidable, but others weren’t.

Backward Compatibility Is a Female Dog

Backward Compatibility (or BC) is the bane of every library and app writer in existence. It stifles improvements, holds back innovation, promotes unsafe practices, frustrates users, and slows development. PHP, being a language intended for beginners, suffers from it even more than most.

When BC is broken, apps break. Operating system vendors who shipped the new and improved version of PHP tend to come under fire for having a broken system, even though they did nothing wrong. More often, the writers of the apps are vilified for not providing working software, despite having done nothing but follow the manual.

Sometimes, the source of the problem is correctly identified, and the PHP developers are berated for trying to make a better language. No matter who takes the blame, though, one thing remains constant; users rarely understand anything other than “it’s broken!” They don’t care whether the new version is better. The old one worked. They want it to keep working. It’s a reasonable expectation.

Unfortunately, with a programming language, it’s often impossible to meet that expectation without sacrificing features, safety, or speed — usually more than one of these.

PHP 4.4 was released to fix a bug that caused memory corruption when references were misued. The fix changed the internal API, forcing every extension module to be rebuilt. Unfortunately, rebuilding extensions can be an arduous process in some environments. Some extensions don’t come with source code. Others are ancient, and code that managed to struggle along finally stops compiling cleanly. Vendors (such as those who provide various flavors of Linux) who ship packages have to rebuild and test not only PHP itself, but also every extension they ship before pushing to their repositories.

The results of all this are twofold. First, almost everyone holds off on updating to the new version to avoid the work and cost involved in fixing the problems that arise, leaving them all running a version with publically disclosed memory corruption bugs. Second, PHP itself is discouraged from making similar changes in the future, lest adoption of compatible fixes be slowed. It only gets worse when the change breaks source compatibility, forcing app writers to change their code even after the vendors catch up.

All of this is bad enough when the change is essential for whatever reason. It’s far worse when the compatibility break is the result of bad choices or poor planning.

Innovations Are the Devil’s Playthings

There are several examples in PHP’s history of changes that were made by well-meaning, forward-thinking developers who were trying to make PHP better, only to be shouted down because of the trouble it would cause to implement them, or, worse, to actually make the change and suffer the chaos that ensued, because they didn’t realize how far the effects would reach.

A recent example is a change in the behavior of the is_a() function between versions 5.3.6 and 5.3.7. is_a() used to allow a string parameter as its first argument; it was changed so that it would call the autoloader when passed a string referring to a class that doesn’t exist. The new behavior was technically correct and consistent with the is_subclass_of() function, but calling into the autoloader when it previously hadn’t caused a great deal of working code to break. Several PEAR packages started throwing exceptions from the autoloader due to supposedly missing classes. Fixing these packages required adding an extra is_object() check to every place is_a() was used.

Unfortunately, by the time the scope of the problem was realized, and a solution was agreed upon, a further version in the 5.3 series, 5.3.8, had been released, with the new behavior intact. To revert the behavior at that point would have created a whole new BC break and necessitated a second round of code changes. By the time the situation was settled with a patch to the 5.4 tree and a reversion in 5.3.9, a CVE report had been entered for the new behavior, and several mailing list threads regarding poor testing coverage and lack of procedures for BC breakage had reached impressive length.

All of this could have been avoided if a unit test had caught the break, or a procedure regarding BC breaks had been in place to prevent the change. The original author of the change can’t be held responsible for fixing a bug and correcting inconsistent behavior, but in an environment where fixes and corrections can have such far-reaching consequences, blame tends to get assigned (that fortunately didn’t happen in this case), and people become less willing to fix things.

On a rather less excessive note, for a long time there have been complaints about the inconsistent naming of array functions. It would be quite nice if, for example, the [uka][r]sort() family of functions was instead named array_[uka][r]sort(), but while the new function names could be added, the old ones couldn’t be removed for a very long time — at least two major PHP versions. With that limitation in mind, it seems pointless to add the new names. So, the old ones just stay — the result of a design decision far in PHP’s past, reasonable at the time (shorter function names were easier to understand, remember, and use).

When the Porpoises Ask the Few Survivors What Went Wrong

If, in the past, someone had asked me what was wrong with PHP, I probably would have said something like “developer apathy.” In early 2009, I took the lead in moving PHP’s source code from the increasingly creaky CVS repo to a shiny new SVN repo. I found that individuals with specific knowledge, such as whether a particular module needed to be saved or not, were in plentiful supply, but it seemed to me as if there was a lack of people to help with the overall process.

Let history be my judge in that regard. It may have been that offers of help were made that I misunderstood or chose to ignore. If so, I apologize to all who made the effort and were rebuffed. I do, however, stand by my assertion of apathy as a problem; PHP 6’s failure is an equally good example.

Even if we accept that perception as valid, it’s certainly not so any longer. A great deal more is happening with PHP now at the end of 2011 than was going on in early 2009. If asked what the problem is today, I would say, “no design and no plan.”

PHP has always been an evolving, almost-organic language. It has been rewritten from the bottom up at least four times, with massive internal changes to the engine at least twice more. Through all these mutations, however, its external interface — the language itself — has remained quite similar for a long time. Nearly everything that can be pointed to as different between PHP 3 and PHP 5.4 is an addition or extension to the language, not a change in existing behavior. There are exceptions, such as the new object model, but by and large, a PHP coder looking at PHP 5 code will be able to make complete sense of PHP 3, and vice versa. All of these versions share one flaw: there is no single specification of the language!

External tokenizers have to be implemented by reading the Zend Engine’s re2c and Bison input files. Reimplementations of PHP have to refer constantly to Zend’s implementation to understand the quirks of the engine. The behavior of the language is inconsistent, and it often feels clunky. In particular, expressions do not reduce recursively to values as one might expect. (new SomeClass)->methodReturningClosureReturningArray()()[5] causes a parse error, for example, though a similar construct in Objective-C, [[[SomeClass alloc] init] methodReturningBlockReturningArray]()[5] works fine.

There are a variety of reasons why this behavior is part of the PHP language, but they boil down to two major points. First, there is no specification that says how the language should work; there’s nothing to compare against and say, “this is wrong” or “this is wrong.” Second, fixing issues like these in a complete and lasting form would necessitate a parser rewrite, and that means reimplementing the entire language differently. “BC break” doesn’t even begin to cover it.

I Can See Clearly Now

I’ve gone on at quite some length about PHP’s problems. I’ve mentioned some solutions to those problems, but I haven’t said much about what’s actually happening. So, here’s the situation, and it’s not nearly so bad as I may have made it sound.

BC breaks
A new release process was adopted in June of 2011, clarifying the timeline for releases, including the proper times for changes which break BC. This has also put PHP on a track for more regular releases in general, which is a significant help for vendors who bundle PHP.
Communication problems
PHP has historically had trouble with no one outside the core team knowing what was happening. In the last several months, there has been considerably more communication with OS vendors and others affected by changes and the release timeline. Some of them came to us, but in other cases we went to them.
Lack of specification and standardization
The lack of a language specification remains a significant issue, but at the very least, awareness of the problem has increased. An initiative to document the language behavior in a format such as EBNF has been suggested.
Unit tests ignored and broken
This particular problem, brought to the public eye by a major security bug in the 5.3.7 release which was caught — and ignored — by a unit test, was dealt with shortly thereafter. Since that time, a huge number of failing tests were fixed, and the release managers now pay a great deal more attention to the test suite.
Developer apathy in the PHP community has largely disappeared. Participation and discussion are considerably improved from where they once were. Development is now active on the 5.4 branch, the development line that grew out of the PHP 6 effort, and it’s already in RC status as of this writing.
Undocumented, confusing engine API
Unfortunately, the cruft of the Zend Engine’s API remains a sticking point. The API is complicated, mostly undocumented, and completely unintuitive. zval reference management comes to mind. Efforts made to document the API have stalled time and again. The news isn’t all grim, though; an RFC is in discussion to completely separate the internal API and create a new, clean external API.
Bit rot (old code and features holding back new things)
The new release process eases this quite a bit; it’s now safe to say there will be a new version at some point which isn’t afraid to break BC. This follows from a simple and obvious fact: the people affected by BC breaks will now have warning that it will happen. Deprecation in point releases is a bad way to handle future changes. Giving the expectation of major changes at a defined point in the future is a good way, and that’s where PHP is headed.
No Unicode
The lack of Unicode support in PHP remains a serious problem, but at least we’re no longer nursing a dying animal (PHP 6) in the vain hopes of making it work when the entire surrounding situation has changed. This opens the door for new ideas on how to fix the issue. For more on PHP 6’s history, see this excellent set of slides by Andrei Zmievski.

A Little Nonsense Now and Then Is Relished by the Wisest Men

It’s safe to say PHP still has a long way to go to be the shining beacon of light we’d all like to see in a language, but it has withstood the test of time better than any other language of its kind, and it sees daily use across millions of servers. Nothing that’s wrong with PHP or its development is unsolvable, and tremendous progress has been made in the last year. No matter how dark I may have sounded during some parts of this article, I’m proud to be a part of PHP, and I hope I’ll continue in the future.

Developer Gift

A gift I recommend for any developer is a copy of any comprehensive book on hardware architecture and operating system design, with Inside the Machine and Operating Systems Design and Implementation being two excellent examples. I have long held the belief that any developer’s skills — whether they’re writing raw machine code or PHP or anything in between — can be improved by a clear and comprehensive understanding of the machines themselves.

Other posts