WEB Advent 2010 / The Ghost of Christmas Past

PHP has been around nearly as long as there has been a Web to develop for, but it evolves constantly and is a modern programming language by almost any definition. Sure, it might not suit hipsters who only enjoy bands when they can say “you probably have not heard of them; they only formed tomorrow,” but it is not your granddad’s programming language.

Sadly, web development often also involves SQL, and that is your granddad’s programming language. SQL, and the stored procedure languages you are most likely to encounter, have deliberately not evolved much since the 70s and 80s. When your programming ancestors were hunting saber-toothed tigers and scratching proposed SQL syntax onto cave walls, they had some fundamentally different ideas to those behind scripting languages.

If you want to write good, efficient SQL for complex problems, you are going to have to learn some of that mindset, but you can go a fair way with simple SQL by avoiding some traps.

Null

There is a null concept in PHP, but it is a lot more “special” in databases. After you spend a few hours chasing down a bug related to a difference between values that render identically, you’ll consider joining the folks rushing to NoSQL database management systems just on the basis that you’d be able to avoid SQL.

PHP — as you are no doubt aware — is loosely-typed, which means that although null has some special properties, they can easily pass you by unless you’re paying attention. In PHP, null == '', null == 0, null == false and null == null all evaluate to true. That’s because null is a special PHP data type that can only ever have one value, but like other PHP data types, it gets converted into other types automatically, so comparisons can happen.

In strongly-typed languages like SQL, you might expect an empty string to not equal 0 nor to equal Boolean false, and you’d be right, but maybe not in the way you expect. In SQL, null is not a type; it is a value, or more precisely, the absence of a value. Any type can be set to null. It is intended to show that a piece of data is unknown or not applicable, so the reason those comparisons fail is not because of differing type.

When comparing two variables set to '' and null, it is not the fact that they are different types that will make the comparison fail, it is that the empty string is one specific value that a string variable can hold, and null is another. Even if that seems fair and logical, it might still bite you from time to time. If you are looking at the contents of your tables, it is not always obvious that a column contains some values that are null and some that are an empty string. They’ll both look exactly the same from a command line test query. In SQL, null == '', null == 0, null == false and null == null all evaluate to null. That’s right, even null = null is not true, but these values are not false, either. Because null is a special value with a special meaning, two nulls are not necessarily the same. Just because two values are unknown, does not mean they are the same, but at the same time if at least one value is unknown, I don’t know that they are different either. I don’t know next week’s lotto numbers, or those of the following week, but that does not imply they will be the same set. It is unlikely that they will be the same both weeks, but I will not be able to say for sure until both are drawn. Because nulls don’t match, you can’t select rows that have nulls by writing SQL like SELECT * FROM my_table WHERE foo = null. You need to write SELECT * FROM my_table WHERE foo IS NULL.

Stored procedures

There are lots of things that are great about stored procedures, but the syntax is rarely one of them. It’s great to be able to keep logic close to your data, so that manipulation, validation, and integrity checking can all be done with little risk of circumvention. It’s great that your database backend can have business logic shared by different parts of your system without having to share it between different programs or port it to different languages. Doing work in the database where it might get usefully optimized or cached without effort on your part is great.

The syntaxes, on the other hand, have crawled out of another era, and just like the stars of Jurassic Park, they are mean. PostgreSQL’s PL/pgSQL, MySQL’s ANSI SQL:2003 syntax, and Oracle’s PL/SQL are different in detail, but they have a lot in common. They come out of the same era to solve the same problem and have more in common with each other than with modern scripting languages.

Here is a trivial PL/pgSQL function.

CREATE FUNCTION double_two_numbers (IN INTEGER, IN INTEGER, OUT
INTEGER, OUT INTEGER) AS $$
BEGIN
 $3 := $1*2;
 $4 := $2*2;
 RETURN;
END;
$$ LANGUAGE 'plpgsql';

Once upon a time, many programming languages made a distinction between functions that take parameters and return a single value and procedures that use parameters to take input, return values, or take input and modify it. Most of those languages were eaten by a new class of mammals as they slept on rocks waiting for the sun to warm their reptile circulatory systems. Some, however, live on as the stored procedure languages in popular databases.

Because this sample uses out parameters to return the results, other databases would call it a procedure, not a function, but in PostgreSQL, everything is a function. If I call it from the CLI, I get the following:

SELECT * FROM double_two_numbers(21, 99);
column1 | column2
---------+---------
     42 |     198
(1 row)

I could make it easier to read by providing aliases for the parameters, or — depending on my PostgreSQL version — I could even give the parameters names, but for a PHP programmer straying into database territory, the main weirdness is the use of in and out parameters. We can easily just avoid them and be more comfortable. Making them inout parameters as in this version is no less weird looking to a PHP programmer and produces the same output.

CREATE FUNCTION double_two_numbers2 (INOUT INTEGER, INOUT INTEGER) AS $$
BEGIN
 $1 := $1*2;
 $2 := $2*2;
 RETURN;
END;
$$ LANGUAGE 'plpgsql';

That’s probably a reasonable solution to our trivial problem. The function is short and easy to read. The result would be easy to consume in PHP. Whatever database access method you use would produce a row with the two values. The concept of inout parameters is a little weird in the 21st century, though, and does not work well for large sets of returned data.

As a PHP programmer, you might be more comfortable using only in parameters (which is the default if you don’t specify OUT or INOUT. For this example, you then just have to find a way to return two values, and if you are trying to write PHP-style code in your database, the most natural solution might be to return an array, as shown in this third version of the function:

CREATE FUNCTION double_two_numbers3 (INTEGER, INTEGER) RETURNS INTEGER[] AS $$
DECLARE
 result INTEGER[];
BEGIN
 result[1] = $1 * 2;
 result[2] = $2 * 2;
 RETURN result;
END;
$$ LANGUAGE 'plpgsql';

We had to specify our return type and declare it before we used it, but the general logic there should look pretty natural. Sadly, it’s an approach you are likely to regret. Arrays are not widely used in databases. They are supported, but rarely used and as a result you’ll probably run into issues. Operations on them may not be very efficient, examples and documentation will be sparse, and PHP support is patchy.

Using your new function on the command line, you’d see:

SELECT * FROM double_two_numbers3(21, 99);
double_two_numbers3
---------------------
{42,198}
(1 row)

Sadly, calling it from PHP you’d get exactly the same thing, a string containing {42,198}, which you’d have to parse back into separate values or an array if that is what you wanted. That’s not very convenient. For a multi-dimensional array, it becomes kind of painful — definitely best avoided. If you ever have the urge to use an array inside a database, there’s probably a better way. It’s generally a bad idea to try to mechanically port a program from one language to another. A clean solution generally requires that you adopt the normal practices of the language you are using. A better solution to our trivial problem would be extending the previous version to this fourth version that returns values as rows rather than array elements.

CREATE FUNCTION double_two_numbers4 (INTEGER, INTEGER) RETURNS SETOF
INTEGER AS $$
DECLARE
 result INTEGER[];
BEGIN
 result[1] = $1 * 2;
 result[2] = $2 * 2;
 FOR i IN 1 ..2 LOOP
   RETURN NEXT result[i];
 END LOOP;
END;
$$ LANGUAGE 'plpgsql';

Called from the command line, it would look like this:

SELECT * FROM double_two_numbers4(21, 99);
double_two_numbers4
---------------------
                 42
                198
(2 rows)

This is easy to consume from PHP. Each element returned by RETURN NEXT … will be a row in the result set, so you can loop through it in PHP and deal with it as you will. The logic should seem reasonably familiar to a PHP programmer, but it is not using an array, so it’s more in the spirit of a database solution.

Branching out from the modern world of PHP programming into the conservative realm of database programming may not seem intuitive at first, but for a great many web apps, it is vital work. A well-designed, complex web app is often doing most of its work at the database layer, and doing it properly might be the difference between producing something you and your peers want to use and producing something that your granddad would like to drive.

Other posts