WEB Advent 2009 / Character Sets: Garbage In, Garbage Out

I recently checked on a project, because the developers were having some problems with strange characters displaying on the page. In particular, a number of accented characters were showing up as the Unicode replacement character (). How sad.

A bit of research revealed that their pages were being served in UTF-8, but the data store was using ISO-8859-1. A few patches were applied, and the string in question was fixed. In order to manage character sets correctly, your entire application — from input to database to output — needs to be configured correctly. This can be quite a challenge, and in this article I will introduce some of the problems and provide a few solutions.

Character sets

Character sets are basically sets of symbols that computers use to map numbers to the glyphs you see. You can think of them as tables that contain all of the glyphs you might need — not only letters, but also punctuation, symbols, space, backspace, &c. Here’s an example character set I’ll call Paul’s First Character Set:

 0123456789
0abcdefghij
1klmnopqrst
2uvwxyzABCD
3EFGHIJKLMN
4OPQRSTUVWX
5YZ01234567

To find a character, just look it up in the chart, across then down. For example, h is 07, and i is 08. If you want to look up a word — Hello, for example — look up each character in the chart — 33 04 11 11 14. Of course, there isn’t just one character set in the world; there are quite a few. Here’s another example character set I’ll call Paul’s Second Character Set:

 0123456789
00123456789
1ABCDEFGHIJ
2KLMNOPQRST
3UVWXYZabcd
4efghijklmn
5opqrstuvwx

In this new character set, Hello is 17 40 47 47 50. You might already be able to imagine how problems can arise. If your browser tries to display data in the wrong character set, it ends up looking all garbled. Some characters may look right, while others look broken. The reason that garbled text is often mostly right is that many Western character sets put the standard English alphabet — as well as numbers and a few other things — in the same place. (My two fictional examples don’t do this.) So, those characters appear to work fine, but less common characters don’t.

This also reveals another problem. If someone simply tells you that a message is 44 04 02 17 04 19 without identifying the character set, you have a problem. You could decode the message twice and see what you come up with — in this case Secret and i32H4J — then pick whichever one looks best. One is an English word that contextually makes sense, the other could quite conceivably be a password. In the presence of other text you could probably make a decision, but it’s a mess, even when only dealing with two fictional character sets.

In order to avoid generating garbage, character sets must be considered at every step.

Step 1: Form

Browsers take a hint from the character set of a form to determine what character set to use when sending data back. If your page is sent in UTF-8, browsers should return data in the same format. Remember that the content type can be set in two places, an HTTP response header and an HTML meta tag. I prefer the HTTP response header. If you’re sending both, make sure they agree with each other. Many sites send contradictory content types, which can confuse browsers — most will prioritize the HTTP header. (Remember that configuration directives in .htaccess can override those in httpd.conf.)

The <form> tag has an accept-charset attribute that should be used to offer additional direction to the browser when it comes to transmitting data back to the server.

Step 2: PHP

When validating input, the functions that you use need to be compatible with the character set you’re using. You can’t use strlen() to analyze a string in a multi-byte character set. The same goes for substr(), and pretty much the entire library of string functions. Luckily, this will change in PHP 6. Until then, take a look at the Multibyte String (mbstring) extension.

The problem with PHP’s internal string library is that it assumes every character is a single byte. In other words, strlen() actually counts bytes, not characters. Other functions, like substr(), use byte counts for offsets, rather than actually iterating through characters. The mbstring extension is pretty easy to use and generally available.

When uploading UTF-8 files to the server, you should use the binary transfer method, not ASCII, for reasons I hope are now clear.

Step 3: Database

Your database needs to be configured to use the correct character set as well. If you tell your database to use ASCII, set up a varchar(25) for usernames, and try to store a string that’s 25 characters long, encoded in a multi-byte character set, you’ll have a problem. The collation setting will determine how ordering and search results work. To quote the MySQL documentation:

“A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.”

Remember that your database not only has a character set for different tables, but your application will also need ensure that it’s using the correct character set for the connection to the database. This will apply specifically to functions like mysql_real_escape_string(). If your database allows setting the character set on a column-by-column basis, this must also be carefully considered.

Several popular databases allow you to select the character set on a table-by-table basis. This may seem like a feature, but it’s really just a great way to aim very carefully at your foot, and shoot it. Unless you’re very careful, you’re likely to run into problems whenever the table uses a different character set than its containing database. (I ran into problems with this recently, and it was a pain to track down.)

Step 4: PHP

When PHP retrieves data from the database, it may do some processing to prepare this data for output: cutting strings, gluing them together, changing case, &c. All of this needs to be done with respect to the character set of the data you’re processing. When working with user data, specify both ENT_QUOTES and the character set when using htmlspecialchars() or htmlentities().

Step 5: Templates

Your templates need to be saved in the same character set as the data, otherwise you’ll be serving data from two different character sets, while the client is only using one.

Helpful PHP functions

Although most PHP functions are blind when it comes to character sets, there are a few places where help can be found:

PCRE Functions
The PCRE library is able to handle UTF-8 strings when the u pattern modifier is used.
Multibyte String Library
This library contains a number of different functions that are able to work with multibyte strings. The mb_strlen() function, for example, can be used to determine the length of a string, and it’s character set aware. Many of PHP’s string functions have a mb_ equivalent. The mb_detect_encoding() function will iterate through a list of different character sets and return TRUE when it finds one that is error-free when interpreting the string. This doesn’t mean it’s necessarily the right character set, but it’s valid.
MySQL Functions
The mysql_set_charset() function can be used to set the character set of the connection, and mysql_client_encoding() can be used to determine the character set of the current connection. The connection’s encoding will be used when mysql_real_escape_string() is called. For more details on this, see this portable PHP-MySQL connection charset fix.

Other posts