I recently checked on a project, because the developers were having some problems with strange characters displaying on the page. In particular, a number of accented characters were showing up as the Unicode replacement character (�). How sad.
A bit of research revealed that their pages were being served in UTF-8, but the data store was using ISO-8859-1. A few patches were applied, and the string in question was fixed. In order to manage character sets correctly, your entire application — from input to database to output — needs to be configured correctly. This can be quite a challenge, and in this article I will introduce some of the problems and provide a few solutions.
Character sets are basically sets of symbols that computers use to map numbers to the glyphs you see. You can think of them as tables that contain all of the glyphs you might need — not only letters, but also punctuation, symbols, space, backspace, &c. Here’s an example character set I’ll call Paul’s First Character Set:
To find a character, just look it up in the chart, across then down. For example,
08. If you want to look up a word —
Hello, for example — look up each character in the chart —
33 04 11 11 14. Of course, there isn’t just one character set in the world; there are quite a few. Here’s another example character set I’ll call Paul’s Second Character Set:
In this new character set,
17 40 47 47 50. You might already be able to imagine how problems can arise. If your browser tries to display data in the wrong character set, it ends up looking all garbled. Some characters may look right, while others look broken. The reason that garbled text is often mostly right is that many Western character sets put the standard English alphabet — as well as numbers and a few other things — in the same place. (My two fictional examples don’t do this.) So, those characters appear to work fine, but less common characters don’t.
This also reveals another problem. If someone simply tells you that a message is
44 04 02 17 04 19 without identifying the character set, you have a problem. You could decode the message twice and see what you come up with — in this case
i32H4J — then pick whichever one looks best. One is an English word that contextually makes sense, the other could quite conceivably be a password. In the presence of other text you could probably make a decision, but it’s a mess, even when only dealing with two fictional character sets.
In order to avoid generating garbage, character sets must be considered at every step.
Step 1: Form
Browsers take a hint from the character set of a form to determine what character set to use when sending data back. If your page is sent in UTF-8, browsers should return data in the same format. Remember that the content type can be set in two places, an HTTP response header and an HTML meta tag. I prefer the HTTP response header. If you’re sending both, make sure they agree with each other. Many sites send contradictory content types, which can confuse browsers — most will prioritize the HTTP header. (Remember that configuration directives in
.htaccess can override those in
<form> tag has an
accept-charset attribute that should be used to offer additional direction to the browser when it comes to transmitting data back to the server.
Step 2: PHP
When validating input, the functions that you use need to be compatible with the character set you’re using. You can’t use
strlen() to analyze a string in a multi-byte character set. The same goes for
substr(), and pretty much the entire library of string functions. Luckily, this will change in PHP 6. Until then, take a look at the Multibyte String (mbstring) extension.
The problem with PHP’s internal string library is that it assumes every character is a single byte. In other words,
strlen() actually counts bytes, not characters. Other functions, like
substr(), use byte counts for offsets, rather than actually iterating through characters. The mbstring extension is pretty easy to use and generally available.
When uploading UTF-8 files to the server, you should use the binary transfer method, not ASCII, for reasons I hope are now clear.
Step 3: Database
Your database needs to be configured to use the correct character set as well. If you tell your database to use ASCII, set up a
varchar(25) for usernames, and try to store a string that’s 25 characters long, encoded in a multi-byte character set, you’ll have a problem. The collation setting will determine how ordering and search results work. To quote the MySQL documentation:
“A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.”
Remember that your database not only has a character set for different tables, but your application will also need ensure that it’s using the correct character set for the connection to the database. This will apply specifically to functions like
mysql_real_escape_string(). If your database allows setting the character set on a column-by-column basis, this must also be carefully considered.
Several popular databases allow you to select the character set on a table-by-table basis. This may seem like a feature, but it’s really just a great way to aim very carefully at your foot, and shoot it. Unless you’re very careful, you’re likely to run into problems whenever the table uses a different character set than its containing database. (I ran into problems with this recently, and it was a pain to track down.)
Step 4: PHP
When PHP retrieves data from the database, it may do some processing to prepare this data for output: cutting strings, gluing them together, changing case, &c. All of this needs to be done with respect to the character set of the data you’re processing. When working with user data, specify both
ENT_QUOTES and the character set when using
Step 5: Templates
Your templates need to be saved in the same character set as the data, otherwise you’ll be serving data from two different character sets, while the client is only using one.
Helpful PHP functions
Although most PHP functions are blind when it comes to character sets, there are a few places where help can be found:
- PCRE Functions
- The PCRE library is able to handle UTF-8 strings when the
upattern modifier is used.
- Multibyte String Library
- This library contains a number of different functions that are able to work with multibyte strings. The
mb_strlen()function, for example, can be used to determine the length of a string, and it’s character set aware. Many of PHP’s string functions have a
mb_detect_encoding()function will iterate through a list of different character sets and return
TRUEwhen it finds one that is error-free when interpreting the string. This doesn’t mean it’s necessarily the right character set, but it’s valid.
- MySQL Functions
mysql_set_charset()function can be used to set the character set of the connection, and
mysql_client_encoding()can be used to determine the character set of the current connection. The connection’s encoding will be used when
mysql_real_escape_string()is called. For more details on this, see this portable PHP-MySQL connection charset fix.