WEB Advent 2010 / Localization

When creating a web site, it is important to consider the audience. Regardless of whether the web site is selling goods, providing a service, or making information available, one should consider how the site will be displayed to a visitor from another country. If visitors want to buy something, use the service, or read the information provided, will the web site be able to accommodate them? Should it? Localization allows the user interface to accommodate a user’s expectations for the display of dates, currency, and text specific to their locale.

Many native English speakers don’t realize that, despite the far reach of their language, it is still a small piece in the spoken language pie. This article states that while English is definitely the most distributed language in the world, it is not the most spoken (currently third). There is so much useful information on the Web, and plenty of it is not in English. Solving language-related problems with computers is a difficult thing to do, but making the web sites we create more internationally-friendly is not as hard as it may seem. While it may not be suitable for a site to cater to ten different languages, it may very well be useful for it to do so for a couple. As an example, in the United States, the increase in Spanish-speaking people is apparent, and having a commerce site that can communicate with that growing subset of the country would mean expanding your potential customer base.

Before showing an example of how to start implementing locales for your web site, I’d like to take a moment to talk about character encoding. Even if you don’t foresee needing to provide for localization in your web site, character sets (charsets) and encoding are important aspects to consider with or without localization. If you are unfamiliar with character encoding, it is used to help software know which characters to expect from the user. For example, the common ISO 8859-1 character set is intended for characters common to Western Europe. So, I can use German umlauts along with the English alphabet within this charset. If, however, I were to try and display an Arabic character, such as Ù‚, when using ISO 8859-1, what may be displayed instead could be � or Ù, meaning that under the ISO 8859-1 charset, we don’t know what that Arabic character is.

A very cross-language-friendly character encoding is the ASCII-compatible UTF-8, which is used to represent the Unicode character set, especially in legacy apps. Using UTF-8 for a web site provides a large list of supported characters that will best accommodate your users. To implement a character encoding properly, you want to make sure it is the same across your entire app. If you are using UTF-8, then Apache, your database tables, PHP and its applicable functions, and the document type should all be set to UTF-8. If these are not consistent, you may end up with a mix of data in different character encodings, which can be very difficult — if not impossible — to remedy.


// Apache httpd.conf or .htaccess

// This will add a charset to the Content-Type response header.
AddDefaultCharset UTF-8

// php.ini
default_charset = "UTF-8"

// Example PHP function
htmlentities($data, ENT_COMPAT, 'UTF-8');

// XML
<?xml version="1.0" encoding="UTF-8" ?>

// HTML
<meta http-equiv="Content-Type"
    content="text/html; charset=utf-8" />

For more information about UTF-8 and why it is important, check out these links:

Now that we’ve spoken about character encoding, let’s jump back to implementing localization. How do we determine which locale to load for a user? How do we know where they are coming from? One could figure it out with the user’s IP address, but the simplest way would be to let the browser tell us what locale to use. The Accept-Language HTTP header is sent with an HTTP request and is something we can see in PHP’s $_SERVER superglobal. It looks something like this:

Accept-Language: en-us;en;q=0.5

While this is a simple way to detect the starting language in which to display the web site, it does not mean that it is the language desired by the user. Be sure to provide an easy way for the user to switch locale. Many developers think that storing a locale code in a session or cookie is sufficient, but that it is not very bookmark friendly. Instead, you may want to provide the locale in the URL, e.g., http://example.com/en/blog/ and http://example.com/de/blog, or in the subdomain, e.g., http://en.wikipedia.org/.

To implement this, we can set an Apache rewrite rule for our web site:

RewriteRule ^(en|es)/(.*) /$2 [PT,E=lang:$1]

With this rewrite rule, we do not have to duplicate the codebase of our web site for each locale. Instead, we instruct Apache to look for the locale in the request, store it as an environment variable, and pass it to our code. In our app logic, we can retrieve the locale:

$locale = apache_getenv('lang', true);

Here is a simple PHP implementation that supports locales:


<?php

/** 
 * Stick to the codes that are already standard.
 * ISO 3166 - http://en.wikipedia.org/wiki/ISO_3166-1
 * RFC 1766 - http://www.faqs.org/rfcs/rfc1766.html
 */
$accepted_locales = array(
    'en' => 'en_US', 
    'es' => 'es_MX'
);


// Get the locale from the browser.
$browser_locale = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);

// For more specific locale codes, e.g., en-us (English, United States), es-mx (Spanish, Mexican), etc.
// $browser_locale = current(explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']));

// Check against a whitelist instead of trusting $_GET['lang'].
// For reference: http://ha.ckers.org/blog/20100128/micro-php-lfi-backdoor/
if (in_array($browser_locale, array_keys($accepted_locales))) {
    $locale_code = $accepted_locales[$browser_locale];
} else {
    $locale_code = 'en_US';
}

// Or, use a URL-based locale designator and a rewrite rule. 
// RewriteRule ^(en|es)/(.*) /$2 [PT,E=lang:$1] 
// Then, grab that environment variable with PHP.
/*
$locale_code = apache_getenv('lang', true);
if (in_array($locale_code, array_keys($accepted_locales))) {
    $locale_code = $accepted_locales[$locale_code];
} else {
    $locale_code = 'en_US';
}
*/

$locale_entries = array();

// Load the locale file.
if (file_exists($locale_code . '.php')) {
    $locale_entries = include $locale_code . '.php';
}

/*
These locale entries are loaded from files that return an array.
en_US.php
<?php
return array(
    'WELCOME' => 'Hello %s!',
    'GOODBYE' => 'Goodbye!',
);

es_MX.php
<?php
return array(
    'WELCOME' => '¡Hola %s!',
    'GOODBYE' => '¡Adios!',
);

You could store locales in a database as well. 
 */

// Now, you can create a simple function that will display the correct text.
function locale($locale_entries, $key, $replacements = array()) {
    if (isset($locale_entries[$key]) && empty($replacements)) {
        return $locale_entries[$key];
    } elseif (isset($locale_entries[$key])) {
        return vsprintf($locale_entries[$key], $replacements);
    }
    throw new Exception("Locale entry for '$key' does not exist.");
}

// echo locale($locale_entries, 'WELCOME', array('Anthony'));
// echo locale($locale_entries, 'GOODBYE');

// A basic class to handle locales
class Locale {
    
    public $code = 'en_US';
    public $locale_path = '/var/www/';
    protected $_entries = array();
    
    public function setCode($code)
    {
        $this->code = $code;
        $this->load();
    }
    
    public function load()
    {
        if (!file_exists($this->locale_path . $this->code . '.php')) {
            throw new Exception("Locale file: {$this->locale_path}{$this->code}.php does not exist.");
        }
        
        $this->_entries = include $this->locale_path . $this->code . '.php';
    }
    
    public function fetch($key, $replacements = array())
    {
        if (isset($this->_entries[$key]) && empty($replacements)) {
            return $this->_entries[$key];
        } elseif (isset($this->_entries[$key])) {
            return vsprintf($this->_entries[$key], $replacements);
        }
        throw new Exception("Locale entry {$this->code}:$key does not exist.");
    }
}

$locale = new Locale();
$locale->setCode($locale_code);
//echo $locale->fetch('WELCOME', 'Anthony');
//echo $locale->fetch('GOODBYE');

?>

I hope this article has been informative and helps you get into the habit of thinking about your visitors and how to best accommodate them. I think most web sites should use UTF-8 throughout the stack. Please at least take some time at the beginning of projects to make sure the sites you develop are doing their part to be internationally friendly.

Other posts