In my previous article on Unicode, I discussed a little bit of background on Unicode, how to prep PHP to serve UTF-8 encoded content, and how to handle displaying Unicode characters. There's still a bit more we need to talk about, however, before we can truly claim internationalization support.
Prepping MySQL for Unicode
MySQL allows you to specify a character encoding at four different levels: server, database, table, and column. This flexibility becomes quite useful when working on a shared host (like I do at DreamHost). In my particular case, I do not have control over either the server or database setting (and both are unfortunately set to latin1). As a result, I set my desired character encoding at the table level.
To see what your current system and database settings are, issue the following SQL commands at the MySQL command prompt:
SHOW VARIABLES LIKE 'character_set_system';
SHOW VARIABLES LIKE 'character_set_database';
To see what character set a table is using, issue the following command:
SHOW CREATE TABLE myTable;
If you are fortunate enough to have control over the database-level character set, you can set it using the following command:
(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8;
The table-specific commands are similar:
(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8;
Column level character encoding can be specified when creating a table or by altering the desired column:
CREATE TABLE MyTable ( column1 TEXT CHARACTER SET utf8 );
ALTER TABLE MyTable MODIFY column1 TEXT CHARACTER SET utf8;
I personally recommend setting the character encoding as high up as you have the capability to. That way, you won't have to remember to set it on any new tables or columns (or even databases).
If you have existing tables that do not use the utf8 character encoding, you can convert them with a simple command:
ALTER TABLE ... CONVERT TO CHARACTER SET utf8;
Be very careful when attempting to convert your data. The convert
command assumes that the existing data is encoded as latin1. Any Unicode characters that already exist will become corrupted during the conversion process. There are some ways to get around this limitation, which may be helpful if you've already got some Unicode data stored in your database.
Communicating with MySQL
Once our tables are ready to accept Unicode data, we need to make some minor changes in the way we connect our application to the database. Essentially, we will be specifying the character encoding that our connection should use. This call needs to be made very early in the order of operations. I personally make this call immediately after creating my database connection. There are several ways we can set the character encoding, depending on the version of PHP and the programming paradigms in use. The first method involves a call to the mysql_query() function:
mysql_query("SET NAMES 'utf8'");
An alternative to this in PHP version 5.2 or later involves a call to the mysql_set_charset() function:
mysql_set_charset('utf8',$conn);
And yet another alternative, if you're using the MySQL Improved extension, comes via the set_charset() function. Here's an example from my code:
// Change the character set to UTF-8 (have to do it early)
if(! $db->set_charset("utf8"))
{
printf("Error loading character set utf8: %s\n", $db->error);
}
Once you have specified the character encoding for your database connection, your database queries (both setting and retrieving data) will be able to handle international characters.
Accepting Unicode Input
The final hurdle in adding internationalization support to our web application is accepting unicode input from the user. This is pretty easy to do, thanks to the accept-charset
attribute on the form
element:
<form accept-charset="utf8" ... >
Explicitly setting the character encoding on each form that can accept extended characters from your users will solve all kinds of potential problems (see the "Form submission and i18n" link in the Resources section below for much more on this topic).
Potential Pitfalls
Since PHP (prior to version 6) considers a character just one byte long, there are some potential coding problems that you might run into in your application:
Checking String Length
Using the strlen function to check the length of a given string can cause problems with strings containing international characters. For example, a string comprising 10 characters of a double-byte alphabet would return a length of 20. This might cause problems if you are expecting the string to be no longer than 10 characters. Thankfully, there's an elegant hack that we can use to get around this:
function utf8_strlen($string) {
return strlen(utf8_decode($string));
}
The utf8_decode
function will turn anything outside of the standard ISO-8859-1 encoding into a question mark, which gets counted as a single character in the strlen
function (which is exactly what we wanted). Pretty slick!
Case Conversions
Forcing a particular case for string comparisons can be problematic with international character sets. In some languages, case has no meaning. So there's not a whole lot that one can do short of creating a lookup table. One example of such a lookup table comes from the mbstring extension. The Dokuwiki project implemented this solution in their conversion to UTF-8.
Using Regular Expressions
The Perl-Compatible Regular Expression (PCRE) functions in PHP support the UTF-8 encoding, through use of the /u
pattern modifier. If you are making use of regular expressions in your application, you'll definitely want to look into this modifier.
Additional Resources
In learning about how to add internationalization support to web applications, I gathered a number of excellent resources that I highly recommend bookmarking. Without further ado, here's the list I've created:
- Character Sets / Character Encoding Issues
- Handling UTF-8 with PHP
- MySQL and UTF-8
- Do you know your character encodings?
- A tutorial on character code issues - Lots of theory; in-depth discussion
- MySQL and UTF-8 — no more question marks!
- Form submission and i18n
- Survival guide to to i18n
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software