Upgrading ExpressionEngine to UTF-8

Character sets are a bugbear for many a web developer. The advice these days is to upgrade everything to UTF-8 and I decided to go down this route when I moved hosts. It is actually reasonably straightforward and this document gives the process I used.

UTF-8 (Unicode)

The reason for all this character confusion is historic. Originally computers were very limited in terms of speed and storage. Much of the early development in IT took place in the United States. A standard called ASCII was devised to represent the alphabet using the characters 0-127. This allowed for a few extra characters such as line-feed and carriage return (yes we are in the era of teletypes) as well as control characters to do stuff like ring the “bell” on the terminal. Obviously this precluded a whole range of languages from European accented characters to more extensive Russian, Arab and Asian characters.

A number of more extensive character representations were developed, in particular ISO-8859-1 (aka Latin1). 8859-1 consists of 191 characters from the Latin script and supports the special characters in Western European languages. These are mapped onto the number range 0-255 which conveniently fits into a single byte of data. It was adopted by the IANA in 1992 as the standard Internet character set. It thus became the default encoding of Web pages and for the descriptive parts of HTTP headers.

UTF-8 is a variable length character set. At its simplest it is completely compatible with ASCII. However if the character is 128 or greater it is followed by up to 3 bytes enabling over a million different characters to be represented. This means that English documents are not penalized with extra overhead while at the same time supporting other, more extensive, character sets.

Now you may wonder why you would want to migrate from ISO-8859-1 to UTF-8. Well for starters 8859 is obsolete. HTML 4.0 documents should use UTF-8. It lacks new characters such as the Euro symbol although its derivative Windows-1252 does have support for this. If you want your blog to support any language outside of the European latin world and to include characters missing from 8859 then UTF-8 is the way to go. If your blog has to support different alphabets at the same time – say you have Russian and English pages then UTF-8 is the only solution.

Prerequisites

  • MySQL 4.1.2 or greater
  • PHP version - unknown
  • Apache version 1.3 or greater

Useful Tools

  • curl - command line url
  • GNU iconv for converting between various character sets

Step 1. Export your database

First off I exported my ExpressionEngine 1.4.2 database using the control panel. You will find this option under the ADMIN tab -> utilites -> Database Backup. Export as either a zip or gzip to cut down on the amount of data to transfer. What EE does is produce a file full of SQL statements to recreate your database from scratch.

Step 2. Convert database to UTF-8

I have seen a lot of complicated process for moving to UTF-8. Normally if your database is already in a recognized character set (and how are you displaying it on a web browser if it is not?) then you can use the GNU iconv utility found on Unix or Cygwin systems.

However after reloading my database into ExpressionEngine I noticed some problems. Euro symbols and smart quotes were not showing up. It turned out that what I actually had in my database was not 8859 but a Windows derivative called Windows-1252. This uses codes between 0x80 and 0x9F for letters and punctuation. In particular the Euro symbol is represented by 0x80. Even if a document is labeled as 8859 most web browsers and e-mail clients will still interpret these codes correctly. I was unaware of this problem because I had been using Windows as my system for preparing documents.

I fixed this problem by rerunning iconv

$ iconv -f cp1252 -t utf-8 mydb_060929 > mydb.sql

There was one other thing to do. The tables exported by ExpressionEngine were all tagged as latin1

CREATE TABLE ...
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Using an editor that can open and save files without changing the character set (ie vi) I found and replaced all occurrences of

CHARSET=latin1

With

CHARSET=utf8

Step 3. Check mysql for UTF8 support

Now you may be wondering why you need any character set support at all in a database. Surely it is just a store for data entered and displayed by ExpressionEngine? Well that is true but if you, or your client, wants to do an operation such as search for a string the database will need to know how characters are represented in tables. You may potentially also want UTF-8 support for column names and other meta data although this shouldn’t be a concern for ExpressionEngine. A big issue is that a field of type VARCHAR may be given a size of say 25 characters. Under UTF-8 this may actually mean 25 to 75 characters depending on the characters used and the database needs to take this into account.
MySQL added UTF-8 support in 4.1.2 You can check which version you are using from the command line:

$ mysql -V
$ mysql Ver 14.7 Distrib 4.1.21, for pc-linux-gnu (i386) using readline 5.1

Or from phpMyAdmin on the home page

localhost

* Server version: 4.1.20a-nt

Note:

These are also issues for PHP. The standard string functions in PHP only work on single byte characters, using these functions on UTF-8 data can result in corruption. Presumably the EE authors have taken this into account. Certainly some special UTF8 functions are called.

Step 4. mysql configuration (optional)

If you are responsible for administering your mysql server you may want to set the default character set to UTF-8.

Edit my.cnf and look for the following entries

[mysql]
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8

etc.

Step 4. Load the database

Connect to your database server and drop the database that holds the ExpressionEngine tables. Then recreate the database with the UTF8 character set and recreate your ExpressionEngine database user.

$ mysql

mysql > drop database mydb;

mysql> create database mydb CHARACTER SET utf8;
mysql> grant all privileges on mydb.* to user@localhost identified by 'password' WITH GRANT OPTION;

Note if you don’t have access to the mysql command you can insert these commands into the top of you SQL injection file from earlier.

Now reload the ExpressionEngine database

cat mydb.sql | mysql uks14234

You may need the –p flag to mysql if you have set up an admin and password.

You can do this through phpmysqladmin interface using the “run sql” option on older versions or the Import tab. In this case be sure to compress your file with zip or gzip before uploading.

Step 5. Configure ExpressionEngine for UTF-8

We need ExpressionEngine to know that it is using UTF-8. This is potentially needed for php to mysql conversions and also to generate the correct content-type meta element in the head of the document. This is generated by the charset variable in the document template:

<meta http-equiv="Content-Type" content="text/html; charset=utf8" />

There are two places where this can be configured. Either under the the general preferences for the individual weblog

CP Home > Admin > Weblog Administration > Weblog Management > Edit Weblog

Or in the System preferences

CP Home > Admin > System Preferences > General Configuration

However only the System Preferences seemed to have any effect.

Step 6. Configure Apache

Although ExpressionEngine is generating the correct charset information this is not what the browser users to decide what character set is used for a document delivered over the Internet. The important information for a browser is the content-type header. You can use a tool such as curl to examine the document header returned by the browser.

$ curl –D header.txt http://myhost/

HTTP/1.1 200 OK
Date: Fri, 29 Sep 2006 22:24:54 GMT
Server: Apache
X-Powered-By: PHP/4.4.3-pl1-gentoo
Set-Cookie: PHPSESSID=ff92a6ce34027cd2c756f6553eefd5ff; path=/
Expires: Tue, 01 Jan 2002 00:00:00 GMT
Cache-Control: no-cache
Pragma: no-cache
Last-Modified: Mon, 25 Sep 2006 18:17:22 GMT
Transfer-Encoding: chunked
Content-Type: text/html; charset=ISO-8859-1;

Here we see that the web server is Apache and uses PHP version 4.4.3 running on gentoo linux (that’s quite a bit of useful information for a potential hacker). The server has set a cookie and because this document is dynamic the browser is instructed not to cache its contents in a variety of ways. Finally we are told that it is an 8859 document despite setting ExpressionEngine to UTF8. This is because Apache has been configured with this character set as the default for html documents. To change this either edit the Apache global configuration file

$ vi /etc/apache2/httpd.conf

and change

AddDefaultCharset ISO-8859-1

to

AddDefaultCharset UTF-8

Or if you don’t have access to the Apache configuration file you can set this information in the .htaccess file where your ExpressionEngine index.php script is found.