Upgrading ExpressionEngine to UTF-8Character sets are a bugbear for many a web developer. The advice these days is to upgrade everything to UTF-8 and I decided to go down this route when I moved hosts. It is actually reasonably straightforward and this document gives the process I used. UTF-8 (Unicode)The reason for all this character confusion is historic. Originally computers
were very limited in terms of speed and storage. Much of the early development
in IT took place in the United States. A standard called ASCII was devised to
represent the alphabet using the characters 0-127. This allowed for a few extra
characters such as line-feed and carriage return (yes we are in the era of
teletypes) as well as control characters to do stuff like ring the “bell” on the
terminal. Obviously this precluded a whole range of languages from European
accented characters to more extensive Russian, Arab and Asian characters. Prerequisites
Useful Tools
Step 1. Export your databaseFirst off I exported my ExpressionEngine 1.4.2 database using the control panel. You will find this option under the ADMIN tab -> utilites -> Database Backup. Export as either a zip or gzip to cut down on the amount of data to transfer. What EE does is produce a file full of SQL statements to recreate your database from scratch. Step 2. Convert database to UTF-8I have seen a lot of complicated process for moving to UTF-8. Normally if your database is already in a recognized character set (and how are you displaying it on a web browser if it is not?) then you can use the GNU iconv utility found on Unix or Cygwin systems. $ iconv -f iso-8859-1 -t utf-8 mydb_060929 > mydb.sql However after reloading my database into ExpressionEngine I noticed some
problems. Euro symbols and smart quotes were not showing up. It turned out that
what I actually had in my database was not 8859 but a Windows derivative called
Windows-1252. This uses codes between 0x80 and 0x9F for letters and punctuation.
In particular the Euro symbol is represented by 0x80. Even if a document is
labeled as 8859 most web browsers and e-mail clients will still interpret these
codes correctly. I was unaware of this problem because I had been using Windows
as my system for preparing documents. $ iconv -f cp1252 -t utf-8 mydb_060929 > mydb.sql There was one other thing to do. The tables exported by ExpressionEngine were all tagged as latin1 CREATE TABLE ... Using an editor that can open and save files without changing the character set (ie vi) I found and replaced all occurrences of CHARSET=latin1 With CHARSET=utf8 Step 3. Check mysql for UTF8 supportNow you may be wondering why you need any character set support at all in a
database. Surely it is just a store for data entered and displayed by
ExpressionEngine? Well that is true but if you, or your client, wants to do an
operation such as search for a string the database will need to know how
characters are represented in tables. You may potentially also want UTF-8
support for column names and other meta data although this shouldn’t be a
concern for ExpressionEngine. A big issue is that a field of type VARCHAR may be
given a size of say 25 characters. Under UTF-8 this may actually mean 25 to 75
characters depending on the characters used and the database needs to take this
into account. $ mysql -V Or from phpMyAdmin on the home page localhost * Server version: 4.1.20a-nt Note:These are also issues for PHP. The standard string functions in PHP only work on single byte characters, using these functions on UTF-8 data can result in corruption. Presumably the EE authors have taken this into account. Certainly some special UTF8 functions are called. Step 4. mysql configuration (optional)If you are responsible for administering your mysql server you may want to set the default character set to UTF-8. Edit my.cnf and look for the following entries [mysql] etc. Step 4. Load the databaseConnect to your database server and drop the database that holds the ExpressionEngine tables. Then recreate the database with the UTF8 character set and recreate your ExpressionEngine database user. $ mysql mysql > drop database mydb; Note if you don’t have access to the mysql command you can insert these commands into the top of you SQL injection file from earlier. Now reload the ExpressionEngine database cat mydb.sql | mysql uks14234 You may need the –p flag to mysql if you have set up an admin and password. You can do this through phpmysqladmin interface using the “run sql” option on older versions or the Import tab. In this case be sure to compress your file with zip or gzip before uploading. Step 5. Configure ExpressionEngine for UTF-8We need ExpressionEngine to know that it is using UTF-8. This is potentially needed for php to mysql conversions and also to generate the correct content-type meta element in the head of the document. This is generated by the charset variable in the document template: <meta http-equiv="Content-Type" content="text/html; charset=utf8" /> There are two places where this can be configured. Either under the the general preferences for the individual weblog CP Home > Admin > Weblog Administration > Weblog Management > Edit Weblog Or in the System preferences CP Home > Admin > System Preferences > General Configuration However only the System Preferences seemed to have any effect. Step 6. Configure ApacheAlthough ExpressionEngine is generating the correct charset information this is not what the browser users to decide what character set is used for a document delivered over the Internet. The important information for a browser is the content-type header. You can use a tool such as curl to examine the document header returned by the browser. $ curl –D header.txt http://myhost/ HTTP/1.1 200 OK Here we see that the web server is Apache and uses PHP version 4.4.3 running on gentoo linux (that’s quite a bit of useful information for a potential hacker). The server has set a cookie and because this document is dynamic the browser is instructed not to cache its contents in a variety of ways. Finally we are told that it is an 8859 document despite setting ExpressionEngine to UTF8. This is because Apache has been configured with this character set as the default for html documents. To change this either edit the Apache global configuration file $ vi /etc/apache2/httpd.conf and change AddDefaultCharset ISO-8859-1 to AddDefaultCharset UTF-8 Or if you don’t have access to the Apache configuration file you can set this
information in the .htaccess file where your ExpressionEngine index.php script
is found. |
|
©1994-2006 All text and images copyright: www.abcseo.com; last updated: |