Migrating MySQL latin1 to utf8 – The process

Having covered the preparation and character set options of performing a latin1 to utf8 MySQL migration, just how do you perform the migration correctly.

Example Case

Just to recap, we have the following example table and data.

mysql> select c,length(c),char_length(c),charset(c), hex(c) from conv.test_latin1;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | latin1     | 61                         |
| abc           |         3 |              3 | latin1     | 616263                     |
| ☺             |         3 |              3 | latin1     | E298BA                     |
| abc ☺☹☻       |        13 |             13 | latin1     | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+

Migration approach 1

The easiest way is to use mysqldump to dump the schema and data, to change the schema definitions and then a client reload process, all with the correct client character sets. Working on our sample table test_latin1 which I have in a conv schema you can do.

$ mysqldump -uroot -p --default-character-set=latin1 --skip-set-charset conv > conv.sql
$ sed -i -e "s/latin1/utf8/g" conv.sql
$ mysql -uroot -p --default-character-set=utf8 conv < conv.sql

If you attempt this which technically works (as I've seen a similar example on the Internet) your bound to screw up something. While the mysqldump and mysql load use the correct client command line options, performing a blind conversion of all references of latin1 to utf8 will not only change your table definition, in my example it will change the name of table, and if would change any values of data that contained the word 'latin1'. For example.

mysql> select c,length(c),char_length(c),charset(c), hex(c) from conv.test_utf8;
+---------+-----------+----------------+------------+----------------------------+
| c       | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------+-----------+----------------+------------+----------------------------+
| a       |         1 |              1 | utf8       | 61                         |
| abc     |         3 |              3 | utf8       | 616263                     |
| ?       |         3 |              1 | utf8       | E298BA                     |
| abc ??? |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select c,length(c),char_length(c),charset(c), hex(c) from conv.test_utf8;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | utf8       | 61                         |
| abc           |         3 |              3 | utf8       | 616263                     |
| ☺             |         3 |              1 | utf8       | E298BA                     |
| abc ☺☹☻       |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.01 sec)

Be sure to realize now you need to connect with --default-character-set=utf8 or in our example, we use set names as a stop gap measure.

Migration Approach 2

A good practice with using mysqldump regardless of migration is to always dump the schema with the --no-data option, and then dump the data separately with the --no-create-info option. In this case, you can then manually edit the schema file, carefully changing latin1 where appropriate. This is your production data, so some manual hand verification is a good thing. The additional benefit is you can create your schema and verify the syntax is correct before loading any data. While mysqldump creates a single file, due to this dump and reload, you can split your data file and perform some level of parallelism for data loading depending on your hardware capabilities.

$ mysqldump -uroot -p --default-character-set=latin1 --skip-set-charset --no-data conv > conv.schema.sql
$ mysqldump -uroot -p --default-character-set=latin1 --skip-set-charset --no-create-info conv > conv.data.sql
# For simplicity in this example I'm using a more strict global replacement. NOTE: this syntax depends on the version of mysql
$ sed -i -e "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g" conv.schema.sql
$ mysql -uroot -p --default-character-set=utf8 conv < conv.schema.sql
$ mysql -uroot -p --default-character-set=utf8 conv < conv.data.sql
$ mysql -uroot -p --default-character-set=utf8 conv
mysql> select c,length(c),char_length(c),charset(c), hex(c) from conv.test_latin1;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | utf8       | 61                         |
| abc           |         3 |              3 | utf8       | 616263                     |
| ☺             |         3 |              1 | utf8       | E298BA                     |
| abc ☺☹☻       |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

Other software interactions

While I only discuss the MySQL data and mysql client usage here, you should ensure that your programming language connecting to MySQL uses the correct character set management, e.g. PHP uses the default_charset settings in php.ini and even your webserver may benefit from the correct encoding, e.g. Apache httpd uses AddDefaultCharset httpd.conf.

Migration Approach Problems

In our example our migration is successfully, however like life, your production data is unlikely to be perfect. In my next post I will talk about identifying and handling exceptions, including data that was wrongly encoded originally into latin1, or translation such as html entities from rich editor input fields for example.

Comments

  1. Bryn Jones says

    If you are using database triggers, you will probably need to split the dump of the schema into two files when you do the mysqldump so that you can restore the data without upsetting the triggers. So when you restore, you first restore the tables, then the data and finally the triggers. The –skip-triggers option on mysqldump allows you to do this.