Migrating MySQL latin1 to utf8 – Preparation

This article is Part 1 of a series of articles regarding MySQL character set conversion.
Be sure to also check out character set options and the process for more information.

Before undertaking such migration the first step is a lesson in understanding more about how latin1 and utf8 work and interact in MySQL. latin1 in a common and historical character set used in MySQL. utf8 (first available in MySQL Version 4.1) is an encoding supporting multiple bytes and is the system default in MySQL 5.0

  • latin1 is a single byte character set.

    • utf8 is a 1-3 byte character set depending on the size of the character. NOTE: MySQL utf8 does not support the RFC 3629 4 byte sequences. (Updated: MySQL 5.5 now supports full Unicode support with the ” utf8m4″charset))

MySQL variables

MySQL has a number of different system variables to consider, the following is the default representation in MySQL 5.1

mysql> show global variables like '%char%';


+————————–+—————————————————————-+ | Variable_name | Value | +————————–+—————————————————————-+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | +————————–+—————————————————————-+

MySQL enables you to specify the character set for data at the Instance,Schema,Table and Column level. This is important because this complexity, especially between the Table and Columns can catch you out.

It is also important to ensure that not only is data stored in the appropriate format, the communication between client and server also needs to support the character set you wish to use.

### latin1 example

We start by creating a simple table, inserting some data, and reviewing the data.

<pre>mysql> create table test_latin1(c varchar(100) not null) default charset latin1;

mysql> insert into test_latin1(c) values (‘a’),(‘abc’),(’☺’),(‘abc ☺☹☻’);

mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_latin1; +—————+———–+—————-+————+—————————-+ | c | length(c) | char_length(c) | charset(c) | hex(c) | +—————+———–+—————-+————+—————————-+ | a | 1 | 1 | latin1 | 61 | | abc | 3 | 3 | latin1 | 616263 | | ☺ | 3 | 3 | latin1 | E298BA | | abc ☺☹☻ | 13 | 13 | latin1 | 61626320E298BAE298B9E298BB | +—————+———–+—————-+————+—————————-+ 4 rows in set (0.00 sec)

We use the LENGTH() and CHAR_LENGTH() functions to confirm the known length of the string and the true number of characters.

In the above examples, the smiley characters can be reproduced in a web page with the following.

<pre>&#9786;&#9785;&#9787;

### utf8 example

In a separate session (because we re-use these later) we repeat for utf8.

<pre>mysql> set  names utf8;

mysql> show session variables like ‘%char%’; +————————–+—————————————————————-+ | Variable_name | Value | +————————–+—————————————————————-+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | +————————–+—————————————————————-+

mysql> create table test_utf8(c varchar(100) not null) default charset utf8; mysql> insert into test_utf8(c) values (‘a’),(‘abc’),(’☺’),(‘abc ☺☹☻’);

mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_utf8; +—————+———–+—————-+————+—————————-+ | c | length(c) | char_length(c) | charset(c) | hex(c) | +—————+———–+—————-+————+—————————-+ | a | 1 | 1 | utf8 | 61 | | abc | 3 | 3 | utf8 | 616263 | | ☺ | 3 | 1 | utf8 | E298BA | | abc ☺☹☻ | 13 | 7 | utf8 | 61626320E298BAE298B9E298BB | +—————+———–+—————-+————+—————————-+ 4 rows in set (0.00 sec)

As you can see, the key difference here is the CHAR_LENGTH() of the utf8 strings differ.

### Mixing latin1 with utf8

If we now look at the latin1 data in the utf8 session we see that while the underlying data via LENGTH(),CHAR_LENGTH() and HEX() remains unchanged (because this is the database representation of the data), the actual output presentation is garbled due to the mismatch in the client communication.

<pre>mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_latin1;

+————————+———–+—————-+————+—————————-+ | c | length(c) | char_length(c) | charset(c) | hex(c) | +————————+———–+—————-+————+—————————-+ | a | 1 | 1 | latin1 | 61 | | abc | 3 | 3 | latin1 | 616263 | | ☺ | 3 | 3 | latin1 | E298BA | | abc ☺☹☻ | 13 | 13 | latin1 | 61626320E298BAE298B9E298BB | +————————+———–+—————-+————+—————————-+ 4 rows in set (0.00 sec)

If we look at the utf8 data in latin1, we see a different garbled representation.

<pre>mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_utf8;

+———+———–+—————-+————+—————————-+ | c | length(c) | char_length(c) | charset(c) | hex(c) | +———+———–+—————-+————+—————————-+ | a | 1 | 1 | utf8 | 61 | | abc | 3 | 3 | utf8 | 616263 | | ? | 3 | 1 | utf8 | E298BA | | abc ??? | 13 | 7 | utf8 | 61626320E298BAE298B9E298BB | +———+———–+—————-+————+—————————-+ 4 rows in set (0.00 sec)

### Conclusion

Armed now with a basic understanding we are ready to consider what approaches we may undertake to migrate this example table, and how we may be able to verify our data.

### References

[Character Set Support][4]  
[Connection Character Sets and Collations][5]
Tagged with: Databases MySQL Uncategorized

Related Posts

Beyond MySQL GA: patches, storage engines, forks, and pre-releases – FOSDEM 2010

Kristian Nielsen presented “Beyond MySQL GA: patches, storage engines, forks, and pre-releases”. This included a history of current products: Google Patches (5.0 & 5.1) included improvements in : statistics/monitoring lock contention binlog malloc() filesorts innodb I/O and wait statistics SHOW …STATISTICS statements smp scalability I/O scalability semisync replication many more Percona Patches (5.

Read more

Multi-Master Manager for MySQL – FOSDEM 2010

The next presentation by Piotr Biel from Percona was on Multi-Master Manager for MySQL. The introduction included a discussion of the popular MySQL HA solutions including: MySQL Master-slave replication with failover MMM managed bi-directional replication Heartbeat/SAN Heartbeat/DRBD NDB Cluster A key problem that was clarified in the talk is the discussion of Multi-Master and this IS NOT master-master.

Read more

10x Performance Improvements in MySQL – A Case Study

The slides for my presentation at FOSDEM 2010 are now available online at slideshare . In this presentation I describe a successful client implementation with the result of 10x performance improvements.

Read more