Python 3 semantics for integer division

As I refresh my skills in Python 2 to Python 3 semantics I discovered there is a difference in the division operator (i.e. /).

When using integers in Python 2 the result (by default) is an integer. For example.

$ python2
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
>>> 1/2
0
>>> 1/2.0
0.5

In Python 3 the result is a float.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
>>> 1/2
0.5

It has been encouraged in the Porting Python 2 Code to Python 3 documentation to perform the following import.

$ python2
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
>>> from __future__ import division
>>> 1/2
0.5

I was also unaware of the floor operator (i.e. //) as specified in PEP 238.

$ python2
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
>>> from __future__ import division
>>> 1//2
0
>>> 1/2
0.5

I uncovered this by playing with algorithms between versions. This Newton Method for the Square Root was something I was unaware of, and this example failed when using Python 2.

My improved version on the referenced example without an import.

def squareroot(number, precision = 5):
  root = number/2.0
  for i in range(20):
    nroot = (1/2.0)*(root + (number / root))
    #print i, nroot
    if (root - nroot < 1.0/10**precision):
      break
    root = nroot
  return round(nroot, precision)
>>> squareroot(10)
3.16228
>>> squareroot(10,1)
3.2
>>> squareroot(10,2)
3.16
>>> squareroot(10,5)
3.16228
>>> squareroot(10,10)
3.1622776602

Migrating MySQL latin1 to utf8 – Character Set Options

Continuing on from preparation in our MySQL latin1 to utf8 migration let us first understand where MySQL uses character sets. MySQL defines the character set at 4 different levels for the structure of data.

  • Instance
  • Schema
  • Table
  • Column

In MySQL 5.1, the default character set is latin1. If not specified, this is what you will get. For example.

mysql> create table test1(c1 varchar(10) not null);
mysql> show create table test1G
Create Table: CREATE TABLE `test1` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

If you want all tables in your instance to always be a default of utf8, you can changed the server variable character_set_server. This can be set dynamically.

mysql> set global character_set_server=utf8;
mysql> set session character_set_server=utf8;
mysql> create table test2(c1 varchar(10) not null);
mysql> show create table test2G
Create Table: CREATE TABLE `test2` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

If you change this dynamically be sure to include the option in your my.cnf to ensure this option is persisted for a mysqld restart.

You can define the default character set for all new tables in a given schema. You specify this when you create the schema.

mysql> set global character_set_server=latin1;
mysql> set session character_set_server=latin1;
mysql> create schema test_ucs2 default character set ucs2;
mysql> use test_ucs2;
mysql> create table test3(c1 varchar(10) not null);
mysql> show create table test3G
Create Table: CREATE TABLE `test3` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=ucs2

Even though we have a schema default, you can always specify the default character set for a given table which overrides any defaults.

mysql> use test_ucs2;
mysql> create table test4_utf8 (c varchar(10) not null) default charset utf8;
mysql> show create table test4_utf8G
Create Table: CREATE TABLE `test4_utf8` (
  `c` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

And finally, if you really wanted to be specific you can define the character set on a per column level.

mysql> create table test4_utf8_latin1 (c varchar(10) not null, c2 varchar(20) charset latin1) default charset utf8;
mysql> show create table test4_utf8_latin1G
*************************** 1. row ***************************
       Table: test4_utf8_latin1
Create Table: CREATE TABLE `test4_utf8_latin1` (
  `c` varchar(10) NOT NULL,
  `c2` varchar(20) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

With great flexibility comes great responsibility. You should have a defined standard for your application that is simple and easy to understand. I am not a proponent of using utf8 for everything, the primary reason why is memory. As part of my consulting I spend a lot of time with clients that have limited resources, e.g. database servers with 2GB or 4GB of RAM. MySQL stores utf8 efficiently on disk, but when this data is stored in memory for internal usage, it automatically uses 3 bytes, when on disk it may only be 1 byte. You can test this by creating a MEMORY table with latin1 and utf8 examples and comparing the difference in size. Is this a serious problem? Well that depends on many factors such as the number of database connections, persistent or not persistent connections, the size of the results etc. While it’s difficult in MySQL to instrument the memory precisely on a per connection basis, prudence should be a consideration for any physical resources, especially RAM.

Now that we understand what’s possible, how can we change our existing latin1 tables in our preparation example?

We could try a simple ALTER TABLE command.

mysql> alter table test_latin1 default charset utf8;
mysql> show create table test_latin1G
Create Table: CREATE TABLE `test_latin1` (
  `c` varchar(100) CHARACTER SET latin1 NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

This does not work, because we are only changing the default storage engine of the table. The underlying columns remain the same. If we were to add a new column, it would default to utf8. We can however achieve what we expected with the CONVERT option.

mysql> alter table test_latin1 convert to character set utf8;
mysql> show create table test_latin1G
Create Table: CREATE TABLE `test_latin1` (
  `c` varchar(100) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

We look at our data, and it looks great? Are we done with our conversion?

mysql> select * from test_latin1;
+---------------+
| c             |
+---------------+
| a             |
| abc           |
| ☺           |
| abc ☺☹☻ |
+---------------+

The answer is no. While it may look like the data is correct, MySQL also manages character sets for the communication channel. In this case, we are still communicating in latin1. To ensure moving forward in the future we must always communicate in utf8 to ensure we correctly pass utf8 to the database. We can test this with the mysql client, and as you will see our data is still corrupt.

mysql> set names utf8;
mysql> select * from test_latin1;
+------------------------+
| c                      |
+------------------------+
| a                      |
| abc                    |
| ☺                 |
| abc ☺☹☻ |
+------------------------+

mysql> show session variables like 'character%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8                                                           |
| character_set_connection | utf8                                                           |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | utf8                                                           |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

While you can see how we could migrate the schema definition, this does not complete our migration. In my next post, I will discuss the various different ways to correctly perform a data migration between latin1 and utf8.

Migrating MySQL latin1 to utf8 – Preparation

This article is Part 1 of a series of articles regarding MySQL character set conversion.
Be sure to also check out character set options and the process for more information.

Before undertaking such migration the first step is a lesson in understanding more about how latin1 and utf8 work and interact in MySQL. latin1 in a common and historical character set used in MySQL. utf8 (first available in MySQL Version 4.1) is an encoding supporting multiple bytes and is the system default in MySQL 5.0

  • latin1 is a single byte character set.
  • utf8 is a 1-3 byte character set depending on the size of the character. NOTE: MySQL utf8 does not support the RFC 3629 4 byte sequences. (Updated: MySQL 5.5 now supports full Unicode support with the ” utf8m4″charset))

MySQL variables

MySQL has a number of different system variables to consider, the following is the default representation in MySQL 5.1

mysql> show global variables like '%char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | latin1                                                         |
| character_set_connection | latin1                                                         |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | latin1                                                         |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

MySQL enables you to specify the character set for data at the Instance,Schema,Table and Column level. This is important because this complexity, especially between the Table and Columns can catch you out.

It is also important to ensure that not only is data stored in the appropriate format, the communication between client and server also needs to support the character set you wish to use.

latin1 example

We start by creating a simple table, inserting some data, and reviewing the data.

mysql> create table test_latin1(c varchar(100) not null) default charset latin1;
mysql> insert into test_latin1(c) values ('a'),('abc'),('☺'),('abc ☺☹☻');

mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_latin1;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | latin1     | 61                         |
| abc           |         3 |              3 | latin1     | 616263                     |
| ☺             |         3 |              3 | latin1     | E298BA                     |
| abc ☺☹☻       |        13 |             13 | latin1     | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

We use the LENGTH() and CHAR_LENGTH() functions to confirm the known length of the string and the true number of characters.

In the above examples, the smiley characters can be reproduced in a web page with the following.

&#9786;&#9785;&#9787;

utf8 example

In a separate session (because we re-use these later) we repeat for utf8.

mysql> set  names utf8;
mysql> show session variables like '%char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8                                                           |
| character_set_connection | utf8                                                           |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | utf8                                                           |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

mysql> create table test_utf8(c varchar(100) not null) default charset utf8;
mysql> insert into test_utf8(c) values ('a'),('abc'),('☺'),('abc ☺☹☻');

mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_utf8;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | utf8       | 61                         |
| abc           |         3 |              3 | utf8       | 616263                     |
| ☺            |         3 |              1 | utf8       | E298BA                     |
| abc ☺☹☻       |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

As you can see, the key difference here is the CHAR_LENGTH() of the utf8 strings differ.

Mixing latin1 with utf8

If we now look at the latin1 data in the utf8 session we see that while the underlying data via LENGTH(),CHAR_LENGTH() and HEX() remains unchanged (because this is the database representation of the data), the actual output presentation is garbled due to the mismatch in the client communication.

mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_latin1;
+------------------------+-----------+----------------+------------+----------------------------+
| c                      | length(c) | char_length(c) | charset(c) | hex(c)                     |
+------------------------+-----------+----------------+------------+----------------------------+
| a                      |         1 |              1 | latin1     | 61                         |
| abc                    |         3 |              3 | latin1     | 616263                     |
| ☺                    |         3 |              3 | latin1     | E298BA                     |
| abc ☺☹☻          |        13 |             13 | latin1     | 61626320E298BAE298B9E298BB |
+------------------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

If we look at the utf8 data in latin1, we see a different garbled representation.

mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_utf8;
+---------+-----------+----------------+------------+----------------------------+
| c       | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------+-----------+----------------+------------+----------------------------+
| a       |         1 |              1 | utf8       | 61                         |
| abc     |         3 |              3 | utf8       | 616263                     |
| ?       |         3 |              1 | utf8       | E298BA                     |
| abc ??? |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

Conclusion

Armed now with a basic understanding we are ready to consider what approaches we may undertake to migrate this example table, and how we may be able to verify our data.

References

Character Set Support
Connection Character Sets and Collations