A Cassandra twitter clone

Following my successful Cassandra Cluster setup and having a potential client example to work with running Ruby On Rails (RoR), I came across the following examples in Ruby.

Not being a ruby developer, I thought it was time to investigate further. Starting first on Mac OS X 10.5, I found the first line example of installing cassandra via gem unsuccessful.

$ gem install cassandra
Updating metadata for 1 gems from http://gems.rubyforge.org
.
complete
ERROR:  could not find cassandra locally or in a repository

Some more reading highlights Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

In case you didn’t read my earlier posts, Java 6 is installed, but not the default.

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
export PATH=$JAVA_HOME/bin:$PATH

I achieved installing RubyGems via Installing Ruby on Rails on Mac OS X.

$ sudo gem install rubygems-update

Updating metadata for 1 gems from http://gems.rubyforge.org
.
complete
Successfully installed rubygems-update-1.3.6
1 gem installed
Installing ri documentation for rubygems-update-1.3.6...
Installing RDoc documentation for rubygems-update-1.3.6...
Could not find main page README
Could not find main page README
Could not find main page README
Could not find main page README

$ sudo update_rubygems
RubyGems 1.3.6 installed

=== 1.3.6 / 2010-02-17

NOTE:

http://rubygems.org is now the default source for downloading gems.

You may have sources set via ~/.gemrc, so you should replace
http://gems.rubyforge.org with http://rubygems.org

http://gems.rubyforge.org will continue to work for the forseeable future.

New features:

* `gem` commands
  * Added `gem push` and `gem owner` for interacting with modern/Gemcutter
    sources
  * `gem dep` now supports --prerelease.
  * `gem fetch` now supports --prerelease.
  * `gem server` now supports --bind.  Patch #27357 by Bruno Michel.
  * `gem rdoc` no longer overwrites built documentation.  Use --overwrite
    force rebuilding.  Patch #25982 by Akinori MUSHA.
* Captial letters are now allowed in prerelease versions.

Bug fixes:

* Development deps are no longer added to rubygems-update gem so older
  versions can update sucessfully.
* Installer bugs:
  * Prerelease gems can now depend on non-prerelease gems.
  * Development dependencies are ignored unless explicitly needed.  Bug #27608
    by Roger Pack.
* `gem` commands
  * `gem which` now fails if no paths were found.  Adapted patch #27681 by
    Caio Chassot.
  * `gem server` no longer has invalid markup.  Bug #27045 by Eric Young.
  * `gem list` and friends show both prerelease and regular gems when
    --prerelease --all is given
* Gem::Format no longer crashes on empty files.  Bug #27292 by Ian Ragsdale.
* Gem::GemPathSearcher handles nil require_paths. Patch #27334 by Roger Pack.
* Gem::RemoteFetcher no longer copies the file if it is where we want it.
  Patch #27409 by Jakub Šťastný.

Deprecation Notices:

* lib/rubygems/timer.rb has been removed.
* Gem::Dependency#version_requirements is deprecated and will be removed on or
  after August 2010.
* Bulk index update is no longer supported.
* Gem::manage_gems was removed in 1.3.3.
* Time::today was removed in 1.3.3.


------------------------------------------------------------------------------

RubyGems installed the following executables:
	/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/gem

NOTE: This second command took over 60 seconds with no user feedback.

I was then able to successfully install cassandra via ruby’s gem package manager.

$ sudo gem install cassandra
Building native extensions.  This could take a while...
Successfully installed thrift-0.2.0
Successfully installed thrift_client-0.4.0
Successfully installed simple_uuid-0.1.0
Successfully installed cassandra-0.7.5
4 gems installed
Installing ri documentation for thrift-0.2.0...

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known
Installing ri documentation for thrift_client-0.4.0...
Installing ri documentation for simple_uuid-0.1.0...
Installing ri documentation for cassandra-0.7.5...
Installing RDoc documentation for thrift-0.2.0...

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known

Enclosing class/module 'thrift_module' for class BinaryProtocolAccelerated not known
Installing RDoc documentation for thrift_client-0.4.0...
Installing RDoc documentation for simple_uuid-0.1.0...
Installing RDoc documentation for cassandra-0.7.5...

My use of cassandra_helper provided the following expected dependency error.

$ cassandra_helper cassandra
Set the CASSANDRA_INCLUDE environment variable to use a non-default cassandra.in.sh and friends.
(in /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5)
You need to install git 1.6 or 1.7

I found instructions to install git at Installing git (OSX) and installed via GUI installer.

I had to include to my current session path to get my Ruby Cassandra installation.

$ export PATH=/usr/local/git/bin:$PATH

$ cassandra_helper cassandra
Set the CASSANDRA_INCLUDE environment variable to use a non-default cassandra.in.sh and friends.
(in /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5)
Checking Cassandra out from git
Initialized empty Git repository in /Users/rbradfor/cassandra/server/.git/
remote: Counting objects: 16715, done.
remote: Compressing objects: 100% (2707/2707), done.
remote: Total 16715 (delta 9946), reused 16011 (delta 9364)
Receiving objects: 100% (16715/16715), 19.22 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (9946/9946), done.
Updating Cassandra.
Buildfile: build.xml

clean:

BUILD SUCCESSFUL
Total time: 2 seconds
HEAD is now at 298a0e6 check-in debian packaging
Building Cassandra
Buildfile: build.xml

build-subprojects:

init:
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/build/classes
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/build/test/classes
    [mkdir] Created dir: /Users/rbradfor/cassandra/server/src/gen-java

check-gen-cli-grammar:

gen-cli-grammar:
     [echo] Building Grammar /Users/rbradfor/cassandra/server/src/java/org/apache/cassandra/cli/Cli.g  ....

build-project:
     [echo] apache-cassandra-incubating: /Users/rbradfor/cassandra/server/build.xml
    [javac] Compiling 247 source files to /Users/rbradfor/cassandra/server/build/classes
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.

build:

BUILD SUCCESSFUL
Total time: 42 seconds
CASSANDRA_HOME: /Users/rbradfor/cassandra/server
CASSANDRA_CONF: /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5/conf
Listening for transport dt_socket at address: 8888
DEBUG - Loading settings from /Library/Ruby/Gems/1.8/gems/cassandra-0.7.5/conf/storage-conf.xml
....

I was then able to complete the example at up and running with cassandra running via the ruby interactive console.

I was also able to fire up the cassandra-cli and see the data added in ruby.

$ bin/cassandra-cli -host localhost
Connected to localhost/9160
cassandra> get Twitter.Statuses['1']
=> (column=user_id, value=5, timestamp=1267072406503471)
=> (column=text, value=Nom nom nom nom nom., timestamp=1267072406503471)
Returned 2 results.
cassandra> get Twitter.UserRelationships['5'];
=> (super_column=user_timeline,
     (column=???!??zvZ+?!, value=1, timestamp=1267072426991872)
     (column=??-?!???C?th?, value=2, timestamp=1267072427019091))
Returned 1 results.

No sure about the data in the second example.

Configuring a Cassandra Cluster

Continuing on from Getting started with Cassandra I’m now trying to configure two servers as a cluster. The Getting Started Step 3 was not clear the first time I read it (after writing this is makes sense), so a Google search yielded the second link as Building a Small Cassandra Cluster for Testing and Development. I love finding reference material from people I know, Padraig being a significant contributor to Drizzle.

Here is what I did to create a running Cassandra Cluster.

  • Stop individual Cassandra instances
  • Re-created data and log directories (I did this just to ensure a clean slate)
  • I added to my local hosts file two aliases for my servers (cass01 and cass02). This helped in the following step.
  • Three changes are needed to the default conf/storage-conf.xml file on my first server.
    • Change <ListenAddress> from localhost to cass01
    • Change <ThristAddress> from localhost to cass01
    • Change <Seed> from 127.0.0.1 to cass01
  • On my second server I changed the <ListenAddress> and <ThriftAddress> accordingly to cass02 and made <Seed> cass01
  • Started Cassandra servers and tested successfully using the set …/get Keyspace1.Standard1[‘jsmith’] example. I was able to connect to both hosts via cassandra-cli and see the results created on just one node. I was able to create data on the second node and view on the first node.

A new command is available to describe your cluster.

$ bin/nodeprobe -host cass01 ring
Address       Status     Load          Range                                      Ring
                                       148029780173059661585165369000220362256
192.168.100.4 Up         0 bytes       59303445267720348277007645348152900920     |<--|
192.168.100.5 Up         0 bytes       148029780173059661585165369000220362256    |-->|

Now with my first introduction successful, time to start using and seeing the true power of using Cassandra.

Edward Screven of Oracle to Answer Questions for future of MySQL

For those of you on the O’Reilly MySQL conference list you will no doubt see this email, but for readers here is the important bits.


Oracle Executive Will Speak at O’Reilly MySQL Conference & Expo
Edward Screven to Answer Questions re: Future of MySQL

Sebastopol, CA, February 24, 2010—Wonder about the future of MySQL? Curious about what Oracle plans for the open source database software? Expect answers when Edward Screven, Oracle’s chief corporate architect and leader of the MySQL business, speaks at the O’Reilly MySQL Conference & Expo, scheduled for April 12-15, at the Santa Clara Convention Center and the Hyatt Regency Santa Clara.

Edward Screven reports to CEO Larry Ellison, and he drives technology and architecture decisions across all Oracle products to ensure that product directions are consistent with Oracle’s overall strategy. He’ll discuss the current and future state of MySQL, now part of the Oracle family of products. His presentation will also cover Oracle’s investment in MySQL technology and community, as well as the role that open source in general is playing within heterogeneous customer environments around the world.

I have not found a link yet to provide reference to this.

Ineffective concatenated indexes

In MySQL significant performance improvements can be achieved by the correct use of indexes. It is important to understand different MySQL index implementations and one key improvement on indexes defined on single columns is to use multiple column or more commonly known concatenated indexes.

However it’s also possible to define ineffective indexes. This example shows you how to identify a concatenated index that is ineffective.

CREATE TABLE example (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  a  INT UNSIGNED NOT NULL,
  b  INT UNSIGNED NOT NULL,
  c  INT UNSIGNED NOT NULL,
  d  INT UNSIGNED NOT NULL,
  x  VARCHAR(10),
  y  VARCHAR(10),
  z  VARCHAR(10),
PRIMARY KEY (id),
UNIQUE INDEX (a,b,c,d)
) ENGINE=InnoDB;

INSERT INTO example(a,b,c,d) VALUES
(1,0,1,1),(1,0,1,2), (1,0,2,3), (1,0,4,5),
(2,0,2,1),(2,0,2,2), (2,0,2,3), (2,0,2,5),
(3,0,2,1),(3,0,2,3), (3,0,3,3), (3,0,3,5);

And our sample query is

SELECT id,x,y,z
FROM   example
WHERE  a = 2
AND    c = 2
AND    d = 3;

The EXPLAIN plan is

+----+-------------+---------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+----+-------------+---------+------+---------------+------+---------+-------+------+-------------+
|  1 | SIMPLE      | example | ref  | a             | a    | 4       | const |    4 | Using where |
+----+-------------+---------+------+---------------+------+---------+-------+------+-------------+

While we are using the index (see the key column), the full benefit of the index is not utilized (see the key_len). 4 indicates the number of bytes used, that is only 1 INT column.

Let’s look at a second example.

SELECT id,x,y,z
FROM   example
WHERE  a = 2
AND    b = 0
AND    c = 2
AND    d = 3;

+----+-------------+---------+-------+---------------+------+---------+-------------------------+------+-------+
| id | select_type | table   | type  | possible_keys | key  | key_len | ref                     | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+-------------------------+------+-------+
|  1 | SIMPLE      | example | const | a             | a    | 16      | const,const,const,const |    1 |       |
+----+-------------+---------+-------+---------------+------+---------+-------------------------+------+-------+

In this example the index is used however the key_len is 16, that is 4 x 4 byte INT columns. This is ideal for this index.

In the above example, the client was using a common data structure however was not using one column, it’s values were all effectively 0. Certain queries were written with this knowledge and instead of specifying the column, they elected to remove it, however the impact of this developer code change was increased load on the database and more inefficient performance.

While this was easily addressed by a code change, an alternative could have been to change the index definition. It was not possible to remove the column due to legacy requirements.

ALTER TABLE example
DROP INDEX a,  ADD UNIQUE INDEX (a,c,d);
mysql> explain SELECT id,x,y,z FROM   example  WHERE  a = 2 AND    c = 2 AND    d = 3;
+----+-------------+---------+-------+---------------+------+---------+-------------------+------+-------+
| id | select_type | table   | type  | possible_keys | key  | key_len | ref               | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+-------------------+------+-------+
|  1 | SIMPLE      | example | const | a             | a    | 12      | const,const,const |    1 |       |
+----+-------------+---------+-------+---------------+------+---------+-------------------+------+-------+

A better solution may be to enforce better integrity on this business rule, that b only contains 0. MySQL does not support check constraint. The closest option would be to use an ENUM data type and a STRICT sql_mode however ENUM is only a string object so this would break other code.

ALTER TABLE example MODIFY b ENUM('0') NOT NULL DEFAULT '0';

Not only is performance impacted in this situation, in other examples of ineffective indexes it’s simply a waste of diskspace to sort additional columns in an index that are never used, and also a waste of memory as the index pages stored in memory contain information that is not used.

Getting started with Cassandra

With the motivation from today’s public news on Twitter’s move from MySQL to Cassandra, my own skills desire following in-depth discussions at last November’s Open SQL Camp to consider Cassandra and yesterday’s discussion with a new client on persistent key-value store products, today I download installed and configured for the first time. Not that today’s news was unexpected, if you follow the Twitter Engineering Open Source projects you would have seen Cassandra as well as other products being used or evaluated by Twitter.

So I went from nothing to a working Cassandra node in under 5 minutes. This is what I did.

  1. While I knew this was an Apache project, a Google Search yields for me the 3rd link for the The Apache Cassandra Project at http://incubator.apache.org/cassandra/. Congrats for Cassandra now a top level Apache Project. This url will update soon.
  2. Download Cassandra. Hard to miss with a big green button on home page. Current version is 0.5
  3. I read Getting Started, which is the 3rd top level link on menu after Home and Download. Step 1 is picking a version which I’ve already done, Step 2 is Running a single node.
  4. The Getting Started indicated a problem on Mac OS X for the required minimum Java version. I was installing on Mac OS X 10.5 and CentOS 5.4. I’ve experienced this Java 6 default path issue before. Set my JAVA_HOME and PATH accordingly (after I updated the wiki with correct value)
  5. I extracted the tar file, changed to the directory and took at look at the README.txt file. Yes, I always check this first with any software and relevant because it includes valuable instructions on creating the default data and log directories.
  6. Start with bin/cassandra -f. No problems!
  7. I then followed the instructions from the link in Step 2 with the CassandraCli. This tests and confirms the installation is operational.

Ok, a working environment. I’ve now installed on a second machine and tested however I now need to configure the cluster, and the documentation is not as straightforward. Time to try out Google again.

On a side note, this is one reason why I love Open Source. I followed the instructions online and found a mistake in the Mac OS X path, I simply registered and corrected providing the benefit of my experience for the next reader(s).

You may also like to view future posts including.

The correct approach to rolling MySQL logs

I say correct because there are several incorrect approaches to managing MySQL logs. In MySQL you have two important log files, the MySQL error log (configured with –log-error) and the MySQL slow query log (configured with –log-slow-queries or –slow-query-log and –slow-query-log-file which is available from 5.1.29).

The ideal management of these log files is different for each type of file.

The MySQL Error Log

With the error log you want to have one file showing a contiguous history of the server instance. This log provides valuable information over time and you should not discard this. You do NOT want to roll your error log. If for feel the content in the error log is too much, then these are errors you need to be addressing, and archiving after correcting. There are circumstances where error logs are rolled outside of your control.

The first is the FLUSH LOGS command. When this command is run, the error log is renamed to -old and a new log is created. For example:

$ ls -l log/error*
-rw-rw---- 1 mysql root  1733 Feb 22 16:11 error.log

$ mysql -uroot -p -e "FLUSH LOGS"

$ ls -l log/error*
-rw-rw---- 1 mysql mysql    0 Feb 22 18:08 error.log
-rw-rw---- 1 mysql root  1733 Feb 22 16:11 error.log-old

What happens when you run this command again?

$ mysql -uroot -p -e "FLUSH LOGS"

$ ls -l log/error*
-rw-rw---- 1 mysql mysql 0 Feb 22 18:10 log/error.log
-rw-rw---- 1 mysql mysql 0 Feb 22 18:08 log/error.log-old

You have now lost all valuable information in the error log, both the current log and the -old log are ZERO bytes in size.

The second example is the Ubuntu specific MySQL distribution on Ubuntu OS that logs MySQL information to the system error log (i.e. /var/log/syslog). You are then receiving a daily log rotate via the default OS log-rotate settings. You effectively lose information after 7 days. Here is what you will find on a stock Ubuntu server.

$ ls -l /var/log
...
-rw-r----- 1 mysql       adm       0 2008-05-28 20:33 mysql.err
-rw-r----- 1 mysql       adm       0 2008-05-28 20:33 mysql.log
...
-rw-r----- 1 syslog      adm  278480 2010-02-22 20:22 syslog
-rw-r----- 1 syslog      adm  366934 2010-02-22 06:25 syslog.0
-rw-r----- 1 syslog      adm   21025 2010-02-21 06:27 syslog.1.gz
-rw-r----- 1 syslog      adm   18551 2010-02-20 06:47 syslog.2.gz
-rw-r----- 1 syslog      adm   20086 2010-02-19 06:25 syslog.3.gz
-rw-r----- 1 syslog      adm   17135 2010-02-18 06:40 syslog.4.gz
-rw-r----- 1 syslog      adm   19238 2010-02-17 06:32 syslog.5.gz
-rw-r----- 1 syslog      adm   16101 2010-02-16 06:34 syslog.6.gz

You have to troll the syslog files to find any mysql specific information, even that is not possible with one command.

$ grep syslog syslog.0 | grep mysql

$ zcat syslog*gz | grep mysql
Feb 16 22:12:20 db1 mysqld[21769]: 100216 22:12:20 [ERROR] /usr/sbin/mysqld: Incorrect key file for table '/tmp/#sql_5508_40.MYI'; try to repair it
Feb 16 22:12:20 db1 mysqld[21769]: 100216 22:12:20 [ERROR] /usr/sbin/mysqld: Incorrect key file for table '/tmp/#sql_5508_33.MYI'; try to repair it
Feb 16 22:30:15 db1 mysqld[21769]: 100216 22:19:13 [ERROR] /usr/sbin/mysqld: Incorrect key file for table '/tmp/#sql_5508_29.MYI'; try to repair it
Feb 16 22:30:17 db1 mysqld[21769]: 100216 22:19:13 [ERROR] /usr/sbin/mysqld: Incorrect key file for table '/tmp/#sql_5508_44.MYI'; try to repair it
...

I’m lucky I looked, because tomorrow would have been too late to see these errors, as this is the oldest log file.

I place a disclaimer that if you have proper backups in place, these files are retrievable, but that takes time and effort which I consider unnecessary for a database server. The mysql log is one of the most important log files for this type of server, you want this information on the server.

The solution to this problem is easy in Ubuntu, always define the -log-error variable for the [mysqld] and [mysqld_safe] sections to a specific file, e.g. /var/log/mysql/mysql.log

The MySQL slow query log

With the slow query log, you DO want to rotate the log file produced. The best practice is to rotate this file daily. In addition you should both analyze the log file producing a top 5 or top 10 list of slow SQL queries each day. As load may not always predictable, it’s ideal to also analyze the combined logs of the past 7 days for cross reference.

The daily granularity allows you to track your load of slow performing queries more consistently and it also enables you verify more easily the impact of improvements made when they have been deployed.

Care needs to be taken to roll your log filed. Simply moving the log file will not work. For example.

A good tip to help in slow query SQL analysis is to
Comment your SQL (url)

$ mysql -uroot -p -e "select sleep(5)"
$ tail -1 log/slow.log
select sleep(5);

$ ls -l log/slow*
-rw-rw---- 1 mysql mysql 352 Feb 22 15:26 log/slow.log

$ mv log/slow.log log/slow.log.1; touch log/slow.log
$ ls -l log/slow*
-rw-r--r-- 1 root  root    0 Feb 22 15:26 log/slow.log
-rw-rw---- 1 mysql mysql 352 Feb 22 15:26 log/slow.log.1


$ mysql -uroot -p -e "SELECT SLEEP(4)"

$ ls -l log/slow*
-rw-r--r-- 1 root  root    0 Feb 22 15:26 log/slow.log
-rw-rw---- 1 mysql mysql 533 Feb 22 16:01 log/slow.log.1
$ tail -1 log/slow.log.1
SELECT SLEEP(4);

As you can see, the slow log was not written to, but the previous file which has the same inode.

$ rm -f log/slow.log; mv log/slow.log.1 log/slow.log
$ ls -l log/slow.log*
-rw-rw---- 1 mysql mysql 533 Feb 22 16:01 log/slow.log
$ cp log/slow.log log/slow.log.`date +%M`; > log/slow.log
$ mysql -uroot -p -e "SELECT SLEEP(3)"
$ ls -l log/slow.log*
-rw-rw---- 1 mysql mysql 181 Feb 22 16:03 log/slow.log
-rw-r----- 1 root  root  533 Feb 22 16:03 log/slow.log.03
$  tail -1 log/slow.log
SELECT SLEEP(3);

$ cp log/slow.log log/slow.log.`date +%M`; > log/slow.log
$ mysql -uroot -p -e "SELECT SLEEP(6)"
$ ls -l log/slow.log*
-rw-rw---- 1 mysql mysql 181 Feb 22 16:04 log/slow.log
-rw-r----- 1 root  root  533 Feb 22 16:03 log/slow.log.03
-rw-r----- 1 root  root  181 Feb 22 16:04 log/slow.log.04
$ tail -1 log/slow.log
SELECT SLEEP(6);

As you can see by copying and truncating you can perform an effective log rotate manually. Ideally you should config logrotate to manage this log.

More Information

What's your MySQL version?

I’ve heard that the mechanic’s wife always has a car that needs repair or tuneup, the painter’s wife always had walls of peeling paint, you get the picture. What about MySQL DBA’s and their own databases? While I have many versions of MySQL for testing including for example the latest 5.1.44 which I was using for my previous post, what is running on my production server? Let’s see:

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.1.25-rc |
+-----------+

That’s really old. And yes, to prove my point that we can be our own worst enemy, the previous version before 5.1.25 was 5.1.6. Yes, .6 which worked just fine, and never crashed once for my 20+ websites. While I have downloaded onto my production server several versions ready for upgrade including versions 5.1.30, 5.1,38, and 5.4.1 I’ve never actually gone through the upgrade process.

Migrating MySQL latin1 to utf8 – Character Set Options

Continuing on from preparation in our MySQL latin1 to utf8 migration let us first understand where MySQL uses character sets. MySQL defines the character set at 4 different levels for the structure of data.

  • Instance
  • Schema
  • Table
  • Column

In MySQL 5.1, the default character set is latin1. If not specified, this is what you will get. For example.

mysql> create table test1(c1 varchar(10) not null);
mysql> show create table test1G
Create Table: CREATE TABLE `test1` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

If you want all tables in your instance to always be a default of utf8, you can changed the server variable character_set_server. This can be set dynamically.

mysql> set global character_set_server=utf8;
mysql> set session character_set_server=utf8;
mysql> create table test2(c1 varchar(10) not null);
mysql> show create table test2G
Create Table: CREATE TABLE `test2` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

If you change this dynamically be sure to include the option in your my.cnf to ensure this option is persisted for a mysqld restart.

You can define the default character set for all new tables in a given schema. You specify this when you create the schema.

mysql> set global character_set_server=latin1;
mysql> set session character_set_server=latin1;
mysql> create schema test_ucs2 default character set ucs2;
mysql> use test_ucs2;
mysql> create table test3(c1 varchar(10) not null);
mysql> show create table test3G
Create Table: CREATE TABLE `test3` (
  `c1` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=ucs2

Even though we have a schema default, you can always specify the default character set for a given table which overrides any defaults.

mysql> use test_ucs2;
mysql> create table test4_utf8 (c varchar(10) not null) default charset utf8;
mysql> show create table test4_utf8G
Create Table: CREATE TABLE `test4_utf8` (
  `c` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

And finally, if you really wanted to be specific you can define the character set on a per column level.

mysql> create table test4_utf8_latin1 (c varchar(10) not null, c2 varchar(20) charset latin1) default charset utf8;
mysql> show create table test4_utf8_latin1G
*************************** 1. row ***************************
       Table: test4_utf8_latin1
Create Table: CREATE TABLE `test4_utf8_latin1` (
  `c` varchar(10) NOT NULL,
  `c2` varchar(20) CHARACTER SET latin1 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

With great flexibility comes great responsibility. You should have a defined standard for your application that is simple and easy to understand. I am not a proponent of using utf8 for everything, the primary reason why is memory. As part of my consulting I spend a lot of time with clients that have limited resources, e.g. database servers with 2GB or 4GB of RAM. MySQL stores utf8 efficiently on disk, but when this data is stored in memory for internal usage, it automatically uses 3 bytes, when on disk it may only be 1 byte. You can test this by creating a MEMORY table with latin1 and utf8 examples and comparing the difference in size. Is this a serious problem? Well that depends on many factors such as the number of database connections, persistent or not persistent connections, the size of the results etc. While it’s difficult in MySQL to instrument the memory precisely on a per connection basis, prudence should be a consideration for any physical resources, especially RAM.

Now that we understand what’s possible, how can we change our existing latin1 tables in our preparation example?

We could try a simple ALTER TABLE command.

mysql> alter table test_latin1 default charset utf8;
mysql> show create table test_latin1G
Create Table: CREATE TABLE `test_latin1` (
  `c` varchar(100) CHARACTER SET latin1 NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

This does not work, because we are only changing the default storage engine of the table. The underlying columns remain the same. If we were to add a new column, it would default to utf8. We can however achieve what we expected with the CONVERT option.

mysql> alter table test_latin1 convert to character set utf8;
mysql> show create table test_latin1G
Create Table: CREATE TABLE `test_latin1` (
  `c` varchar(100) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

We look at our data, and it looks great? Are we done with our conversion?

mysql> select * from test_latin1;
+---------------+
| c             |
+---------------+
| a             |
| abc           |
| ☺           |
| abc ☺☹☻ |
+---------------+

The answer is no. While it may look like the data is correct, MySQL also manages character sets for the communication channel. In this case, we are still communicating in latin1. To ensure moving forward in the future we must always communicate in utf8 to ensure we correctly pass utf8 to the database. We can test this with the mysql client, and as you will see our data is still corrupt.

mysql> set names utf8;
mysql> select * from test_latin1;
+------------------------+
| c                      |
+------------------------+
| a                      |
| abc                    |
| ☺                 |
| abc ☺☹☻ |
+------------------------+

mysql> show session variables like 'character%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8                                                           |
| character_set_connection | utf8                                                           |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | utf8                                                           |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

While you can see how we could migrate the schema definition, this does not complete our migration. In my next post, I will discuss the various different ways to correctly perform a data migration between latin1 and utf8.

Checked your MySQL recovery process recently?

I sound like a broken record with every client when I talk to about the resilience of their production environments. It’s very simple in theory, however in practice many organizations fail.

Ask yourself these checklist questions for your MySQL backup and recovery process?

  1. Do you have MySQL backups in place?
  2. Do you backup ALL your MySQL data?
  3. Do you have consistent MySQL backups?
  4. Do you have backups that include both static snapshot and point in time transactions?
  5. Do you review your backup logs EVERY SINGLE day or have tested backup monitoring in place?
  6. Do you perform a test recovery of your static backup?
  7. Do you perform a test recovery to point in time?
  8. Do you time your backup and recovery process and review over time?
  9. Do you have off-site copies of your backups?
  10. Do you backup your primary binary logs?

In the past month I’ve discovered clients that have an online only business (i.e. Their MySQL data is their only tangible asset), they perform daily backups but they don’t have binary logging enabled. I’ve also discovered an example of backup logs not being checked, and an underlying mysqldump error was resulting in an incomplete backup, yet the backup script apparently completed successfully.

Disaster is inevitable.

If you don’t score 8 or better in the above checklist in your business, you are at higher risk. If you are a owner/founder/executive this should keep you awake at night if your not sure of your business viability. If your organization needs help, please contact me for assistance.

  • Can you recover from a small or large disaster?
  • Do you have confidence in your DR plan?
  • Do you know how long your DR plan will take.
  • What does your online business look like or operate during your DR time?

The Blue Pill or the Red Pill

At the recent FOSDEM 2010 event, I presented in my keynote Dolphins, now and beyond a option which I termed the “Blue Pill” or the “Red Pill”. The following slide produced noticed interest in a packed room, and subsequent conversation.

While the ownership of MySQL has changed, the option between MySQL and Oracle as a product for use still remains. While MySQL is the most popular for modern online applications, Oracle continues to have the widely used enterprise database product and has a large number of Oracle DBAs in the IT marketplace.

Over the past 5 years I have presented a number of topics on MySQL for Oracle DBA’s. At the upcoming MySQL Users Conference 2010 I will be presenting the first two talks in my new series “MySQLCamp for the Oracle DBA”.

  • IGNITION – is the preparation necessary for a successful launch of a MySQL ecosystem for an Oracle DBA. This volume covers the preparation needed to be ready for ongoing production administration of MySQL.
  • LIFTOFF – Only after a successful preparation covered in IGNITION can you be ready for the implementation and management of a MySQL ecosystem and a successful launch of your product.

Migrating MySQL latin1 to utf8 – Preparation

This article is Part 1 of a series of articles regarding MySQL character set conversion.
Be sure to also check out character set options and the process for more information.

Before undertaking such migration the first step is a lesson in understanding more about how latin1 and utf8 work and interact in MySQL. latin1 in a common and historical character set used in MySQL. utf8 (first available in MySQL Version 4.1) is an encoding supporting multiple bytes and is the system default in MySQL 5.0

  • latin1 is a single byte character set.
  • utf8 is a 1-3 byte character set depending on the size of the character. NOTE: MySQL utf8 does not support the RFC 3629 4 byte sequences. (Updated: MySQL 5.5 now supports full Unicode support with the ” utf8m4″charset))

MySQL variables

MySQL has a number of different system variables to consider, the following is the default representation in MySQL 5.1

mysql> show global variables like '%char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | latin1                                                         |
| character_set_connection | latin1                                                         |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | latin1                                                         |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

MySQL enables you to specify the character set for data at the Instance,Schema,Table and Column level. This is important because this complexity, especially between the Table and Columns can catch you out.

It is also important to ensure that not only is data stored in the appropriate format, the communication between client and server also needs to support the character set you wish to use.

latin1 example

We start by creating a simple table, inserting some data, and reviewing the data.

mysql> create table test_latin1(c varchar(100) not null) default charset latin1;
mysql> insert into test_latin1(c) values ('a'),('abc'),('☺'),('abc ☺☹☻');

mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_latin1;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | latin1     | 61                         |
| abc           |         3 |              3 | latin1     | 616263                     |
| ☺             |         3 |              3 | latin1     | E298BA                     |
| abc ☺☹☻       |        13 |             13 | latin1     | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

We use the LENGTH() and CHAR_LENGTH() functions to confirm the known length of the string and the true number of characters.

In the above examples, the smiley characters can be reproduced in a web page with the following.

&#9786;&#9785;&#9787;

utf8 example

In a separate session (because we re-use these later) we repeat for utf8.

mysql> set  names utf8;
mysql> show session variables like '%char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value                                                          |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8                                                           |
| character_set_connection | utf8                                                           |
| character_set_database   | latin1                                                         |
| character_set_filesystem | binary                                                         |
| character_set_results    | utf8                                                           |
| character_set_server     | latin1                                                         |
| character_set_system     | utf8                                                           |
+--------------------------+----------------------------------------------------------------+

mysql> create table test_utf8(c varchar(100) not null) default charset utf8;
mysql> insert into test_utf8(c) values ('a'),('abc'),('☺'),('abc ☺☹☻');

mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_utf8;
+---------------+-----------+----------------+------------+----------------------------+
| c             | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------------+-----------+----------------+------------+----------------------------+
| a             |         1 |              1 | utf8       | 61                         |
| abc           |         3 |              3 | utf8       | 616263                     |
| ☺            |         3 |              1 | utf8       | E298BA                     |
| abc ☺☹☻       |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

As you can see, the key difference here is the CHAR_LENGTH() of the utf8 strings differ.

Mixing latin1 with utf8

If we now look at the latin1 data in the utf8 session we see that while the underlying data via LENGTH(),CHAR_LENGTH() and HEX() remains unchanged (because this is the database representation of the data), the actual output presentation is garbled due to the mismatch in the client communication.

mysql> select c,length(c),char_length(c),charset(c), hex(c) from test_latin1;
+------------------------+-----------+----------------+------------+----------------------------+
| c                      | length(c) | char_length(c) | charset(c) | hex(c)                     |
+------------------------+-----------+----------------+------------+----------------------------+
| a                      |         1 |              1 | latin1     | 61                         |
| abc                    |         3 |              3 | latin1     | 616263                     |
| ☺                    |         3 |              3 | latin1     | E298BA                     |
| abc ☺☹☻          |        13 |             13 | latin1     | 61626320E298BAE298B9E298BB |
+------------------------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

If we look at the utf8 data in latin1, we see a different garbled representation.

mysql> select c,length(c),char_length(c),charset(c),hex(c) from test_utf8;
+---------+-----------+----------------+------------+----------------------------+
| c       | length(c) | char_length(c) | charset(c) | hex(c)                     |
+---------+-----------+----------------+------------+----------------------------+
| a       |         1 |              1 | utf8       | 61                         |
| abc     |         3 |              3 | utf8       | 616263                     |
| ?       |         3 |              1 | utf8       | E298BA                     |
| abc ??? |        13 |              7 | utf8       | 61626320E298BAE298B9E298BB |
+---------+-----------+----------------+------------+----------------------------+
4 rows in set (0.00 sec)

Conclusion

Armed now with a basic understanding we are ready to consider what approaches we may undertake to migrate this example table, and how we may be able to verify our data.

References

Character Set Support
Connection Character Sets and Collations

Beyond MySQL GA: patches, storage engines, forks, and pre-releases – FOSDEM 2010

Kristian Nielsen presented “Beyond MySQL GA: patches, storage engines, forks, and pre-releases”.
This included a history of current products:

Google Patches (5.0 & 5.1) included improvements in :

  • statistics/monitoring
  • lock contention
  • binlog
  • malloc()
  • filesorts
  • innodb I/O and wait statistics
  • SHOW …STATISTICS statements
  • smp scalability
  • I/O scalability
  • semisync replication
  • many more

Percona Patches (5.0) focus on

  • statistics/monitoring
  • performance/scalability
  • buffer pool content/mutexes
  • microslow patch

These have been ported to 5.1 and mainly integrated into XtraDB.

EBay Patches (5.0) have included:

  • variable length memory storage engine
  • pool of threads
  • Virtual columns

XtraDB storage engine (5.1) includes

  • Percona patches
  • Google patches
  • Innodb patches
  • Has XtraBackup for backup

Other engines/patches discussed included:

  • PBXT storage engine – community contribution
  • FederatedX – replacement to Federated
  • Sphinx storage engine
  • Pinba storage engine – Collects PHP statistics
  • Others OQGraph/Spider
  • Galera – Synchronous replication
  • Drizzle

Alternative packaging options for MySQL 5.0 and MySQL 5.1 including Our Delta, Percona and MariaDB.

FOSDEM 2010 MySQL Developer Room Schedule
FOSDEM 2010 Website
Brussels, Belgium
February 7, 2010

Multi-Master Manager for MySQL – FOSDEM 2010

The next presentation by Piotr Biel from Percona was on Multi-Master Manager for MySQL.

The introduction included a discussion of the popular MySQL HA solutions including:

  • MySQL Master-slave replication with failover
  • MMM managed bi-directional replication
  • Heartbeat/SAN
  • Heartbeat/DRBD
  • NDB Cluster

A key problem that was clarified in the talk is the discussion of Multi-Master and this IS NOT master-master. You only write to a single node. With MySQL is this critical because MySQL replication does not manage collision detection.

The MMM Cluster Elements are:

  • monitoring node
  • database nodes

And the Application Components are:

  • mon
  • agent
  • angel

MMM works with 3 layers.

  • Network Layer – uses a virtual IP address, related to servers, not a physical machine
  • Database Layer
  • Application Layer

MMM uses two roles for management with your application.

  • exclusive – also known as the writer
  • balanced – also known as the reader

There are 3 different statuses are used to indicate node state

  • proper operation
  • maintenance
  • fatal errors

The mmm_control is the tool used to manage the cluster including:

  • move roles
  • enable/disable individual nodes
  • view cluster status
  • configure failover

The Implementation challenges require the use of the following MySQL settings to minimize problems.

  • auto_increment_offset/auto_increment_increment
  • log_slave_updates
  • read_only

FOSDEM 2010 MySQL Developer Room Schedule
FOSDEM 2010 Website
Brussels, Belgium
February 7, 2010

10x Performance Improvements in MySQL – A Case Study

The slides for my presentation at FOSDEM 2010 are now available online at slideshare. In this presentation I describe a successful client implementation with the result of 10x performance improvements. My presentation covers monitoring, reviewing and analyzing SQL, the art of indexes, improving SQL, storage engines and caching.

The end result was a page load improvement from 700+ms load time to a a consistent 60ms.

State of phpMyAdmin – FOSDEM 2010

Following the opening keynote “Dolphins, now and beyond”, Marc Delisle presented on “State of phpMyAdmin”.

phpMyAdmin is an DBA administration tool for MySQL available today in 57 different languages. This is found today in many distributions, LAMP stack products and also in cpanel. The product is found at http://phpmyadmin.net.

There are current two versions, the legacy 2.x version to support older php 3.x & 4.x, The current version 3.x is for PHP 5.2 or greater.

The current UI includes some new features including.

  • calendar input for date fields
  • meta data for mime types e.g images, which is great for showing the output as an image, otherwise blob data
  • Relational designer with the able to show and create foreign keys

The New features in 3.3 (currently in beta) include:

  • Replication support including configuring master/slave, start/stop slave.
  • Synchronization model showing structure and data differences between two servers and ability to sync.
  • New export to php array, xslx, mediawiki, new importing features including progress bar.
  • Changes tracking for changes on per instance or per table. Providing change report and export options.

FOSDEM 2010 MySQL Developer Room Schedule
FOSDEM 2010 Website
Brussels, Belgium
February 7, 2010

Dolphins, now & beyond – FOSDEM 2010

I had the honor of opening the day at the MySQL developer room at FOSDEM 2010 where I had a chance to talk about the MySQL product and community, now and what’s happening moving forward.

For those that missed the talk, my slides are available online at Slideshare however slides never due justice to some of the jokes including:

  • What do you consider? the Blue Pill, or the Red Pill
  • Why think two dimensionally, how about the Green Pill
  • Emerging Breeds with performance enhancing modifications