Looking just at the data

There are many areas you need to review when addressing MySQL performance such as current database load, executed SQL statements, connections, configuration parameters, memory usage, disk to memory ratio, hardware performance & bottlenecks just to name a few.

If you were to just look at the data that is held in the database, what would you consider?
Here are my tips, when looking just at the data.

  1. What is the current database size?
  2. What is the growth of data over time, say daily, weekly?
  3. Which are the 2 largest tables now?
  4. What 2 tables are growing the fastest?
  5. What tables have greatest churn, specifically DELETE’s?
  6. How often do you optimize your tables?
  7. What is your archiving/purging strategy? Do you even have one?
  8. Review data types? I average 25% reduction in footprints, just by choosing optimal data types, generally with zero code changes.
  9. What further data simplification can occur to reduce size, eg. INT for IP’s, enums, removing repeating text etc?
  10. What normalization of data can occur?
  11. What storage engines are in use?
  12. What data is write once data?
  13. Can data be stored in other forms, e.g. outside a relational database?

Even without looking at the SQL statements or the MySQL configuration you can generally deduce a lot of information about the application by just looking at the data.

My favorite MySQL data type – DECIMAL(31,0)

It may seem hard to believe, but I have seen DECIMAL(31,0) in action on a production server. Not just in one column, but in 15 columns just in the largest 4 tables of one schema. The column was being used to represent a integer primary or foreign key column.

In a representative production instance (one of a dozen plus distributed production database servers) the overall database footprint was decreased from ~10 GB to ~2 GB, a 78% saving. In total, 15 columns across just 4 tables were changed from DECIMAL(31,0) to INT UNSIGNED.

One single table > 5GB was reduced to under 1GB (a 81% saving). This being my record for any GB+ tables in my time working with the MySQL database.

Had this server for example had 4GB of RAM, and say 2.5GB allocated to the innodb_buffer_pool_size, this one change moved the system from requiring more consistent disk access (4x data to memory) to being able to store all data in memory. Tests showed a clear improvement in Innodb buffer pool reads and hit ratio.

Today’s lesson as described in my 2008 conference presentation Top 20 design tips for data architects is, choose the right integer data type for your data.

Seeking public data for benchmarks

I have several side projects when time permits and one is that of benchmarking various MySQL technologies (e.g. MySQL 5.0,5.1,5.4), variants (e.g. MariaDB, Drizzle) and storage engines (e.g. Tokutek, Innodb plugin) and even other products like Tokyo Cabinet which is gaining large implementations.

You have two options with benchmarks, the brute force approach such as Sysbench, TPC, sysbench, Juice Benchmark, iibench, mysqlslap, skyload. I prefer the realistic approach however these are always on client’s private data. What is first needed is better access to public data for benchmarks. I have compiled this list to date and I am seeking additional sources for reference.

Of course, the data is only the starting point, having representative transactions and queries to execute and a framework to execute and a reporting module are also necessary. The introduction of Lua into Sysbench may now be a better option then my tool of choice mybench which I use simply because I can configure, write and deploy generally for a client in under 1 hour.

If anybody has other good references to free public data that’s easily loadable into MySQL please let me know.