Seeking public data for benchmarks

I have several side projects when time permits and one is that of benchmarking various MySQL technologies (e.g. MySQL 5.0,5.1,5.4), variants (e.g. MariaDB, Drizzle) and storage engines (e.g. Tokutek, Innodb plugin) and even other products like Tokyo Cabinet which is gaining large implementations.

You have two options with benchmarks, the brute force approach such as Sysbench, TPC, sysbench, Juice Benchmark, iibench, mysqlslap, skyload. I prefer the realistic approach however these are always on client’s private data. What is first needed is better access to public data for benchmarks. I have compiled this list to date and I am seeking additional sources for reference.

Of course, the data is only the starting point, having representative transactions and queries to execute and a framework to execute and a reporting module are also necessary. The introduction of Lua into Sysbench may now be a better option then my tool of choice mybench which I use simply because I can configure, write and deploy generally for a client in under 1 hour.

If anybody has other good references to free public data that’s easily loadable into MySQL please let me know.

Comments

  1. EdwinF says

    Thank you Ronald, you just made my day with these links!!!
    I’m building (actually, already finished but tunning) a synthetic data generator inspired in DBMonster and some others, and these dictionaries are just what I needed to complement it.

  2. says

    I think database examples work best when everyone already understands the content. This is why MySQL always users the world database in training (the only shame is that it’s too small).

    IMDB is my favorite of the examples you mentioned. I once cleaned it up for personal usage, but the problem is that it doesn’t really have a free license to distribute under.

  3. Gavin Towey says

    http://www.openstreetmap.org/ has lots of geographic data to the tune of 160G of XML (look for Planet.osm file) There’s an application called Osmosis which will load all that into a database (which took approximately 5 days for me, it handles both mysql,postgres, and odbc)

  4. says

    The National Bureau of Transportation Statistics has lots of data, including ontime arrival information for 20+ major air carriers for 20+ years:
    http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time

    We converted this data into a STAR schema and make it available to the public in our “cloud demo”. We’ve included both MyISAM and KFDB versions so you can compare results. There is around 50GB of raw data to work with.

    I’ve been working on a benchmark tool which is kind of like ‘sysbench’ for ROLAP queries on a STAR schema. It is a PHP application because I didn’t want to deal with sysbench, which is written in C, and I wanted to be able to connect to more databases than sysbench provides drivers. It scales from 1M to 10B rows or more. I’ve never generated more than 10B rows, which is 1TB of data because my data generator is single threaded. I can give you the data generator and queries that I’ve written if you like. I has two “tunables” to control behavior.

    If you want a really big data set, there is the Sloan Digital Sky Survey:
    http://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey It is over 4TB of raw data.

  5. says

    Hi!

    “IMDB – Not clean, but Roland Bouman is working on this at http://code.google.com/p/imbi/

    I should point out that I am not actively working on it atm. The project was just a place for me and a few other professionals to stuff code while learning how to use pentaho data integration. The stuff that is up there is usable, but it’s not complete.

    Also, like Morgan pointed out, once you cleaned the data and loaded it, you cannot freely distribute the data itself.

  6. says

    Hi!

    separate post to point you to the fake name generator:

    http://www.fakenamegenerator.com/order.php

    This offers free of charge fake person record generation. I was quite pleased to find out that for US data, the geografical distribution (at state level) is pretty close to the data indicated by the US cencus. So its not just random data. You can only grab 50,000 records a time, and as up to 3 queued requests at any time, but if you do that a couple of times you can get a reasonably large data set

  7. says

    You might try the OpenStreetMap data (openstreetmap.org). There is a huge database (planet.osm) – it’s an XML file and the data is pretty geo centric. It’s HUGE, though.

    dhw