Seeking public data for benchmarks

August 28, 2009 by ronald

I have several side projects when time permits and one is that of benchmarking various MySQL technologies (e.g. MySQL 5.0,5.1,5.4), variants (e.g. MariaDB, Drizzle) and storage engines (e.g. Tokutek, Innodb plugin) and even other products like Tokyo Cabinet which is gaining large implementations.

You have two options with benchmarks, the brute force approach such as Sysbench, TPC, sysbench, Juice Benchmark, iibench, mysqlslap, skyload. I prefer the realistic approach however these are always on client’s private data. What is first needed is better access to public data for benchmarks. I have compiled this list to date and I am seeking additional sources for reference.

Feebase – Data is in clean loadable format
IMDB – Not clean, but Roland Bouman is working on this at http://code.google.com/p/imbi/
data.gov. Check out Check out Sunlight Labs as a starting point.
Netflix offered data but this contest is now closed.

Of course, the data is only the starting point, having representative transactions and queries to execute and a framework to execute and a reporting module are also necessary. The introduction of Lua into Sysbench may now be a better option then my tool of choice mybench which I use simply because I can configure, write and deploy generally for a client in under 1 hour.

If anybody has other good references to free public data that’s easily loadable into MySQL please let me know.

Comments

Frank Mashraqi says

August 28, 2009 at 3:06 pm

Check out Socrata. They have some great publicly available data sets
EdwinF says

August 28, 2009 at 3:53 pm

Thank you Ronald, you just made my day with these links!!!
I’m building (actually, already finished but tunning) a synthetic data generator inspired in DBMonster and some others, and these dictionaries are just what I needed to complement it.
Ronald says

August 28, 2009 at 4:11 pm

I’ve also just found DBpedia — http://wiki.dbpedia.org/Downloads33
Ronald says

August 28, 2009 at 4:24 pm

Stock data is available from here for FREE, but requires registration. I’m not sure how easy it would be to automate. http://www.eoddata.com/
Morgan Tocker says

August 28, 2009 at 4:27 pm

I think database examples work best when everyone already understands the content. This is why MySQL always users the world database in training (the only shame is that it’s too small).

IMDB is my favorite of the examples you mentioned. I once cleaned it up for personal usage, but the problem is that it doesn’t really have a free license to distribute under.
Eric Day says

August 28, 2009 at 5:16 pm

Hi Ronald! Last week at the Drizzle meetings we were talking about this, and one other is Dell e-commerce tests. It can be found here: http://linux.dell.com/dvdstore/
Gavin Towey says

August 28, 2009 at 5:21 pm

http://www.openstreetmap.org/ has lots of geographic data to the tune of 160G of XML (look for Planet.osm file) There’s an application called Osmosis which will load all that into a database (which took approximately 5 days for me, it handles both mysql,postgres, and odbc)
Justin Swanhart says

August 28, 2009 at 8:57 pm

The National Bureau of Transportation Statistics has lots of data, including ontime arrival information for 20+ major air carriers for 20+ years:
http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time

We converted this data into a STAR schema and make it available to the public in our “cloud demo”. We’ve included both MyISAM and KFDB versions so you can compare results. There is around 50GB of raw data to work with.

—

I’ve been working on a benchmark tool which is kind of like ‘sysbench’ for ROLAP queries on a STAR schema. It is a PHP application because I didn’t want to deal with sysbench, which is written in C, and I wanted to be able to connect to more databases than sysbench provides drivers. It scales from 1M to 10B rows or more. I’ve never generated more than 10B rows, which is 1TB of data because my data generator is single threaded. I can give you the data generator and queries that I’ve written if you like. I has two “tunables” to control behavior.

—

If you want a really big data set, there is the Sloan Digital Sky Survey:
http://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey It is over 4TB of raw data.
Roland Bouman says

August 29, 2009 at 4:00 am

Hi!

“IMDB – Not clean, but Roland Bouman is working on this at http://code.google.com/p/imbi/”

I should point out that I am not actively working on it atm. The project was just a place for me and a few other professionals to stuff code while learning how to use pentaho data integration. The stuff that is up there is usable, but it’s not complete.

Also, like Morgan pointed out, once you cleaned the data and loaded it, you cannot freely distribute the data itself.
Roland Bouman says

August 29, 2009 at 4:04 am

Hi!

separate post to point you to the fake name generator:

http://www.fakenamegenerator.com/order.php

This offers free of charge fake person record generation. I was quite pleased to find out that for US data, the geografical distribution (at state level) is pretty close to the data indicated by the US cencus. So its not just random data. You can only grab 50,000 records a time, and as up to 3 queued requests at any time, but if you do that a couple of times you can get a reasonably large data set
Justin Swanhart says

August 30, 2009 at 3:32 pm

@Gavin:

Are there any example queries available for the OSM data set? There are a lot of tables.
ronald says

September 9, 2009 at 6:33 pm

Lists of various word categories at http://www.cotse.com/tools/wordlists1.htm
ronald says

September 9, 2009 at 6:36 pm

Free US Zip Code Data http://www.census.gov/geo/www/tiger/zip1999.html (unfortunately in .DBF format)
ronald says

September 9, 2009 at 6:41 pm

Free dictionaries:

http://www.dicts.info/dictionaries.php
ronald says

October 1, 2009 at 11:43 am

More sources to try.

http://numbrary.com/
https://infochimps.org/
http://aws.amazon.com/publicdatasets

Long List at http://www.datawrangling.com/some-datasets-available-on-the-web/
ronald says

October 1, 2009 at 11:51 am

http://www.gutenberg.org/wiki/Main_Page

Look at open API’s from for example http://www.programmableweb.com/apis

Open Government Data Initiative http://ogdisdk.cloudapp.net/Developers.aspx
ronald says

October 1, 2009 at 11:57 am

Last FM Artists from 2007

http://blogs.sun.com/plamere/entry/open_research_the_data_lastfm

Words and phrases – http://www.gutenberg.org/etext/3201
David H. Wilkins says

November 24, 2009 at 10:26 pm

You might try the OpenStreetMap data (openstreetmap.org). There is a huge database (planet.osm) – it’s an XML file and the data is pretty geo centric. It’s HUGE, though.

dhw