I have several side projects when time permits and one is that of benchmarking various MySQL technologies (e.g. MySQL 5.0,5.1,5.4), variants (e.g. MariaDB, Drizzle) and storage engines (e.g. Tokutek, Innodb plugin) and even other products like Tokyo Cabinet which is gaining large implementations.
You have two options with benchmarks, the brute force approach such as Sysbench, TPC, sysbench, Juice Benchmark, iibench, mysqlslap, skyload. I prefer the realistic approach however these are always on client’s private data. What is first needed is better access to public data for benchmarks. I have compiled this list to date and I am seeking additional sources for reference.
- Feebase – Data is in clean loadable format
- IMDB – Not clean, but Roland Bouman is working on this at http://code.google.com/p/imbi/
- data.gov. Check out Check out Sunlight Labs as a starting point.
- Netflix offered data but this contest is now closed.
Of course, the data is only the starting point, having representative transactions and queries to execute and a framework to execute and a reporting module are also necessary. The introduction of Lua into Sysbench may now be a better option then my tool of choice mybench which I use simply because I can configure, write and deploy generally for a client in under 1 hour.
If anybody has other good references to free public data that’s easily loadable into MySQL please let me know.
Frank Mashraqi says
Check out Socrata. They have some great publicly available data sets
EdwinF says
Thank you Ronald, you just made my day with these links!!!
I’m building (actually, already finished but tunning) a synthetic data generator inspired in DBMonster and some others, and these dictionaries are just what I needed to complement it.
Ronald says
I’ve also just found DBpedia — http://wiki.dbpedia.org/Downloads33
Ronald says
Stock data is available from here for FREE, but requires registration. I’m not sure how easy it would be to automate. http://www.eoddata.com/
Morgan Tocker says
I think database examples work best when everyone already understands the content. This is why MySQL always users the world database in training (the only shame is that it’s too small).
IMDB is my favorite of the examples you mentioned. I once cleaned it up for personal usage, but the problem is that it doesn’t really have a free license to distribute under.
Eric Day says
Hi Ronald! Last week at the Drizzle meetings we were talking about this, and one other is Dell e-commerce tests. It can be found here: http://linux.dell.com/dvdstore/
Gavin Towey says
http://www.openstreetmap.org/ has lots of geographic data to the tune of 160G of XML (look for Planet.osm file) There’s an application called Osmosis which will load all that into a database (which took approximately 5 days for me, it handles both mysql,postgres, and odbc)
Justin Swanhart says
The National Bureau of Transportation Statistics has lots of data, including ontime arrival information for 20+ major air carriers for 20+ years:
http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time
We converted this data into a STAR schema and make it available to the public in our “cloud demo”. We’ve included both MyISAM and KFDB versions so you can compare results. There is around 50GB of raw data to work with.
—
I’ve been working on a benchmark tool which is kind of like ‘sysbench’ for ROLAP queries on a STAR schema. It is a PHP application because I didn’t want to deal with sysbench, which is written in C, and I wanted to be able to connect to more databases than sysbench provides drivers. It scales from 1M to 10B rows or more. I’ve never generated more than 10B rows, which is 1TB of data because my data generator is single threaded. I can give you the data generator and queries that I’ve written if you like. I has two “tunables” to control behavior.
—
If you want a really big data set, there is the Sloan Digital Sky Survey:
http://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey It is over 4TB of raw data.
Roland Bouman says
Hi!
“IMDB – Not clean, but Roland Bouman is working on this at http://code.google.com/p/imbi/”
I should point out that I am not actively working on it atm. The project was just a place for me and a few other professionals to stuff code while learning how to use pentaho data integration. The stuff that is up there is usable, but it’s not complete.
Also, like Morgan pointed out, once you cleaned the data and loaded it, you cannot freely distribute the data itself.
Roland Bouman says
Hi!
separate post to point you to the fake name generator:
http://www.fakenamegenerator.com/order.php
This offers free of charge fake person record generation. I was quite pleased to find out that for US data, the geografical distribution (at state level) is pretty close to the data indicated by the US cencus. So its not just random data. You can only grab 50,000 records a time, and as up to 3 queued requests at any time, but if you do that a couple of times you can get a reasonably large data set
Justin Swanhart says
@Gavin:
Are there any example queries available for the OSM data set? There are a lot of tables.
ronald says
Lists of various word categories at http://www.cotse.com/tools/wordlists1.htm
ronald says
Free US Zip Code Data http://www.census.gov/geo/www/tiger/zip1999.html (unfortunately in .DBF format)
ronald says
Free dictionaries:
http://www.dicts.info/dictionaries.php
ronald says
More sources to try.
http://numbrary.com/
https://infochimps.org/
http://aws.amazon.com/publicdatasets
Long List at http://www.datawrangling.com/some-datasets-available-on-the-web/
ronald says
http://www.gutenberg.org/wiki/Main_Page
Look at open API’s from for example http://www.programmableweb.com/apis
Open Government Data Initiative http://ogdisdk.cloudapp.net/Developers.aspx
ronald says
Last FM Artists from 2007
http://blogs.sun.com/plamere/entry/open_research_the_data_lastfm
Words and phrases – http://www.gutenberg.org/etext/3201
David H. Wilkins says
You might try the OpenStreetMap data (openstreetmap.org). There is a huge database (planet.osm) – it’s an XML file and the data is pretty geo centric. It’s HUGE, though.
dhw