Do you have a MySQL horror story to share?

I am looking for a few more unique examples to add to the final chapter of my upcoming book on MySQL Backup & Recovery. If you would like to share your fun experience, receive a mention and a free copy please let me know via comment. If you would like to share but not have your comment published, please note at top of your feedback.

Thanks for helping to contribute to a detailed list of what could go wrong and how to be prepared for a MySQL disaster.

Comments

  1. James Day says

    Someone deletes ibdata1. Someone asks Oracle’s MySQL Support team before shutting down their server. They don’t lose their data. It’s useful to know that a Linux system will not discard the ibdata1 file until the process that owns it is shut down. So long as you don’t shut down mysqld you can still use mysqldump to back up your data and/or just recover the file. At least one MySQL data recovery service from a well known third party support vendor still has the dangerous advice to shut down mysqld as soon as possible if someone has deleted data on its data recovery welcome page. That can be good advice. It can also cost you data that you could have saved if you do it in the wrong situation. Knowing how the data has been deleted matters.

    At Wikipedia when dealing with recovery of damaged data on a slave I once dropped a table on the master instead of the slave. I recovered from a very lagged slave. One of the nice features in MySQL 5.6 is the ability to set up a slave so that it is guaranteed that it will always lag by a specified amount. This can save you a lot of time, giving you a slave you can clone from instead of having to restore a backup.

    At Wikipedia we once had a power outage that damaged data on all of our master-slaves that had RAID controllers. A simple and cheap slave with striped SATA drives worked fine. We recloned the site from that slave. The problem with the main database servers was that neither the SATA nor the SCSI RAID controllers disabled the hard drive’s write buffers, making the battery backup on the controllers useless. One of the vendors used to do it but stopped because it made performance in benchmarks worse. In the ensuing Slashdot discussion I resisted the temptation to name that RAID vendor. But I was sorely tempted. InnoDB crash recovery is almost 100% reliable if you don’t have your hard drive write buffers turned on. If you do have them turned on it can be more like 0% reliable, as expected. If you like durability do be sure to use a supported configuration and don’t lose data due to drive buffers being turned on. And test it with real power unplugging, don’t just use resets that don’t really remove power.

    Backups and the binary log are good. Use them. So is replication. A dirt cheap slave, even on an old and slow developer box or your own computer, can be a good tool to save data if the worst happens.

    What will you do if your data center burns down? Your slaves and backups aren’t all at the same place that just burned down, I hope? One remote slave can save you a lot of pain. That can be in your office instead of your colo. Or in a different colo, ideally on a different regional power system so it’ll stay up if local power goes down.

    Your UPS and on site generator won’t work very well if someone hits the emergency power off button. That’s normally required by fire safety codes to disconnect UPS and generator power as well as utility power. Don’t rely on UPS and generators for durability. They fail. And are required by law to fail.

    Learn from the above and your data is much more likely to survive common bad events.

    I don’t really need a copy of the book. :) Just get people using and testing proper durable settings and at least one remote slave, please. :)

    If you want the official view of Oracle Inc., contact an Oracle press officer, the above is just my opinion.

    James Day, MySQL Senior Principal Support Engineer, Oracle

  2. Rares Mirica says

    Well, I have issued a REPLACE INTO statement on the master thinking it will solve a data shift problem in one of the slaves only to later figure out that it had deleted all the records in tables that were related to it. I later spent the night recovering all deleted data by parsing the binlogs (used row-basedd replication) with grep, sed, and a makeshift parser written in python

  3. willem says

    In a system with one master and 6 slaves, it appeared that an application accidentally inserted million of bogus test data since a few month. I identified the data, and gave a delete statement to delete all the records in one go. The statement – that used an index, not a full table scan – ran for one whole week(!).
    The other stories with MySQL is not of the above caliber, but it was an ongoing disappointment in the robustness of MySQL. There are many cases of ‘checklist’ development of mysql – for example the close useless implementation of partitions where the list must be part of the PK (!!) – and other unpleasant surprises.