Recently I was asked to provide guidelines for testing fail over of a MySQL configuration that was provided by a hosting provider.
The first observation was the client didn’t have any technical details from the hosting provider of what the moving parts were, and also didn’t have any confirmation other then I think a verbal confirmation that it had been testing.
The first rule in using hosting, never assume. Too many times I’ve seen details from a client stating for example H/W configuration, only to audit and find out otherwise. RAID is a big one, and is generally far more complex to determine. Even for companies with internal systems I’ve seen the most simple question go unanswered. Q: How do you know your RAID is fully operational? A: Somebody will tell us? It’s really amazing to investigate on site with the client to find that RAID system is running in a degraded mode due to a disk failure and nobody knew.
It took some more digging to realize the configuration in question was with Red Hat Cluster Suite. A word of warning for any clients that use this, DO NOT USE MyISAM. I’ll leave it to the readers to ask me why.
Here is a short list I provided as the minimum requirements I’d test just to ensure the configuration was operational.
Verifying a working Red Hat Cluster Suite MySQL Environment
The MySQL Environment
The database environment consists of two MySQL database servers, configured in an active/passive mode using a shared disk storage via SAN.
For the purposes of the following procedures the active server will be known as the ‘primary’ server, and the passive server will be the ‘secondary server’.
The two physical servers for the purposes of these tests will be defined as ‘alpha’ and ‘beta’, with specific H/W that does not change during these tests.
Normal Operations
Expected Configuration under normal operations.
Primary Server
- server is pingable
- server accepts SSH Connection
- MySQL service is started
- has /data appropriately mounted
- has assigned VIP address
- MySQL configuration file and settings are correct
Secondary Server
- server is pingable
- server accepts SSH Connection
- MySQL service IS NOT started
- DOES NOT have /data mounted
- DOES NOT has assigned VIP address
- MySQL configuration file is not available
1. Reboot servers ‘alpha’ and ‘beta’.
Test Status:
- alpha server is the designated primary server
- alpha and beta servers are operational
Action:
1.1 Restart alpha server (init 6)
1.2 Restart beta server (init 6)
Checklist:
1.3 Alpha server matches primary server configuration
1.4 Beta server matches secondary server configuration
2. Controlled fail over from ‘alpha’ to ‘beta’
Test Status:
- alpha server is the designated primary server
- alpha and beta servers are operational
Action:
2.1 Alpha server – Instigate Cluster failover (clusvcadm -r mysql-svc)
Checklist:
2.2 Beta server matches primary server configuration
2.3 Alpha server matches secondary server configuration
3. Controlled failover from ‘beta’ to ‘alpha’
Test Status:
- beta server is the designated primary server
- alpha and beta servers are operational
Action:
3.1 beta server – Instigate Cluster failover (clusvcadm -r mysql-svc)
Checklist:
3.2 Alpha server matches primary server configuration
3.3 Beta server matches secondary server configuration
Exception Operations
4. Loss of connectivity to primary server
Test Status:
- alpha server is the designated primary server
- beta server is online
Action:
4.1 Stop networking services on ‘alpha’ (ifdown bond0)
Checklist:
4.2 Monitoring detects and reports connectively loss
4.3 Automated failover occurs
4.4 Beta server matches primary server configuration
4.5 Alpha server matches secondary server configuration
5. Restore connectivity to secondary server
Test Status:
- beta server is the designated primary server
- alpha server is online, but not accessible via private IP
Action:
4.1 Start networking services on ‘alpha’ (ifup bond0)
Checklist:
5.2 Monitoring detects and reports connectively restored
5.3 No failback occurs
5.4 Beta server matches primary server configuration
5.5 Alpha server matches secondary server configuration
6. Loss of connectivity to secondary server
Test Status:
- beta server is the designated primary server
- alpha server is online
Action:
6.1 Stop networking services on ‘alpha’ (ifdown bond0)
Checklist:
6.2 Monitoring detects and reports connectively lost
6.3 No failback occurs
6.4 Beta server matches primary server configuration
6.5 Alpha server matches secondary server configuration
7. Restore connectivity to secondary server
Test Status:
- beta server is the designated primary server
- alpha server is online, but not accessible via private IP
Action:
7.1 Start networking services on ‘alpha’ (ifup bond0)
Checklist:
7.2 Monitoring detects and reports connectively restored
7.3 No failback occurs
7.4 Beta server matches primary server configuration
7.5 Alpha server matches secondary server configuration
8. Power down secondary server
Test Status:
- beta server is the designated primary server
- alpha server is online
Action:
8.1 Power down alpha (init 0) NOTE: Need remote boot capabilities
Checklist:
8.2 Monitoring detects and reports connectively lost
8.3 Beta server matches primary server configuration
8.4 Additional paging for extended down time for ‘degraded support for failover’
9. Loss of connectivity to primary server
Test Status:
- beta server is the designated primary server
- alpha server is offline
Action:
9.1 Power down beta (init 0) NOTE: Need remote boot capabilities
Checklist:
9.2 Monitoring detects and reports connectively lost
9.3 Site database connectively completely unavailable
9.4 Additional paging for loss of HA solution
- power restored to secondary server
Test Status:
- alpha server is offline
- beta server is offline
Action:
10.1 Power on alpha
Checklist:
10.2 Monitoring detects and reports server up
10.3 Alpha server assumes primary role (previously it was beta)
10.4 Alpha server matching primary server configuration
10.5 Addition paging for degraded HA
11. power restored to secondary server
Test Status:
- alpha server is primary server
- beta server is offline
Action:
11.1 Power on beta
Checklist:
11.2 Monitoring detects and reports server up
11.3 Alpha server matching primary server configuration
11.4 Beta server matching secondary server configuration
Database Operations
12. MySQL services on primary server go offline
Test Status:
- alpha server is the designated primary server
- beta server is online
Action:
12.1 Stop mysql services on ‘alpha’ (/etc/init.d/mysqld stop)
Checklist:
12.2 Monitoring detects and reports database loss (while connectivity is still available)
12.3 Automated failover occurs
12.4 Beta server matches primary server configuration
12.5 Alpha server matches secondary server configuration
13. MySQL services on secondary server go offline
Test Status:
- beta server is the designated primary server
- alpha server is online
Action:
13.1 stop mysql services on ‘beta’ (/etc/init.d/mysqld stop)
Checklist:
13.2 Monitoring detects and reports database loss (while connectivity is still available)
13.3 Automated failover occurs
13.4 Alpha server matches primary server configuration
13.5 beta server matches secondary server configuration
14. Load Testing during failure
Test Status:
- alpha server is the designated primary server
- beta server is online
Action:
14.1 Agressive load testing against database server
14.2 MySQL killed without prejudice (killall -9 mysqld_safe mysql)
Checklist:
14.3 Monitoring detects and reports mysql service loss
14.4 Automated failover occurs
14.5 Beta server matches primary server configuration
14.6 Alpha server matches secondary server configuration
14.7 Beta mysql logs shows a forced MySQL Recovery in logs
15. Forced Recovery
Test Status:
- alpha server is the designated primary server
- beta server is online
Action:
15.1 Manual full database backup is done (in case recovery does not work). Hosting Provider not told of this.
15.2 Dummy new table/schema is created (used as verification point)
15.3 Database on alpha primary server is dropped
15.4 Hosting Provider is notified stating a full database recovery including Point In time to just before drop (no time given, only command that was run)
Checklist:
15.5 Site is marked as unavailable
15.6 Hosting Provider restore data from backup and recover to point in time
15.7 Confirmation that new table/schema is restored, and full schema is available
15.8 Site is made available
15.9 Record of time for full disaster is recorded
Conclusion
This is not an exhaustive test, in fact it is just a documented approach for consideration to show a client what the minimum testing should be. As no dry run actually occurred, there may be inaccuracies and additions necessary to this document when first executed. I would need access to an appropriate configuration in order to perform a level of testing to complete this document.
About the Author
Ronald Bradford provides Consulting and Advisory Services in Data Architecture, Performance and Scalability for MySQL Solutions. An IT industry professional for two decades with extensive database experience in MySQL, Oracle and Ingres his expertise covers data architecture, software development, migration, performance analysis and production system implementations. His knowledge from 10 years of consulting across many industry sectors, technologies and countries has provided unique insight into being able to provide solutions to problems. For more information Contact Ronald .