Databases fail. No one can promise 100% uptime, it’s impossible. Whether the database is large, small, on-premise or cloud-based, all have the potential to fail. This could be transactional errors, system crashes, out of memory errors or out of disk space errors. Sometimes they fail suddenly and sometimes they just can’t cope with growing demand and “fail slowly” over a period of time.
The list of reasons that cause database failure is long and include:
- Application code changes
- Workload change as the user base grows or shifts
- Hardware failure or change (Spectre and Meltdown patches, take a bow)
- Database version upgrades
- Configuration changes made to accommodate a new architecture or improve performance
- Configuration assumptions made for an old workload that change
What happens when a database fails?
- Data loss
- Loss of productivity
- Other systems can be negatively affected
- Poor user experience as entire systems fail slowly
What can you do?
Be prepared. Failures are caused by changes, some that you control and others that you don’t. It's not about preparing for the apocalypse, it's about being the best possible application every day. Optimization is an ongoing process that should never stop. So what can you do?
If you're making changes that could result in failure we suggest you:
If you expect things outside of your control to change and potentially cause failure we suggest you:
- Set up monitoring on all essential systems
- Test in stages
- Do gradual rollouts
- Have a plan to roll back if changes cause problems
- Backup and snapshot systems regularly
Real world example: One of our long-standing customers, a high-profile online retailer, recently shared their story. A software developer made a change to a structure causing the system to slow down. During a surge of holiday season shopping the system crashed. Immediately they were notified and looked into SolarWinds® Database Performance Monitor (DPM) to compare the environment and pinpoint the change out of thousands of queries. They rectified the issue and were back up and running in MINUTES.
Another example from our own environment: At SolarWinds, we are constantly ingesting metrics data from our customers’ monitored environments. If our database inserts become too slow, our data pipeline backs up and other systems are affected. The end result becomes visible to users as delayed data in dashboards. The process of catching up becomes increasingly more difficult the longer the performance degradation drags on. The monitoring that we have in place enables us to identify the root cause of a problem before it blows up and has cascading effects.
- Regularly and consistently backup and archive
- Test backups
- Practice strategies by introducing controlled failures
- Set up alerts
Be ready. Things are going to break and it’s necessary to be prepared for your users, your business and your team. Invest in the necessary tools for high availability, monitoring and backups. Start a free trial here to see how SolarWinds DPM can help you today!