Software Updates in the IT Stack: (Not) a Game of Jenga
June 14, 2018
Monitoring and Observability
In my last post, I talked about upgrading storage systems. More specifically, how to determine the right upgrade interval for your storage systems. What the previous post did not cover is the fact that your storage system is part of a larger ecosystem. It attaches to a SAN and LAN, and is accessed by clients. A number of technologies such as backup, snapshots, and replication protect data. Monitoring systems ensure you are aware of what is going on with all the individual components. This list can go on and on…
Each component in that ecosystem will receive periodic updates. Some components are relatively static: a syslog server may receive a few patches, but it is unlikely that these patches fundamentally change how the syslog server works. Moreover, in case of a bad update, it is relatively easy to roll back to a previous version. As a result, it is easy to keep a syslog server up to date.
Other components change more frequently or have a larger feature change between patches. For example, hyper-converged infrastructure, which is still a growing market, receives many new features to make it more attractive to a wider audience. It is more of a gamble to upgrade these systems: new features might break old functions that your peripheral systems rely on.
Finally, do not forget the systems that are a spider in the web, like hypervisors. They run on hardware that needs to be on a compatibility list. Backup software talks to it, using snapshots or things like Changed Block Tracking to create backups and restore data. Automation tools will talk to them to create new VMs. Plus, the guest OS in a VM receives VM hardware and tool upgrades. These systems are again more difficult to upgrade, just because there’s so many aspects of the software exposed to other components of the ecosystem.
So how can you keep this ecosystem healthy, without too many “Uh-oh, I broke something!” moments with the whole IT stack collapsing like a game of Jenga?
Reading, testing, and building blocks
Again, read! Release notes, compatibility lists, advisories, etc. Do not just look for changes in the product itself, but also for changes to APIs or peripheral components. A logical drawing of your infrastructure helps: visualize which systems talk with other systems.
Next is testing. A vendor tests upgrade paths and compatibility, but no environment is like your own. Test diligently. If you cannot afford a test environment, then at the very least test your upgrades on a production system of lesser importance. After the upgrade, test again: does your backup still run? No errors? We had a “no upgrades on Friday afternoon” policy at one customer: it avoids having to pull a weekender to fix issues, or missing three backups because nobody noticed something was broken.
As soon as you find the ideal combination of software and hardware versions, create a building block out of it. TOGAF can help: it is a lightweight and adaptable framework for IT architecture. You can tailor it to your own specific needs and capabilities. Moreover, you do not need to do “all of it” before you can reap the benefits: you can pick the components you like.
Let us assume you want to run an IaaS platform. It consists of many systems: storage, SAN, servers, hypervisor, LAN, etc. You have read the HCLs and done the required testing, so you are certain that a combination of products works for you. Whatever keeps your VMs running! This is your solution building block.
Some components in the solution building block could need careful specification. For example Cisco UCS firmware 3.2(2d) with VMware ESXi 6.5U1 needs FNIC driver Y. Others are more loosely specified: syslogd, any version.
Next, track the life cycle of these building blocks, starting with the building block that you’re currently running in production: the active standard. Think ESX 6.5U1 with UCS blades on firmware 3.2(2d) on a VMAX with SRDF/Metro for replication and Veeam for backup and recovery. Again, specify versions or version ranges where required.
You might also be testing a new building block: the proposed or provisional standard. That could be with newer versions of software (like vSphere 6.7) or different components. It could even be completely different and use HCI infrastructure such as VXRail.
Finally, there are the old building blocks, either phasing-out or retired. The difference between these versions is the amount of effort you will put in upgrading or removing them from your landscape. A building block with ESXi 5.5 could be “phasing-out” in late 2017, which means you will not deploy new instances of it, but you also do not actively retire it. Now though, with the EOL of ESXi 5.5 around the corner, that building block should transition to retired. You need to remove it from your environment because it is an impending support risk.
By doing the necessary legwork before you upgrade, and by documenting the software and hardware versions you use, upgrades should become less of game of Jenga where one small upgrade brings down the entire stack.