In the last years the Azure Stack Team made some really good improvements when it comes to Patch and Update (PnU) of Azure Stack Hubs. When the PnU-process is fired off guest vm’s are moved around the nodes, so that a node can be taken in maintenance and be upgraded.
In the last couple of years I noticed that during the migration of the vm’s sometimes a small margin of vm’s get unresponsive (or rebooted). The cause of this can be many reasons, but a common heard thing is that Linux VM’s do this more often than Windows VM’s. This reminded me of a virtualization thing of all time: Disk timeout time.
During a migration, and especially storage migration, there is always a moment the “owner-node” of the vm-disk changes. This causes a little hick-up in the storage I/O but within the OS things get cached. Usually it take a second or two and its business as usual afterwards. But sometimes it takes more time. This can have several causes (resource issues/overcommitting, hypervisor performance, disk I/O, …). The Azure Stack Hub is orchestrating the live-migration for you. As for now, an operator does not have control over this. So we need to look at what we can do in this case.
Disk timeout values for Windows and Linux
The default disk timeout value in Windows is set to 60 seconds. You can see this in the following registry setting:
The default disk timeout on most Linux distributions is 30 seconds by default.
See this Vmware KB article: https://kb.vmware.com/s/article/1009465
So a possible solution to the rebooting / non-responsive Linux VM’s can be an increase of the disk time-out value on Linux to 60 seconds. Of course this can impact other things as well, especially on I/O heavy machines you should be careful with this.
Let me know what you think and if you have other ideas on how to solve this. You can reach me here on Twitter.