Patch and Update Automation Tool for Dell EMC VxRack Azure Stack 14G

Update: I want to update this blog with information I received. It looks like the bits that I downloaded the day this tool was released was the beta version. Dell EMC has since corrected this and the GA release of the product is now available. Some of my issues below most likely would have been addressed in the GA version. I am looking forward to testing this new release and looking forward to the possibilities this tool will bring as new features are released.

A few months ago I wrote a blog about Patching and Updating the Dell EMC VxRack Azure Stack 14G.  A few things have changed since that blog with a new tool that Dell EMC has recently released.  On November 8th Dell EMC released the Patch and Update Automation Tool.  Taken from their description, “The Dell EMC Patch and Update Automation Tool uses a wizard interface that automates and facilities installing Dell EMC updates for the Dell EMC Cloud for Microsoft Stack solution.”  So, after having the experience to update their Azure Stack solution manually I jumped at the chance to try out this new automation tool.  There wasn’t anything wrong with the old way we had to patch the Dell EMC Azure Stack solution, it worked but everything was manual and it took some time.  Now, I am writing this blog from my experience so far with the new tool and with the existing PNU for Dell EMC. The tool itself seems to be a work in progress but it is differently headed in the right direction.

Previous Update Process

First, let’s talk a little about the older process at a high level.  The first thing you had to do was sign in to the Dell EMC support site and download the most recent Cloud For Microsoft Azure Stack VxRackAS Customer Toolkit.  This normally would line up with the most recent Microsoft Azure Stack PNU.  Once downloaded you do various tasks like update the HLH with the include Cumulative Windows Update included, you would do the same for the two VM’s, the OME and the OMNM VM that runs on the HLH.  There are some updated policies you would install depending on the package, etc.  Then a little later, they introduced a way to update all the firmware on all the host via the Azure Stack PNU process.  This was pretty simple and you would follow the same process that you would if you had manually uploaded the Microsoft PNU into your Azure Stack.  There are about three files you would upload to the update storage account, then the Azure Stack Update Resource Provider would process those packages and you would soon see an update from Dell EMC available in your Update blade.  You would kick off the update and just like the Microsoft PNU, you would come back a few hours later to a successful update to your Azure Stack OEM hardware.  At the time there was no way to update the firmware on the TOR switches and the BMC switch as well.  I also don’t recall a way to update the Dell EMC Open Manage Network Manager via the old process as well.  Included in the Customer Toolkit was the update for Dell EMC’s Open Manage Essentials and I think an update to Support Assist Enterprise that runs on the OME-VM.  For someone documenting these processes for our Operations Teams and also future clients, the process mean a lot of hours prepping and installing.  This also meant a few reboots of the Azure Stack nodes which meant a decent maintenance window.  Add the Microsoft PNU and you have a good number of working hours needed to keep your Azure Stack updated and current.

The Patch and Update Tool

Now, the new Dell EMC Patch and Update tool looks really promising.  I still think from my experience with it tonight it still has some maturing to do but for the 1.0 release it comes out pretty strong.   Now the one thing I didn’t find was documentation on this tool.  Then again I didn’t look too hard and like most things (non-prod), I just jumped right into it.  The first thing you want to do it log on to the Dell EMC support site and download the tool as well as the most recent Customer Toolkit, which at the time of this blog was 1809-4.  The tool will need to be installed on the OME-VM in order to update all the nodes, the HLH, and the OME-VM Windows OS and the OMNM-VM Windows OS.  I didn’t see anything about updating the actual OME and OMNM tools or any of the network devices at this time.  I will contact someone at Dell EMC to verify and see when those are on the roadmap for this tool.

So, our first step is to download the tool and the Customer Toolkit. The installer I just downloaded to the DellEMC-Updates Folder and ran the installer from there.  For the Customer Toolkit, I created a new directory structure C:\DellEMC\PNU\Toolkit and downloaded the zip file there.

The installation of the Patch and Update Automation tool went fairly fast.  There will be a new icon on your desktop.

After the installation, the first thing we will see after starting the Patch and Update tool is the summary of what steps are available in the current release of the tool.  Just click the check box to Accept the terms and click Accept.

Configuration:  Profile Tab

The next window is the Configuration Window.  Since I was in a rush and didn’t look for documentation I went ahead and took a stab at each field under Profile, Workflow, and Settings.  There are two files you will need to have on hand.  These files are the ConfigurationData.json and the DeploymentData.json files that your Dell EMC engineer that installed the system would have given you.  I placed this in a directory called Azure Stack Stamp Information.

The rest of the fields I filled in with information that I was aware of.  I would like to make a disclaimer.  I am not sure if Dell EMC gives out all this information needed but I am not sure why they wouldn’t.

For the Service Admin Id, this would actually be your admin account for your Azure Subscription that your Azure Stack is registered with.  So something like admin@tenant.onmicrosoft.com and the password that account uses.  There may be a chance that you actually have a Service Admin account created and used during installation.  I didn’t have a different account created and used the default admin account for the subscription.

The only other field under the profile tab I had questions on at first was the Firmware SMB Share Host User.  I found out if you hover your mouse over each field it will share important information.  All this needs is an account that has access to the machine you are running this tool from in order to give permissions to the share it will create.

The remaining fields I already had the information for.  Just a note, for the ERCS Endpoint User, make sure you put it in the following:  azurestack\username.

Configuration:  Work Flow Tab

The next tab is the Workflow tab.  This tab was populated with all the information already except for the path of the DellEMC Customer Toolkit Folder.  I added that as well as the OMNM IP Address field.  Note: I noticed that every time I came back to this tab the OMNM IPaddress field was blank again.  I am not sure if that is a known bug or an actual feature?

Configuration:  Settings Tab

The next tab is the Settings Tab.  I didn’t change anything on this tab and just click Next here.

Status:  Scale Unit Firmware Upgrades

On the status section, you have a choice of two different views.  The first being the Graphical View and the other being a Console View.  The graphical view doesn’t really show what is going on.  I really like the Console view to see the up to date logs.

This Status process looks to run a VxRack Azure Stack pre-checker to make sure that the tool can communicate to all the nodes, the idracs, and the Windows servers it will be updating.  I ran into an issue where the tool couldn’t remote to the HLH and the prechecker would fail.

Log Name : DellEMC MHC PNU Log
Host Name : DELLEMC-OMEVM
Level : Error
Logged : 11/11/2018 17:45:50
Details : Unable to complete all updates
[10.99.2.30] Connecting to remote server 10.99.2.30 failed with the following error message : Access is denied. For more information, see the about_Remote_Troubleshooting Help topic.
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ExecuteStep(ExecutableStep step)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ExecuteStepAndUpdate(Int32 i, ExecutableStep step)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ExecuteSteps(IReadOnlyList1 steps)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.Start()
at DellEmc.MHC.PNU.Automation.StateManagement.InProgressState.StartWorkflow()

I started to troubleshoot, I verified that account I had put in the configuration section works.  I could RDP just fine to the HLH server with that account.  After doing some trouble shooting I decided to make a configuration change on the HLH.  I am not sure if I was allowed to do this from Dell EMC but I ended up doing it anyway just to get the tool to run a successful precheck. I will take the slap on the wrist later and change my configuration changes back if I need.   What I ended up doing was running WinRM QuickConfig.  It came back asking me to grant administrative rights remotely to local users.  Since every account we will be using is considered a local user, I figured I would answer Yes and allow it to make this change.  Again, I will need to verify with Dell EMC if this was OK or not so I suggest if you are following this blog you do the same.

PS C:\> winrm quickconfig
WinRM service is already running on this machine.
WinRM is not set up to allow remote access to this machine for management.
The following changes must be made:

Configure LocalAccountTokenFilterPolicy to grant administrative rights remotely to local users.

Make these changes [y/n]? y

WinRM has been updated for remote management.

Configured LocalAccountTokenFilterPolicy to grant administrative rights remotely to local users.
PS C:\>

I was able to make a remote connection now from the OME-VM to the HLH.  I then restarted the Pre-checker and it went through without any other issues.  At this point, I clicked the Update button.

At this point, you can switch between the Graphical View and the Console view again.  I think the Console view shows more detailed information so a majority of the time I had it in the Console view.

As the process ran you can easily track what was happening.  The first few stages the tool drains the first node after putting it into maintenance mode.  It does a series of verifications to make sure that node is now safe to reboot and update various firmware and drivers.  From what I have watched it looks to take an hour or more a node on average.  You can also go into the Admin Portal and view that the current node is in maintenance mode.

The only thing I noticed was the critical alerts I received because the node is offline and that it wasn't available for virtual machine placement.  I will need to reach out to Microsoft to find out why we are getting these alerts if a node is in maintenance mode.

I did run into an issue with the tool when it tried to update the firmware on node 4.  For some reason, the tool successfully put the node in maintenance mode but then failed because there were storage jobs running.  Which is what it did for nodes 1 -3 without an issue.

Log Name : DellEMC MHC PNU Log
Host Name : DELLEMC-OMEVM
Level : Error
Logged : 11/11/2018 23:51:45
Details : 11/11/2018 23:51:45 [ERROR] Storage jobs are running and not completed. Aborting firmware upgrade

Log Name : DellEMC MHC PNU Log
Host Name : DELLEMC-OMEVM
Level : Error
Logged : 11/11/2018 23:51:45
Details : 11/11/2018 23:51:45 [ERROR] Manual intervention may be required to resume the node. For additional information please refer the Patch and Update Automation documentation.

The fix for me was to go to the admin portal, resume node 4 and then go back to the update tool and click the retry button.  It started the process up again at node 4 and didn't have any issues. After a few more hours of uneventful firmware updates, the process then turned to updating the HLH firmware, drivers and OS.  Pretty painless update to the nodes themselves.

Status:  HLH Firmware, Drivers, and OS Upgrade

So the question I had in my head is how will the tool update the HLH without killing the actual tool since the tool is running on OME-VM which is hosted by the HLH.  I had my answer shortly after the HLH rebooted from the first reboot after installing the firmware.  The OME-VM came back online but had an error that the machine shutdown unexpectedly.  My guess is because I was logged on when Hyper-V tried to shutdown the guest it couldn't and the VM shutdown when the HLH rebooted.  When the OME-VM came back up I started the tool again and it started up at the point same stage it left off.  The process ran for a few minutes and then failed.  In the logs I saw that again the tool couldn't remote to the HLH.  I tested myself and again got the Access Denied when trying to remote to the HLH.  I was able to RDP to the box so I know the account worked.

Log Name : DellEMC MHC PNU Log
Host Name : DELLEMC-OMEVM
Level : Error
Logged : 12/11/2018 09:18:14
Details : Unable to complete Step :Upgrade HLH Firmware( 5)
[WIN-CBGILT1503D] Connecting to remote server WIN-CBGILT1503D failed with the following error message : Access is denied. For more information, see the about_Remote_Troubleshooting Help topic.

There seems to be a policy on the HLH that disables LocalAccountTokenFilterPolicy to grant administrative rights remotely to local users. I once again did the fix from above by running winrm quickconfig and allowing local accounts administrative rights remotely.  I resumed the tools job and the work flow started again at the point where it failed. However, shortly after the tool failed again. Everytime I would click retry it would continually fail.  This time with the following error message

Log Name : DellEMC MHC PNU Log
Host Name : DELLEMC-OMEVM
Level : Error
Logged : 12/11/2018 09:25:37
Details : Unable to complete all updates
Step Execution FailedA specified logon session does not exist. It may already have been terminated.
A specified logon session does not exist. It may already have been terminated.
A specified logon session does not exist. It may already have been terminated.
A specified logon session does not exist. It may already have been terminated.
A specified logon session does not exist. It may already have been terminated.

at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ExecuteStep(ExecutableStep step)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteStep(ExecutableStep step, Queue
1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteStepAndUpdate(Int32 i, ExecutableStep step, Queue1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteSteps(IReadOnlyList
1 steps, Queue1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteStep(ExecutableStep step, Queue
1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteStepAndUpdate(Int32 i, ExecutableStep step, Queue1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.ResumeExecuteSteps(IReadOnlyList
1 steps, Queue`1 currentStepStack)
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.Retry()
at DellEmc.MHC.PNU.Automation.Execution.Engine.WorkFlowExecutor.Retry(Workflow workFlow)
at DellEmc.MHC.PNU.Automation.StateManagement.InProgressState.RetryWorkflow()

Note: I think once we get around the HLH remote connection issues this wouldn’t have been a problem.

At this time there wasn’t much that I could do except try and exit the tool and hopefully restart the entire process.  So I closed down the tool and started the process from start again. I ran into the same issues after restarted the entire process with the same errors from above.  I opened a Service Request with Dell EMC support and within a few hours was on a support call.   After troubleshooting for a little while we disabled a security policy on the HLH and the tool was able to connect again and continue with the driver installation.  We knew the credentials we knew worked because we can see in the logs the tool connecting and verifying that the system needed drivers.  It was at the point where it tried to reconnect that we had the above issues.

We disabled the Network access: Do not allow storage of passwords and credentials on the HLH.

We had no issues with the driver installation after disabling that setting.  The HLH rebooted again and the third task started, updating the HLH Windows OS.  I did have to change the winrm settings again in order for the tool to connect. The Windows OS update on the HLH went well but when the tool came back up it was stuck on “Waiting for WinRM service to start….” to reconnect to the HLH.  I killed the update tool and start the process again.  The workflow kicked off again where it had hung and now I have a fully patched HLH server with updated drivers and firmware.  It wasn’t smooth but at least we walked through the issues fairly fast.

Status:  OMNM VM Upgrade

The update to the OMNM VM went pretty well.  Again we had to do the WinRM QuickConfig to Configure LocalAccountTokenFilterPolicy to grant administrative rights remotely to local users.  The workflow installed the updates, rebooted, and continued on to the OME VM without any issues.  The one question I have and I will include it in a section called Questions Needed answered, will this tool one day update the OpenManage Network Manager along with patching the Windows OS?  Also, the tool doesn’t update any of the network devices like the TOR switches and the BMC switch.  This would be a nice add for future releases.

Status:  OME VM Upgrade

This server was pretty simple as well. Since this was the box the tool is running from we didn’t run into any security policy issues.  The Tool updated the Server OS without any issues.  Just like the OMNM VM, I have questions about the roadmap for this tool.  Will this one day update OpenManage Essentials and Support Assist Enterprise?  At this time it looks to only update the Windows OS.  This would be a nice add in future releases.

Summary

At the end, we now have a fully patched HLH server, OMNM VM, and OME VM.  All 4 nodes along with the HLH server are now up to date with their firmware and drivers.

I do have another question for Dell EMC at this time.  When going to the Azure Stack Admin Portal, in the update blade the OEM version still shows 1.0.1808.1.  I would assume this should now say 1.0.1809.4.  I think I can call this successful even though we did encounter several problems along the way.  All of them fairly easy to resolve with one that we had to get Dell EMC support involved.   The two issues I have that cause a majority of the problems were easy to fix.  Although I am hoping there is a permanent fix moving forward.  I will look for the documentation and see if there is anything about these two issues that may resolve them in future updates.

The very last thing we need to do is enable Bitlocker back on the HLH.  The Patch and Update Automation Tool suspended Bitlocker for us but we need to manually re-enable it.  Another item I would like to see on the roadmap for this tool.

Questions Needed Answered

So I have a few questions I will want to reach out to Dell EMC and Microsoft for some answers.

  • Will the future process automate the downloading of the most recent Customer Toolkit?
  • Will the tool update OpenManage Network Manager and OpenManage Essentials?
  • Will we be able to save the configurations to a file so we don’t have to type everything in each time?
  • Why are we getting alerts about the nodes being offline if they are in maintenance mode?
  • When does the Current OEM version get updated to the current release after running the tool?
  • Are there some security configurations not documented that need to be done on the HLH in order to make this more hands-off solution?
  • Will the tool in the future enable Bitlocker on the HLH once the workflow is done with the HLH updates?

Final Thoughts

As I was writing this blog I was actually running the tool for the first time.  I actually had some questions on how they would update the HLH without killing the process of the Patch and Update tool since this tool is currently running on the OME VM running on the HLH host.  I am excited to see what Dell EMC will do with this tool.  I will like to see the OEM’s and Microsoft work on getting a combined PNU at one point.  Since we have to run the OEM PNU and Microsoft PNU seperate that means at least two maintenance windows for ourselves and our clients that we manage.  The good news is the OEM PNU doesn’t or shouldn’t take down any of the Azure Stack infrastructure so our tenants wouldn’t see anything and in theory would be able to do business as usual.  Time will tell on how this turns out.  Over all I am pretty impressed with version 1.0 so far.

Oh, and as soon as I finished this process I actually found documentation on this tool.  It is part of the monthly patch and update documentation available to download where we get the Customer Toolkit and the Patch and Update Automation Tool.

https://support.emc.com/docu91534_Cloud-for-Microsoft-Azure-Stack-Powered-by-VxRack-AS-Patch-and-Update-Installation-Guide,-Based-on-1809-Release.pdf?language=en_US

Spread the word. Share this post!