Azure Stack Infrastructure Operations and Management

My first session on day 3 of Ignite was Vijay Tewari, Group Product Manager of Azure Stack. An excellent overview was given on how to operate and manage an Azure Stack environment, showing how this works at a much smaller scale than public Azure. Clusters in Azure are not based on Failover Clustering but are close to 1000 nodes in size. So management and operations at the much smaller scale of Azure Stack requires some adaptation.

Cloud Operating Model

In Azure, Microsoft is the cloud provider, offering services to tenants and has a large number of engineers to operate that cloud. In Azure Stack this is exactly the opposite because the service provider/enterprise is the cloud provider, and is responsible for the infrastructure. The customers of Azure Stack are your tenants who consume the services that you are offering.

cloud-operating-model

Key design decisions

Microsoft looked at System Center, as well as the tools that are currently used in Azure. Microsoft decided that the tools in Azure were preferred because that is how MS operates their cloud. It would have been difficult to mold System Center to manage Azure Stack. Nevertheless, Microsoft has taken learnings from both sides, Azure and System Center.

design-decisions

Hyper-converged or Converged?

Microsoft gained a lot of insight in industry trends and with Windows Server 2016 they built great technology which follows along those trends. Microsoft decided to choose hyper-converged with Storage Spaces Direct (S2D) over converged with SOFS for Azure Stack. Combining storage and hypervisor is now supported in Windows Server 2016.

How do we allow you to scale?

Like Azure you have one or more clusters per region and you can have multiple regions. Azure Stack has the exact same scale model as public Azure.

How open can we be with Hyper-V?

The fact is that if you want to keep everything stable and up-to-date you need to seal the host. You cannot touch the Hyper-V host in Azure but can only use its services via API services. A lot of work goes in properly updating the system and no risks can be taken with admins trying to modify something at the hypervisor level. For this reason, the hypervisor and management will be sealed and in a known state to facilitate automated updates. Of course the Azure Stack admin will get all the tools to manage and keep the systems secure and continue to provide value with the Azure services to their tenants.

What hardware does Azure Stack run on?

Microsoft decided to start with three hardware vendors and work closely together to build integrated systems. A well known set of hardware is required and there is still a lot to be learnt before everything works as expected. On a large scale Microsoft has experience with this approach, since Azure is probably the largest integrated system. Aligning firmware and driver versions with the operating system and applications on top of the systems is crucial for the success of both Azure and Azure Stack. Azure changes very rapidly and Microsoft wants to keep Azure Stack closely behind Azure. If an Azure Stack service provider or enterprise stays behind, they don’t really offer Azure anymore. Eventually Microsoft expects to open up the choice of hardware and rely on its architecture and automation processes. So for these reasons Microsoft decided to choose integrated systems as a delivery approach.

azure-stack-integrated-system

When the box shows up, all you need to do is connect, power on. After this, the system will be automatically configured to run as an Azure Stack system. For this a lot of automation is used.

Validation is a massive investment. With the rapid pace of updates, Microsoft needs to validate that the system remains operational after each update. Currently systems are used which are continually updated to make sure this works perfectly. Of course there is doubt. Many hands where raised when Vijay asks how many people see applications fail after a Windows update. Microsoft has been far from perfect at this, but using the integrated systems approach will increase the success rate.

Update will be offered as a service in the cloud which will run validation against your system to check the current version and available updates. You can monitor the hardware, software and services using the portal and tools included with Azure Stack. Alerts are shown per resource provider and you can zoom in to any issue and find guidance about solving the issue.

Security and privacy

A lot of work is done to keep Azure Stack secure and likewise, business continuity is a requirement for any service provider or enterprise if they are going to implement Azure Stack. Patching & Updating is a challenge for many admins, but remains a fact of life. In Azure Stack their task will be greatly relieved because of extreme validation and automation. A fair thing to ask is what happens if things fail. Hardware can fail. Software can fail. If a disk fails it has to be replaced. If a server fails it has to be replaced. This is all done in an automated way. We can replace disks, re-apply virtual switches, replace servers and Azure Stack will come back up-and-running.

Hardware Partners

Microsoft chose Dell EMC, HPE and Lenovo as their first three partners to develop Azure Stack on their hardware. Microsoft was able to pull the best engineers together and the cooperation is phenomenal. The hardware vendors did not offer their gold-plated servers but best-of-breed servers like Dell PowerEdge R730 and HP ProLiant DL380 Gen9. These are all very capable industry standard servers.

Thinking about scale

Scale largely depends on which services will be offered in a particular region. You can have many regions in your environment and you have to think of a scale unit as a fault domain. The number of fault domains is what you specify in an ARM json template.

A difference with Azure is that with Azure Stack, Microsoft uses failover clusters running Windows Server 2016, so think about the size of the cluster, the number of clusters in a region and the number of regions. Scale in Azure Stack can be defined as #servers per Scale Unit x #Scale Units per region x #regions. With Azure Stack a minimum of 4 cluster nodes are used. The eventual design goal is really high so you will not be capacity constrained in the future. But don’t expect to go massive scale from day 1.

Things to consider

Microsoft promises to always offer a free single box for Azure Stack POC, so make sure you have a server under your desk to try each new version of Azure Stack and test your services. It will help you think about your architecture in terms of regions, VM sizes and services. It will help you determine what kind of box you need, what kind of workloads you are going to offer. If you need a lot of IO, use a combination of HDD, SSD and NVMe or go all flash. Will your services be CPU intensive, buy faster CPUs. Think about your identity system. Will you use Azure AD and let tenants federate with Azure Active Directory, or are you an enterprise and use your internal AD and federate with ADFS.

things-to-consider

Integration in your datacenter

Azure Stack will include two TOR switches and a BMC switch and will have to be connected with your datacenter switches. Make sure your networking teams are involved right from the start and help them understand the architecture. Because Azure Stack will be in its own dedicated AD forest, you’ll need a 1-way trust from your AD with Azure Stack if you want to use your management domain AD groups and users for things like datacenter monitoring for instance.

integration-in-your-datacenter

For preparation of Azure Stack integration in your datacenter, Microsoft will offer a Microsoft Azure Stack Deployment Worksheet, which must be filled out. It will also show a graphical layout of the environment.

azure-stack-worksheet

azure-stack-worksheet-graph

Peek into a Scale Unit

The following slide shows many details about a scale unit of Azure Stack, including the different resource providers:

  • FRP – Fabric Resource Provider
  • SRP – Storage Resource Provider
  • NRP – Network Resource Provider
  • HRP – Health Resource Provider
  • URP – Update Resource Provider

peek-into-a-scale-unit

Each Resource Provider has a northbound interface to the ARM layer (portal, PowerShell, CLI, SDK) and a southbound interface to the infrastructure roles doing the real work (storage, network, compute controllers, etc.)

Azure Stack has a robust automation engine which is responsible for both deployment and changes to the system. In fact a huge mount of infrastructure components have to be installed. The good thing: you don’t have to worry about them. Azure Stack is always installed and configured in exactly the same way. Everything else layers on top of this.

peek-into-a-scale-unit-2

At a minimum, 1 SSD boot disk is required for the hypervisor and at least 2 caching devices (SSD/NVMe) and 4 capacity disks (HDD, SSD). You can also make all disks SSD or NVMe or use more disks. It all depends on what is required for the services that are offered.

Azure Stack runs Windows Server 2016 Core with Hyper-V and is configured as a hyper-converged cluster which provides both the storage and hypervisor layer. Every service is deployed resiliently, such as the portal which sits behind the software load balancer that is offered by the network controller in Windows Server 2016.

Cloud Administration vision for Azure Stack

Everything is designed so a service provider or enterprise can focus on their SLAs with their tenants or their customers. Microsoft aims at requiring a single admin to manage a 4-node cluster (and possibly more). This admin should be able to work across all these domains.

cloud-administration-vision

If you login to the Azure Stack portal and if you are identified as an admin, you get a different experience than the tenant. You will be able to see Region Management and its respective Fabric, Health and Update RPs, see events and receive alerts. An admin can pick up the REST API call provided, and use it in PowerShell or any other tool.

Monitoring

It is Microsoft’s goal not to flood the system with alerts. Azure Stack will offer specific actions using links to online troubleshooting and documentation.

monitoring

Patching & Update

Microsoft will be testing updates on a daily basis, validating that the update process will be guaranteed to be of very quality. The goal is to offer monthly updates. Azure Stack will contain an automation framework that will not bring down tenant workloads. It uses failover clustering, so live migration is your friend. An important fact is that the hypervisor host is never updated, but re-imaged for each new rollup update.

patching-and-update

A new VHD is constructed and pointed to a hypervisor host. Then the boot order is flipped, so that the host is booted in the vhd that already includes all the patches. Azure Stack will only boot from VHD.

Backup and Disaster Recovery

From the tenant view, this will be very similar to what you get in Azure today. Tenants will want Azure Backup to work with Azure Stack. This will be possible over time.

For Azure Stack system backup, an Infrastructure Backup Controller is used to create backups to a file share. You can take any backup product to protect that share. If rehydration is required, just give the data back to the Infrastructure Backup Controller, which will allow you to roll back Azure Stack to a known good state.

backup-and-disaster-recovery

Security Principles

All data will be encrypted with BitLocker and Just Enough Administration (JEA) is used to only allow access to the right person at the right time for only the required actions. There will be Application Whitelisting: if something malicious turn up it will be refused.

security-principles

Azure Stack is hardened by default and will provide all the details that are required for auditing so you can prove that you are well protected against breach. Microsoft has decided not to use Nano Server yet, but Server Core. The switch to Nano Server is expected in future builds, not necessarily the next or GA, although this is possible.

If your servers are not connected to the Internet, you’ll have to download updates and put them in a certain location so the Update RP can find them. If you are connected, updates are downloaded automatically.

The only thing that is not updated automatically is the physical switch, but there will be guidance for that. A switch update failure will not bring down Azure Stack if these guidelines are used.

There is a lot of software below the OS: Disk firmware, controller firmware, network firmware, BIOS/UEFI. It is not yet determined how these updates will be distributed, but one thing is clear: you’ll get one payload (from Microsoft or the hardware vendor).

Timelines

During Ignite 2016, Azure Stack TP2 (single node) was released and we expect a third technical preview early 2017. The multi node version of Azure Stack has become available this September and is currently tested in a private preview. The public version will be generally available somewhere mid-CY17. GA of Azure Stack is coincident with GA of hardware.

timelines

Summary

  • Azure Stack is delivered as pre-validated Integrated System
  • Contains purpose built infrastructure management that integrates into your environment
  • Start now to develop skills as Cloud Architect/Admin with TP2

Questions

Q: Hardware management?
A: Will be done without agents on the hypervisor host and works via the BMC controller. You will be able to use OMS
HP will integrate OpenView.

Q: Will multiple regions be supported at GA?
A: Too early to tell, still 9 months away. You will be able to use multiple clusters per regioncloud-administration-vision

Spread the word. Share this post!