Dive into Microsoft Azure Stack Architecture (part 1)

Hall C1 can take 2,000 people and every one of them was present for Spencer Shepler’s Dive into Microsoft Azure Stack Architecture session.

Azure Stack Integrated Systems

First picture Spencer showed was of the three Azure Stack integrated systems by HPE, Dell EMC and Lenovo. A must see on the booth of Ignite 2016.



A quick look at the timelines schedule shows that Azure Stack TP2 is now available for a 1-node proof of concept. The multi-node will remain private previews for a limited audience and both the 1-node and multi-node are expected to arrive mid calendar year 2017.



Azure Stack: What is it?

At high level, we are bringing the Azure services into your datacenter on your hardware. We’re trying to provide consistency so you can take your workloads, deploy them in Azure or deploy them onprem to Azure Stack. You can move them back and forth for dev/test for regulatory reasons. There is a variety of reasons why you want the full ecosystem in place for Azure Stack.azure-services-in-your-datacenter

Azure Stack Integrated System

It is not just a software solution, it is also a set of hardware that has been tested, configured to align to the requirement. It is a set of support infrastructure as we bring solutions together with our OEM partners, making sure that you get the right support both from them and from us. So it is a complete solution and not just a handoff from one vendor to the other. And finally there is a series of services wrapped around Azure Stack, so it will be racked and stacked, bring it into the datacenter, get it deployed and if you want get ongoing services as well if we understand what our customers need.

Let me describe what the hardware stack looks like. It is a hardware and software solution and that’s how you acquire Azure Stack, which we’ll call an Azure Stack integrated system. It is a fully lifecycle of services that come together. It’s not how do I install the software and then leave and forget it. We try to account for all the solution over time.

Azure Stack Integrated System (Life Cycle)

We first go through what the architecture should be, the underpinnings of the hardware, Windows Server and all the services on top of that. It has to come together, deliver the scale, performance, and availability for you as a customer, to run the Azure services on top of that. Deployment, configuration, we work through our OEM partners, we engage with customers to understand how they deploy services in their datacenter, making sure we can accommodate the variety of environments, configurations and the needs for what we put together. It is another big piece of what we are trying to do. Validation is a big piece of what we do, obviously for our own software, but we also work with our hardware partners to ensure that we get the right features and functions out of the hardware components. All the way from HBAs, NICs, storage devices, server, storage switches, through the whole stack, so that you can be assured that when you’re getting an integrated system, that those components have come together, individually been tested but been tested as a whole as an overall solution. Monitoring and diagnostics on a large system, we have to understand what’s going on, you have to be able to see what’s happening. What do I need to react to. How do I put together in my existing environment and how do I integrate that into the things that I do from day to day. Field replacement of parts as hardware, software can fail.


We need to be able to account for that. In our partnership we have certain ideas and capabilities in our software stack for which partners have their own supply chains, support infrastructure, bringing those together in a way that we can actually deliver that and make sure that it satisfies the SLAs for the overall solution. Patching and updating looking  at how we deliver software on a regular cadence, that we can keep up the consistency with Azure, providing services, security patches, firmware updates, essentially the whole set of things that we have to take advantage of and do that in a way that is not disruptive to the workloads that are running on top of an Azure Stack instance. Business Continuity, things like backup, disaster recovery, how does it integrate into this solution. And then finally the Security practices, how we do deployment, how to manage it, bring it together for you in your environment and make sure it meets the requirements of the enterprise, service delivery partners and a variety of other entities that are involved to make sure that we have a secure platform to work from. So this is the whole life cycle, or the way we think the life cycle structure for Azure Stack. And actually is a pretty good guide for the overall agenda that we have for today, topics that I’ll go through and spend time on. The majority of my time is going to be focused on architecture, the hardware and topology characteristics of Azure Stack, the things we’re building for in the initial release in the middle of next year.

Azure Stack Architecture Overview

This is kind of a classic block diagram of the architecture, starting off at the very top, the consistency with Azure, so we have the Azure portal, the templates being able to interact with PowerShell, the way we’re managing the overall topology of Azure and Azure Stack and then Azure Resource Manager (ARM), that runs as an instance in Azure Stack. Essentially, this is the endpoint that you interact with is both a tenant and administrator of an Azure Stack instance. This is key to the overall delivery.


Underneath that is an architectural layer, a Partition Request Broker. We have a variety of acronyms listed underneath that, all these essentially refer to a Resource Provider (RP). It is a well-known API that interacts back and forth with ARM, just like Azure does. The architecture is exactly the same.

Then underneath that we have essentially the variations that start to occur because our architecture, unlike the massive datacenters in Azure, is going to be four servers to start with. You have to take the software that runs in that datacenter in some form, provide the consistency of Azure and bring it on to a 4-node configuration to start. This represents the underlying pieces where the variance begins to happen from an Azure standpoint. Now this is all underlying detail. This is an infrastructure that we’re providing and isn’t externally exposed to you through a set of APIs. You’re not going to see that when you go off deploy workloads as a tenant or administer the overall server solution.

Within the Infrastructure Roles, finally then the Hardware Layer that underlies all this, so network, compute and storage as underlying feature set.

And then the details of what’s inside the Infrastructure Roles. You can think of those as the services you see running in Task Manager. You may know what they are, but you don’t need to get into the details or manage. You may have to stop and restart them occasionally. The ideas is that those are well engineered, tested, validated services that you don’t need to interact with or know the details. That is what is represented in those roles.

As I mentioned, we’ll focus on the network switches, compute and storage configurations and the choices we made to deliver essentially Azure Stack for you datacenter.

What is an Installation or Instance of Azure Stack?

We have a single instance of Azure Resource Manager (ARM) that manages the portal for the instance of Azure Stack. Then you can have one or more regions underneath. The same concept holds for Azure, so if I have a datacenter in US West as a place where I can deploy resources. I also have a datacenter is US East and I can deploy resources there and the same thing is going to happen with Azure Stack. You’ll have those regions present and ARM will be able to manage the single entity of Azure Stack.

Under each of these regions will be one or more Scale Units (SU). You can think of them as a group of compute, storage and network resources that interact in a tightly bound fashion, delivered as a unit or whole. You’ll have one or more of them in a regions that provide you scale. Within each Scale Unit we’ll start off with four servers and I’ll explain the logic, why that is the case, why we offer four instead of one or two, and what the resiliency model is for that, and how think about capacity expansion, starting off for that four server unit.


Azure Stack Region

An Azure Stack region represents essentially that set of Scale Units, typically in the same physical location. You’ll have set of racks, a set of Scale Units, probably with a “single administrator” which is of course up to the customer to decide what these characteristics mean.  We have essentially underlying requirements for regions that they have the network bandwidth and latency that match the expectations what you’d see in Azure. So high bandwidth, very low latency because you’ll have workloads that are deployed in that region and that have requirements to interact very effectively. So what it comes down to that I have a network infrastructure that provides that bandwidth and latency. So I’m not going to lay down a couple of 1Gb links between a couple of Scale Units in a Region. I’m going to run 10Gb, possible multiple 10Gb configurations between those Scale Units to provide that.


Then lastly, the other set of attributes that are associated with a Region, the nice thing about Regions is that it is a separation of physical hardware. So you as a customer can take and apply essentially other attributes. Examples might be: you have two lines of business, they’re going to invest in hardware, but they want to manage who gets access to that hardware and who is going to deploy workloads. You may be the administrator that doesn’t want to run multiple instances of Azure Stack, so you’ll take one set of hardware and place that in one Region and you’ll take a second set of hardware purchased by another line of business, place that in a second Region, and place that under a single instance of ARM. You can then create Plans and Offers that are tenants who are restricted to a particular Region when they do deployments. So you have control and alignment. A classic example is having two Regions that are placed in two different datacenters with an application that is pushing data between them from a replication standpoint for disaster recovery, and I can physically separate those Regions by hundreds of feet, different buildings, tens of kilometers if that’s required for my business.

Azure Stack Scale Unit

When you deploy a Scale Unit it will be consumed fully by that Region in a 1:1 relationship. If I need more capacity, I add more Scale Units. There is an alignment with Azure for the concept of Fault Domains. If I deploy workloads in Azure, I can choose to deploy three VMs or services in three different Fault Domains. The ARM templates allow for the expression of that so the workloads will be placed in the appropriate physical locations for that separation. So for Azure Stack the Fault Domains align with what the Scale Units are. Another great way to model and think about physical location and the workloads that are placed across the physical entities. Finally there is an alignment with the Hardware SKUs. Servers must be homogeneous within a Scale Unit. They have to have the same memory, cpu, network and storage devices to accommodate for that.


 Azure Stack Scale Unit – Implementation Details

Let’s go a little bit deeper and talk about the Scale Unit and what the implementation details are. First we build on Windows Server 2016. We use quite a number of features that have been delivered this week with the availability of Windows Server in the Azure Stack infrastructure.  Things like software defined storage, software defined networking, a lot of other enhancements that have been added in Windows Server.

We’re going choose to use a Hyper-Converged configuration which gives us a lot of flexibility about how we do deployment. It is very straightforward to reason about. In the initial release this is what we are going to rely on.


Also we’re going to use Windows Failover Clustering and if you’re familiar with our software defined storage, we’ll use Storage Spaces Direct (S2D). We’ll also use Failover Clustering to manage VMs and they are live migrated or restarted if there are failures.

A set of servers in a Scale Unit will be physically attached to a set of Top of Rack (ToR) switches, a ToR pair of individual switches. You can actually configure one or more Scale Units, or Failover Clusters per ToR pair. So we don’t need a ToR pair for every Scale Unit or each Compute/Storage configuration. There is a little flexibility here.

Once we do all this, we can pull together the ability to expand capacity very easily, we start off very small compared to a disaggregate compute and storage configuration. Those have to start much larger because I need a storage unit and compute which will start larger from a cost and structure standpoint. Hyper-converged configuration can start in a much lower size.

Continues in part 2

Spread the word. Share this post!