In my last blogpost I explained how the future of “Azure: The world computer” looks like with Azure Stack. Azure Data Factory could be another Azure Service that plays a role in this hybrid / edge scenario. It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
Azure Stack has a service called Azure Storage. Storage as a Service that is the same and consistent one as running in public Azure. But, once you try to connect to Azure Stack blob storage from Azure Data Factory an error occurs:
The key to resolve this consist out of 2 parts:
- Install Self Hosted Integration Runtime (either on Azure Stack, or public Azure if your Azure Stack blob endpoint is publicly accessible over the internet)
- Create an Azure Blob storage connection to an existing pubic Azure Storage Account and then change the type using the code editor.
Part 1
For those never worked or know what is the Self Hosted Integration Runtime please learn more about it here. Deploy 1 or more VMs that you would like to use as Self Hosted Integration Runtime (IR) . In Azure, create the Integration Runtime in Azure Data Factory but don’t download the express setup:
When you have deployed 1 or more virtual machines for using the IR you need to download the integration runtime from here. This is the version that will work with Azure Stack blob storage. Just run the installer and when asked you need to use 1 of the keys to register it.
!! Important note!! Disable auto-update in the Auto Update tab! It is really important. The IR will cause compatibility issues with the Azure Stack storage if it is updated to the latest version.
When you successfully installed the IR Runtime this is what you should see:
Part 2
Now that we have the Self Hosted Integration Runtime up and running we need to add the Azure & Azure Stack storage linked services for our data copy demo. This sounds a bit weird what I am going to explain now, but create 2 Azure Blob Storage connections and name them Azure_Blob and AzureStack_blob, or any other name you prefer but not the same. Connect them to the same public Azure Storage account like I did with mine here:
The trick is now to edit the linked service in code so that the Self Hosted Integrated Runtime can consume this using Azure Stack blob storage. Because we used a different version for the integrated runtime installation we need to edit the code on both linked services.
First let’s start with the AzureStack_BlobStorage. Click on the code button and change the type from AzureBlobStorage to AzureStorage and click finish:
Now edit the service and fill in all the information. Don’t forget to change the integration runtime and change it to the self hosted created earlier:
When you run “test connection” you should get the green checkbox and a connection successful. Remember, you need to do the same trick for the Azure Blob Storage account thats pointing to the public Azure storage account. This is because the version of the IR doesn’t understand the AzureBlobStorage type and will throw this error once you try to use it in a copy pipeline:
If both are edited and you create a new copy_data pipeline you can copy data from Azure Stack to Azure using Azure Data Factory!
I would like to shout out and thank Abhishek Narain, PM on the Azure Data Factory team to help me sort this out! Happy data copying!