azure-samples / modern-data-warehouse-dataops Goto Github PK

DataOps for the Modern Data Warehouse on Microsoft Azure. https://aka.ms/mdw-dataops.

License: MIT License

PowerShell 6.85% Shell 27.49% Python 16.20% TSQL 3.20% Dockerfile 1.91% Makefile 0.51% C# 5.02% HCL 7.22% Bicep 5.36% TypeScript 0.41% Jupyter Notebook 25.82%

devops dataops data mdw azure databricks datafactory cicd automatedtesting

modern-data-warehouse-dataops's Introduction

page_type

languages

products

description

sample

python

csharp

typeScript

bicep

azure

microsoft-fabric

azure-sql-database

azure-data-factory

azure-databricks

azure-stream-analytics

azure-synapse-analytics

Code samples showcasing how to apply DevOps concepts to the modern data warehouse architecture leveraging different Azure data technologies.

DataOps for the Modern Data Warehouse

This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the Modern Data Warehouse (MDW) architectural pattern on Microsoft Azure.

The samples are either focused on a single azure service (Single Tech Samples) or showcases an end to end data pipeline solution as a reference implementation (End to End Samples). Each sample contains code and artifacts relating one or more of the following

Infrastructure as Code (IaC)
Build and Release Pipelines (CI/CD)
Testing
Observability / Monitoring

Single Technology Samples

End to End samples

Parking Sensor Solution

This demonstrates batch, end-to-end data pipeline following the MDW architecture, along with a corresponding CI/CD process.

This has two version of the solution:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

modern-data-warehouse-dataops's People

Contributors

Stargazers

Watchers

Forkers

mallik-g hannesne dsmmartin laxman-sm fratei shawndeggans mahesh141989 leandrohmvieira paweldobrzynski alpeles mathiastello wangyihaier poonamsampatmicrosoft richmintz mwinel deekumar2019 victordelangel84 azadehkhojandi keshava censomin pravinbopps calsaviour rrios042 nrgfly ievsantillan bradenavanade abrahams1 alexanderdeleon chitratsr lunachaverodaniel cloudbreadpapa rajeevchugh1980 nick-niu168 mmendozaat ravivarmad ktmd jtrusty asackmann tejado funsoiyaju ryancrawcour jszisanhi alexkarasek zanawar shmrodrigues thedatadojo yisus74 irfanmaroof morefun0302 gaurav7961 lxyea calum-gunn1 bchalamayya victorchin j143 manianls dhirendrapatil radoslavgatev h2floh akirakakar jmostella joyoyoyoyoyo vonrosenchild robinaggarwal shervyna jomit hoainam25699 ncheng8773 isabella232 marcelaldecoa larinico troyel akamlakar alekhyagutnuboina swisstan anogues jaiswalanshul bayernmunich thuanvan tessferrandez samiyaakhtar datastudysquad dataengdev manideep1197 anup-tiwary piboonsak sunmilola namitms mathmachado chris-brooks ln-data-bass tollebol dngoins grandezzauk lkrysik 2-data-engineering mdrakiburrahman kai-tr dvanderslice tashbettridge

modern-data-warehouse-dataops's Issues

ADF: Demonstrate CI/CD with a simple copy data pipeline as initial sample

Simple copy ADF pipeline
Git integrated
AzDevOps pipeline which releases to a separate environment (ADF)

Synapse: Create AzDevOps Build Pipeline (YAML)

Initial basic build pipeline that

Restore nuget packages (if any)
Build SQL Project and produces DACPAC
Run tests (if any)
Publishes DACPAC as build artifacts

AzFunction: split stream based on temperature

Simple Azure Function that will:

Read from EventHub filteredDevices
Split steam based on temperature
- Temperature <100: Send output to Eventhub TemperatureOutput
- Temperature >100: Send output to Eventhub TemperatureBadOutput

Parking Sensors: Deploy Azure Pipelines as part of automated deployment script

Currently, deploying the Azure Pipelines is a manual step. Ideally this is part of the deployment script.

AzureSQL: Automate deployment of sample

Script that deploys the following:

AzureSQL DB
CI/CD pipeline

AzureSQL: Create AzDevOps Release Pipeline (YAML)

Add second stage as the Release pipeline that deploys DACPAC to AzureSQL.

AzureSQL: Make deployment script idempotent

Deployment script needs to be re-runable so that it can recover from 'half run' states.

if AzDevOps service connection, pipelines already exist, then either drop recreate, or skip -- make this configurable

Setup Azure Functions Project

AzureSQL: Create AzDevOps Build Pipeline (YAML)

Initial basic build pipeline that

Restore nuget packages (if any)
Build SQL Project and produces DACPAC
Run tests (if any)
Publishes DACPAC as build artifacts

AzureSQL: Add sample unit test to the SQL Database project

Load testing script that can generate a mixture of good/bad values

Use IoT Telemetry Simulator and create a script that will generate a mixture of good/bad data.

The testing mixture should allow the pipeline stages to be worked at different ratios.

Ratio:

50% filtered out deviceId >= 1,000
40% "good" data deviceId <1,000 AND temperature <100
10% "bad" data deviceId <1,000 AND temperature >=100

Synapse: Document how to use the sample

Include the following:

Overview
Prerequisites - any necessary software pre-requisites
Setup - how to setup and deploy the sample
Running the sample - after deploying, how to do a build and release

Core: Setup PR Templates

Synapse: Add sample unit test to the SQL Database project

Add Unit test to demonstrate testing Stored Procedures and UDFs

AzureSQL: Show how to integrate KeyVault with pipelines

Create a sample pipelines which stores secrets in KeyVault, ideally demonstrating deploying across environments.

Synapse: Create ARM template to automatically deploy azure infrastructure

Create an ARM template that deploys Azure Synapse Analytics instance with default values. ARM template should auto-generate names of the resource(s) based on the Resource Group and require minimum input parameters.

AzureSQL: Add a relatively complex multi-stage pipeline demonstrating testing release in stage/smoke prior to production

Pipeline:

Restore a snapshot of Production database to Staging Database
Deploy SQL DACPAC changes to Staging Database.
Run tests.
Create "roll-forward" script between Production and Staging.
Test "roll-forward" script on Staging.
Re-run tests.
Deploy changes to Production.

Test2

Description

Core: Setup initial Projects and Backlog

Update existing projects to reflect agreed way of managing work.

Synapse: Create AzDevOps Release Pipeline (YAML)

Add second stage as the Release pipeline that deploys DACPAC to Azure Synapse Analytics Instance.

Core: Setup CI Pipeline to check for Markdown linting

I am thinking of running in github action

AzureSQL: Ensure sample database has data when deployed

Ensure the salesdb deployed has data

ADF: Create deployment script for samples

Automatically deploys samples to a targe AzDevOps and Azure subscription

Git usage in devcontainer(s)

I want to use the devcontainer for the Parking Sensors example.
I also verified that I have to do that by opening the repo from the /e2e_samples/parking_sensors folder in order to get vscode to compile it.

Initially I wanted to update the documentation with that information but found out that since we are now in a subfolder of the git repo you can not use git because it relies on the info in .git folder.

I followed the discussion in the PR on reasons not to put it on root level.

Still I want to get the full dev experience for e.g. contributing and deploying the samples.

Ideas:

Keeping the individual devcontainer related code in the sample folder level and document to copy the relevant one to root folder of the repo in order to get the full developer experience for this project X sample.
This approach here Connecting to multiple containers at once looks most interesting to me, I will investigate if we can set it up in a way that only the devcontainer needed is started. Here the container will be able to see the repo root based .git folder. It looks to me the most user friendly way if we don't end up spinning up 10+ devcontainers on the dev's host.

CC @devlace @Azadehkhojandi

Role assignment creation failed - Invalid scope

Hi,

I got an error in the deployment. Seems that the scope in following line is incorrect:

modern-data-warehouse-dataops/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh

Line 211 in e36c68b

    
               --scopes "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP_NAME/providers/Microsoft.DataFactory/factories/$DATAFACTORY_NAME/" \

I was able to fix this by removing the slash after $DATAFACTORY_NAME.

Parking Sensors: Convert Release Pipelines to YAML format

Currently, Release Pipelines are built using the visual (classic) editor and much harder to include in an automated deployment script. Need to convert this to YAML.

Parking Sensors: Add ADF integration tests

Integration tests which tests data from test input API are correctly outputted to Azure Synapse (SQLDW). This would require an additional Test environment, preferably that can be spun up automatically as part of the pipeline.

Currently, only unit tests are demonstrated.

Wrong Databricks Workspace URL in ADF

I followed the deployment steps in the README and I got a ADF Pipe with "https://australiaeast.azuredatabricks.net" as the Databricks Workspace URL. This is wrong... at least if you deploy everything with RESOURCE_GROUP_LOCATION=northeurope.

So all the databricks connections will fail in die ADF pipe due to the wrong workspace URL.

Parking Sensors: Add Github Actions sample for doing CI/CD

Translate existing Azure DevOps Pipelines into Github Actions:
https://github.com/Azure-Samples/modern-data-warehouse-dataops/tree/master/e2e_samples/parking_sensors/devops
Document limitations if any

Parkings Sensors: Automate deployment of Azure Pipelines

Currently pipelines are deployed manually. This can be automated with the az devops cli extension.

AzureSQL: Document how to use the sample

Include the following:

Overview
Prerequisites - any necessary software pre-requisites
Setup - how to setup and deploy the sample
Running the sample - after deploying, how to do a build and release

AzureSQL: parameterize source branch for AzDevOps Pipelines so that developers can tests changes on other branches

Currently, deployment script has the source branch for the AzDevOps Pipelines YAML definitions hardcoded to master. We need to change this so that this is configurable via Environment variable so that Developers can easily tests changes in other branches.

Terraform template

Create a terraform template that creates the following components:

Application Insights
EventHub: Ingress
Azure Function: Include list filter
EventHub: FilteredDevices
Azure Function: Temperature filter
EventHub: TemperatureOutput
EventHub: TemperatureBadOutput

Synapse: Automate deployment of sample

Script that deploys the following:

Azure Synapse Analytics instance
CI/CD pipeline

AzureSQL: Create ARM template to automatically deploy azure infrastructure

App Insights integrated with EventHub + Azure Functions

documentation for observability learnings

Test

As blah, and to foo

AzureSQL: Make Azure resources deployed as part of single deployment easily identifiable as being part of the same deployment

Each deployment should have an "id"/name. This can either be User Supplied or (if not supplied) generated random string. This id can then be appended as a suffix to azure resources which require unique names, such as sqlserver and keyvault
This is so that resources can be easily identified to as to which deployment they belong to.
All Resources should be tagged with this deployment id and "mdw-dataops".

CD for Azure functions

A pipeline.yaml file, that can deploy the 2 Azure functions to an existing Azure subscription that has had infrastructure pre-created via the Terraform template

Azure functions:

Device include-list
temperature stream split

Options:

GitHub actions pipeline yaml
Azure Devops pipeline yaml.

setup script for terraform template

A pipeline.yaml file, that can execute the terraform template.

Options:

GitHub actions pipeline yaml
Azure Devops pipeline yaml.

ADF deployment error - Need variable for repository

in the file mdw-dataops-clone/e2e_samples/parking_sensors/devops/azure-pipelines-cd-release.yml line 16, the repository name needs to be a variable to avoid an error deploying ADF

Synapse: Document key concepts around Build and Release

Document the steps in the Build and Release Pipeline, preferably with supplementary diagram.

Core: Add credential scan to prevent accidental commits

I think there are two options for this.

github action to filter code in addition to default credscan in github
Pros: It prevents commits to be merge from PR to main branch. And it can be applied to all the PR easily
Cons: It does not prevent to commit to branch
Adding githook to check code before commiting anything
Pros: It prevents to commit cred at all
Cons: Each engineer needs to add this.

We could do both but need to prioritize which one to work first.
I am thinking to work on 1 first.

Parking Sensors: Incorporate Synapse (SQLDW) deployment in script

Currently Synapse is deployed separately (manually). This needs to be part of the larger deployment script.

AzFunction: Sensor filter based on include-list

Simple Azure Function that will:

Read from EventHub ingress
Filter out messages where sensorId > 1,000
Send output to Eventhub filteredSensors

AzureSQL: Add Github Actions sample pipelines

Create sample CI/CD Github Action pipelines, similar to the existing three Azure DevOps pipelines.

AzureSQL: Document key concepts around Build and Release

Include a diagram showing build and release pipeline and explanation of the process.

ADF deployment error - linkedservices path incorrect

Looks like the linked-services reference is pointing to the incorrect folder.

Updating Data Factory LinkedService to point to newly deployed resources (KeyVault and DataLake). jq: error: Could not open file .tmp/adf/linkedService/Ls_KeyVault_01.json: No such file or directory jq: error: Could not open file .tmp/adf/linkedService/Ls_AdlsGen2_01.json: No such file or directory Deploying Data Factory artifacts. Creating ADF LinkedService: Ls_KeyVault_01 Unsupported Media Type({"message":"The request contains an entity body but no Content-Type header. The inferred media type 'application/octet-stream' is not supported for this resource."})

Fixes either:

Update the copy command to copy everything to .tmp/adf/

modern-data-warehouse-dataops/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh

Line 195 in 9a015ba

mkdir -p $adfTempDir && cp -a adf/ .tmp/

Update adfLsDir to point to .tmp/linkedService instead of .tmp/adf/linkedService

modern-data-warehouse-dataops/e2e_samples/parking_sensors/scripts/deploy_infrastructure.sh

Line 198 in 9a015ba

adfLsDir=$adfTempDir/linkedService

Parking Sensors: Incorporate Application Insights deployment in script

Data monitoring relies on logging to Application Insights. Currently, this is not deployed as part of the deployment script.

AzureSQL: Create AzDevOps pipeline to validate PR

Pipeline triggers on any PR to master.
Builds DACPAC, runs tests (if any)
Does not produce a build artifact.

azure-samples / modern-data-warehouse-dataops Goto Github PK

modern-data-warehouse-dataops's Introduction

DataOps for the Modern Data Warehouse

Single Technology Samples

End to End samples

Parking Sensor Solution

Contributing

modern-data-warehouse-dataops's People

Contributors

Stargazers

Watchers

Forkers

modern-data-warehouse-dataops's Issues

Recommend Projects

Recommend Topics

Recommend Org