Code Monkey home page Code Monkey logo

azure-samples / modern-data-warehouse-dataops Goto Github PK

View Code? Open in Web Editor NEW
535.0 51.0 429.0 105.33 MB

DataOps for the Modern Data Warehouse on Microsoft Azure. https://aka.ms/mdw-dataops.

License: MIT License

PowerShell 6.85% Shell 27.49% Python 16.20% TSQL 3.20% Dockerfile 1.91% Makefile 0.51% C# 5.02% HCL 7.22% Bicep 5.36% TypeScript 0.41% Jupyter Notebook 25.82%
devops dataops data mdw azure databricks datafactory cicd automatedtesting

modern-data-warehouse-dataops's Introduction

page_type languages products description
sample
python
csharp
typeScript
bicep
azure
microsoft-fabric
azure-sql-database
azure-data-factory
azure-databricks
azure-stream-analytics
azure-synapse-analytics
Code samples showcasing how to apply DevOps concepts to the modern data warehouse architecture leveraging different Azure data technologies.

DataOps for the Modern Data Warehouse

This repository contains numerous code samples and artifacts on how to apply DevOps principles to data pipelines built according to the Modern Data Warehouse (MDW) architectural pattern on Microsoft Azure.

The samples are either focused on a single azure service (Single Tech Samples) or showcases an end to end data pipeline solution as a reference implementation (End to End Samples). Each sample contains code and artifacts relating one or more of the following

  • Infrastructure as Code (IaC)
  • Build and Release Pipelines (CI/CD)
  • Testing
  • Observability / Monitoring

Single Technology Samples

End to End samples

Parking Sensor Solution

This demonstrates batch, end-to-end data pipeline following the MDW architecture, along with a corresponding CI/CD process.

Architecture

This has two version of the solution:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

modern-data-warehouse-dataops's People

Contributors

abeebu avatar akirakakar avatar azadehkhojandi avatar balakrishnaakuleti avatar balteravishay avatar deniscep avatar dependabot[bot] avatar devlace avatar elenaterenzi avatar hannesne avatar herman-wu avatar jmostella avatar jsburckhardt avatar kiwibayer avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mmclende avatar promisinganuj avatar quickns avatar sapinderpalsingh avatar scogromsft avatar shawndeggans avatar siliang-j-1225 avatar snorris31 avatar sreedhar-guda avatar sudivate avatar tejado avatar tessferrandez avatar thurstonchen avatar ydaponte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modern-data-warehouse-dataops's Issues

AzFunction: split stream based on temperature

Simple Azure Function that will:

  • Read from EventHub filteredDevices
  • Split steam based on temperature
    • Temperature <100: Send output to Eventhub TemperatureOutput
    • Temperature >100: Send output to Eventhub TemperatureBadOutput

AzureSQL: Make deployment script idempotent

Deployment script needs to be re-runable so that it can recover from 'half run' states.

  • if AzDevOps service connection, pipelines already exist, then either drop recreate, or skip -- make this configurable

Synapse: Document how to use the sample

Include the following:

  • Overview
  • Prerequisites - any necessary software pre-requisites
  • Setup - how to setup and deploy the sample
  • Running the sample - after deploying, how to do a build and release

Git usage in devcontainer(s)

I want to use the devcontainer for the Parking Sensors example.
I also verified that I have to do that by opening the repo from the /e2e_samples/parking_sensors folder in order to get vscode to compile it.

Initially I wanted to update the documentation with that information but found out that since we are now in a subfolder of the git repo you can not use git because it relies on the info in .git folder.

I followed the discussion in the PR on reasons not to put it on root level.

Still I want to get the full dev experience for e.g. contributing and deploying the samples.

Ideas:

  1. Keeping the individual devcontainer related code in the sample folder level and document to copy the relevant one to root folder of the repo in order to get the full developer experience for this project X sample.
  2. This approach here Connecting to multiple containers at once looks most interesting to me, I will investigate if we can set it up in a way that only the devcontainer needed is started. Here the container will be able to see the repo root based .git folder. It looks to me the most user friendly way if we don't end up spinning up 10+ devcontainers on the dev's host.

CC @devlace @Azadehkhojandi

Parking Sensors: Add ADF integration tests

Integration tests which tests data from test input API are correctly outputted to Azure Synapse (SQLDW). This would require an additional Test environment, preferably that can be spun up automatically as part of the pipeline.

Currently, only unit tests are demonstrated.

AzureSQL: Document how to use the sample

Include the following:

  • Overview
  • Prerequisites - any necessary software pre-requisites
  • Setup - how to setup and deploy the sample
  • Running the sample - after deploying, how to do a build and release

Terraform template

Create a terraform template that creates the following components:

  • Application Insights
  • EventHub: Ingress
  • Azure Function: Include list filter
  • EventHub: FilteredDevices
  • Azure Function: Temperature filter
  • EventHub: TemperatureOutput
  • EventHub: TemperatureBadOutput

Test

As blah, and to foo

AzureSQL: Make Azure resources deployed as part of single deployment easily identifiable as being part of the same deployment

  • Each deployment should have an "id"/name. This can either be User Supplied or (if not supplied) generated random string. This id can then be appended as a suffix to azure resources which require unique names, such as sqlserver and keyvault
  • This is so that resources can be easily identified to as to which deployment they belong to.
  • All Resources should be tagged with this deployment id and "mdw-dataops".

CD for Azure functions

A pipeline.yaml file, that can deploy the 2 Azure functions to an existing Azure subscription that has had infrastructure pre-created via the Terraform template

Azure functions:

  • Device include-list
  • temperature stream split

Options:

  • GitHub actions pipeline yaml
  • Azure Devops pipeline yaml.

Core: Add credential scan to prevent accidental commits

I think there are two options for this.

  1. github action to filter code in addition to default credscan in github
    Pros: It prevents commits to be merge from PR to main branch. And it can be applied to all the PR easily
    Cons: It does not prevent to commit to branch
  2. Adding githook to check code before commiting anything
    Pros: It prevents to commit cred at all
    Cons: Each engineer needs to add this.

We could do both but need to prioritize which one to work first.
I am thinking to work on 1 first.

ADF deployment error - linkedservices path incorrect

Looks like the linked-services reference is pointing to the incorrect folder.

Updating Data Factory LinkedService to point to newly deployed resources (KeyVault and DataLake). jq: error: Could not open file .tmp/adf/linkedService/Ls_KeyVault_01.json: No such file or directory jq: error: Could not open file .tmp/adf/linkedService/Ls_AdlsGen2_01.json: No such file or directory Deploying Data Factory artifacts. Creating ADF LinkedService: Ls_KeyVault_01 Unsupported Media Type({"message":"The request contains an entity body but no Content-Type header. The inferred media type 'application/octet-stream' is not supported for this resource."})

Fixes either:

  1. Update the copy command to copy everything to .tmp/adf/

  1. Update adfLsDir to point to .tmp/linkedService instead of .tmp/adf/linkedService

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.