Code Monkey home page Code Monkey logo

adbfundamentalsws's Introduction

Azure Databricks Fundamentals Workshop (adbfundamentalsws)

Use this repo for the Azure Databricks Fundamentals Workshop

Welcome to the workhop. The purpose of the workshop is to bring you up to speed on some of the Data Engineering, Data Science, and Data Analysis capabilities of Azure Databricks. You will be using resources in Azure for this workshop. Each of you will recieve a userid and password to access the resources. Please consult your instuctor if you have any questions or blockers.

The Workshop is broken into two parts:

- Data Engineering and Data Science

- Data Analysis

Prior to starting the hands-on activites you will need to complete Step 1.

During Step 1 you will log into a VM that is deployed on Azure Labservice. This will ensure that everyone has the same capabilites by using a standard Window 11 VM that already has the software you need installed.

Sections:

Step 1 Login to Azure Lab Services

Step 2 Data Engineering

Step 3 Data Science Simple Starter

Step 4 Data Analysis

Step 5 Introducing Microsoft Fabric

Step 1 Login to Azure Lab Services

Here you will to register, start, and login to your Windows 11 desktop that you will use during the workshop.

  1. Click on this Registration Link for Azure Lab Services

Refer to the paper you get during the workshop for your userid and password.

you may be using a organization similar to this @MngEnvMCAP004660.onmicrosoft.com

Enter your userid and click Next

labsignin

Enter your password and click Sign in

password

Note: You may be prompted for multifactor auth. Check with your instructor.

You should endup at a screen that looks like this.

adbfundamentals

Click on the stopped toggle on the VM to start your VM

startingvm

Once it says that it is Running click on the icon that looks like a monitor

runningvm

Keep the download if you browser prompts you.

keep

Open the RDP file and click Connect

openrdp

Enter the password provided by your instructor and click OK

vmok

Click Yes

yes

Your Windows desktop should look like this. Notice the following applications that you will use during the workshop:

- Power BI Desktop
- Microsoft Azure Storage Explorer
- Git (to clone this repository to your VM)

win11desktop

Open up the Microsoft Edge browser on the desktop and open up this GitHub Repo to follow the rest of the directions from that remote desktop VM. If you prefer to cut and paste it here is the link

https://github.com/DataSnowman/adbfundamentalsws

gitrepo

Step 2 Data Engineering

During this section of the workshop we will focus on reviewing and looking at a couple of Databricks Notebooks and using them for Data Engineering activities.

Click on this link to access the Databricks environment for the workshop

When the sign in pops up click Sign in with Azure AD

signinwithaad

Note: You may be prompted for multifactor auth. Check with your instructor.

Your Azure Databricks Screen will look like this

adbscreen

In like the new UI so you might want to click the Enable new UI toggle

We will review the items available on the left nav panel

reviewleftnavitems

For the workshop today we will focus on:

- Workspace
- Data
- Compute
- SQL Editor
- SQL Warehouses
- Partner Connect

To start this please clone or download a zip of the GitHub Repo we are using in the workshop

clone

Git is installed on the VM so you can open a command prompt and paste in the following:

git clone https://github.com/DataSnowman/adbfundamentalsws.git

Go into your Workspace in Databricks and navigate to the three dots to the right of your username and select import

import

Click on browse and navigate to where you cloned or downloaded the zip of the repo and select the dataengineering.dbc file and click Open

usersdbc

Click Import

clickimport

Open the dataengineering folder and you should see these two notebooks

notobooks

Open the first notebook LoadCMSFiles

loadcmsfiles

In the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

moduser

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

Before you run anything you need to attach the notebook to the cluster. Find the WorkshopNIS cluster and click Attach

attach

When you complete running each cell there should be delta tables in the Data section of the Leftnav

data

Open up Microsoft Azure Storage Explorer to check your Storage accounts to find the input source data in Bronze

bronze

The output of the notebook will be in Silver

silver

Congrats: You just completed moving source csv files in the bronze container and wrote the data as Delta/Parquet output to the silver container. You created a Delta table that references those files.

Now you are going to take that Delta/Parquet data created in the first notebook and create three Delta tables:

- Datedim
- DRGdim
- MedicareInpatientFact

Open the second notebook LoadGoldSchema

loadgoldschema

As in the other notebook, in the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

When you complete running the each cell there should be delta tables in the Data section of the Leftnav

data

Check your Storage accounts to find the input source data in Silver

silver

The output of the notebook will be in gold folder in the Silver container

gold

Congrats: You just completed creating the Delta/Parquet output that we will use in the Data Analysis section later in the workshop.

Lets now look at the Data Science capabilities in Azure Databricks.

Step 3 Data Science Simple Starter

Go into your Workspace in Databricks and navigate to the three dots to the right of your username and select import

import

Click on browse and navigate to where you cloned or downloaded the zip of the repo and select the diabetesmlmodel.dbc file and click Open

diabetesdbc

Click Import

importdiabetes

Open the dataengineering folder and you should see these the new notebook. You could of also created a new folder in you Databricks workspace call datascience and import the notebook into that folder.

diabetesnotebook

Open the diabetesmlmodel notebook

loaddiabetes

In the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

moduser18

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

When you complete running the each cell there should be a delta table in the Data section of the Leftnav called diabetesageadb<username> (your username will be concatinated on the end of the table name so you can find your table amongst the other users tables)

Let's stop hear and take a break before we resume. The diabetes notebook demonstrated a bunch of ways of accessing the data and the creation of a machine learning model. This is just a start put be aware that much of the Machine Learning and Deep Learning that creates AI Models like ChatGPT use Python. You have just run three PySpark notebooks that emerse you in what Data Engineers and Data Scientists work with.

Step 4 Data Analysis

We are now going to leverage the Delta tables we created in the Data Engineering section.

Navigate to the SQL Editor in the Left Nav

sqleditor

Run some queries. Here is a sample query

SELECT * FROM cms.datedimuserone1

Note that you will need to put your username on the end of SELECT * FROM cms.datedim<username>

Try some adhoc queries of your own. Here is another example to try:

select distinct DRG_Cd, DRG_Desc from cms.medicareinpatienthospitals order by DRG_Cd

Access Databricks SQL Warehouse using Power BI

Now lets try to access these delta tables in Power BI desktop

Navigate to Partner Connect in the Left Nav and click on Setup Power BI Desktop

partnerconnect

Click on Download connection file

downloadconnectionfile

Click on openfile

openfile

Make sure you select Azure Active Directory and then click on Sign in and enter your username and password. Then click Connect

aadsignin

Select your tables and click Load

loadpbi

Create the joins in your model

joins

Start Building a Power BI Report

pbireport

`Please ask any questions you have about using Power BI. Our focus today is Azure Databricks but many Data Analysts and Business users are comfortable using Power BI. Power BI can access the Delta table you created in all the previous steps.

Step 5 Introducing Microsoft Fabric

Now we are going to connect to the Delta tables we just created in the workshop by creating a Shortcut to the Delta files in Azure Data Lake Storage Gen2 using Microsoft Fabric.

Login to Microsoft Fabric in your browser on your lab services VM by opening another tab by clicking on the following link Microsoft Fabric. Here is the link if you need to cut and paste it into the browser

https://fabric.microsoft.com/

fabrichome

Click on Synapse Data Engineering

synapsede

Find your workspace by clicking on Workspaces on the left nav.

workspaces

Your workspace should look like this:

yourws

Your will either be in NC Databricks Workshop 1-9

ncdw1-9

Or NC Databricks Workshop 10-18

ncdw10-18

Create a Lakehouse of your own

Create a Lakehouse by clicking on +New and chosing Lakehouse. Lakehouse (Preview) while still in public preview

lakehouse

As we have been doing all day please name it user<number>. Click Create

An examlpe would be "userone", "Usertwo", etc

useronelh

You should now have an empty Lakehouse that looks something like this:

yourlh

Click on New shortcut

newsc

Click on Azure Data Lake Storage Gen2

adls

Now you need to create a connection string

constring

Go back to Azure Storage Explorer and find your gold schema

goldtables

Right click on a folder and scroll down to properties and select properties

props

Select the DFS URL

dfs

It should look something like this:

https://fabaccelerla3wfpqdmcv7c.dfs.core.windows.net/silver/cms/MedicareInpatientHospitalsByProviderAndService/userone1/gold/medicareinpatientfact

For the URL enter just the storage account

https://fabaccelerla3wfpqdmcv7c.dfs.core.windows.net

Then click Next

url

In the Shortcut setting enter the Shortcut name (the name of the folder which represents the delta table for example) and the Subpath. Something like this:

Shortcut Name: medicareinpatientfact 
URL: /silver/cms/MedicareInpatientHospitalsByProviderAndService/userone/gold/medicareinpatientfact

The path for your folder will vary

scname

Click Create

Now create shortcuts for the data and drg tables using the same process. The Storage URL will be the same as above and each Subpath will just need the last folder name changed.

You now should have the following 3 tables in your Lakehouse:

goldschema

Explore the SQL Endpoint

While you are in your lakehouse you can go to the top right corner and switch to the SQL Endpoint

sqlep

You now should be in the Home screen of the SQL Endpoint

sqlephome

Click on New SQL query and try some SQL statements. These are T-SQL queries like you would run in SQL Server, but you are querying the Delta tables you created shortcuts to in your Lakehouse.

SELECT count(*)
FROM [medicarereports].[dbo].[medicareinpatientproviderfact]
SELECT year, count(twodigityear) as records from MedicareInpatientHospitalsByProviderAndService group by year
select [Rndrng_Prvdr_CCN],[Rndrng_Prvdr_Org_Name], sum(Tot_Dschrgs) AS Discharges from medicareinpatienthospitals where [Rndrng_Prvdr_Org_Name] like ('%Upmc%') group by [Rndrng_Prvdr_CCN], [Rndrng_Prvdr_Org_Name] order by Discharges desc

This last query was used to create a list of UPMC hospitals. We will now use a csv of that list to create a personal Delta table back in our Lakehouse.

Back to the Lakehouse

Go to the top right corner and switch to the Lakehouse

sqlep

Click on the three dots next to Files and click New subfolder

newsubfld

Name the Subfolder csv

newsfcreate

Click on the three dots next to the new csv subfolder you just created. Click Upload>Upload files.

upload

Click on the fold icon in the dropdown. Navigate to the upmchospitals.csv that was downloaded by your clone or download of this GitRepo. Click Open. Click Upload.

uploadfiles

Click the x in the top right corner

clickx

The uploaded csv should now be in the csv subfolder

csvsub

Right click on the three dot by the csv file and select Load to Tables

loadtotables

Go with the default name and click Confirm

This process creates a new Delta table in the Lakehouse from the csv in Files. This also works with Parquet files.

csvtodelta

Back to the SQL Endpoint

Go to the top right corner and switch to the SQL Endpoint

sqlep

You now should be in the Home screen of the SQL Endpoint

sqlephome

Click on Model on the bottom of the UI

model

Create the joins between:

medicareinpatientfact Rndrng_Prvdr_CNN and the upmchospitals Rndrng_Prvdr_CNN
medicareinpatientfact DRG_Cd and the drgdim DRG_Cd
medicareinpatientfact twodigityear and the datedim twodigityear

Create a New Power BI report

You are now ready to build a Power BI report

Click New report on the top bar above the Model

Build a simple PBI report

fabricpbireport

Congrats you have finished the workshop!

adbfundamentalsws's People

Contributors

datasnowman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.