Azure Databricks Fundamentals Workshop (adbfundamentalsws)

Use this repo for the Azure Databricks Fundamentals Workshop

Welcome to the workhop. The purpose of the workshop is to bring you up to speed on some of the Data Engineering, Data Science, and Data Analysis capabilities of Azure Databricks. You will be using resources in Azure for this workshop. Each of you will recieve a userid and password to access the resources. Please consult your instuctor if you have any questions or blockers.

The Workshop is broken into two parts:

- Data Engineering and Data Science

- Data Analysis

Prior to starting the hands-on activites you will need to complete Step 1.

During Step 1 you will log into a VM that is deployed on Azure Labservice. This will ensure that everyone has the same capabilites by using a standard Window 11 VM that already has the software you need installed.

Sections:

Step 1 Login to Azure Lab Services

Step 2 Data Engineering

Step 3 Data Science Simple Starter

Step 4 Data Analysis

Step 5 Introducing Microsoft Fabric

Step 1 Login to Azure Lab Services

Here you will to register, start, and login to your Windows 11 desktop that you will use during the workshop.

Click on this Registration Link for Azure Lab Services

Refer to the paper you get during the workshop for your userid and password.

you may be using a organization similar to this @MngEnvMCAP004660.onmicrosoft.com

Enter your userid and click Next

Enter your password and click Sign in

Note: You may be prompted for multifactor auth. Check with your instructor.

You should endup at a screen that looks like this.

Click on the stopped toggle on the VM to start your VM

Once it says that it is Running click on the icon that looks like a monitor

Keep the download if you browser prompts you.

Open the RDP file and click Connect

Enter the password provided by your instructor and click OK

Click Yes

Your Windows desktop should look like this. Notice the following applications that you will use during the workshop:

- Power BI Desktop
- Microsoft Azure Storage Explorer
- Git (to clone this repository to your VM)

Open up the Microsoft Edge browser on the desktop and open up this GitHub Repo to follow the rest of the directions from that remote desktop VM. If you prefer to cut and paste it here is the link

https://github.com/DataSnowman/adbfundamentalsws

Step 2 Data Engineering

During this section of the workshop we will focus on reviewing and looking at a couple of Databricks Notebooks and using them for Data Engineering activities.

Click on this link to access the Databricks environment for the workshop

When the sign in pops up click Sign in with Azure AD

Note: You may be prompted for multifactor auth. Check with your instructor.

Your Azure Databricks Screen will look like this

In like the new UI so you might want to click the Enable new UI toggle

We will review the items available on the left nav panel

For the workshop today we will focus on:

- Workspace
- Data
- Compute
- SQL Editor
- SQL Warehouses
- Partner Connect

To start this please clone or download a zip of the GitHub Repo we are using in the workshop

Git is installed on the VM so you can open a command prompt and paste in the following:

git clone https://github.com/DataSnowman/adbfundamentalsws.git

Go into your Workspace in Databricks and navigate to the three dots to the right of your username and select import

Click on browse and navigate to where you cloned or downloaded the zip of the repo and select the dataengineering.dbc file and click Open

Click Import

Open the dataengineering folder and you should see these two notebooks

Open the first notebook LoadCMSFiles

In the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

Before you run anything you need to attach the notebook to the cluster. Find the WorkshopNIS cluster and click Attach

When you complete running each cell there should be delta tables in the Data section of the Leftnav

Open up Microsoft Azure Storage Explorer to check your Storage accounts to find the input source data in Bronze

The output of the notebook will be in Silver

Congrats: You just completed moving source csv files in the bronze container and wrote the data as Delta/Parquet output to the silver container. You created a Delta table that references those files.

Now you are going to take that Delta/Parquet data created in the first notebook and create three Delta tables:

- Datedim
- DRGdim
- MedicareInpatientFact

Open the second notebook LoadGoldSchema

As in the other notebook, in the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

When you complete running the each cell there should be delta tables in the Data section of the Leftnav

Check your Storage accounts to find the input source data in Silver

The output of the notebook will be in gold folder in the Silver container

Congrats: You just completed creating the Delta/Parquet output that we will use in the Data Analysis section later in the workshop.

Lets now look at the Data Science capabilities in Azure Databricks.

Step 3 Data Science Simple Starter

Go into your Workspace in Databricks and navigate to the three dots to the right of your username and select import

Click on browse and navigate to where you cloned or downloaded the zip of the repo and select the diabetesmlmodel.dbc file and click Open

Click Import

Open the dataengineering folder and you should see these the new notebook. You could of also created a new folder in you Databricks workspace call datascience and import the notebook into that folder.

Open the diabetesmlmodel notebook

In the first cell modify the uncomment (remove the #) last item so it is your username

user = ""

example would be (besure to use your user so you don't colide with another user)

user = "usereighteen"

IMPORTANT Let review what is going on in the notebook and please run one cell at a time

When you complete running the each cell there should be a delta table in the Data section of the Leftnav called diabetesageadb<username> (your username will be concatinated on the end of the table name so you can find your table amongst the other users tables)

Let's stop hear and take a break before we resume. The diabetes notebook demonstrated a bunch of ways of accessing the data and the creation of a machine learning model. This is just a start put be aware that much of the Machine Learning and Deep Learning that creates AI Models like ChatGPT use Python. You have just run three PySpark notebooks that emerse you in what Data Engineers and Data Scientists work with.

Step 4 Data Analysis

We are now going to leverage the Delta tables we created in the Data Engineering section.

Navigate to the SQL Editor in the Left Nav

Run some queries. Here is a sample query

SELECT * FROM cms.datedimuserone1

Note that you will need to put your username on the end of SELECT * FROM cms.datedim<username>

Try some adhoc queries of your own. Here is another example to try:

select distinct DRG_Cd, DRG_Desc from cms.medicareinpatienthospitals order by DRG_Cd

Access Databricks SQL Warehouse using Power BI

Now lets try to access these delta tables in Power BI desktop

Navigate to Partner Connect in the Left Nav and click on Setup Power BI Desktop

Click on Download connection file

Click on openfile

Make sure you select Azure Active Directory and then click on Sign in and enter your username and password. Then click Connect

Select your tables and click Load

Create the joins in your model

Start Building a Power BI Report

`Please ask any questions you have about using Power BI. Our focus today is Azure Databricks but many Data Analysts and Business users are comfortable using Power BI. Power BI can access the Delta table you created in all the previous steps.

Step 5 Introducing Microsoft Fabric

Now we are going to connect to the Delta tables we just created in the workshop by creating a Shortcut to the Delta files in Azure Data Lake Storage Gen2 using Microsoft Fabric.

Login to Microsoft Fabric in your browser on your lab services VM by opening another tab by clicking on the following link Microsoft Fabric. Here is the link if you need to cut and paste it into the browser

https://fabric.microsoft.com/

Click on Synapse Data Engineering

Find your workspace by clicking on Workspaces on the left nav.

Your workspace should look like this:

Your will either be in NC Databricks Workshop 1-9

Or NC Databricks Workshop 10-18

Create a Lakehouse of your own

Create a Lakehouse by clicking on +New and chosing Lakehouse. Lakehouse (Preview) while still in public preview

As we have been doing all day please name it user<number>. Click Create

An examlpe would be "userone", "Usertwo", etc

You should now have an empty Lakehouse that looks something like this:

Click on New shortcut

Click on Azure Data Lake Storage Gen2

Now you need to create a connection string

Go back to Azure Storage Explorer and find your gold schema

Right click on a folder and scroll down to properties and select properties

Select the DFS URL

It should look something like this:

https://fabaccelerla3wfpqdmcv7c.dfs.core.windows.net/silver/cms/MedicareInpatientHospitalsByProviderAndService/userone1/gold/medicareinpatientfact

For the URL enter just the storage account

https://fabaccelerla3wfpqdmcv7c.dfs.core.windows.net

Then click Next

In the Shortcut setting enter the Shortcut name (the name of the folder which represents the delta table for example) and the Subpath. Something like this:

Shortcut Name: medicareinpatientfact 
URL: /silver/cms/MedicareInpatientHospitalsByProviderAndService/userone/gold/medicareinpatientfact

The path for your folder will vary

Click Create

Now create shortcuts for the data and drg tables using the same process. The Storage URL will be the same as above and each Subpath will just need the last folder name changed.

You now should have the following 3 tables in your Lakehouse:

Explore the SQL Endpoint

While you are in your lakehouse you can go to the top right corner and switch to the SQL Endpoint

You now should be in the Home screen of the SQL Endpoint

Click on New SQL query and try some SQL statements. These are T-SQL queries like you would run in SQL Server, but you are querying the Delta tables you created shortcuts to in your Lakehouse.

SELECT count(*)
FROM [medicarereports].[dbo].[medicareinpatientproviderfact]

SELECT year, count(twodigityear) as records from MedicareInpatientHospitalsByProviderAndService group by year

select [Rndrng_Prvdr_CCN],[Rndrng_Prvdr_Org_Name], sum(Tot_Dschrgs) AS Discharges from medicareinpatienthospitals where [Rndrng_Prvdr_Org_Name] like ('%Upmc%') group by [Rndrng_Prvdr_CCN], [Rndrng_Prvdr_Org_Name] order by Discharges desc

This last query was used to create a list of UPMC hospitals. We will now use a csv of that list to create a personal Delta table back in our Lakehouse.

Back to the Lakehouse

Go to the top right corner and switch to the Lakehouse

Click on the three dots next to Files and click New subfolder

Name the Subfolder csv

Click on the three dots next to the new csv subfolder you just created. Click Upload>Upload files.

Click on the fold icon in the dropdown. Navigate to the upmchospitals.csv that was downloaded by your clone or download of this GitRepo. Click Open. Click Upload.

Click the x in the top right corner

The uploaded csv should now be in the csv subfolder

Right click on the three dot by the csv file and select Load to Tables

Go with the default name and click Confirm

This process creates a new Delta table in the Lakehouse from the csv in Files. This also works with Parquet files.

Back to the SQL Endpoint

Go to the top right corner and switch to the SQL Endpoint

You now should be in the Home screen of the SQL Endpoint

Click on Model on the bottom of the UI

Create the joins between:

medicareinpatientfact Rndrng_Prvdr_CNN and the upmchospitals Rndrng_Prvdr_CNN

medicareinpatientfact DRG_Cd and the drgdim DRG_Cd

medicareinpatientfact twodigityear and the datedim twodigityear

datasnowman / adbfundamentalsws Goto Github PK

adbfundamentalsws's Introduction

Azure Databricks Fundamentals Workshop (adbfundamentalsws)

Step 1 Login to Azure Lab Services

Step 2 Data Engineering

Step 3 Data Science Simple Starter

Step 4 Data Analysis

Access Databricks SQL Warehouse using Power BI

Step 5 Introducing Microsoft Fabric

Create a Lakehouse of your own

Explore the SQL Endpoint

Back to the Lakehouse

Back to the SQL Endpoint

Create a New Power BI report

Congrats you have finished the workshop!

adbfundamentalsws's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent