solliancenet / azure-synapse-analytics-workshop-400 Goto Github PK

TSQL 3.15% PowerShell 60.63% Jupyter Notebook 36.23%

azure-synapse-analytics-workshop-400's Introduction

Azure Synapse Analytics Workshop (level 400, 4 days)

Workshop Agenda
	Day 1	Day 2	Day 3	Day 4
	Ingest & Transform	Optimize, Query & Secure	Work with Spark, Machine Learning, Monitor & Manage	Challenge-based Learning
08:00	Azure Synapse Analytics 101	Design Presentation: DW Optimization (*)	Design Presentation: Synapse Spark (*)	Challenge Introduction (****)
08:15	Azure Synapse Analytics 101			Challenge Introduction (****)
08:30	Demo Walkthrough - Synapse Analytics In Action			Begin POC Challenge (****)
08:45	Break	Build Hands-on: DW Optimization Part 1 (***)
09:00	Design Presentation: Data Loading & Data Lake Organization (*)		Build Hands-on: Machine Learning (***)
09:15
09:30				Break
09:45				Continue POC Challenge (****)
10:00	Q&A / Digital Cohort Activity: Data Lake Design & Security Considerations (**)	Q&A / Digital Cohort Activity: DW Optimization (**)	Break
10:15		Q&A / Digital Cohort Activity: DW Optimization (**)	Build Hands-on: Spark (***)
10:30		Break
10:45		Design Presentation: DW Optimization Part 2 (*)
11:00	Break	Design Presentation: DW Optimization Part 2 (*)	Demo Walkthrough - CI/CD with Synapse Workspaces (**)	Break
11:15	Build Hands-on: Data Integration Part 1 (***)	Build Hands-on: DW Optimization Part 2 (***)		Continue POC Challenge (****)
11:30		Build Hands-on: DW Optimization Part 2 (***)
11:45		Main Break	Main Break
12:00	Main Break			Main Break
12:15
12:30
12:45		Design Presentation: Security (*)	Design Presentation: Monitor & Manage (*)
13:00	Design Presentation: Data Transformations (*)		Design Presentation: Monitor & Manage (*)	Finish POC Challenge (****)
13:15			Q&A / Digital Cohort Activity: Monitor & Manage (**)
13:30		Q&A / Digital Cohort Activity: Security (**)	Q&A / Digital Cohort Activity: Monitor & Manage (**)
13:45	Q&A / Digital Cohort Activity: Data Engineering Discussion (**)	Q&A / Digital Cohort Activity: Security (**)	Break
14:00	Build Hands-on: Data Integration Part 2 (***)	Break	Build Hands-on: Monitoring (***)	Challenge Debrief (****)
14:15		Build Hands-on: Security (***)		Challenge Debrief (****)
14:30				Closing
14:45			Q&A Closing (if 3-day event)	Closing
15:00	End of Day 1	End of Day 2	End of Day 3	End of Day 4

	Key
	Design	Build	Deploy
	* SME Presentations L400 Technical	*** Guided virtual lab experience with remote SME assistance	**** Challenge & team-based learning experience (hack)
	** Q&A Discussion and small group cohort activity

	8 hr 45 min	7 hr 15 min	5 hr 30 min
	address "why" design this way	address the "how" to	apply "how" and "why" learning to a scenario

	LEVEL 400		LEVEL 200-300

azure-synapse-analytics-workshop-400's People

Contributors

Stargazers

Watchers

Forkers

manesh-r euangms midesa givenscj amy-chue nmoore-demo agrajm kylebunting olufemig samratbhatnagar chmitch anniexudan masayukiozawa haroon921 mariuspc xigyenge ceakbey mj733p rajatrakesh farhad-heybati prsayeen ctesta-oneillmsft dexterzhou srushti-71 yang-jiayi wangyihaier spektrasystems ievsantillan stvflowers sinhasachin msshada gitsathish kalyanimeka123 malikamalik frvigano giorgiosaez qlycool pjirsa hariomshandilya arthursteijn osama222 bassampourco roscibanms deepakahirwar oraylis dahatake amjadmkhan desertislesql direkshan-digital zahiralward sqlkabouter gskgupta hyssh gustavo-devops-001 alodha100 chinmayapadhi dbinsights inderpuneet trphan rbackupx multiclouddev rajaniesh ashu20red razaulhaqakif levoleg cwrisec kanhaiyasingh kanhaiyaorg rahul0523 mykovalenko carlosjrestrepo 72ohit vijaytsr blydenren element824 sidneyocirqueira jasonhorner srmchem parveenkrraina felitztapia jairolollo juchen0411 hancj-dev nextdynamic joserfjuniorllms anjijava16 obsicn koksdatapro enioterra savvyt alejosql nitinsingh12 sivaanish rajini09 pietheinstrengholt ana-mesquita wiiki0807 cyberbeerbrewer maartenhbe klearning-tech

azure-synapse-analytics-workshop-400's Issues

L400 - Day 3 - Lab 6 - Exercise 1

When attempting to check if XGBoost is properly installed in the notebook, the following error is returned:

Exception : Not support to display this type of data: <class 'list'>
Traceback (most recent call last):
File "", line 891, in
File "", line 718, in display
File "", line 671, in _convert
Exception: Not support to display this type of data: <class 'list'>

Lab 2 / Exercise 2 / Task 4 / Step 2

Missing step to expand the "Tables" item

Lab 1 / Exercise 1 / Task 1 / Step 2

You are asking the user to provide the name to the new linked service (to Azure Cosmos DB) as the same as the storage account name, but when I tried that, it said the "Name is not unique"

Lab 1 / Exercise 1 / Task 2 / Step 1

You're telling the user to create a new dataset, but you are not saying how to do it (clicking the ellipsis next to Datasets). It could be OK for a L400 course, but I'd reconsider for the sake of the user experience.

Lab 1 / Exercise 3 / Task 2 / Step 4

Just a note that it took me 2 minutes to run the operation, as opposed to the 6-8 minutes mentioned in the instructions

Lab 2 / Exercise 2 / Task 1 / Step 6

When I clicked "Import projections", I was shown a dialog

I picked the IR that we created in the beginning of the exercise and clicked "Turn on debug"

DW Optimizer/Task 2/Step 4

The complex query is throwing an error:
SELECT
AVG(TotalProfit) as AvgMonthlyCustomerProfit
FROM
(
SELECT
S.CustomerId
,D.Year
,D.Month
,SUM(S.TotalAmount) as TotalAmount
,AVG(S.TotalAmount) as AvgAmount
,SUM(S.ProfitAmount) as TotalProfit
,AVG(S.ProfitAmount) as AvgProfit
FROM
[wwi_perf].[Sale_Partition01] S
join [wwi].[Date] D on
D.DateId = S.TransactionDateId
WHERE
D.Year in (2010, 2011, 2012, 2013, 2014)
GROUP BY
S.CustomerId
,D.Year
,D.Month
) T

Lab 2 / Exercise 3 / Task 3 / Step 7

For some reason, I'm seeing 5,216,822 rows as opposed to the 15M you are getting. It could be that I messed something up in the data flow, but I've reviewed everything and can't find any differences. I'd advise you to test it again to see if you get the same results

Lab 2 / Exercise 1 / Task 1 / Step 3

The Integration runtime setup dialog no longer shows "Azure, Self-Hosted". It only says "Azure"

DW Optimization/Exercise 1/Task1/Step 5

It will be good to explicitly state that the user need to run the query 👍 SELECT TOP 1000 * FROM
(
SELECT
S.CustomerId
,SUM(S.TotalAmount) as TotalAmount
FROM
[wwi_perf].[Sale_Heap] S
GROUP BY
S.CustomerId
) T
OPTION (LABEL = 'Lab03: Heap')

before querying the sys.pdm.dw. exec

DW Optimization/Exercise/Task1/Step 2

Even though it is a level 400 content, it will be good if link certain concepts to the docs page for better understanding of the user. For e.g sys.dm_pdw_exec_requests . if we can link it to :https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-pdw-exec-requests-transact-sql?view=aps-pdw-2016-au7

Missing .\artifacts\environment-setup\labfiles\AzureCred.ps1

Please could you post an example of what this AzureCred.ps1 file should look like as it is missing from the repo. I'd like to get an idea of what I need to write to make the Repo work to create all of the environment and linked services.

DW Optimization Exercise1/Task1

Please explicitly state the the query needs to run on SQL Pool . Initially, I didn't realize, it was running on SQL On Demand and was throwing error

Lab 2 / Exercise 4 / Step 8

Maybe because I'm getting much less data (see previous issue I logged), from this step forward, I'm not getting any rows in the queries

Step 8:

Step 9:

Step 10:

Step 11:

Step 12:

Datasets

Please can you document and response to this issue on what datasets need to be on the datalake for these labs to be successful?

DW Optimizer/Exercise2/Task2/Step 4

Same issue, I can't see the query plan and it asks me to download SSMS to see the query plan.
Also, Its not clear where in the query plan do you see the optimization done due to the materialized view

Lab 1 / Exercise 6 / Task 2 / Step 19

You say we can check the execution of the pipeline "here", but you don't mention where that is. I went to Monitor-> Pipeline Runs

L400 - Lab 6 – Exercise 1

When running the lab job executions in lab 6 Exercise 1, Cell 7 returns an error. I wonder if in a previous lab there is an AD integration. Or there is some steps missing in this lab.
Create Spark temporary views for sales and products fails with an authentication error (cell 7). See Below.

Error : com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user ''. ClientConnectionId:f934fdeb-4dbd-4e17-b8b4-f3e4c90d00fd at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:258) at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:256) at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:108) at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:4290) at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:3157) at com.microsoft.sqlserver.jdbc.SQLServerConnection.access$100(SQLServerConnection.java:82) at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:3121) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7151) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2478) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:2026) at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:1687) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:1528) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:866) at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:569) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54) at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper$$anonfun$3.apply(SQLAnalyticsJDBCWrapper.scala:70) at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper$$anonfun$3.apply(SQLAnalyticsJDBCWrapper.scala:70) at scala.util.Try$.apply(Try.scala:192) at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.createConnection(SQLAnalyticsJDBCWrapper.scala:70) at com.microsoft.spark.sqlanalytics.utils.Utils$.createConnection(Utils.scala:287) at com.microsoft.spark.sqlanalytics.read.SQLAnalyticsReader.com$microsoft$spark$sqlanalytics$read$SQLAnalyticsReader$$connection$lzycompute(SQLAnalyticsReader.scala:39)

Appears that we are missing instructions for authentication pre-requisite, or this was not set up correctly as part of the lab: https://docs.microsoft.com/en-us/sql/connect/jdbc/connecting-using-azure-active-directory-authentication?view=sql-server-ver15

Lab 2 / Exercise 1 / Task 1 / Step 4

This step is no longer needed

DW Optimization/Exercise 1/Task1/Step 4

It is not clear from the instruction on what the user needs to do to use the Explain function to get the data movement recommendation and the cost of query operations

L400-Day 3-Lab 6-Exercise 1 Cell 7 line magic UsageError

When attempting to run Cell 7 in the lab, I get the following error:

UsageError: %%spark is not a line magic.

L400 – Lab 6 – Broken links

It would appear that all links on (https://github.com/solliancenet/azure-synapse-analytics-workshop-400/blob/master/day-03/lab06-machine-learning.md#exercise-1---training-models ) are broken.

For example if you go to Exercise 2 Steps 1 and 3. Those links return an error. This seems to be the case for nearly all links on this page

CLI 2.0 Commands

In the 01-enviornment-setup.ps1 script, you are using not using the newest CLI 2 command. I had to update the following:

$userName = ((az account show) | ConvertFrom-JSON).user.name

Versus the depreciated command shown below, at line 65:
$userName = ((az ad signed-in-user show) | ConvertFrom-JSON).UserPrincipalName

Lab 1 / Exercise 4 / Task 2 / Step 19

The pipeline run is failing for me with an error message:

{
"errorCode": "2200",
"message": "ErrorCode=UserErrorSqlDWCopyCommandError,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=SQL DW Copy Command operation failed with error 'COPY statement failed because non nullable column 'TransactionDate' cannot have the NULL value as its default.',Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=COPY statement failed because non nullable column 'TransactionDate' cannot have the NULL value as its default.,Source=.Net SqlClient Data Provider,SqlErrorNumber=105211,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=105211,State=1,Message=COPY statement failed because non nullable column 'TransactionDate' cannot have the NULL value as its default.,},],'",
"failureType": "UserError",
"target": "Copy sales",
"details": []
}

L400 – Lab 6 Exercise 1 – Documentation

This exercise seems to highlight that there is 4 basic things that need to be preformed. But when you log into the Notebook there is about 51 Cells that need to be run. Does running these Cells complete the items needed to complete this lab? I would suggest a better connection between what needs to be preformed and what we are doing in the lab exercise.

Day 2/Lab2/Exercise 2/Part2

While I was able to pick another table and run the transaction, it would be great to add bit more info on the use case here. It didn't ring a bell for me on how the table distribution impacts query performance just by executing the steps

Lab 1 / Exercise 6 / Task 2 / Step 14

My destination column names are slightly different from yours. For some reason, it does not show the type after its name, like yours do

I double checked that my source and sink match yours

Lab 3/Exercise 2/Step 4

It will be good to provide some guidance on what is the importance of this feature before the user begins this exercise. for e.g , why we need to turn on the stats, why is important for query optimizer

Lab 1 / Exercise 3 / Task 5 / Step 5

At first, my chart showed differently than yours.

I noticed that in your chart, the categories are shown in DESC order. Mine are not ordered at all. To get the same results, I added an order by clause to that query

Lab 1 / Exercise 4 / Task 1 / Header

It feels like the second paragraph is duplicated and conveys the same idea as the first

Task 1: Configure workload management classification
When loading a large amount of data, it is best to run only one load job at a time for fasted performance. If this isn't possible, run a minimal number of loads concurrently. be sure that you allocate enough memory to the pipeline session. To do this, increase the resource class of a user which has permissions to rebuild the index on this table to the recommended minimum.

For fastest loading speed, run only one load job at a time. If that isn't feasible, run a minimal number of loads concurrently. If you expect a large loading job, consider scaling up your SQL pool before the load.

DW Optimization/Exercise/Task1/Step 3

Small typo:

Check the structure of the Sale_Heap table, by right clicking on it in the Data bub and selecting New SQL script and then CREATE. Take a look at the script used to create the table:

Please correct it to Data Hub

Lab3/Exercise2/task3

The Setting of result Set cache as "On" is throwing a Syntax error. I am using the following command:
ALTER DATABASE [SQLPool01]
SET RESULT_SET_CACHING ON

Lab 1 / Exercise 3 / Task 1 / Step 4

When trying to create the [external] schema, it tells me it already exists. It could be a bug in the lab guide, or it could be that someone already ran this lab and it wasn't cleaned up

L400 – Lab 6 – Documentation

This lab can be very confusing. A user can spend a little bit of time digging though all of the text just trying to figure out where to go to perform the lab. I would highlight better that this lab needs to be performed on “asaworkspace03” and that in the upper right you can locate “Synapse Analytics”. For example there is some much text about ONNX, but only half a line telling you were to go to perform the lab.

Lab 3/Exercise 2/Step 5/Step 2

The query with clustered index column retrieved the results in 1 sec vs 15 secs took with the CCI option.

This is contrary to what is mentioned in the steps.

Lab 2 / Exercise 2 / Task 3 / Step 13

It's unclear to me how we're changing the ProductCategory column in this transformation. Both before and after the Derived column task, I see Décor shown the same way:

Before:

After: (notice how the Revenue column is well formed, but no changes to the category you highlighted in the instructions)

DW Optimizer/Lab 1/Exercise 2/Step 6

DBCC PDW_SHOWMATERIALIZEDVIEWOVERHEAD ( 'wwi_perf.mvCustomerSales' )

Unable to run this query as it says invalid object name:

Failed to execute query. Error: Invalid object name 'SQLPool01.wwi_perf.mvCustomerSales'.
Total execution time: 00:00:00.288

Lab 1 / Exercise 1 / Task 1 / Step 1

You're asking the user to go to the Synapse workspace, but you don't provide a link. I know this is L400, but it doesn't mean everyone knows the link to web.azuresynapse.net. It may be helpful to provide a link

Lab 1 / Exercise 2 / Task 2 / Step 1

Minor, but worth noticing: You're asking the user to "navigate" to the folder where the parquet file is, but note that we're already there because of the previous exercise. Typically in other Microsoft courses, you would add "navigate to the folder if needed"

DW Optimization/Exercise 1/Task 2

Please add that the users cannot generate the query plan from Synapse Studio and have to go to SSMS to download the query plan

Lab 2 / Exercise 1 / Task 3 / Step 3

You are telling the user to leave configurations as they are, but for every field you say "Compression type: Select none", "Column delimiter: Select Comma (,)", etc. It should not say "select" because if feels like we need to change something

Empty Storage Containers with Cosmos DB Data

The cosmos DB documents are not copying over via the AZ Copy. It appears there may not be any data below. :

https://solliancepublicdata.blob.core.windows.net/wwi-02/online-user-profiles-02/

Would you please verify if this is the correct path? I can modify the code below to use the updated path.

The relevant powershell code is as follows:
foreach ($dataDirectory in $dataDirectories.Keys) {

            $vals = $dataDirectories[$dataDirectory].tostring().split(",");

            $source = $publicDataUrl + $vals[1];

            $path = $vals[0];

            $destination = $dataLakeStorageBlobUrl + $path + $destinationSasKey
            Write-Information "Copying directory $($source) to $($destination)"
            & $azCopyCommand copy $source $destination --recursive=true

Day 2/Lab2/Exercise 2/Part1

DBCC PDW_SHOWSPACEUSED('wwi_perf.Sale_Hash')

Running this commands throws the following error:
Failed to execute query. Error: Could not find object 'Sale_Hash' or you do not have permission.
Total execution time: 00:00:00.311