Deployment of the Purview ADB Lineage Solution Accelerator

When installed as a working connector, your data sources, Azure Databricks, and Azure Purview are assumed to be setup and running.

Services Installed

Prerequisites

Installing the base connector requires that you have already configured Databricks CLI with the Azure Databricks platform.

Deployment Steps

Clone the repository into Azure cloud shell
Run the installation script
Post Installation
Download and configure OpenLineage Spark agent with your Azure Databricks clusters

Clone the repository into Azure cloud shell

From the Azure Portal

At the top of the page, click the Cloud Shell icon

Click "Confirm" if the "Switch to PowerShell in Cloud Shell" pop up appears.
Change directory and clone this repository into the clouddrive directory. If this directory is not available please follow these steps to mount a new clouddrive
```
cd clouddrive
git clone https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator.git
```

<Private Preview only: remove before publication>

For internal testing you will require a github access token to generate go to github Link: https://catalyst.zoho.com/help/tutorials/githubbot/generate-access-token.html

When copying values out of GitHub to the cloud shell, be sure to choose "copy as plain text" option.

On first run of the git clone command, you will be prompted to enter your GitHub username and password

Enter your GitHub username and hit "Enter"

For the password, copy and past the access token created in the previous step. NOTE: The password field will look empty after you enter the token
```
Cloning into 'Purview-ADB-Lineage-Solution-Accelerator'...
Username for 'https://github.com': <UserName>
Password for 'https://<UserName>@github.com': <AccessToken>
```
Note: These steps are only required for MS internal employees accessing the private repo. They will not be required for customer access.

</Private Preview only>

Run the installation script

Set the Azure subscription you want to use:

az account set --subscription <SubscriptionID>

If needed, create a new working Resource Group:

az group create --name <ResourceGroupName> --location <ResourceGroupLocation>

Change into the deployment directory:

cd clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/

Deploy solution resources:
```
az deployment group create --resource-group <ResorceGroupName>  --template-file "./newdeploymenttemp.json" --parameters purviewName=<ExistingPurviewServiceName>
```
- A prompt will ask you to provide the following:
  - Prefix (this is added to service names)
  - Client ID & Secret - can be random values if using Managed Identity (from the App ID required as a prerequisite)
  - Resource Tags (optional, in the following format: {"Name":"Value","Name2":"Value2"})
  - This deployment will take approximately 5 minutes.

Post Installation

If needed, change into the deployment directory:

cd clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/

(Manual Configuration) After the installation finishes, you will need to add the service principal to the data curator role in your Purview resource. Follow this documentation to Set up Authentication using Service Principal using the Application Identity you created as a prerequisite to installation.
Install necessary types into your Purview instance by running the following commands in Bash.

You will need:
- your Tenant ID
- your Client ID - used when you ran the installation script above
- your Client Secret - used when you ran the installation script above

purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"
TENANT_ID="<TENANT_ID>" 
CLIENT_ID="<CLIENT_ID>" 
CLIENT_SECRET="<CLIENT_SECRET>"

acc_purview_token=$(curl https://login.microsoftonline.com/$TENANT_ID/oauth2/token --data "resource=https://purview.azure.net&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET&grant_type=client_credentials" -H Metadata:true -s | jq -r '.access_token')
purview_type_resp_spark_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
        -H "Authorization: Bearer $acc_purview_token" \
        -H "Content-Type: application/json" \
        -d @Atlas2.2.json )
echo $purview_type_resp_spark_type

Or without service pricipal, using the logged in user:

purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"

acc_purview_token=$(az account get-access-token --resource https://purview.azure.net | jq -r '.accessToken')
purview_type_resp_spark_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
        -H "Authorization: Bearer $acc_purview_token" \
        -H "Content-Type: application/json" \
        -d @Atlas2.2.json )
echo $purview_type_resp_spark_type

Download the OpenLineage Spark agent and configure with your Azure Databricks clusters

You will need the default API / Host key configured on your Function app. To retrieve this:

Browse to your Function App in the Azure portal
Select Functions in the left-hand navigation
Click on the OpenLineageIn function
Select Function Keys in the left-hand navigation
Select Show Values, copy and save for use in the steps below

<Private Preview Only: Important - do not download the jar from the site. The jar file is provided for you in "deployment/infra/MSFT-0.xxxx" during the private preview>

To retrieve the ADB-WORKSPACE-ID, in the Azure portal, navigate to the Azure DataBricks (ADB) service in your resource group. In the overview section, copy the ADB workspace identifier from the URL (as highlighted below).
To retrieve the FUNCTION_APP_DEFAULT_HOST_KEY, go back to the resource group view and click on the Azure Function. In the right pane go to 'App Keys', then click on the show values icon and copy the _default key.
Follow the Databricks Install Instructions to download the latest release and to enable OpenLineage in Databricks.
1. Ensure the upload-to-databricks script matches your jar path before running
2. Replace Spark configuration (fourth bullet) with the following:
```
spark.openlineage.version 1 
spark.openlineage.namespace <ADB-WORKSPACE-ID>#<DB_CLUSTER_ID>
spark.openlineage.host https://<FUNCTION_APP_NAME>.azurewebsites.net
spark.openlineage.url.param.code <FUNCTION_APP_DEFAULT_HOST_KEY>
```
3. The ADB-WORKSPACE-ID value should be the first part of the URL when navigating to Azure Databricks, not the workspace name. For example, if the URL is: https://adb-4630430682081461.1.azuredatabricks.net/, the ADB-WORKSPACE-ID should be adb-4630430682081461.1.
4. To support Spark 2 Job clusters, you must:
  - Add your Service Principal to the Spark 2 cluster
  - Assign the Service Principal as a contributor to the Databricks Workspace
5. You should store the API Host Key in a secure location. If you will be configuring individual clusters with the OpenLineage agent, you can use Azure Databricks secrets to store the key in Azure KeyVault and retrieve it as part of the cluster initialization script. For more information on this, see the Azure documentation

After configuring the secret storage, the API key for OpenLineage can be configured in the Spark config, as in the following example: spark.openlineage.url.param.code {{secrets/secret_scope/Ol-Output-Api-Key}}

You can also configure the OpenLineage listener to run globally, so that any cluster which is created automatically runs the listener. To do this, you can utilize a global init script. <Insert link to OL global init setup doc when available>

Note: Global initialization cannot currently use values from Azure Databricks KeyVault integration mentioned above. If using global initialization scripts, this key would need to be retrieved in the notebooks themselves, or hardcoded into the global init script.

elenaterenzi / spark-openlineage Goto Github PK

spark-openlineage's Introduction

Deployment of the Purview ADB Lineage Solution Accelerator

Services Installed

Prerequisites

Deployment Steps

Clone the repository into Azure cloud shell

<Private Preview only: remove before publication>

</Private Preview only>

Run the installation script

Post Installation

Download the OpenLineage Spark agent and configure with your Azure Databricks clusters

<Private Preview Only: Important - do not download the jar from the site. The jar file is provided for you in "deployment/infra/MSFT-0.xxxx" during the private preview>

spark-openlineage's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent