Code Monkey home page Code Monkey logo

spark-openlineage's Introduction

Deployment of the Purview ADB Lineage Solution Accelerator

When installed as a working connector, your data sources, Azure Databricks, and Azure Purview are assumed to be setup and running.

Services Installed

high-level-architecture.svg

Prerequisites

Installing the base connector requires that you have already configured Databricks CLI with the Azure Databricks platform.

Deployment Steps

  1. Clone the repository into Azure cloud shell
  2. Run the installation script
  3. Post Installation
  4. Download and configure OpenLineage Spark agent with your Azure Databricks clusters

From the Azure Portal

  1. At the top of the page, click the Cloud Shell icon

    ShellIcon.png

    SelectPowershell.png

    Click "Confirm" if the "Switch to PowerShell in Cloud Shell" pop up appears.

    CloudShellConfirm.png

  2. Change directory and clone this repository into the clouddrive directory. If this directory is not available please follow these steps to mount a new clouddrive

    cd clouddrive
    git clone https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator.git
    

<Private Preview only: remove before publication>

  • For internal testing you will require a github access token to generate go to github Link: https://catalyst.zoho.com/help/tutorials/githubbot/generate-access-token.html

    When copying values out of GitHub to the cloud shell, be sure to choose "copy as plain text" option.

    On first run of the git clone command, you will be prompted to enter your GitHub username and password

    Enter your GitHub username and hit "Enter"

    For the password, copy and past the access token created in the previous step. NOTE: The password field will look empty after you enter the token

    Cloning into 'Purview-ADB-Lineage-Solution-Accelerator'...
    Username for 'https://github.com': <UserName>
    Password for 'https://<UserName>@github.com': <AccessToken>
    

    Note: These steps are only required for MS internal employees accessing the private repo. They will not be required for customer access.

</Private Preview only>


  1. Set the Azure subscription you want to use:

    az account set --subscription <SubscriptionID>
  2. If needed, create a new working Resource Group:

    az group create --name <ResourceGroupName> --location <ResourceGroupLocation>
  3. Change into the deployment directory:

    cd clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/
  4. Deploy solution resources:

    az deployment group create --resource-group <ResorceGroupName>  --template-file "./newdeploymenttemp.json" --parameters purviewName=<ExistingPurviewServiceName>
    • A prompt will ask you to provide the following:
      • Prefix (this is added to service names)
      • Client ID & Secret - can be random values if using Managed Identity (from the App ID required as a prerequisite)
      • Resource Tags (optional, in the following format: {"Name":"Value","Name2":"Value2"})
      • This deployment will take approximately 5 minutes.
  1. If needed, change into the deployment directory:

    cd clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/
  2. (Manual Configuration) After the installation finishes, you will need to add the service principal to the data curator role in your Purview resource. Follow this documentation to Set up Authentication using Service Principal using the Application Identity you created as a prerequisite to installation.

  3. Install necessary types into your Purview instance by running the following commands in Bash.

    You will need:

    • your Tenant ID
    • your Client ID - used when you ran the installation script above
    • your Client Secret - used when you ran the installation script above
purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"
TENANT_ID="<TENANT_ID>" 
CLIENT_ID="<CLIENT_ID>" 
CLIENT_SECRET="<CLIENT_SECRET>"

acc_purview_token=$(curl https://login.microsoftonline.com/$TENANT_ID/oauth2/token --data "resource=https://purview.azure.net&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET&grant_type=client_credentials" -H Metadata:true -s | jq -r '.access_token')
purview_type_resp_spark_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
        -H "Authorization: Bearer $acc_purview_token" \
        -H "Content-Type: application/json" \
        -d @Atlas2.2.json )
echo $purview_type_resp_spark_type

Or without service pricipal, using the logged in user:

purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"

acc_purview_token=$(az account get-access-token --resource https://purview.azure.net | jq -r '.accessToken')
purview_type_resp_spark_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
        -H "Authorization: Bearer $acc_purview_token" \
        -H "Content-Type: application/json" \
        -d @Atlas2.2.json )
echo $purview_type_resp_spark_type

You will need the default API / Host key configured on your Function app. To retrieve this:

  1. Browse to your Function App in the Azure portal
  2. Select Functions in the left-hand navigation
  3. Click on the OpenLineageIn function
  4. Select Function Keys in the left-hand navigation
  5. Select Show Values, copy and save for use in the steps below

<Private Preview Only: Important - do not download the jar from the site. The jar file is provided for you in "deployment/infra/MSFT-0.xxxx" during the private preview>

  1. To retrieve the ADB-WORKSPACE-ID, in the Azure portal, navigate to the Azure DataBricks (ADB) service in your resource group. In the overview section, copy the ADB workspace identifier from the URL (as highlighted below).

    GetAdbEndpoint.png

  2. To retrieve the FUNCTION_APP_DEFAULT_HOST_KEY, go back to the resource group view and click on the Azure Function. In the right pane go to 'App Keys', then click on the show values icon and copy the _default key.

    FunctionKeys.png

  3. Follow the Databricks Install Instructions to download the latest release and to enable OpenLineage in Databricks.

    1. Ensure the upload-to-databricks script matches your jar path before running
    2. Replace Spark configuration (fourth bullet) with the following:
      spark.openlineage.version 1 
      spark.openlineage.namespace <ADB-WORKSPACE-ID>#<DB_CLUSTER_ID>
      spark.openlineage.host https://<FUNCTION_APP_NAME>.azurewebsites.net
      spark.openlineage.url.param.code <FUNCTION_APP_DEFAULT_HOST_KEY>
      
    3. The ADB-WORKSPACE-ID value should be the first part of the URL when navigating to Azure Databricks, not the workspace name. For example, if the URL is: https://adb-4630430682081461.1.azuredatabricks.net/, the ADB-WORKSPACE-ID should be adb-4630430682081461.1.
    4. To support Spark 2 Job clusters, you must:
    5. You should store the API Host Key in a secure location. If you will be configuring individual clusters with the OpenLineage agent, you can use Azure Databricks secrets to store the key in Azure KeyVault and retrieve it as part of the cluster initialization script. For more information on this, see the Azure documentation

After configuring the secret storage, the API key for OpenLineage can be configured in the Spark config, as in the following example: spark.openlineage.url.param.code {{secrets/secret_scope/Ol-Output-Api-Key}}

You can also configure the OpenLineage listener to run globally, so that any cluster which is created automatically runs the listener. To do this, you can utilize a global init script. <Insert link to OL global init setup doc when available>

Note: Global initialization cannot currently use values from Azure Databricks KeyVault integration mentioned above. If using global initialization scripts, this key would need to be retrieved in the notebooks themselves, or hardcoded into the global init script.

spark-openlineage's People

Contributors

elenaterenzi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.