googlecloudplatform / mlops-on-gcp Goto Github PK

License: Apache License 2.0

Jupyter Notebook 84.03% Dockerfile 0.19% Python 15.27% Shell 0.38% HCL 0.10% JavaScript 0.01% Mustache 0.02%

mlops-on-gcp's Introduction

ML Engineering on Google Cloud Platform

This repository maintains hands-on labs and code samples that demonstrate best practices and patterns for implementing and operationalizing production grade machine learning workflows on Google Cloud Platform.

Navigating this repository

This repository is organized into two sections:

Mini workshops

This section contains hands-on labs for instructor led ML Engineering mini workshops.

Code Samples

This section compiles samples demonstrating design and code patterns for a variety of ML Engineering topics.

mlops-on-gcp's People

Contributors

Stargazers

Watchers

Forkers

ksalama rakesh283343 deshuaiwang opendedup micronsam nfaggian varshithr mdlglobal-atlassian-net turanbul lizalayne hydrogeohc jainsourabh2 slachterman-g smhosein bucketdeveloper dmontjoy yufengg smathalikunnel direkshan-digital kornelregius raigonjolly pn-goog prajp aniruddhachoudhury ludsh2000 rosetn bkumarr tottenjordan fatehbenazir abhinavkrdeeps ankin yv0nnechen aalyanak cianom rogier jenishk56 jbergbrede jcoloma damioke kshitizrimal lakshmanok slbinilkumar takumiasl teppeik-asl toiasl kanisan123 team-kyocera-alpha ialzyoud girish-ir wvish89 camushoilingma deep-brainz aditganda deadmoto rai-harshit sajidsh yanlongsun junghoo-ops matheusjerico pandekalyani flmoura rsamant07 sra4077 surachart jeanpierremilan alexfang0214sh maabel0712 nikitabu mohammedamirk devilmetal ajayakumar1983 yballage vedica1011 tarrade naspindler pajai r000bin brunojordan valhold huan-aw johawi federicorogai rehaag stefanoberhaensli alex-gek mdraheim2304 dojo515 thewertzgroup alirezasadeghi kcthogiti hzitoun atcjochems valentinricher dkav6 wajeehulhassanvii peemblis jkgiesler nicholasjallan shivajid lingzhen-chen

mlops-on-gcp's Issues

TFX on Cloud AI Platform Pipelines: Skaffold not deployed on Notebooks instance

In Lab 02, "Exercise: deploy your pipeline container to AI Platform Pipelines with TFX CLI" fails because Skaffold is not installed by default:

No executable skaffold
please refer to https://github.com/GoogleContainerTools/skaffold/releases for installation instructions.
No container image is built.
Traceback (most recent call last):
  File "/home/jupyter/.local/lib/python3.7/site-packages/tfx/tools/cli/container_builder/skaffold_cli.py", line 40, in __init__
    stdout=subprocess.DEVNULL)
  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['which', 'skaffold']' returned non-zero exit status 1.

Skaffold can be installed by the following:

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/v1.30.0/skaffold-linux-amd64 && chmod +x skaffold && sudo mv skaffold /usr/local/bin

`dag` parameter not required because the operator is in the DAG context already

https://github.com/GoogleCloudPlatform/mlops-on-gcp/blob/2ef4c5bce2fbd356b3018b1ed981f151a7520003/continuous_training/composer/labs/chicago_taxi_dag.py#LL207C1-L207C1

build

echo "# Deepak19025" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/Deepak19025/Deepak19025.git
git push -u origin main

[TFX on Cloud AI Platform Pipelines] Cannot pass Deploy the pipeline package to AI Platform Pipelines

I cannot pass this despite completing the notebook with the bucket value and appropriate credentials

Here is my notebook for reference:

lab-02-Abdur-RahmaanJ.ipynb.txt

I mean, does the solution notebook needs updating?

Missing CS dataset

In mlops-on-gcp/workshops/tfx-caip-tf23/lab-01-tfx-walkthrough/solutions/lab-01.ipynb
DATA_ROOT = 'gs://workshop-datasets/covertype/small'

NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{ "error": { "code": 404, "message": "The requested project was not found.", "errors": [ { "message": "The requested project was not found.", "domain": "global", "reason": "notFound" } ] } } ' when reading gs://workshop-datasets/covertype/small

Also:

gs://workshop-datasets/: ERROR: (gcloud.alpha.storage.ls) gs://workshop-datasets not found: 404.

Lengthy run-time for lab-02-tfx-pipeline

Ran through lab-02-tfx-pipeline 3 times with the following run-times:

1 hrs 46 min
2 hrs 6 min
1 hour 54 min

I was a bit concerned by this runtime length on a small dataset (~500k examples) for delivery and motivating the use of CAIP pipelines compared to existing CAIP training and prediction services so wanted to flag it and discuss improvement opportunities.

Lot of deprecation warnings and non-fatal errors in the log. I am still learning the KFP interface compared to SmartEngine UI so wasn't sure how to view the runtimes of individual components to profile. From what I can tell, the ordering based on wall time is Trainer > Evaluator > Transform > CsvExampleGen.

To improve performance, are there opportunities to:

Add additional worker machines / accelerator (GPU, TPU) to Trainer?
Add additional worker machines to Evaluator?

I see the GKE cluster created has 2 nodes with autoscaling on for up to 5. Looks like the cluster was well within memory and CPU limits but one of the nodes did have an autoscaler pod run. This guide https://cloud.google.com/ai-platform/pipelines/docs/configure-gke-cluster?hl=en_US#ensure mentions having at least 3 nodes (+1 node) with 2 CPUs with 4GB (+1GB each) memory. Perhaps mirroring this config and allocating more resources upfront would yield performance benefits?

TFX on Cloud AI Platform Pipelines: "Activate Cloud Shell" unnecessary

In TFX on Cloud AI Platform Pipelines, Cloud Shell is never used in the lab, so the "Activate Cloud Shell" section of the instructions is not needed.

mlops-composer-mlflow/install.sh old composer version issue

Hi,

mlops-on-gcp/environments_setup/mlops-composer-mlflow/install.sh

Line 128 in 7f18c4c

--image-version=composer-1.13.4-airflow-1.10.12 \

in google console, it saids composer version is too old(Its now not supporting).
--image-version=composer-1.13.4-airflow-1.10.12

i changed it to
--image-version=composer-1.18.7-airflow-2.2.3
and it worked well.

Can you check this issue please? thx.

lab-02-kfp-pipeline: kpf BQ component run failure

When running the pipeline from the notebook two BQ components are failing:

It seems that the first component that succeeds creates a dataset, and the two other components that are failing are doing so because that dataset has already been created by that first component:

The pipeline succeeds though when run directly from the UI.

MLFLOW_TRACKING_EXTERNAL_URI is empty. It seems MLFlow Tracking Server is not properly deployed

Hi,

I have an endless loop when installing mlops-composer-mlflow.
as far as I understand the code, there is no specific script to install MLFlow Tracking Server.

Problematic paragraph

mlops-on-gcp/environments_setup/mlops-composer-mlflow/install.sh

Line 217 in 54b9a77

MLFLOW_TRACKING_EXTERNAL_URI="https://"

MLFLOW_TRACKING_EXTERNAL_URI="https://"
# Internal access from Composer to Mlflow
MLFLOW_URI_FOR_COMPOSER=="http://"
while [ "$MLFLOW_TRACKING_EXTERNAL_URI" == "https://" ] || [ "$MLFLOW_URI_FOR_COMPOSER" == "http://" ]
do
  echo "wait 5 seconds..."
  sleep 5s
  MLFLOW_TRACKING_EXTERNAL_URI="https://"$(kubectl describe configmap inverse-proxy-config -n mlflow | grep "googleusercontent.com")
  MLFLOW_URI_FOR_COMPOSER="http://"$(kubectl get svc -n mlflow mlflow -o jsonpath='{.spec.clusterIP}{":"}{.spec.ports[0].port}')
done

Command line output:

Waiting for MLflow Tracking server provisioned
wait 5 seconds...
Error on line: 234
Caused by: MLFLOW_TRACKING_EXTERNAL_URI="https://"$(kubectl describe configmap inverse-proxy-config -n mlflow | grep "googleusercontent.com")
That returned exit status: 1
Aborting...
wait 5 seconds...
Error on line: 234
Caused by: MLFLOW_TRACKING_EXTERNAL_URI="https://"$(kubectl describe configmap inverse-proxy-config -n mlflow | grep "googleusercontent.com")
That returned exit status: 1
Aborting...
^C

Debug code:

_<username>_@cloudshell:~/mlops-composer-mlflow _(<gcp_project_name>)_$ kubectl describe configmap inverse-proxy-config -n mlflow
Name:         inverse-proxy-config
Namespace:    mlflow
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=mlflow
Annotations:  meta.helm.sh/release-name: mlflow
              meta.helm.sh/release-namespace: mlflow

Data
====

BinaryData
====

Events:  <none>

I have managed to install the docker images for MLFlow docker image and MLFlow UI proxy image. Everything up to that problematic line has worked. Please let me know how to fix the problem

Error in Docker build - GPG error: http://packages.cloud.google.com/apt gcsfuse-bionic InRelease: The following signatures couldn't be verified because the public key is not available

Following the README at mlops-on-gcp/examples/mlops-env-on-gcp/creating-notebook-instance/README.md step by step results in below error:

W: GPG error: http://packages.cloud.google.com/apt gcsfuse-bionic InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FEEA9169307EA071 NO_PUBKEY 8B57C5C2836F4BEB
E: The repository 'http://packages.cloud.google.com/apt gcsfuse-bionic InRelease' is not signed.
W: GPG error: http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FEEA9169307EA071 NO_PUBKEY 8B57C5C2836F4BEB
E: The repository 'http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease' is not signed.
The command '/bin/sh -c apt-get update -y && apt-get -y install kubectl' returned a non-zero code: 100
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 100

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ERROR: (gcloud.builds.submit) build 9113de9f-5276-4960-8a47-994431acdf5c completed with status "FAILURE"
crogers@mbp-crogers lab-workspace % nano Dockerfile 
crogers@mbp-crogers lab-workspace % gcloud builds submit --timeout 15m --tag ${IMAGE_URI} .
Creating temporary tarball archive of 3 file(s) totalling 3.0 KiB before compression.
Uploading tarball of [.] to [gs://PROJECT-REDACTED_cloudbuild/source/1623070913.599017-56c228c79bb6488d876045d2e8ceb9bc.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/PROJECT-REDACTED/locations/global/builds/9571dabd-6d86-4139-9adf-960de21e3434].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/9571dabd-6d86-4139-9adf-960de21e3434?project=169360861282].```


Work around is to add `RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - ` as the second line of the Dockerfile

TFX on Cloud AI Platform Pipelines: kfp not installed by default

In Lab 02, for "Validate lab package version installation", the kfp package is not installed by default:

TFX version: 0.26.3
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'kfp'

Fortunately the suggested %pip install --upgrade --user kfp==1.0.4 command works.

[TFX Standard Components Walkthrough] Outdated library dependencies

In lab ''TFX Standard Components Walkthrough'' task#4 ''Clone the example repo within your AI Platform Notebooks instance'' can't be completed becaue the versions used in the intall script are too old.

Specifically, the command

cd mlops-on-gcp/workshops/tfx-caip-tf23
./install.sh

results in error, even specifically mentioned the library version via pip install doesn't work. I think this lab needs to be updated to be compatible with currently supported versions.

workshops/tfx-caip-tf23/lab-02-tfx-pipeline lab's last two checkpoints cannot be satisfied

The last two lab checkpoints are about creating the pipeline, and then creating the pipeline run, which both can be done regardless if the run succeeds or not.

Notebook Instance Creation

I followed the steps to create the notebook. The steps completed successfully but the entry is not coming under "Notebooks" tab in "AI Platform".

TFX on Cloud AI Platform Pipelines: GKE cluster not provisioned on lab startup

In TFX on Cloud AI Platform Pipelines, the instruction state that:

A cluster named cluster-1 was provisioned for you on lab startup

This is not the case, you need to create it manually. Fortunately the defaults for "Create cluster" in the Pipelines creation work, although it takes a solid 5 minutes after everything is up and running until the Qwiklab "Check my progress" for this passes.

TFX on Cloud AI Platform Pipelines: TFX pipeline run cannot write into defined bucket (403 Insufficient Permission)

In Lab 02, when the TFX Tuner SA is configured as documented:

CUSTOM_SERVICE_ACCOUNT = 'tfx-tuner-caip-service-account@qwiklabs-gcp-01-1057c4de4b13.iam.gserviceaccount.com'

Runs of the TFX pipeline fail because Pipelines can't write into the bucket (which is missing by default, see issue #124).

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 403 with body '{
  "error": {
    "code": 403,
    "message": "Insufficient Permission",
    "errors": [
      {
        "message": "Insufficient Permission",
        "domain": "global",
        "reason": "insufficientPermissions"
      }
    ]
  }
}
'
	 when initiating an upload to gs://my-missing-bucket/tfx_covertype_continuous_training/

I'm somewhat baffled as to why, since the Tuner SA and a few more all have Object Storage Admin privs on the bucket:

qwiklabs-gcp-01-1057c4de4b13@qwiklabs-gcp-01-1057c4de4b13.iam.gserviceaccount.com | Qwiklabs User Service Account | Storage Admin | qwiklabs-gcp-01-1057c4de4b13 |   |  

[email protected] | Google Cloud ML Engine Service Agent | AI Platform Service AgentStorage Object Admin | qwiklabs-gcp-01-1057c4de4b13 qwiklabs-gcp-01-1057c4de4b13

tfx-tuner-caip-service-account@qwiklabs-gcp-01-1057c4de4b13.iam.gserviceaccount.com | TFX Tuner CAIP Vizier | Storage Object Admin

Unfortunately you can't really tell from the logs which SA it's using.

Something

lab-02-kfp-pipeline: container build error

In the notebook lab-02, the container build

produces the following error

The build succeeds though.

Feature request - AI Platform prediction custom training routine

Hi -

First of all please allow me to congratulate you on putting together such a clear/concise training on such a complex topic.

May I also suggest that you look into adding a notebook on how to perform predictions using custom routines -
https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines

I have tried implementing this but receive errors similar to the below -

https://stackoverflow.com/questions/59917440/create-version-failed-bad-model-detected-with-error-no-module-named-skle

regards
Sebastian

ml-ops examples do not work

There are several issues with the ML ops examples.

The first requires modifications to the docker image to work, (and throws many dependency images along the way). Furthermore, it does not describe where to find the link to the running jupyter notebook, or what functionality this example provides. (I can see there is a jupyter instance running, and there is a containerd instance running, but it's unclear how they are related)

The second also fails, but in a way much harder to debug:

Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=400, message=Location "\"us-central1-a\"" does not exist.
Error on line: 97
Caused by: gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID
That returned exit status: 1

These examples are clearly out of date and need to be checked against newly created vanilla projects.

kfp-caip-sklearn setup container notebook error

Following the instructions for the kfp-caip-sklearn setup, the connection to the notebook popup indicates a missing container and the connection to http://localhost:8080/ fails.

LAB-01 Unable to create model resource & model version

Create model resource

Section "Deploy the model to AI Platform Prediction", create model resource.

The following code always executes the ELSE part, instead of the IF, which means the resource does NOT get created.

model_name = 'forest_cover_classifier'
labels = "task=classifier,domain=forestry"
filter = 'name:{}'.format(model_name)
models = !(gcloud ai-platform models list --filter={filter} --format='value(name)')
 
if not models:
    !gcloud ai-platform models create  $model_name \
    --regions=$REGION \
    --labels=$labels
else:
    print("Model: {} already exists.".format(models[0]))

The reason for this to fail is because the following command

gcloud ai-platform models list

generates the following output:

...@cloudshell:~ (my-project)$ gcloud ai-platform models list
Using endpoint [https://ml.googleapis.com/]
Listed 0 items.

The "Using endpoint…" string passes the filter and therefore the models variable is not none.

models == ['Using endpoint [https://ml.googleapis.com/]']

This cause the ELSE to be execute, therefore no resource is created.

Create model version

Section "Deploy the model to AI Platform Prediction", create model version.

A similar thing happens for creating the version:

model_version = 'v01'
filter = 'name:{}'.format(model_version)
versions = !(gcloud ai-platform versions list --model={model_name} --format='value(name)' --filter={filter})

if not versions:
  !gcloud ai-platform versions create {model_version} \
    --model={model_name} \
    --origin=$JOB_DIR \
    --runtime-version=1.15 \
    --framework=scikit-learn \
    --python-version=3.7
else:
    print("Model version: {} already exists.".format(versions[0]))

The version is not created either.

`Pusher.png` mistakenly labeled as "InfraValidator"

The label on the orange box in images/Pusher.png is incorrect - it says "InfraValidator", but it should be "Pusher" instead.

This image is used in the workshops/tfx-caip-tf23/lab-01-tfx-walkthrough/labs/lab-01.ipynb notebook.

Missing `covertype_training_pipeline.py`

The file covertype_training_pipeline.py is missing from its appropriate folder mlops-on-gcp/workshops/kfp-caip-sklearn/lab-02-kfp-pipeline/exercises/pipeline/

How to deploy *.h5 based model?

Is there manner to use *.h5 model file when creating a model resource, and also a model version?
Looks like the only supported file is the *.pb format.

Creating notebook instance - build error

Building of the mlops-dev image step 4 of the environments_setup/mlops-kfp-mlmd/creating-notebook-instance returns the following error in step 2 of the build:

W: GPG error: http://packages.cloud.google.com/apt gcsfuse-bionic InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B57C5C2836F4BEB NO_PUBKEY FEEA9169307EA071
E: The repository 'http://packages.cloud.google.com/apt gcsfuse-bionic InRelease' is not signed.
W: GPG error: http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8B57C5C2836F4BEB NO_PUBKEY FEEA9169307EA071
E: The repository 'http://packages.cloud.google.com/apt cloud-sdk-bionic InRelease' is not signed.

Adding the following to the Docker file prior to the update statement seems to have fixed it
**RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - **

Not sure if this is a common error folks have encountered but wanted to flag it.

Creating instance of notebook

Which zone should you enter in the following command: ZONE=[YOUR_ZONE]? What should [YOUR_ZONE] be replaced with?

Typos in workshops/tfx-caip-tf23/lab-01-tfx-walkthrough

in apersistent store
anamolies
feature proprocessing
scalabe -> scalable
generate statistics -> generates statistics
Infering -> Inferring
Trainsform
walthrough -> walkthrough

I'll submit a PR

Notebook is not supported in all regions

Perhaps it would be good to give some advice on what region/zone to pick to place notebook in for:
https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/workshops/kfp-caip-sklearn

I had to delete and create my notebook again, it showed up under notebooks, but unable to connect to it...

Unable to recognize "STDIN": no matches for kind "Deployment" in version "apps/v1beta2"

It seems that the follow script to create a Kubeflow instance is not working anymore on GCP:
https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/examples/mlops-env-on-gcp/provisioning-kfp

The issue is in the install.sh script and I am getting:
Unable to recognize "STDIN": no matches for kind "Deployment" in version "apps/v1beta2"

current and not working

# Deploy KFP to the cluster
export PIPELINE_VERSION=0.2.5
kustomize build \
    github.com/kubeflow/pipelines/manifests/kustomize/base/crds/?ref=$PIPELINE_VERSION | kubectl apply -f -
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kustomize build . | kubectl apply -f -

working solution

# Deploy KFP to the cluster
export PIPELINE_VERSION=1.0.4
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"

I am not an expert so I don't do a PR. Just highlighting what was working for me in case it can help other users.

TFX on Cloud AI Platform Pipelines: Bucket not created by default

In Lab 02, for "Exercise: build your pipeline with the TFX CLI", the instructions state:

- `ARTIFACT_STORE` - An existing GCS bucket. You can use any bucket or use the GCS bucket created during installation of AI Platform Pipelines. The default bucket name will contain the `kubeflowpipelines-` prefix.

But no such bucket is created by default, you need to create it manually.

Bad argument name in "chicago_taxi_dag.py"

In "mlops-on-gcp/continuous_training/composer/solutions/chicago_taxi_dag.py", line 308

bq_check_rmse_query_op = BigQueryValueCheckOperator(     
                task_id="bq_value_check_rmse_task",
                sql=model_check_sql,
                pass_value=0,
                tolerence=0,
                use_legacy_sql=False,
                )

and "mlops-on-gcp/continuous_training/composer/labs/chicago_taxi_dag.py ", line 294

bq_check_rmse_query_op = BigQueryValueCheckOperator(      
                tolerence=0,     
                use_legacy_sql=False,
                #ADD YOUR CODE HERE
                )

the argument "tolerence" should be "tolerance".

Apache airflow documentation for 'bigquery_check_operator'.