Code Monkey home page Code Monkey logo

Comments (33)

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024 2

@CeliaGMqrz any update on this ?

from charts.

act-mreeves avatar act-mreeves commented on September 23, 2024 2

@iamhritik290799 I am doing an ugly hack to work around this for now. You can figure out what your tracking args are by ssh-ing into a running tracking pod and running ps aux | cat. Thanks for troubleshooting this.

tracking:
  command: [ "/bin/sh", "-c" ]
  args:
    - >
      unset MLFLOW_S3_ENDPOINT_URL;
      mlflow server --host=0.0.0.0 --port=5000 --app-name=basic-auth
      --serve-artifacts --artifacts-destination=s3://$YOUR_BUCKET
      --backend-store-uri=postgresql://postgres:$(MLFLOW_DATABASE_PASSWORD)@$YOUR_DB_HOST:5432/mlflow_db;

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024 2

@andresbono I've raised one PR to mitigate this ENV issue.

#25294

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024 1

Hi @carrodher

I have checked and after removing this ENV MLFLOW_S3_ENDPOINT_URL from the deployment, it's working fine and I'm able to load artifact in my mlflow experiments

image

I have checked on the mlflow documentation as well and they suggested to unset MLFLOW_S3_ENDPOINT_URL env on the client system but somehow after removing this env in deployment it worked.

image

but in the bitnami/mlflow helm chart tracking deployment template there is no parameter used to exclude this env if not required. PFBR

image

from charts.

dwolfeu avatar dwolfeu commented on September 23, 2024 1

We have the same use case as @iamhritik290799 (artifacts saved in S3, no custom endpoint, AWS service user) and the suggested solution also worked for us.

from charts.

dwolfeu avatar dwolfeu commented on September 23, 2024 1

@aaj-synth Alas this solution doesn't work for us: If we remove externalS3.host from values.yaml, then we get the error message "No Artifacts Recorded Use the log artifact APIs to store file outputs from MLflow runs" in the Artifacts tab of the web interface. So far only the solution suggested by @iamhritik290799 has solved the issue for us.

from charts.

act-mreeves avatar act-mreeves commented on September 23, 2024 1

Thanks @iamhritik290799 ! Great work.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024 1

The only thing that still baffles me is that I am 100% certain I tried including the regional code before. But maybe there was something else misconfigured at that time. Which I guess shows that it is good to challenge your own assumptions.

Edit: the original issue description also included the region, so apparently it was broken at some point but maybe got fixed in mlflow itself? Especially since from an AWS S3 usecase perspective nothing relevant has changed in the chart (as far as I can tell), as I already mentioned in #23959 (comment). Super weird.

Edit edit: Ok, the inital bug description also contained the bucket name in the host, so maybe we really just never tested it with "just" the regional. Well, hopefully the updated value description in my PR will eliminate any remaining question marks that dummies like myself could have in the future and we can finally close this issue once and for all πŸ˜…

from charts.

javsalgar avatar javsalgar commented on September 23, 2024

Hi,

Looking at the issue, it is not clear to me that the issue is related to the Bitnami packaging of MLflow or some issue with S3 inside the mlflow code. Did you check with the upstream developers?

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024

checked but they were saying maybe there is some issue in bitnami mlflow image due to which you are getting this error when trying to access your Artifacts from S3 bucket.

btw we are these args in our mlflow container :

containers:
  - args:
    - server
    - --backend-store-uri=postgresql://admin:$(MLFLOW_DATABASE_PASSWORD)@rds-instance-endpoint.rds.amazonaws.com:5432/mlflow
    - --artifacts-destination=s3://mlflow-artifacts
    - --serve-artifacts
    - --host=0.0.0.0
    - --port=5000
    - --app-name=basic-auth

from charts.

carrodher avatar carrodher commented on September 23, 2024

The issue may not be directly related to the Bitnami container image or Helm chart, but rather to how the application is being utilized or configured in your specific environment.

Having said that, if you think that's not the case and are interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

If you have any questions about the application itself, customizing its content, or questions about technology and infrastructure usage, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024

Team, have you made any changes on the helm chart to customize env variable for the mlflow deployment ?

from charts.

andresbono avatar andresbono commented on September 23, 2024

Even before we make any changes in the Helm chart, I think we should clarify in which specific cases or scenarios setting the MLFLOW_S3_ENDPOINT_URL env-var for the tracking component is required. @iamhritik290799, @act-mreeves, can you help clarifying that?

BTW @iamhritik290799, just to confirm, the screenshot you shared that suggests to unset the env-var comes from this documentation page, right? https://mlflow.org/docs/2.10.2/tracking/artifacts-stores.html#setting-bucket-region

from charts.

act-mreeves avatar act-mreeves commented on September 23, 2024

@andresbono My specific issue which requires me to unset MLFLOW_S3_ENDPOINT_URL is when using IRSA (IAM for Service accounts) and an AWS S3 Bucket.
You are 100% correct I have not exhaustively tested if and when this env var IS required.

What I think I am seeing that only the bucket name is needed in this scenario and these are the relevant arguments given to the mlflow binary: --serve-artifacts --artifacts-destination=s3://my-mlflow-bucket.

tracking:
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::1234567890:role/my-mlflow-s3-role

externalS3:
  useCredentialsInSecret: false
  protocol: "https"
  host: "my-mlflow-bucket.s3.us-east-1.amazonaws.com"
  bucket: "my-mlflow-bucket"
  serveArtifacts: true

In a nut shell I think if you use external s3 per mlflow/mlflow#9523 (comment) if you use minio (which is default on this helm chart) you use MLFLOW_S3_ENDPOINT_URL.

There is a lot of discussion here too: mlflow/mlflow#7104. I think @Gekko0114 would have more domain knowledge to explain what is going on here.

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024

@act-mreeves correct. Additionally, in my setup, I only require the "--artifacts-destination" argument, which I can define in the pod args section. However, I do not need the default "MLFLOW_S3_ENDPOINT_URL" environment variable that comes with the Bitnami Helm chart.

@andresbono My request is to either remove this ENV variable that's by default right now in the helm chart or make it optional rather than mandatory.

from charts.

iamhritik290799 avatar iamhritik290799 commented on September 23, 2024

Hi @andresbono , any update on this ?

from charts.

andresbono avatar andresbono commented on September 23, 2024

Thank you for all the additional information you provided. Based on that, I think the best option is to remove the environment variable from the deployment. When needed in some specific scenarios, users can always add it via tracking.extraEnvVars.

Would you like to send a PR addressing the change? Thank you!!

from charts.

aaj-synth avatar aaj-synth commented on September 23, 2024

Another solution is not to set the externalS3.host when using IRSA. I tried it and that worked perfectly

from charts.

aaj-synth avatar aaj-synth commented on September 23, 2024

Okay, that's weird. I'm at the chart version 1.0.2 and using IRSA to provide access to the S3 bucket and RDS instance and for me, i simply removing the external.host did the trick. On the pod definition i cannot see the env var anymore.

Environment:
    BITNAMI_DEBUG:                false
    MLFLOW_DATABASE_PASSWORD:     <redacted>
    AWS_STS_REGIONAL_ENDPOINTS:   regional
    AWS_DEFAULT_REGION:           <redacted>
    AWS_REGION:                   <redacted>
    AWS_ROLE_ARN:                 <redacted>
    AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token

from charts.

andresbono avatar andresbono commented on September 23, 2024

A team member will review the PR soon. Thank you so much for your contribution.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

Has someone already verified whether this actually fixes the issue?

We have now tried to remove @act-mreeves's workaround and update to the v1.0.3 helm chart (I believe the fix is already included there?) but we are now back to experiencing issues with artifact access

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

Has someone already verified whether this actually fixes the issue?

We have now tried to remove @act-mreeves's workaround and update to the v1.0.3 helm chart (I believe the fix is already included there?) but we are now back to experiencing issues with artifact access

Nvm, figured it out πŸ‘

from charts.

andresbono avatar andresbono commented on September 23, 2024

FYI, #26462 may look as a regression of this issue, but it shouldn't be. Please check the comments in the PR for more information. TL;DR:

  • ❌ externalS3.host=mlflow-bucket-name.s3.eu-central-1.amazonaws.com
  • βœ… externalS3.host=s3.eu-central-1.amazonaws.com
  • βœ… externalS3.host=s3.amazonaws.com

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

@andresbono cannot confirm, after upgrading to the latest release we get this error yet again!

As stated before MLFLOW_S3_ENDPOINT_URL should not be set when using AWS S3, see also: mlflow/mlflow#9523 (comment)

The change in #26462 causes yet again MLFLOW_S3_ENDPOINT_URL to be set when using AWS S3. We now had to resort back to using the initial workaround described in #23959 (comment)

Can we get this issue re-opened please...

from charts.

andresbono avatar andresbono commented on September 23, 2024

Hi @Jasper-Ben, could you share what is the value you are passing for externalS3.host? see #23959 (comment). You can redact it, I'm just interested in the format.

I don't know if you had a chance to check the comments of #26462, but we did some extensive testing and it worked for all the test cases, given the proper values were passed.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

Hi @Jasper-Ben, could you share what is the value you are passing for externalS3.host? see #23959 (comment). You can redact it, I'm just interested in the format.

I don't know if you had a chance to check the comments of #26462, but we did some extensive testing and it worked for all the test cases, given the proper values were passed.

Yes, I have read the comments and we are using s3.amazonaws.com as host.

What did these tests include? The initial connection test to s3 works (has been before the initial fix as well). The issue only appears when you try to actually access the artifacts of a job.

If that helps, this is our terraform config (without the workaround):

resource "helm_release" "k8s_mlflow" {
  name       = local.release_name
  namespace  = kubernetes_namespace.mlflow.metadata[0].name
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "mlflow"
  version    = "1.4.22"
  values = [
    file("${path.module}/helm_values.yaml"),
    yamlencode(var.extra_helm_configuration),
    yamlencode({
      commonLabels = local.common_labels_k8s
      tracking = {
        auth = {
          username = var.tracking_username
          password = var.tracking_password
        }
        persistence = {
          enabled = false
        }
      },
      run = {
        persistence = {
          enabled = false
        }
      },
      externalS3 = {
        host   = "s3.amazonaws.com"
        bucket = local.artifact_bucket # this is the plain bucket name

        accessKeyID     = aws_iam_access_key.s3_access.id
        accessKeySecret = aws_iam_access_key.s3_access.secret
      },
      externalDatabase = {
        host                      = kubernetes_manifest.postgres.manifest.metadata.name
        user                      = keys(kubernetes_manifest.postgres.manifest.spec.users).0
        existingSecret            = "${local.postgres_user}.${local.postgres_name}.credentials.postgresql.acid.zalan.do"
        existingSecretPasswordKey = "password"
        database                  = "${keys(kubernetes_manifest.postgres.manifest.spec.databases).0}?sslmode=require"
        authDatabase              = "${keys(kubernetes_manifest.postgres.manifest.spec.databases).1}?sslmode=require"
      }
    })
  ]
}

Also, we have the following additional helm values:

minio:
  enabled: false
tracking:
  nodeAffinityPreset:
    type: hard
    key: node.kubernetes.io/lifecycle
    values:
      - normal
  service:
    type: "ClusterIP"
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  resources:
    requests:
      cpu: 2
      memory: 6Gi
    limits:
      cpu: 3
      memory: 10Gi
  auth:
    enabled: true
run:
  nodeAffinityPreset:
    type: hard
    key: node.kubernetes.io/lifecycle
    values:
      - normal
  source:
    type: "configmap"
postgresql:
  enabled: false

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

We have been using s3.amazonaws.com from the beginning btw, and we have also tried regional endpoint. Both don't work.

Also, I do not see how #26462 would fix anything compared to the pre-#25294 code.

#25294 caused the MLFLOW_S3_ENDPOINT_URL variable to be set only when the internal minio setup is used. Which fixed it for AWS S3 users but broke it for other external S3 compatible storage solutions.

#26462 from a AWS S3 perspective basically reverted the previous change with the exception that it will now set the MLFLOW_S3_ENDPOINT_URL variable on the following condition ("hidden" behind the include):

{{- if or .Values.minio.enabled .Values.externalS3.host -}}

Which of course will always evaluate to true for the AWS S3 use-case, since we also need to set externalS3.host for mflow to be configured to use S3 at all, thus the MLFLOW_S3_ENDPOINT_URL variable is set again.

So basically we went full circle on this issue and it has been "fixed" for one use-case while breaking it for another (2x).

Maybe the addressing style stuff from #26462 (comment) fixes things (haven't fully understood / tested that yet) but just setting s3.amazonaws.com does not.

from charts.

andresbono avatar andresbono commented on September 23, 2024

Thank you @Jasper-Ben.

What did these tests include? The initial connection test to s3 works (has been before the initial fix as well). The issue only appears when you try to actually access the artifacts of a job.

You can find what we tested here: #26462 (comment) (unfold Scenario 2). I specifically tested the access to job artifacts, see the screenshot. When I tested it with theexternalS3.host=s3.amazonaws.com value, it worked fine. There should be some relevant difference between my testing scenario and yours.

I share your concern about going in circles on this issue. My assumption was that setting the proper externalS3.host was enough, that is why merging #26462 made sense.

Maybe the addressing style stuff from #26462 (comment) fixes things (haven't fully understood / tested that yet)

Please, try that and let us know about any other update you may have.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

Just to reiterate the status quo:

I set up a second test instance using the exact same configuration as mentioned in #23959 (comment).

The important bits:

  1. externalS3.host is set to s3.amazonaws.com
  2. externalS3.bucket is set to a plain bucket name

I then used the following example project to create an experiment:

import mlflow
from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

mlflow.set_tracking_uri(uri="<MLFLOW_URI>")

# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "multi_class": "auto",
    "random_state": 8888,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)


# Create a new MLflow Experiment
mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-quickstart",
    )

(basically step 3 and 4 from https://mlflow.org/docs/latest/getting-started/intro-quickstart/index.html)

This will cause the following error while trying to upload the artifacts to AWS S3:

mlflow boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tmpl8tifcwt/input_example.json to <BUCKET_NAME>/2/d4780a6c6c50403fab62785c7a08d8db/artifacts/iris_model/input_example.json: An error occurred (PermanentRedirect) when calling the PutObject operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

So I was able to reproduce the issue on a fresh setup. I will now experiment with the addressing style.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

I changed the addressing style to path as suggested in #26462 (comment). Still, I get the same error message. So that does not seem to help.

Pinging @frittentheke for visibility.

The tracking pod looks like this now:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-08-21T13:48:36Z"
  generateName: iris-devops-mlflow-test-tracking-ffb6fd9f-
  labels:
    app.kubernetes.io/component: tracking
    app.kubernetes.io/instance: iris-devops-mlflow-test
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mlflow
    app.kubernetes.io/part-of: mlflow
    app.kubernetes.io/version: 2.15.1
    generator: Terraform
    helm.sh/chart: mlflow-1.4.22
    pod-template-hash: ffb6fd9f
  name: iris-devops-mlflow-test-tracking-ffb6fd9f-kqblg
  namespace: mlflow-test
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: iris-devops-mlflow-test-tracking-ffb6fd9f
    uid: 0a5ca0e4-48c7-4007-8d37-1edc0156792c
  resourceVersion: "755350757"
  uid: 709a958f-f5ee-4450-b172-e147e05153a3
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values:
            - normal
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: mlflow
              app.kubernetes.io/instance: iris-devops-mlflow-test
              app.kubernetes.io/name: mlflow
          topologyKey: kubernetes.io/hostname
        weight: 1
  automountServiceAccountToken: false
  containers:
  - args:
    - server
    - --backend-store-uri=postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow?sslmode=require
    - --artifacts-destination=s3://<BUCKET_NAME>
    - --serve-artifacts
    - --host=0.0.0.0
    - --port=5000
    - --expose-prometheus=/bitnami/mlflow/metrics
    - --app-name=basic-auth
    command:
    - mlflow
    env:
    - name: BITNAMI_DEBUG
      value: "false"
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: root-user
          name: iris-devops-mlflow-test-externals3
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: root-password
          name: iris-devops-mlflow-test-externals3
    - name: MLFLOW_S3_ENDPOINT_URL
      value: https://s3.amazonaws.com:443
    - name: MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE
      value: path
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - pgrep
        - -f
        - mlflow.server
      failureThreshold: 5
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: mlflow
    ports:
    - containerPort: 5000
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 5
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: http
      timeoutSeconds: 5
    resources:
      limits:
        cpu: "3"
        memory: 10Gi
      requests:
        cpu: "2"
        memory: 6Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /app/mlruns
      name: mlruns
    - mountPath: /app/mlartifacts
      name: mlartifacts
    - mountPath: /bitnami/mlflow-basic-auth/basic_auth.ini
      name: rendered-basic-auth
      subPath: basic_auth.ini
    - mountPath: /bitnami/mlflow
      name: data
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      retry_while() {
        local -r cmd="${1:?cmd is missing}"
        local -r retries="${2:-12}"
        local -r sleep_time="${3:-5}"
        local return_value=1

        read -r -a command <<< "$cmd"
        for ((i = 1 ; i <= retries ; i+=1 )); do
            "${command[@]}" && return_value=0 && break
            sleep "$sleep_time"
        done
        return $return_value
      }

      check_host() {
          local -r host="${1:-?missing host}"
          local -r port="${2:-?missing port}"
          if wait-for-port --timeout=5 --host=${host} --state=inuse $port ; then
             return 0
          else
             return 1
          fi
      }

      echo "Checking connection to iris-devops-mlflow-test-postgres:5432"
      if ! retry_while "check_host iris-devops-mlflow-test-postgres 5432"; then
          echo "Connection error"
          exit 1
      fi

      echo "Connection success"
      exit 0
    image: docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: wait-for-database
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      cp /bitnami/mlflow-basic-auth/basic_auth.ini /bitnami/rendered-basic-auth/basic_auth.ini
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    name: get-default-auth-conf
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /bitnami/rendered-basic-auth
      name: rendered-basic-auth
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      # First render the overrides
      render-template /bitnami/basic-auth-overrides/*.ini > /tmp/rendered-overrides.ini
      # Loop through the ini overrides and apply it to the final basic_auth.ini
      # read the file line by line
      while IFS='=' read -r key value
      do
        # remove leading and trailing spaces from key and value
        key="$(echo $key | tr -d " ")"
        value="$(echo $value | tr -d " ")"

        ini-file set -s mlflow -k "$key" -v "$value" /bitnami/rendered-basic-auth/basic_auth.ini
      done < "/tmp/rendered-overrides.ini"
      # Remove temporary files
      rm /tmp/rendered-overrides.ini
    env:
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    - name: MLFLOW_DATABASE_AUTH_URI
      value: postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow_auth?sslmode=require
    - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          key: admin-user
          name: iris-devops-mlflow-test-tracking
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          key: admin-password
          name: iris-devops-mlflow-test-tracking
    - name: MLFLOW_BOTO_CLIENT_ADDRESSING_STYLE
      value: path
    image: docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: render-auth-conf
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /bitnami/basic-auth-overrides
      name: basic-auth-overrides
    - mountPath: /bitnami/rendered-basic-auth
      name: rendered-basic-auth
  - args:
    - -m
    - mlflow.server.auth
    - db
    - upgrade
    - --url
    - postgresql://mlflow:$(MLFLOW_DATABASE_PASSWORD)@iris-devops-mlflow-test-postgres:5432/mlflow_auth?sslmode=require
    command:
    - python
    env:
    - name: MLFLOW_DATABASE_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: mlflow.iris-devops-mlflow-test-postgres.credentials.postgresql.acid.zalan.do
    image: docker.io/bitnami/mlflow:2.15.1-debian-12-r0
    imagePullPolicy: IfNotPresent
    name: upgrade-db-auth
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  - command:
    - bash
    - -ec
    - |
      #!/bin/bash
      retry_while() {
        local -r cmd="${1:?cmd is missing}"
        local -r retries="${2:-12}"
        local -r sleep_time="${3:-5}"
        local return_value=1

        read -r -a command <<< "$cmd"
        for ((i = 1 ; i <= retries ; i+=1 )); do
            "${command[@]}" && return_value=0 && break
            sleep "$sleep_time"
        done
        return $return_value
      }

      check_host() {
          local -r host="${1:-?missing host}"
          local -r port="${2:-?missing port}"
          if wait-for-port --timeout=5 --host=${host} --state=inuse $port ; then
             return 0
          else
             return 1
          fi
      }

      echo "Checking connection to s3.amazonaws.com:443"
      if ! retry_while "check_host s3.amazonaws.com 443"; then
          echo "Connection error"
          exit 1
      fi

      echo "Connection success"
      exit 0
    image: 693612562064.dkr.ecr.eu-central-1.amazonaws.com/docker.io/bitnami/os-shell:12-debian-12-r27
    imagePullPolicy: IfNotPresent
    name: wait-for-s3
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 1001
      runAsNonRoot: true
      runAsUser: 1001
      seLinuxOptions: {}
      seccompProfile:
        type: RuntimeDefault
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
  nodeName: ip-10-208-18-75.eu-central-1.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1001
    fsGroupChangePolicy: Always
  serviceAccount: iris-devops-mlflow-test-tracking
  serviceAccountName: iris-devops-mlflow-test-tracking
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tmp
  - emptyDir: {}
    name: mlruns
  - emptyDir: {}
    name: mlartifacts
  - configMap:
      defaultMode: 420
      name: iris-devops-mlflow-test-tracking-auth-overrides
    name: basic-auth-overrides
  - emptyDir: {}
    name: rendered-basic-auth
  - emptyDir: {}
    name: data

from charts.

frittentheke avatar frittentheke commented on September 23, 2024

@Jasper-Ben

mlflow boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tmpl8tifcwt/input_example.json to <BUCKET_NAME>/2/d4780a6c6c50403fab62785c7a08d8db/artifacts/iris_model/input_example.json: An error occurred (PermanentRedirect) when calling the PutObject operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

I suppose the bucket resides in some region and AWS does not like you to continue using the global S3 hostname.
See e.g. thoughtbot/paperclip#2151 on how setting the endpoint to the correct regional endpoint fixes things.

See https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region for list of endpoints.

from charts.

Jasper-Ben avatar Jasper-Ben commented on September 23, 2024

I figured it out (also thanks to @frittentheke).

It works when using a regional endpoint, regardless of the addressing style.

Turns out that using the HTTP endpoint is always regional, in contrast to the s3 endpoint (https://<bucket_name>.s3.eu-central-1.amazonaws.com vs s3://<bucket_name>). (Yes that is f****n confusing). I probably knew that at some point but the information was purged out of my brain, so I had to rediscover it. When setting the MLFLOW_S3_ENDPOINT_URL environment variable, Mlflow uses the HTTP endpoint.

So the reason why @andresbono tested successful with the host set to s3.amazonaws.com is that he just happened to test with a bucket deployed in us-east-1, which AWS will default to for HTTP endpoints when no region specific endpoint is used (see: https://stackoverflow.com/questions/51611874/access-amazon-s3-bucket-without-region-end-point/51612461#51612461).

We use a bucket in eu-central-1, which is why just setting externalS3.host=s3.amazonaws.com breaks for us.

So basically the fix here is to always use the regional endpoint, everything else will just cause confusion. So what I would do: Delete / Update comment #23959 (comment) to only include the regional endpoint as βœ… and update the example at https://github.com/bitnami/charts/blob/main/bitnami/mlflow/README.md?plain=1#L451C74-L451C83 to use a regional endpoint with the note to pick the appropriate regional endpoint from https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region. For the latter I will create a PR.

Also, maybe someone else could verify/reproduce my findings, just in case?

from charts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.