System information TFX Version

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cannot submit a custom training job to a VAI Persistent Resource about tfx HOT 13 OPEN

adriangay commented on September 26, 2024

Cannot submit a custom training job to a VAI Persistent Resource

from tfx.

Comments (13)

adriangay commented on September 26, 2024 1

@singhniraj08 hi, not sure why this issue was closed - I did not knowingly close it. Thx for re-opening!

from tfx.

singhniraj08 commented on September 26, 2024

@adriangay,

I think you need make few changes in training_clients.py to use the google.cloud.aiplatform_v1beta1.

create client and create_custom_job needs to be updated as shown in create_custom_job example code. Also, get_custom_job needs to be updated as shown here.

Let us know if this works for you. Thank you!

from tfx.

adriangay commented on September 26, 2024

@singhniraj08 Thank you for responding quickly. It's unlikely I would have stumbled upon the create_custom_job example code, so i really appreciate that! tl;dr: made the changes, but the symptom is still the same, ie. does not appear to submit the job to VAI Persistent Resource Cluster. The changes I made were in class VertexJobClient only - this may be the issue, ie. I should have done the same for class CAIPJobClient also, as its a CAIP job that submits the VAI Trainer job?

In training_clients.py:

from google.cloud.aiplatform_v1beta1 import JobServiceClient,CreateCustomJobRequest,GetCustomJobRequest
.
.
from google.cloud.aiplatform_v1beta1.types.custom_job import CustomJob
from google.cloud.aiplatform_v1beta1.types.job_state import JobState

Note that existing TFX code did:

  self._client = gapic.JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

I changed this to use the additional v1beta1 import shown above as per the example code:

   self._client = JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

In launch_job():

    request = CreateCustomJobRequest(
        parent=parent,
        custom_job=training_job,
    )
    response = self._client.create_custom_job(request)

in get_job():

    request = GetCustomJobRequest(name=self._job_name)
    return self._client.get_custom_job(request)

I could not see anywhere else that needed to be changed.

from tfx.

briron commented on September 26, 2024

I should have done the same for class CAIPJobClient also
-> If you set enable_vertex=True when calling training_clients.get_job_client, you don't have to.

If you already changed JobServiceClient, it seems to be enough. But, the job is not routed to the persistent resource cluster, right?

I've found some configurations [1] that need to be done when using a persistent resource like

- Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
- Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
- Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

It seems that you already specified the persistent_resource_id, but I have no idea whether machine_spec and disk_spec in worker_pool_specs match exactly with a corresponding resource pool from the persistent resource.

Could you please check this?

[1] https://cloud.google.com/vertex-ai/docs/training/persistent-resource-train#create_a_training_job_that_runs_on_a_persistent_resource

from tfx.

adriangay commented on September 26, 2024

@briron hi, thanks for the reply. yes, have added persistent_resource_id. This is accepted on CustomJobSpec with the v1beta1 changes. The worker_pool_specs we provide are the same as before, and the persistent cluster was provisioned with the same machine type and GPU. The persistent cluster has a replica_count of 1 and maxReplicaCount of 2 (so we can see if scaling works), and the replica_count in the worker_pool_specs is 1. The relevant parts of the custom_config passed by TFX Trainer are:

  "ai_platform_training_args": {
    "persistent_resource_id": "persistent-preview-test",
    "project": "<redacted>",
    "worker_pool_specs": [
      {
        "container_spec": {
          "image_uri": "<redacted>"
        },
        "machine_spec": {
          "accelerator_count": 1,
          "accelerator_type": "NVIDIA_TESLA_A100",
          "machine_type": "a2-highgpu-1g"
        },
        "replica_count": 1
      }
    ]
  },

which I think aligns with the VAI Persistent Resource cluster we provisioned:

$ gcloud beta ai persistent-resources list --project nbcu-disco-int-nft-003 --region us-central1
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-08-30T10:09:35.302158Z'
displayName: persistent-preview-test
name: projects/<redacted>/locations/us-central1/persistentResources/persistent-preview-test
resourcePools:
- autoscalingSpec:
    maxReplicaCount: '2'
    minReplicaCount: '1'
  diskSpec:
    bootDiskSizeGb: 100
    bootDiskType: pd-ssd
  id: a2-highgpu-1g-nvidia-tesla-a100-1
  machineSpec:
    acceleratorCount: 1
    acceleratorType: NVIDIA_TESLA_A100
    machineType: a2-highgpu-1g
  replicaCount: '1'
startTime: '2023-08-30T10:14:34.743355734Z'
state: RUNNING
updateTime: '2023-08-30T10:14:36.212384Z'

from tfx.

briron commented on September 26, 2024

@adriangay
Thanks for the detailed information. How about disk_spec?
custom_config doesn't include that. Have you set this before?

from tfx.

adriangay commented on September 26, 2024

@briron No, we don't normally set it. You think this is required? I can try that 😸

from tfx.

briron commented on September 26, 2024

@adriangay Let's try. I'll investigate more apart from that.

from tfx.

adriangay commented on September 26, 2024

@briron added:

"disk_spec":  {
    "boot_disk_size_gb": 100, 
    "boot_disk_type": "pd-ssd"
}

to worker_pool_specs, job was submitted, saw this logged:

INFO 2023-09-21T21:31:21.613955595Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:31:29.469867460Z [resource.labels.taskName: service] Resources are insufficient in region: us-central1. Please try a different region. If you use K80, please consider using P100 or V100 instead.
INFO 2023-09-21T21:32:00.155728709Z [resource.labels.taskName: service] Job failed.
INFO 2023-09-21T21:32:00.174426138Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:34:18.042487875Z [resource.labels.taskName: workerpool0-0] 2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
ERROR 2023-09-21T21:34:27.701628499Z [resource.labels.taskName: workerpool0-0] I0921 21:34:27.701316 139906954585920 run_executor.py:139] Executor tfx.components.trainer.executor.GenericExecutor do: inputs: {'base_model': [Artifact(artifact: id: 8957739397649224294
.
.
INFO 2023-09-21T21:39:54.742265861Z [resource.labels.taskName: service] Job completed successfully.

"Resources are insufficient..." job failure, then resource was acquired after a retry and training started. So I'm assuming I got lucky on the retry and this is not running on the persistent cluster. I have no direct way of observing where execution occurred other than the labels for the log messages:

{
insertId: "1v5dcqofdf6b25"
jsonPayload: {
attrs: {
tag: "workerpool0-0"
}
levelname: "ERROR"
message: "2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
"
}
labels: {
compute.googleapis.com/resource_id: "741144473188239551"
compute.googleapis.com/resource_name: "cmle-training-17744171291569385386"
compute.googleapis.com/zone: "us-central1-f"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/tpu_worker_id: ""
ml.googleapis.com/trial_id: ""
ml.googleapis.com/trial_type: ""
}
logName: "projects/nbcu-disco-int-nft-003/logs/workerpool0-0"
receiveTimestamp: "2023-09-21T21:34:41.154586863Z"
resource: {
labels: {
job_id: "3598193926936199168"
project_id: "nbcu-disco-int-nft-003"
task_name: "workerpool0-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2023-09-21T21:34:18.042487875Z"
}

from tfx.

adriangay commented on September 26, 2024

@briron The logging to the VAI pipeline console UI does not show any of the logging I see in Stackdriver logs. All I see in VAI UI is:

2023-09-21 22:31:20.655 BST
I0921 21:31:20.655704 140564913690432 training_clients.py:419] Submitting custom job='tfx_20230921213120_fa456afa', parent='projects/nbcu-disco-int-nft-003/locations/us-central1' to Vertex AI Training.
2023-09-21 22:37:53.336 BST
I0921 21:37:53.336246 140564913690432 runner.py:123] Job 'projects/636088981528/locations/us-central1/customJobs/3598193926936199168' successful.
2023-09-21 22:37:59.688 BST
Tearing down training program.

The messages re: insufficient resources and retry are not there. But retry may be happening on other successful jobs and I wouldn't see them regardless of where it ran?

from tfx.

adriangay commented on September 26, 2024

@briron i've uploaded my modified training_clients.py module. Maybe you can check I've made the changes correctly? Thank you.
training_clients.py.zip

from tfx.

briron commented on September 26, 2024

If you're using VertexClient, it looks right. The job seems to be submitted well, but I have no idea how VAI works on its side.

from tfx.

adriangay commented on September 26, 2024

@briron ok, thank you for investigating

from tfx.

Cannot submit a custom training job to a VAI Persistent Resource about tfx HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent