Code Monkey home page Code Monkey logo

Comments (13)

adriangay avatar adriangay commented on September 26, 2024 1

@singhniraj08 hi, not sure why this issue was closed - I did not knowingly close it. Thx for re-opening!

from tfx.

singhniraj08 avatar singhniraj08 commented on September 26, 2024

@adriangay,

I think you need make few changes in training_clients.py to use the google.cloud.aiplatform_v1beta1.

create client and create_custom_job needs to be updated as shown in create_custom_job example code. Also, get_custom_job needs to be updated as shown here.

Let us know if this works for you. Thank you!

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@singhniraj08 Thank you for responding quickly. It's unlikely I would have stumbled upon the create_custom_job example code, so i really appreciate that! tl;dr: made the changes, but the symptom is still the same, ie. does not appear to submit the job to VAI Persistent Resource Cluster. The changes I made were in class VertexJobClient only - this may be the issue, ie. I should have done the same for class CAIPJobClient also, as its a CAIP job that submits the VAI Trainer job?

In training_clients.py:

from google.cloud.aiplatform_v1beta1 import JobServiceClient,CreateCustomJobRequest,GetCustomJobRequest
.
.
from google.cloud.aiplatform_v1beta1.types.custom_job import CustomJob
from google.cloud.aiplatform_v1beta1.types.job_state import JobState

Note that existing TFX code did:

  self._client = gapic.JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

I changed this to use the additional v1beta1 import shown above as per the example code:

   self._client = JobServiceClient(
        client_options=dict(api_endpoint=self._region +
                            _VERTEX_ENDPOINT_SUFFIX))

In launch_job():

    request = CreateCustomJobRequest(
        parent=parent,
        custom_job=training_job,
    )
    response = self._client.create_custom_job(request)

in get_job():

    request = GetCustomJobRequest(name=self._job_name)
    return self._client.get_custom_job(request)

I could not see anywhere else that needed to be changed.

from tfx.

briron avatar briron commented on September 26, 2024

I should have done the same for class CAIPJobClient also
-> If you set enable_vertex=True when calling training_clients.get_job_client, you don't have to.

If you already changed JobServiceClient, it seems to be enough. But, the job is not routed to the persistent resource cluster, right?

I've found some configurations [1] that need to be done when using a persistent resource like

- Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
- Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
- Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.

It seems that you already specified the persistent_resource_id, but I have no idea whether machine_spec and disk_spec in worker_pool_specs match exactly with a corresponding resource pool from the persistent resource.

Could you please check this?

[1] https://cloud.google.com/vertex-ai/docs/training/persistent-resource-train#create_a_training_job_that_runs_on_a_persistent_resource

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron hi, thanks for the reply. yes, have added persistent_resource_id. This is accepted on CustomJobSpec with the v1beta1 changes. The worker_pool_specs we provide are the same as before, and the persistent cluster was provisioned with the same machine type and GPU. The persistent cluster has a replica_count of 1 and maxReplicaCount of 2 (so we can see if scaling works), and the replica_count in the worker_pool_specs is 1. The relevant parts of the custom_config passed by TFX Trainer are:

  "ai_platform_training_args": {
    "persistent_resource_id": "persistent-preview-test",
    "project": "<redacted>",
    "worker_pool_specs": [
      {
        "container_spec": {
          "image_uri": "<redacted>"
        },
        "machine_spec": {
          "accelerator_count": 1,
          "accelerator_type": "NVIDIA_TESLA_A100",
          "machine_type": "a2-highgpu-1g"
        },
        "replica_count": 1
      }
    ]
  },

which I think aligns with the VAI Persistent Resource cluster we provisioned:

$ gcloud beta ai persistent-resources list --project nbcu-disco-int-nft-003 --region us-central1
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-08-30T10:09:35.302158Z'
displayName: persistent-preview-test
name: projects/<redacted>/locations/us-central1/persistentResources/persistent-preview-test
resourcePools:
- autoscalingSpec:
    maxReplicaCount: '2'
    minReplicaCount: '1'
  diskSpec:
    bootDiskSizeGb: 100
    bootDiskType: pd-ssd
  id: a2-highgpu-1g-nvidia-tesla-a100-1
  machineSpec:
    acceleratorCount: 1
    acceleratorType: NVIDIA_TESLA_A100
    machineType: a2-highgpu-1g
  replicaCount: '1'
startTime: '2023-08-30T10:14:34.743355734Z'
state: RUNNING
updateTime: '2023-08-30T10:14:36.212384Z'

from tfx.

briron avatar briron commented on September 26, 2024

@adriangay
Thanks for the detailed information. How about disk_spec?
custom_config doesn't include that. Have you set this before?

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron No, we don't normally set it. You think this is required? I can try that 😸

from tfx.

briron avatar briron commented on September 26, 2024

@adriangay Let's try. I'll investigate more apart from that.

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron added:

"disk_spec":  {
    "boot_disk_size_gb": 100, 
    "boot_disk_type": "pd-ssd"
}

to worker_pool_specs, job was submitted, saw this logged:

INFO 2023-09-21T21:31:21.613955595Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:31:29.469867460Z [resource.labels.taskName: service] Resources are insufficient in region: us-central1. Please try a different region. If you use K80, please consider using P100 or V100 instead.
INFO 2023-09-21T21:32:00.155728709Z [resource.labels.taskName: service] Job failed.
INFO 2023-09-21T21:32:00.174426138Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:34:18.042487875Z [resource.labels.taskName: workerpool0-0] 2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
ERROR 2023-09-21T21:34:27.701628499Z [resource.labels.taskName: workerpool0-0] I0921 21:34:27.701316 139906954585920 run_executor.py:139] Executor tfx.components.trainer.executor.GenericExecutor do: inputs: {'base_model': [Artifact(artifact: id: 8957739397649224294
.
.
INFO 2023-09-21T21:39:54.742265861Z [resource.labels.taskName: service] Job completed successfully.

"Resources are insufficient..." job failure, then resource was acquired after a retry and training started. So I'm assuming I got lucky on the retry and this is not running on the persistent cluster. I have no direct way of observing where execution occurred other than the labels for the log messages:

{
insertId: "1v5dcqofdf6b25"
jsonPayload: {
attrs: {
tag: "workerpool0-0"
}
levelname: "ERROR"
message: "2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
"
}
labels: {
compute.googleapis.com/resource_id: "741144473188239551"
compute.googleapis.com/resource_name: "cmle-training-17744171291569385386"
compute.googleapis.com/zone: "us-central1-f"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/tpu_worker_id: ""
ml.googleapis.com/trial_id: ""
ml.googleapis.com/trial_type: ""
}
logName: "projects/nbcu-disco-int-nft-003/logs/workerpool0-0"
receiveTimestamp: "2023-09-21T21:34:41.154586863Z"
resource: {
labels: {
job_id: "3598193926936199168"
project_id: "nbcu-disco-int-nft-003"
task_name: "workerpool0-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2023-09-21T21:34:18.042487875Z"
}

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron The logging to the VAI pipeline console UI does not show any of the logging I see in Stackdriver logs. All I see in VAI UI is:

2023-09-21 22:31:20.655 BST
I0921 21:31:20.655704 140564913690432 training_clients.py:419] Submitting custom job='tfx_20230921213120_fa456afa', parent='projects/nbcu-disco-int-nft-003/locations/us-central1' to Vertex AI Training.
2023-09-21 22:37:53.336 BST
I0921 21:37:53.336246 140564913690432 runner.py:123] Job 'projects/636088981528/locations/us-central1/customJobs/3598193926936199168' successful.
2023-09-21 22:37:59.688 BST
Tearing down training program.

The messages re: insufficient resources and retry are not there. But retry may be happening on other successful jobs and I wouldn't see them regardless of where it ran?

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron i've uploaded my modified training_clients.py module. Maybe you can check I've made the changes correctly? Thank you.
training_clients.py.zip

from tfx.

briron avatar briron commented on September 26, 2024

If you're using VertexClient, it looks right. The job seems to be submitted well, but I have no idea how VAI works on its side.

from tfx.

adriangay avatar adriangay commented on September 26, 2024

@briron ok, thank you for investigating

from tfx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.