Comments (13)
@singhniraj08 hi, not sure why this issue was closed - I did not knowingly close it. Thx for re-opening!
from tfx.
I think you need make few changes in training_clients.py to use the google.cloud.aiplatform_v1beta1
.
create client and create_custom_job needs to be updated as shown in create_custom_job example code. Also, get_custom_job needs to be updated as shown here.
Let us know if this works for you. Thank you!
from tfx.
@singhniraj08 Thank you for responding quickly. It's unlikely I would have stumbled upon the create_custom_job
example code, so i really appreciate that! tl;dr: made the changes, but the symptom is still the same, ie. does not appear to submit the job to VAI Persistent Resource Cluster. The changes I made were in class VertexJobClient
only - this may be the issue, ie. I should have done the same for class CAIPJobClient
also, as its a CAIP job that submits the VAI Trainer job?
In training_clients.py
:
from google.cloud.aiplatform_v1beta1 import JobServiceClient,CreateCustomJobRequest,GetCustomJobRequest
.
.
from google.cloud.aiplatform_v1beta1.types.custom_job import CustomJob
from google.cloud.aiplatform_v1beta1.types.job_state import JobState
Note that existing TFX code did:
self._client = gapic.JobServiceClient(
client_options=dict(api_endpoint=self._region +
_VERTEX_ENDPOINT_SUFFIX))
I changed this to use the additional v1beta1
import shown above as per the example code:
self._client = JobServiceClient(
client_options=dict(api_endpoint=self._region +
_VERTEX_ENDPOINT_SUFFIX))
In launch_job()
:
request = CreateCustomJobRequest(
parent=parent,
custom_job=training_job,
)
response = self._client.create_custom_job(request)
in get_job()
:
request = GetCustomJobRequest(name=self._job_name)
return self._client.get_custom_job(request)
I could not see anywhere else that needed to be changed.
from tfx.
I should have done the same for class CAIPJobClient also
-> If you set enable_vertex=True
when calling training_clients.get_job_client
, you don't have to.
If you already changed JobServiceClient
, it seems to be enough. But, the job is not routed to the persistent resource cluster, right?
I've found some configurations [1] that need to be done when using a persistent resource like
- Specify the persistent_resource_id parameter and set the value to the ID of the persistent resource (PERSISTENT_RESOURCE_ID) that you want to use.
- Specify the worker_pool_specs parameter such that the values of machine_spec and disk_spec for each resource pool matches exactly with a corresponding resource pool from the persistent resource. Specify one machine_spec for single node training and multiple for distributed training.
- Specify a replica_count less than or equal to the replica_count or max_replica_count of the corresponding resource pool, excluding the replica count of any other jobs running on that resource pool.
It seems that you already specified the persistent_resource_id, but I have no idea whether machine_spec
and disk_spec
in worker_pool_specs
match exactly with a corresponding resource pool from the persistent resource.
Could you please check this?
from tfx.
@briron hi, thanks for the reply. yes, have added persistent_resource_id
. This is accepted on CustomJobSpec
with the v1beta1
changes. The worker_pool_specs
we provide are the same as before, and the persistent cluster was provisioned with the same machine type and GPU. The persistent cluster has a replica_count
of 1 and maxReplicaCount
of 2 (so we can see if scaling works), and the replica_count
in the worker_pool_specs
is 1. The relevant parts of the custom_config
passed by TFX Trainer are:
"ai_platform_training_args": {
"persistent_resource_id": "persistent-preview-test",
"project": "<redacted>",
"worker_pool_specs": [
{
"container_spec": {
"image_uri": "<redacted>"
},
"machine_spec": {
"accelerator_count": 1,
"accelerator_type": "NVIDIA_TESLA_A100",
"machine_type": "a2-highgpu-1g"
},
"replica_count": 1
}
]
},
which I think aligns with the VAI Persistent Resource cluster we provisioned:
$ gcloud beta ai persistent-resources list --project nbcu-disco-int-nft-003 --region us-central1
Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-08-30T10:09:35.302158Z'
displayName: persistent-preview-test
name: projects/<redacted>/locations/us-central1/persistentResources/persistent-preview-test
resourcePools:
- autoscalingSpec:
maxReplicaCount: '2'
minReplicaCount: '1'
diskSpec:
bootDiskSizeGb: 100
bootDiskType: pd-ssd
id: a2-highgpu-1g-nvidia-tesla-a100-1
machineSpec:
acceleratorCount: 1
acceleratorType: NVIDIA_TESLA_A100
machineType: a2-highgpu-1g
replicaCount: '1'
startTime: '2023-08-30T10:14:34.743355734Z'
state: RUNNING
updateTime: '2023-08-30T10:14:36.212384Z'
from tfx.
@adriangay
Thanks for the detailed information. How about disk_spec
?
custom_config
doesn't include that. Have you set this before?
from tfx.
@briron No, we don't normally set it. You think this is required? I can try that 😸
from tfx.
@adriangay Let's try. I'll investigate more apart from that.
from tfx.
@briron added:
"disk_spec": {
"boot_disk_size_gb": 100,
"boot_disk_type": "pd-ssd"
}
to worker_pool_specs
, job was submitted, saw this logged:
INFO 2023-09-21T21:31:21.613955595Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:31:29.469867460Z [resource.labels.taskName: service] Resources are insufficient in region: us-central1. Please try a different region. If you use K80, please consider using P100 or V100 instead.
INFO 2023-09-21T21:32:00.155728709Z [resource.labels.taskName: service] Job failed.
INFO 2023-09-21T21:32:00.174426138Z [resource.labels.taskName: service] Waiting for job to be provisioned.
ERROR 2023-09-21T21:34:18.042487875Z [resource.labels.taskName: workerpool0-0] 2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
ERROR 2023-09-21T21:34:27.701628499Z [resource.labels.taskName: workerpool0-0] I0921 21:34:27.701316 139906954585920 run_executor.py:139] Executor tfx.components.trainer.executor.GenericExecutor do: inputs: {'base_model': [Artifact(artifact: id: 8957739397649224294
.
.
INFO 2023-09-21T21:39:54.742265861Z [resource.labels.taskName: service] Job completed successfully.
"Resources are insufficient..." job failure, then resource was acquired after a retry and training started. So I'm assuming I got lucky on the retry and this is not running on the persistent cluster. I have no direct way of observing where execution occurred other than the labels for the log messages:
{
insertId: "1v5dcqofdf6b25"
jsonPayload: {
attrs: {
tag: "workerpool0-0"
}
levelname: "ERROR"
message: "2023-09-21 21:34:18.042273: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
"
}
labels: {
compute.googleapis.com/resource_id: "741144473188239551"
compute.googleapis.com/resource_name: "cmle-training-17744171291569385386"
compute.googleapis.com/zone: "us-central1-f"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/tpu_worker_id: ""
ml.googleapis.com/trial_id: ""
ml.googleapis.com/trial_type: ""
}
logName: "projects/nbcu-disco-int-nft-003/logs/workerpool0-0"
receiveTimestamp: "2023-09-21T21:34:41.154586863Z"
resource: {
labels: {
job_id: "3598193926936199168"
project_id: "nbcu-disco-int-nft-003"
task_name: "workerpool0-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2023-09-21T21:34:18.042487875Z"
}
from tfx.
@briron The logging to the VAI pipeline console UI does not show any of the logging I see in Stackdriver logs. All I see in VAI UI is:
2023-09-21 22:31:20.655 BST
I0921 21:31:20.655704 140564913690432 training_clients.py:419] Submitting custom job='tfx_20230921213120_fa456afa', parent='projects/nbcu-disco-int-nft-003/locations/us-central1' to Vertex AI Training.
2023-09-21 22:37:53.336 BST
I0921 21:37:53.336246 140564913690432 runner.py:123] Job 'projects/636088981528/locations/us-central1/customJobs/3598193926936199168' successful.
2023-09-21 22:37:59.688 BST
Tearing down training program.
The messages re: insufficient resources and retry are not there. But retry may be happening on other successful jobs and I wouldn't see them regardless of where it ran?
from tfx.
@briron i've uploaded my modified training_clients.py
module. Maybe you can check I've made the changes correctly? Thank you.
training_clients.py.zip
from tfx.
If you're using VertexClient, it looks right. The job seems to be submitted well, but I have no idea how VAI works on its side.
from tfx.
@briron ok, thank you for investigating
from tfx.
Related Issues (20)
- TFX components in GCP does not display component logs in GCP Vertex AI HOT 17
- DataFlow Job in TFX pipeline fails after running for an hour HOT 6
- TFX component never completes even though Vertex AI custom job succeeds / fails HOT 8
- Upgrade Tensorflow version HOT 3
- documentations for driver class HOT 2
- Custom driver support for KubeflowV2DagRunner HOT 3
- Error when starting Evaluator component HOT 6
- TFX 1.15.0 Issues HOT 1
- R2Score Metric is incompatible with Evaluator Component HOT 2
- Version issues with Savemodel
- Version Issues with Estimator SaveModel
- Python-snappy not found during execution of CSVExampleGen HOT 1
- Dependency Version Constraints error in release notes for version 1.15.0 HOT 1
- TFX for small, single-laptop workflows HOT 4
- TFX 1.15 docker image contains conflicting dependencies HOT 2
- Filter/Data Quality Component
- Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker
- Documentation Follow-up Work HOT 3
- [Request] Update Kubernetes client version
- Missing _make_exception() function in ml-metadata 1.14, causing TFX pipeline execution issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tfx.