Comments (27)
This also didn't work for me the first time I tried it. Then I realised you also need to make sure your custom image is used by Dataflow by adding f"--sdk_container_image={PIPELINE_IMAGE}",
to BEAM_DATAFLOW_PIPELINE_ARGS
.
from tfx.
@singhniraj08
I have added that flag but I am still getting the same error
Dataflow job id: 2023-11-17_04_23_23-6315535713304255245
from tfx.
@singhniraj08, should this environment variable not be added to the TFX base image before the issue is closed? Is the TFX base image not intended to be used to run TFX jobs (on Vertex AI or Kubeflow)? Those TFX jobs might reasonably include with Dataflow components.
from tfx.
@IzakMaraisTAL, Yes, It will make more sense to add the environment variable to TFX base image to avoid these issues in future. I have to make sure that it doesn't break any other scenarios where DockerFile is being used apart from Dataflow.
Reopening this issue. We will update this thread. Thank you for bringing this up!
from tfx.
I tried the update the ENV variable in TFX dockerfile and build image but it takes forever to build because of #6468. TFX dependencies takes lot of time to install and results in installation failure. Once that issue is fixed, I will be able to integrate the environment variable in docker file and test it. Thanks.
@jonathan-lemos, Thank you for bringing this up. This should be fixed once we fix #6468 issue.
from tfx.
I also observed this going from TFX 1.12.0 to 1.14.0. My only Dataflow component is the Transform component, so it is the one that gets stuck.
Here are some metrics. There is no throughput:
CPU usage has a cyclical pattern:
It is not lack of memory: it is only using a fraction of the available memory.
from tfx.
I came across few troubleshooting steps to debug the issue further. Can you please follow the step as shown in Troubleshooting dataflow error and let us know what is causing this error. This will help us lot in finding the root cause of the issue. Thank you!
from tfx.
These are the errors in the logs
System logs:
- ima: Can not allocate sha384 (reason: -2)
Kubelet logs:
- "Error initializing dynamic plugin prober" err="error (re-)creating driver directory: mkdir /usr/libexec/kubernetes: read-only file system"
- "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Also seems like it's struggling to download the container image
from tfx.
@coocloud could you please provide us with Dataflow Job ID?
from tfx.
@AnuarTB This is the Dataflow Job ID that failed
2023-10-24_02_28_52-1220281879361432238
from tfx.
This issue was previously reported by one of the TFX users and apologies for missing that. Dataflow struggles to fetch the TFX image as TFX image is of large size.
The solution for this issue is to set the flag --experiments=disable_worker_container_image_prepull
in pipeline options. Ref: similar issue
Please try the above solution and let us know if you face any issues. Thank you!
from tfx.
Can you share your Dataflow job id?
from tfx.
Regarding 2023-11-17_04_23_23-6315535713304255245
, it looks like the image is so big that you ran out of disk space. Try bumping up it up.
from tfx.
Increasing the disk space also didn't seem to fix the issue.
Job id: 2023-11-20_05_04_45-14348954509617130059
The size of the image is around 11GB so surely it should be enough?
from tfx.
It is not about the image size. If you check your worker-startup
log, one error occurred:
2023/11/20 13:11:49 failed to create a virtual environment. If running on Ubuntu systems, you might need to install `python3-venv` package. To run the SDK process in default environment instead, set the environment variable `RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1`. In custom Docker images, you can do that with an `ENV` statement. Encountered error: python venv initialization failed: exit status 1
from tfx.
Any update on this. I am facing the exact same issue highlighted above.
from tfx.
check my comment in #6386 (comment). If you see that error, you can ssh to your container like docker run --rm -it --entrypoint=/bin/bash YOUR_CONTAINER_IMAGE
and check whether venv
is installed. or when building your container, use ENV to define RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
.
from tfx.
Thanks @liferoad : I got the exact same error so i build a custom image, which basically ran tfx 1.14.0 but added that ENV and it all worked fine.
Thank you!!!
from tfx.
@liferoad Thanks, my dataflow job seems to run successfully after adding that environment variable.
from tfx.
Closing this issue, since the issue is resolved for you. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!
from tfx.
Are you satisfied with the resolution of your issue?
Yes
No
from tfx.
I want to mention that Dataflow-side, it seems that when the job is cancelled after the 1-hour timeout, the pip install
of the TFX package is still running with high CPU usage. I can reproduce this issue locally with
pyenv global 3.8.10 # issue also occurs on 3.10.8
mkdir /tmp/venv
python -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install 'tfx==1.14.0'
Which outputs logs such as
INFO: pip is looking at multiple versions of exceptiongroup to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of anyio to determine which version is compatible with other requirements. This could take a while.
Collecting anyio>=3.1.0
Downloading anyio-4.0.0-py3-none-any.whl (83 kB)
|████████████████████████████████| 83 kB 2.3 MB/s
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
Using cached anyio-3.7.1-py3-none-any.whl (80 kB)
Using cached anyio-3.7.0-py3-none-any.whl (80 kB)
Using cached anyio-3.6.2-py3-none-any.whl (80 kB)
Using cached anyio-3.6.1-py3-none-any.whl (80 kB)
Using cached anyio-3.6.0-py3-none-any.whl (80 kB)
Using cached anyio-3.5.0-py3-none-any.whl (79 kB)
INFO: pip is looking at multiple versions of anyio to determine which version is compatible with other requirements. This could take a while.
Using cached anyio-3.4.0-py3-none-any.whl (78 kB)
Using cached anyio-3.3.4-py3-none-any.whl (78 kB)
Using cached anyio-3.3.3-py3-none-any.whl (78 kB)
Using cached anyio-3.3.2-py3-none-any.whl (78 kB)
Using cached anyio-3.3.1-py3-none-any.whl (77 kB)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
Using cached anyio-3.3.0-py3-none-any.whl (77 kB)
Using cached anyio-3.2.1-py3-none-any.whl (75 kB)
Using cached anyio-3.2.0-py3-none-any.whl (75 kB)
Using cached anyio-3.1.0-py3-none-any.whl (74 kB)
INFO: pip is looking at multiple versions of jupyter-server to determine which version is compatible with other requirements. This could take a while.
Collecting jupyter-server<3,>=2.4.0
Using cached jupyter_server-2.10.1-py3-none-any.whl (378 kB)
...
It has been doing so for the last 20+ minutes.
I believe the Dataflow job fails because pip
is unable to resolve the versions within the 1-hour time limit.
from tfx.
Hello @singhniraj08 and @liferoad ,
I'm still encountering the same problem.
Even after adding the "--experiments=disable_worker_container_image_prepull" and ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
My job id : 2024-01-28_15_08_34-3324575780147226127
Here is my code for docker image :
`FROM gcr.io/tfx-oss-public/tfx:1.14.0
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY requirementsfinal.txt requirements.txt
RUN sed -i 's/python3/python/g' /usr/bin/pip
RUN pip install -r requirements.txt
COPY src/ src/
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
`
And the rest is similar to the one
`
train_output_config = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(
splits=[
example_gen_pb2.SplitConfig.Split(
name="train", hash_buckets=int(config.NUM_TRAIN_SPLITS)
),
example_gen_pb2.SplitConfig.Split(
name="eval", hash_buckets=int(config.NUM_EVAL_SPLITS)
),
]
)
)
# Train example generation
train_example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
query=train_sql_query,
output_config=train_output_config,
custom_config=json.dumps({})
).with_beam_pipeline_args(config.BEAM_DATAFLOW_PIPELINE_ARGS).with_id("TrainDataGen")
`
BEAM_DATAFLOW_PIPELINE_ARGS = [ f"--project={PROJECT}", f"--temp_location={os.path.join(GCS_LOCATION, 'temp')}", f"--region=us-east1", f"--runner={BEAM_RUNNER}", "--disk_size_gb=50", "--experiments=disable_worker_container_image_prepull", ]
for the runner
runner = tfx.orchestration.experimental.KubeflowV2DagRunner( config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(default_image=config.TFX_IMAGE_URI), output_filename=PIPELINE_DEFINITION_FILE)
and worker logs
from tfx.
@IzakMaraisTAL and @coocloud can you share your config files or anything you have done differently to make it work, please !
from tfx.
It worked !! Thank you @IzakMaraisTAL !
from tfx.
I hope your issues is resolved. If you have issue, please reopen this thread.
from tfx.
Are you satisfied with the resolution of your issue?
Yes
No
from tfx.
Related Issues (20)
- [Request] Update to Apache Beam 2.52.0, enable Beam 2.46.0 compatibility HOT 5
- How to pass airflow task configuration to one custom component? HOT 3
- Error executing pip install tfx in new conda environment with python 3.10 HOT 6
- installing tfx 1.13.0 by pip takes so much time HOT 5
- TFX trainer component running in Kubeflow fails although it was successful in the Interactive Context HOT 8
- TFX components in GCP does not display component logs in GCP Vertex AI HOT 17
- DataFlow Job in TFX pipeline fails after running for an hour HOT 6
- TFX component never completes even though Vertex AI custom job succeeds / fails HOT 8
- Upgrade Tensorflow version HOT 3
- documentations for driver class HOT 2
- Custom driver support for KubeflowV2DagRunner HOT 3
- Error when starting Evaluator component HOT 6
- TFX 1.15.0 Issues HOT 1
- R2Score Metric is incompatible with Evaluator Component HOT 2
- Version issues with Savemodel
- Version Issues with Estimator SaveModel
- Python-snappy not found during execution of CSVExampleGen HOT 1
- Dependency Version Constraints error in release notes for version 1.15.0 HOT 1
- TFX for small, single-laptop workflows HOT 4
- TFX 1.15 docker image contains conflicting dependencies HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tfx.