Hi, after installing carefree-creator in docker, it reports an error when running, how

ok, i caught another potential mistake: <div class="snippet-clipboard-content notr

carefree0910 commented on June 12, 2024

Hello! According to your error message, it seems that my docker is a bit out of date considering the CUDA:

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I'll try to update the Dockerfile and see if it can solve your problem!

from carefree-creator.

carefree0910 commented on June 12, 2024

Hi, I updated the Dockerfile just now, could you use the latest Dockerfile and try again? Thanks!

from carefree-creator.

Luoyu-Wang commented on June 12, 2024

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

from carefree-creator.

carefree0910 commented on June 12, 2024

I run the "docker build -t $TAG_NAME ."again, but report errors, do I have to delete de docker file and rebuild?

what are the errors? And yes, you may need to delete the original Dockerfile and download the latest Dockerfile, and rebuild again!

from carefree-creator.

Luoyu-Wang commented on June 12, 2024

I rebuild again，but the errorS seem to be still there.

$ docker run --gpus all --rm -p 8123:8123 cfcreator:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

from carefree-creator.

carefree0910 commented on June 12, 2024

ok, i caught another potential mistake:

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

which means you restrict the memory of your docker to be 64MB, which may cause error. You can try the command it suggests you, or manually link your server's /dev/shm to the docker's /dev/shm, and see if it helps!

from carefree-creator.

Luoyu-Wang commented on June 12, 2024

Seems like another error?

$docker run --gpus 0 --rm -p 8123:8123 $TAG_NAME:latest --ipc=host --ulimit memlock=-1 --ulimit stack=67108864

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/opt/nvidia/nvidia_entrypoint.sh: line 49: exec: --: invalid option
exec: usage: exec [-cl] [-a name] [command [arguments ...]] [redirection ...]

from carefree-creator.

carefree0910 commented on June 12, 2024

Maybe you need to put $TAG_NAME:latest at the end? because now it still complains 64MB ram, and it says your command is invalid (i'm not sure, i'm not an expert in docker commands either 🤣)

from carefree-creator.

Luoyu-Wang commented on June 12, 2024

Yes! You are right. But the ImportError seems also exist.

$ docker run --gpus 0 --rm -p 8123:8123 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 $TAG_NAME:latest

=============
== PyTorch ==

NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 510.108.03 which has support for CUDA 11.6. This container
was built with CUDA 11.8 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Traceback (most recent call last):
File "/opt/conda/bin/cfcreator", line 5, in
from cfcreator.cli import main
File "/opt/conda/lib/python3.8/site-packages/cfcreator/init.py", line 2, in
from .common import *
File "/opt/conda/lib/python3.8/site-packages/cfcreator/common.py", line 22, in
from cflearn.zoo import DLZoo
File "/opt/conda/lib/python3.8/site-packages/cflearn/init.py", line 3, in
from .schema import *
File "/opt/conda/lib/python3.8/site-packages/cflearn/schema.py", line 27, in
from accelerate import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/init.py", line 3, in
from .accelerator import Accelerator
File "/opt/conda/lib/python3.8/site-packages/accelerate/accelerator.py", line 34, in
from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
File "/opt/conda/lib/python3.8/site-packages/accelerate/checkpointing.py", line 24, in
from .utils import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/init.py", line 112, in
from .launch import (
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/launch.py", line 27, in
from ..utils.other import merge_dicts
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/other.py", line 24, in
from .transformer_engine import convert_model
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/transformer_engine.py", line 21, in
import transformer_engine.pytorch as te
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/init.py", line 7, in
from . import pytorch
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/init.py", line 6, in
from .module import LayerNormLinear
File "/opt/conda/lib/python3.8/site-packages/transformer_engine/pytorch/module.py", line 16, in
import transformer_engine_extensions as tex
ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106SymInt8toSymIntENS_13intrusive_ptrINS_14SymIntNodeImplENS_6detail34intrusive_target_default_null_typeIS2_EEEE

from carefree-creator.

carefree0910 commented on June 12, 2024

Hmmm, here's my another guess: maybe the CUDA driver on your physics server (i.e., 510.108.03) or something like that is too low for the latest PyTorch. To verify it: can you run other pytorch2.0 projects on your server?

from carefree-creator.

ImportError: /opt/conda/lib/python3.8/site-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol about carefree-creator HOT 10 OPEN

Comments (10)

=============
== PyTorch ==

=============
== PyTorch ==

=============
== PyTorch ==

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (10)

============= == PyTorch ==

============= == PyTorch ==

============= == PyTorch ==

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

=============
== PyTorch ==

=============
== PyTorch ==

=============
== PyTorch ==