作者您好,我训练报了以下错误,可以帮忙看下嘛。。。。。。。
bash ./scripts/train_mvtec.sh
Setting ds_accelerator to cuda (auto detect)
[2023-10-18 10:51:44,336] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-10-18 10:51:44,350] [INFO] [runner.py:555:main] cmd = /home/witai4090/anaconda3/envs/zkh/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=28400 --enable_each_rank_log=None train_mvtec.py --model openllama_peft --stage 1 --imagebind_ckpt_path /home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth --vicuna_ckpt_path /home/witai4090/data/zkh/AnomalyGPT-main/vicuna_ckpt/7b_v0/ --delta_ckpt_path /home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/pandagpt_ckpt/7b/pytorch_model.pt --max_tgt_len 1024 --data_path /home/witai4090/data/zkh/AnomalyGPT-main/data/pandagpt4_visual_instruction_data.json --image_root_path /home/witai4090/data/zkh/AnomalyGPT-main/data/images/ --save_path /home/witai4090/data/zkh/AnomalyGPT-main/code/ckpt/train_mvtec/ --log_path /home/witai4090/data/zkh/AnomalyGPT-main/code/ckpt/train_mvtec/log_rest/
Setting ds_accelerator to cuda (auto detect)
[2023-10-18 10:51:45,133] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-10-18 10:51:45,133] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-10-18 10:51:45,133] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-10-18 10:51:45,133] [INFO] [launch.py:163:main] dist_world_size=2
[2023-10-18 10:51:45,133] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
[!] load base configuration: config/base.yaml
[!] load configuration from config/openllama_peft.yaml
[!] load base configuration: config/base.yaml
[!] load configuration from config/openllama_peft.yaml
[2023-10-18 10:51:49,415] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-10-18 10:51:49,415] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-10-18 10:51:49,415] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-10-18 10:51:49,415] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-10-18 10:51:49,415] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[!] collect 161151 samples for training
[!] collect 161151 samples for training
Initializing visual encoder from /home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ...
Initializing visual encoder from /home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth ...
Visual encoder initialized.
Initializing language decoder from /home/witai4090/data/zkh/AnomalyGPT-main/vicuna_ckpt/7b_v0/ ...
Visual encoder initialized.
Initializing language decoder from /home/witai4090/data/zkh/AnomalyGPT-main/vicuna_ckpt/7b_v0/ ...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.10s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:28<00:00, 14.15s/it]
trainable params: 33554432 || all params: 6771970048 || trainable%: 0.49548996469513035
trainable params: 33554432 || all params: 6771970048 || trainable%: 0.49548996469513035
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Language decoder initialized.
Language decoder initialized.
[2023-10-18 10:54:12,519] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.3, git-hash=unknown, git-branch=unknown
[2023-10-18 10:54:12,519] [INFO] [comm.py:619:init_distributed] Distributed backend already initialized
[2023-10-18 10:54:34,114] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/witai4090/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/witai4090/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Traceback (most recent call last):
File "train_mvtec.py", line 152, in
main(**args)
File "train_mvtec.py", line 115, in main
agent = load_model(args)
File "/home/witai4090/data/zkh/AnomalyGPT-main/code/model/init.py", line 10, in load_model
agent = globals()[agent_name](model, args)
File "/home/witai4090/data/zkh/AnomalyGPT-main/code/model/agent.py", line 29, in init
self.ds_engine, self.optimizer, _ , _ = deepspeed.initialize(
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1597, in _write_ninja_file_and_build_library
get_compiler_abi_compatibility_and_version(compiler)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 336, in get_compiler_abi_compatibility_and_version
if not check_compiler_ok_for_platform(compiler):
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 290, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
Loading extension module cpu_adam...
Traceback (most recent call last):
File "train_mvtec.py", line 152, in
main(**args)
File "train_mvtec.py", line 115, in main
agent = load_model(args)
File "/home/witai4090/data/zkh/AnomalyGPT-main/code/model/init.py", line 10, in load_model
agent = globals()[agent_name](model, args)
File "/home/witai4090/data/zkh/AnomalyGPT-main/code/model/agent.py", line 29, in init
self.ds_engine, self.optimizer, _ , _ = deepspeed.initialize(
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
self.ds_opt_adam = CPUAdamBuilder().load()
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1101, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /home/witai4090/.cache/torch_extensions/py38_cu117/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f880d33daf0>
Traceback (most recent call last):
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f5f1d8fdaf0>
Traceback (most recent call last):
File "/home/witai4090/anaconda3/envs/zkh/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
[2023-10-18 10:54:43,332] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 785186
[2023-10-18 10:54:43,379] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 785187
[2023-10-18 10:54:43,382] [ERROR] [launch.py:320:sigkill_handler] ['/home/witai4090/anaconda3/envs/zkh/bin/python', '-u', 'train_mvtec.py', '--local_rank=1', '--model', 'openllama_peft', '--stage', '1', '--imagebind_ckpt_path', '/home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/imagebind_ckpt/imagebind_huge.pth', '--vicuna_ckpt_path', '/home/witai4090/data/zkh/AnomalyGPT-main/vicuna_ckpt/7b_v0/', '--delta_ckpt_path', '/home/witai4090/data/zkh/AnomalyGPT-main/pretrained_ckpt/pandagpt_ckpt/7b/pytorch_model.pt', '--max_tgt_len', '1024', '--data_path', '/home/witai4090/data/zkh/AnomalyGPT-main/data/pandagpt4_visual_instruction_data.json', '--image_root_path', '/home/witai4090/data/zkh/AnomalyGPT-main/data/images/', '--save_path', '/home/witai4090/data/zkh/AnomalyGPT-main/code/ckpt/train_mvtec/', '--log_path', '/home/witai4090/data/zkh/AnomalyGPT-main/code/ckpt/train_mvtec/log_rest/'] exits with return code = 1