Code Monkey home page Code Monkey logo

aios's People

Contributors

ttxskk avatar wyjsjtu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aios's Issues

Error in installation

Dear authors,

thank you for your great work. But i encountered some problem in installation.

The first is in your installation instruction, there is no requirement.txt

The second is Traceback (most recent call last): File "main.py", line 14, in <module> Traceback (most recent call last): File "main.py", line 14, in <module> from detrsmpl.data.datasets import build_dataloader ModuleNotFoundError: No module named 'detrsmpl.data' from detrsmpl.data.datasets import build_dataloader ModuleNotFoundError: No module named 'detrsmpl.data' when running official demo

Question about Computational cost and speed.

Hi, @ttxskk
Thanks for share nice work.

I skim the paper quickly, and didn't catch about computational cost and speed.

when i look the overall framework's workflow, In my opinion, The process is not shorten compare with OSX(CVPR 2023)

So i'm very curious about your work achieve more faster and lighter than before works.

Sincerely,

Bad paper title: OS is reserved for Operating System

This is a very bad use of acronym. AI OS already has a pretty well established meaning of (generative) AI-powered Operating System.

Using Ai to represent All-in and using OS to represent one-stage is pointless and misleading.

Please come up with a better non-misleading name for your paper & project.

Error running demo

Dear Authors,

thanks for your great work. But i encountered following error when running official demo:

`(/mnt/qb/work/ponsmoll/pba178/.conda/aios) [pba178@galvani-cn219 AiOS]$ sh scripts/inference.sh short_video demo

/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


usage: DETR training and evaluation script [-h] --config_file CONFIG_FILE [--options OPTIONS [OPTIONS ...]] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME]
[--pretrain_model_path PRETRAIN_MODEL_PATH] [--finetune_ignore FINETUNE_IGNORE [FINETUNE_IGNORE ...]] [--start_epoch N] [--eval] [--num_workers NUM_WORKERS]
[--test] [--debug] [--find_unused_params] [--save_log] [--to_vid] [--inference] [--rank RANK] [--local_rank LOCAL_RANK] [--amp]
[--inference_input INFERENCE_INPUT]
DETR training and evaluation script: error: unrecognized arguments: --local-rank=0
usage: DETR training and evaluation script [-h] --config_file CONFIG_FILE [--options OPTIONS [OPTIONS ...]] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME]
[--pretrain_model_path PRETRAIN_MODEL_PATH] [--finetune_ignore FINETUNE_IGNORE [FINETUNE_IGNORE ...]] [--start_epoch N] [--eval] [--num_workers NUM_WORKERS]
[--test] [--debug] [--find_unused_params] [--save_log] [--to_vid] [--inference] [--rank RANK] [--local_rank LOCAL_RANK] [--amp]
[--inference_input INFERENCE_INPUT]
DETR training and evaluation script: error: unrecognized arguments: --local-rank=1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3022941) of binary: /mnt/qb/work/ponsmoll/pba178/.conda/aios/bin/python
Traceback (most recent call last):
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/qb/work/ponsmoll/pba178/.conda/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-21_23:12:25
host : galvani-cn219
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 3022942)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-21_23:12:25
host : galvani-cn219
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 3022941)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

`

I have downloaded the checkpoint and SMPL content and placed them in correct path, but this error still occurs. I would like to know if you have any solution for this problem

Bad results!

I run the inference code with the default arguments using the following video, but it reconstruct a band mesh with ugly face and misaligned hand pose! I'm wondering whether the result is correct?

00001.mp4
out.mp4

Error running the demo

Hi! I'm having some trouble with the demo. I installed all the required libraries, except for the version of cuda which is 11.1. I'm getting the following error.

Before torch.distributed.barrier()
End torch.distributed.barrier()
Loading config file from config/aios_smplx_inference.py
[05/22 14:34:28.837]: git:
sha: be1ea5a, status: has uncommited changes, branch: main

[05/22 14:34:28.837]: Command: main.py -c config/aios_smplx_inference.py --options batch_size=8 epochs=100 lr_drop=55 num_body_points=17 backbone=resnet50 --resume data/checkpoint/aios_checkpoint.pth --eval --inference --to_vid --inference_input demo/short_video.mp4 --output_dir demo/demo
[05/22 14:34:28.839]: Full config saved to demo/demo/config_args_all.json
[05/22 14:34:28.839]: world size: 1
[05/22 14:34:28.839]: rank: 0
[05/22 14:34:28.839]: local_rank: 0
[05/22 14:34:28.839]: args: Namespace(agora_benchmark='na', amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=8, bbox_loss_coef=5.0, bbox_ratio=1.2, body_3d_size=2, body_bbox_loss_coef=5.0, body_giou_loss_coef=2.0, body_model_test={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_model_train={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_only=True, camera_3d_size=2.5, clip_max_norm=0.1, cls_loss_coef=2.0, cls_no_bias=False, code_dir=None, config_file='config/aios_smplx_inference.py', config_path='config/aios_smplx.py', continue_train=True, cur_dir='/home/seba/Documents/AiOS/config', data_dir='/home/seba/Documents/AiOS/config/../dataset', data_strategy='balance', dataset_list=['AGORA_MM', 'BEDLAM', 'COCO_NA'], ddetr_lr_param=False, debug=False, dec_layer_number=None, dec_layers=6, dec_n_points=4, dec_pred_bbox_embed_share=False, dec_pred_class_embed_share=False, dec_pred_pose_embed_share=False, decoder_module_seq=['sa', 'ca', 'ffn'], decoder_sa_type='sa', device='cuda', dilation=False, dim_feedforward=2048, distributed=True, dln_hw_noise=0.2, dln_xy_noise=0.2, dn_attn_mask_type_list=['match2dn', 'dn2dn', 'group2group'], dn_batch_gt_fuse=False, dn_bbox_coef=0.5, dn_box_noise_scale=0.4, dn_label_coef=0.3, dn_label_noise_ratio=0.5, dn_labelbook_size=100, dn_number=100, dropout=0.0, ema_decay=0.9997, ema_epoch=0, embed_init_tgt=False, enc_layers=6, enc_loss_coef=1.0, enc_n_points=4, end_epoch=150, epochs=100, eval=True, exp_name='output/exp52/dataset_debug', face_3d_size=0.3, face_bbox_loss_coef=5.0, face_giou_loss_coef=2.0, face_keypoints_loss_coef=10.0, face_oks_loss_coef=4.0, find_unused_params=False, finetune_ignore=None, fix_refpoints_hw=-1, focal=(5000, 5000), focal_alpha=0.25, frozen_weights=None, gamma=0.1, giou_loss_coef=2.0, gpu=0, hand_3d_size=0.3, hidden_dim=256, human_model_path='data/body_models', indices_idx_list=[1, 2, 3, 4, 5, 6, 7], inference=True, inference_input='demo/short_video.mp4', input_body_shape=(256, 192), input_face_shape=(192, 192), input_hand_shape=(256, 256), interm_loss_coef=1.0, keypoints_loss_coef=10.0, lhand_bbox_loss_coef=5.0, lhand_giou_loss_coef=2.0, lhand_keypoints_loss_coef=10.0, lhand_oks_loss_coef=0.5, local_rank=0, log_dir=None, losses=['smpl_pose', 'smpl_beta', 'smpl_expr', 'smpl_kp2d', 'smpl_kp3d', 'smpl_kp3d_ra', 'labels', 'boxes', 'keypoints'], lr=1.414e-05, lr_backbone=1.414e-06, lr_backbone_names=['backbone.0'], lr_drop=55, lr_drop_list=[30, 60], lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], make_same_len=False, masks=False, match_unstable_error=False, matcher_type='HungarianMatcher', model_dir=None, modelname='aios_smplx', multi_step_lr=True, nheads=8, nms_iou_threshold=-1, no_aug=False, no_interm_box_loss=False, no_mmpose_keypoint_evaluator=True, num_body_points=17, num_box_decoder_layers=2, num_classes=2, num_face_points=6, num_feature_levels=4, num_group=100, num_hand_face_decoder_layers=4, num_hand_points=6, num_patterns=0, num_queries=900, num_select=50, num_workers=0, oks_loss_coef=4.0, onecyclelr=False, options={'batch_size': 8, 'epochs': 100, 'lr_drop': 55, 'num_body_points': 17, 'backbone': 'resnet50'}, output_dir='demo/demo', output_face_hm_shape=(8, 8, 8), output_hand_hm_shape=(16, 16, 16), output_hm_shape=(16, 16, 12), param_dict_type='default', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, pretrained_model_path='../output/train_gta_synbody_ft_20230410_132110/model_dump/snapshot_2.pth.tar', princpt=(96.0, 128.0), query_dim=4, random_refpoints_xy=False, rank=0, result_dir='/home/seba/Documents/AiOS/config/../exps62/result', resume='data/checkpoint/aios_checkpoint.pth', return_interm_indices=[1, 2, 3], rhand_bbox_loss_coef=5.0, rhand_giou_loss_coef=2.0, rhand_keypoints_loss_coef=10.0, rhand_oks_loss_coef=0.5, rm_detach=None, rm_self_attn_layers=None, root_dir='/home/seba/Documents/AiOS/config/..', save_checkpoint_interval=1, save_log=False, scheduler='step', seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, set_cost_keypoints=10.0, set_cost_kpvis=0.0, set_cost_oks=4.0, smpl_beta_loss_coef=0.01, smpl_body_kp2d_ba_loss_coef=0.0, smpl_body_kp2d_loss_coef=1.0, smpl_body_kp3d_loss_coef=1.0, smpl_body_kp3d_ra_loss_coef=1.0, smpl_expr_loss_coef=0.01, smpl_face_kp2d_ba_loss_coef=0.0, smpl_face_kp2d_loss_coef=0.1, smpl_face_kp3d_loss_coef=0.1, smpl_face_kp3d_ra_loss_coef=0.1, smpl_lhand_kp2d_ba_loss_coef=0.0, smpl_lhand_kp2d_loss_coef=0.5, smpl_lhand_kp3d_loss_coef=0.1, smpl_lhand_kp3d_ra_loss_coef=0.1, smpl_pose_loss_body_coef=0.1, smpl_pose_loss_jaw_coef=0.1, smpl_pose_loss_lhand_coef=0.1, smpl_pose_loss_rhand_coef=0.1, smpl_pose_loss_root_coef=1.0, smpl_rhand_kp2d_ba_loss_coef=0.0, smpl_rhand_kp2d_loss_coef=0.5, smpl_rhand_kp3d_loss_coef=0.1, smpl_rhand_kp3d_ra_loss_coef=0.1, start_epoch=0, step_size=20, strong_aug=False, test=False, test_max_size=1333, test_sample_interval=100, test_sizes=[800], testset='INFERENCE', to_vid=True, total_data_len='auto', train_batch_size=32, train_max_size=1333, train_sample_interval=10, train_sizes=[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800], trainset_2d=[], trainset_3d=['AGORA_MM', 'BEDLAM', 'COCO_NA'], trainset_humandata=[], trainset_partition={'AGORA_MM': 0.4, 'BEDLAM': 0.7, 'COCO_NA': 1}, transformer_activation='relu', two_stage_bbox_embed_share=False, two_stage_class_embed_share=False, two_stage_default_hw=0.05, two_stage_keep_all_tokens=False, two_stage_learn_wh=False, two_stage_type='standard', use_cache=True, use_checkpoint=False, use_dn=True, use_ema=True, vis_dir=None, weight_decay=0.0001, world_size=1)

aios_smplx
Traceback (most recent call last):
File "main.py", line 437, in
main(args)
File "main.py", line 173, in main
model, criterion, postprocessors, postprocessors_aios = build_model_main(
File "main.py", line 86, in build_model_main
from models.registry import MODULE_BUILD_FUNCS
File "/home/seba/Documents/AiOS/models/init.py", line 1, in
from .aios import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/init.py", line 1, in
from .aios_smplx import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/aios_smplx.py", line 17, in
from .transformer import build_transformer
File "/home/seba/Documents/AiOS/models/aios/transformer.py", line 10, in
from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
File "/home/seba/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in
from .ops.modules import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/init.py", line 9, in
from .ms_deform_attn import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in
from ..functions import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/init.py", line 9, in
from .ms_deform_attn_func import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in
import MultiScaleDeformableAttention as MSDA
ImportError: /home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPvmm
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4686) of binary: /home/seba/anaconda3/envs/aios/bin/python
Traceback (most recent call last):
File "/home/seba/anaconda3/envs/aios/bin/torchrun", line 8, in
sys.exit(main())
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-22_14:34:30
host : seba-GE66-Raider-10UH
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4686)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Thanks for your help!

About preprocess dataset

Thanks for your wonderful work.

Could you share the preprocessed dataset of BEDLAM and AGORA (*.npz files)?

Error running the demo

Thanks for your fantastic work, but I encountered a series of problems when running the demo. I really appreciate it if you can give me some help. Here are the problems I got:
Environment Error
If I follow the instructions in README to install pytorch 1.10.1 and then pytorch3d, there will be a mismatch of CUDA version error. The detected CUDA version (12.4) mismatches the version that was used to compile PyTorch (11.3). Please make sure to use the same CUDA versions.

I solved this by installing the latest pytorch 2.3.1 and manually download the pytorch3d conda package and install it. I don't know if I should install an older version of Nvidia driver on my machine.

debugpy always waiting
If I don't comment the line debugpy.wait_for_client(), the code will just stop there and wait forever to expect the debugpy client to start.

def main(args):
    
    utils.init_distributed_mode_ssc(args)
    # utils.init_distributed_mode(args)
    if args.rank == 0:
        debugpy.listen(("127.0.0.1", 10086))
        debugpy.wait_for_client()
    print('Loading config file from {}'.format(args.config_file))
    shutil.copy2(args.config_file,'config/aios_smplx.py')

Some distributed running error
If I use the default mmcv distributed in the code, I have the following error, which seems like a bug related to device type:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 162, in _run_ddp_forward
[rank0]:     inputs, kwargs = self.to_kwargs(  # type: ignore
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/distributed.py", line 27, in to_kwargs
[rank0]:     return scatter_kwargs(inputs, kwargs, [device_id], dim=self.dim)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 60, in scatter_kwargs
[rank0]:     inputs = scatter(inputs, target_gpus, dim) if inputs else []
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 50, in scatter
[rank0]:     return scatter_map(inputs)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 40, in scatter_map
[rank0]:     out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 35, in scatter_map
[rank0]:     return list(zip(*map(scatter_map, obj)))
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/scatter_gather.py", line 33, in scatter_map
[rank0]:     return Scatter.forward(target_gpus, obj.data)
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in forward
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/mycode/mmcv/mmcv/parallel/_functions.py", line 75, in <listcomp>
[rank0]:     streams = [_get_stream(device) for device in target_gpus]
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 119, in _get_stream
[rank0]:     if device.type == "cpu":
[rank0]: AttributeError: 'int' object has no attribute 'type'

If I disable distributed running, another error showed up, which also seems to be related to data type convertion:

[rank0]: Traceback (most recent call last):
[rank0]:   File "main.py", line 395, in <module>
[rank0]:     main(args)
[rank0]:   File "main.py", line 297, in main
[rank0]:     inference(model,
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/engine.py", line 338, in inference
[rank0]:     outputs, targets, data_batch_nc = model(data_batch)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/tony/miniconda3/envs/aios/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 962, in forward
[rank0]:     samples, targets = self.prepare_targets(data_batch)
[rank0]:   File "/home/tony/mycode/motion_estimation/AiOS/models/aios/aios_smplx.py", line 1845, in prepare_targets
[rank0]:     data_batch_coco = []
[rank0]: AttributeError: 'DataContainer' object has no attribute 'float'

My environment for running the code is:

OS: Ubuntu 24.04 LTS x86_64
Kernel: 6.8.0-38-generic
CPU: 13th Gen Intel i9-13900K (32) @ 5.500GHz
GPU: NVIDIA GeForce RTX 4090
Nvidia Driver Version: 550.90.07
CUDA Version: 12.4 

Required specs for inference

Thank you for the interesting works.
Can you describe the required computer specs for running the inference?

I tried to run the inference, but I kept getting device ordinal error.

LOCAL_RANK environment seems to be set to 1 by default. I suspect that you use multiple GPU in a single computer?

Here is my computer specs for running the inference code.

OS   : WSL2 (Windows Subsystem for Linux, Ubuntu 20.04)
CPU  : Intel(R) Core(TM) i7-8700
RAM  : 32 Gb
GPU  : NVidia RTX 3080 Ti (12 Gb)

pytorch 1.13.0 + CUDA 11.7

I also tested on other pytorch version but the error persist.

I modified the code utils.init_distributed_mode(args) function in misc.py line 581. I added a print command to print the GPU used in line 612.

args.distributed = True
print(f"Cuda GPU device set to {args.local_rank}")
torch.cuda.set_device(args.local_rank)

And here is the error result

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Cuda GPU device set to 0
Before torch.distributed.barrier()
Cuda GPU device set to 1
Traceback (most recent call last):
  File "main.py", line 441, in <module>
    main(args)
  File "main.py", line 99, in main
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 614, in init_distributed_mode
    torch.cuda.set_device(args.local_rank)
  File "/home/testrun/.virtualenvs/torch4/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17593 closing signal SIGTERM

The line error Cuda GPU device set to X is called twice, before print("End torch.distributed.barrier()").
I supposed because of the multiprocess?

And then I forced to only use GPU:0 by adding os.environ['LOCAL_RANK'] = "0" in the beginning of main.py but I got the following error.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Cuda GPU device set to 0
Cuda GPU device set to 0
Before torch.distributed.barrier()
Before torch.distributed.barrier()
Traceback (most recent call last):
Traceback (most recent call last):
  File "main.py", line 441, in <module>
  File "main.py", line 441, in <module>
    main(args)
  File "main.py", line 99, in main
    main(args)
  File "main.py", line 99, in main
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 620, in init_distributed_mode
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 620, in init_distributed_mode
    torch.distributed.barrier()
  File "/home/testrun/.virtualenvs/torch4/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3145, in barrier
        work = default_pg.barrier(opts=opts)
torch.distributed.barrier()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000

So, does it require multiple GPU to run the inference?

Global pose for the bones

I have a query regarding the global pose. How can we get the global poses for each bone, similar to what is done in WHAM.

Code release timeline

Hi,

Great paper! Do you have an estimate on when you'll release the training code and pretrained weights?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.