Code Monkey home page Code Monkey logo

Comments (15)

vamseedharr avatar vamseedharr commented on June 27, 2024 1

UPDATE: I installed it on our HPC and this time, it was marginally faster. It got through the whole initial process and then went through the GNN iterations. After the initial run, it seems to be quite fast (~5-10 minutes) for all the consequent runs. If this is only related to the weights not getting downloaded, I'm wondering why that is the case.

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Hi,

Thank you for the error report. While it is running, does nvidia-smi show any GPU utilization from ModelAngelo?

If not, could you make sure that the conda environment's pytorch is able to access the GPU?

Best,
Kiarash.

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

Hi Kiarash,

nvidia-smi reports NO GPU utilization.

Pytorch is available through the conda environment
image

Downgraded to cudatoolkit=11.6 and still no go. Not sure how to proceed. Any ideas?

Thanks,

-Vamsee

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

This is very interesting.

What happens when you type "which model_angelo". Is it installed in the right conda environment?

If you manually pass "--device 0" to "model_angelo build", does it use the GPU?

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024
  1. which model_angelo returns the environment path (/home/vamsee/anaconda3/envs/model_angelo/bin/model_angelo)
  2. Still no GPU utilization with the "--device 0" option.

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Could you please send me the log file?

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

2022-12-20 at 22:19:59 | ERROR | Error in ModelAngelo
Traceback (most recent call last):

File "/home/vamsee/anaconda3/envs/model_angelo/bin/model_angelo", line 33, in
sys.exit(load_entry_point('model-angelo==0.2.2', 'console_scripts', 'model_angelo')())
│ │ └ <function importlib_load_entry_point at 0x7fc30ed130d0>
│ └
└ <module 'sys' (built-in)>
File "/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-0.2.2-py3.9.egg/model_angelo/main.py", line 51, in main
args.func(args)
│ │ └ Namespace(volume_path='./Model_building/Bin2_zGP_11883_11886_C1_map_flipped.mrc', fasta_path='./initial_models/zGP-11883-1188...
│ └ <function main at 0x7fc2616fdc10>
└ Namespace(volume_path='./Model_building/Bin2_zGP_11883_11886_C1_map_flipped.mrc', fasta_path='./initial_models/zGP-11883-1188...

File "/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-0.2.2-py3.9.egg/model_angelo/apps/build.py", line 130, in main
model_bundle_path = download_and_install_model(parsed_args.model_bundle_name)
│ │ └ 'original'
│ └ Namespace(volume_path='./Model_building/Bin2_zGP_11883_11886_C1_map_flipped.mrc', fasta_path='./initial_models/zGP-11883-1188...
└ <function download_and_install_model at 0x7fc264549af0>
File "/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-0.2.2-py3.9.egg/model_angelo/utils/torch_utils.py", line 460, in download_and_install_model
with zipfile.ZipFile(dest + ".zip", "r") as zip_object:
│ │ └ '/home/vamsee/.cache/torch/hub/checkpoints/model_angelo/original'
│ └ <class 'zipfile.ZipFile'>
└ <module 'zipfile' from '/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/zipfile.py'>
File "/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/zipfile.py", line 1266, in init
self._RealGetContents()
│ └ <function ZipFile._RealGetContents at 0x7fc30ea043a0>
└ <zipfile.ZipFile [closed]>
File "/home/vamsee/anaconda3/envs/model_angelo/lib/python3.9/zipfile.py", line 1333, in _RealGetContents
raise BadZipFile("File is not a zip file")
└ <class 'zipfile.BadZipFile'>

zipfile.BadZipFile: File is not a zip file

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Could you email me the log file itself at [email protected]

There seems to be an issue during installation

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

Sent

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

UPDATE:

Tried several things.

  1. Removed environment and set it up again. No luck
  2. Removed anaconda and set the whole process up again. No luck
  3. Tried miniconda. No luck.

In the process, I noticed, there was an HTTP request error. This has happened with anaconda and miniconda and apparently is well documented. Found some workarounds for it and the HTTP requests disappeared.

  1. torch.cuda.is_available() is true.
  2. Finally, reinstalled the operating system (upgraded to LM 21.1) and started from absolute scratch. No luck.
  3. Tried cuda-11.6 as the previous 2080Ti post suggested. No luck.
  4. The video card works. Tested it with gpu-burn with no errors.

I did notice something interesting though. Every time I start the run, there is a small spike in power usage and volatile memory usage (20W, 20%) of the GPU which then drops back down to base levels (~9W and 0%) in about 10-15 seconds.

Unsure of what to do next. I've tried to eliminate all possible variables. If you think of any more please suggest. Willing to try.

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Hi,

From the log file, it seems that the weight download did not happen correctly. Could you please delete the folder '/home/vamsee/.cache/torch' and try again?

Sorry for the inconvenience

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

Hi Kiarash,

No inconvenience at all.

Not sure your suggestion will help anymore. I've reinstalled it several times included reinstalling the operating system itself (Check my previous update for details). I feel like I am missing something but not sure what.

-Vamsee

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Does it always fail with the same message in the log? I think this has to do with the HTTP request failing. We could try manually downloading the weights, I could give you commands for that if the failure is always at the same point

from model-angelo.

vamseedharr avatar vamseedharr commented on June 27, 2024

The HTTP error is internal to anaconda/miniconda and has been well documented across the web. I've disable the SSL settings for it and since then I haven't gotten the HTTP errors. I would like to try downloading the weights manually though. Maybe that'll help. Please send them to me. Thanks.

from model-angelo.

jamaliki avatar jamaliki commented on June 27, 2024

Sorry for the late replies, I am currently out of office for a couple of weeks.

Yes, it was downloading weights before. I am unsure why it was so slow for you, but the runs after that are actually indicative of the model building speed.

from model-angelo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.