tunib-ai / large-scale-lm-tutorials Goto Github PK

Large-scale language modeling tutorials with PyTorch

Home Page: https://nbviewer.org/github/tunib-ai/large-scale-lm-tutorials/tree/main/notebooks/

License: Apache License 2.0

Jupyter Notebook 91.37% Python 8.63%

large-scale-lm-tutorials's Introduction

Large-scale language modeling tutorials with PyTorch

안녕하세요. 저는 TUNiB에서 머신러닝 엔지니어로 근무 중인 고현웅입니다. 이 자료는 대규모 언어모델 개발에 필요한 여러가지 기술들을 소개드리기 위해 마련하였으며 기본적으로 PyTorch와 Transformer 언어모델에 대한 지식이 있다고 가정하고 만들었습니다. 내용 중 틀린부분이 있거나 궁금하신 부분이 있으시면 이슈나 메일로 문의 주시면 감사하겠습니다.

목차의 대분류는 '세션', 소분류는 '챕터'라고 명명하였습니다.
모든 소스코드 및 노트북 파일은 Github 에 공개되어 있습니다.
Github에서 열람하시는 것보다 NBViewer 로 열람하시는 것을 추천드립니다.

Environments

Local Environments

Linux Ubuntu 18.04 LTS
4 * A100 GPU
Python 3.7
pytorch==1.9.0+cu111

Docker Environments

docker pull pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel
원활한 실습을 위해 --shm-size를 키우거나 --ipc=host 옵션을 설정해주세요.

LICENSE

Copyright 2021 TUNiB Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

large-scale-lm-tutorials's People

Contributors

Stargazers

Watchers

Forkers

leejwuniverse ttoru96 seopbo qute012 jafffy hunnxx icodein kboseong oceanos74 chnaaam gradjitta dsguseong springloops suehyunpark dobbytk hdongr wavy-jung merry555 jeesu-jung qualis2006 conceptofmind veb-101 seareale ashishpatel26 techthiyanes jack-kjlee mingyugwon jongwon-jay-lee hyezzz sieu-n 5x5x5x5 jmlee2017 geekadalovelace cloudnine-mj codingchild2424 mystatsolve momozzing aeolian83 stat-eklee imvijay23 shaowei-su skkumin rielkim createsmit7 upens jinlovespho namkibeom se-hun yjoonjang dwchoo hwang2006

large-scale-lm-tutorials's Issues

03_distributed_programming.ipynb 를 따라하는데 broadcast에서 오류가 발생합니다.

안녕하세요... 공개해주신 노트북으로 파이토치 병렬처리 개념을 알아가고 있습니다.

03_disstributed_programming.ipynb 에서 P2P Communication 까지는 소스 실행이 잘되고 있었습니다.

그런데 Collective Communication 의 첫번째 broadcast 연산의 샘플 프로그램을 따라서 실행하다가 오류가 발생합니다. dist.broadcast(tensor, src=0)을 실행하기까지는 출력이 되었는데, 이후 오류가 발생합니다.

혹시 이 오류가 왜 발생하는지 알 수 있을까요?

제가 테스트 한 환경은 다음과 같습니다.
윈도우 10 Enterprise 22H2 버전이며,
wsl2 기반 윈도우 docker desktop 에서 ubuntu20.04 버전을 실행하고
있습니다.
파이썬은 3.8.10 이며,
pytorch 버전은 1.13.1+cu116 입니다.
GPU는 1080ti 2장이 설치되어 있는 상황입니다.

테스트 실행 결과는 다음과 같습니다.

root@9023c839d35c:/data/hf_test# python -m torch.distributed.launch --nproc_per_node=2 broadcast.py
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

torch.cuda.current_device() : 0
torch.cuda.current_device() : 1
before rank 0: tensor([[ 0.5138, -1.4212],
[-0.8317, 1.0614]], device='cuda:0')

before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f93e939f457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f93e93693ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f9414a47c64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f9414a1f3e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9414a22054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f943f11fe23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f93e937f9e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f93e937faf9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f943f37dc68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f943f37df85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f945f86d083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32834b2457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f328347c3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f32aeb5ac64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f32aeb323e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f32aeb35054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f32d9232e23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f32834929e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3283492af9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f32d9490c68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f32d9490f85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f32f9980083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 260) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

broadcast.py FAILED

Failures:
[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):
[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

tunib-ai / large-scale-lm-tutorials Goto Github PK

large-scale-lm-tutorials's Introduction

Large-scale language modeling tutorials with PyTorch

Contents

Environments

Local Environments

Docker Environments

LICENSE

large-scale-lm-tutorials's People

Contributors

Stargazers

Watchers

Forkers

large-scale-lm-tutorials's Issues

테스트 실행 결과는 다음과 같습니다.

broadcast.py FAILED

Failures: [1]: time : 2023-07-10_03:07:46 host : 9023c839d35c rank : 1 (local_rank: 1) exitcode : -6 (pid: 261) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure): [0]: time : 2023-07-10_03:07:46 host : 9023c839d35c rank : 0 (local_rank: 0) exitcode : -6 (pid: 260) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 260

Recommend Projects

Recommend Topics

Recommend Org

Failures:
[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):
[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260