Code Monkey home page Code Monkey logo

large-scale-lm-tutorials's Introduction

Large-scale language modeling tutorials with PyTorch

안녕하세요. 저는 TUNiB에서 머신러닝 엔지니어로 근무 중인 고현웅입니다. 이 자료는 대규모 언어모델 개발에 필요한 여러가지 기술들을 소개드리기 위해 마련하였으며 기본적으로 PyTorch와 Transformer 언어모델에 대한 지식이 있다고 가정하고 만들었습니다. 내용 중 틀린부분이 있거나 궁금하신 부분이 있으시면 이슈나 메일로 문의 주시면 감사하겠습니다.

  • 목차의 대분류는 '세션', 소분류는 '챕터'라고 명명하였습니다.
  • 모든 소스코드 및 노트북 파일은 Github 에 공개되어 있습니다.
  • Github에서 열람하시는 것보다 NBViewer 로 열람하시는 것을 추천드립니다.

Contents

  1. Introduction
  2. Motivation
  3. Distributed Programming
  4. Overview of Parallelism
  5. Data Parallelism
  6. Pipeline Parallelism
  7. Tensor Parallelism
  8. Zero Redundancy Optimization
  9. Multi-dimensional Parallelism
  10. Additional Techniques

Environments

Local Environments

  • Linux Ubuntu 18.04 LTS
  • 4 * A100 GPU
  • Python 3.7
  • pytorch==1.9.0+cu111

Docker Environments

  • docker pull pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel
  • 원활한 실습을 위해 --shm-size를 키우거나 --ipc=host 옵션을 설정해주세요.

LICENSE

Copyright 2021 TUNiB Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

large-scale-lm-tutorials's People

Contributors

bernardscumm avatar hyunwoongko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

large-scale-lm-tutorials's Issues

03_distributed_programming.ipynb 를 따라하는데 broadcast에서 오류가 발생합니다.

안녕하세요... 공개해주신 노트북으로 파이토치 병렬처리 개념을 알아가고 있습니다.

03_disstributed_programming.ipynb 에서 P2P Communication 까지는 소스 실행이 잘되고 있었습니다.

그런데 Collective Communication 의 첫번째 broadcast 연산의 샘플 프로그램을 따라서 실행하다가 오류가 발생합니다. dist.broadcast(tensor, src=0)을 실행하기까지는 출력이 되었는데, 이후 오류가 발생합니다.

혹시 이 오류가 왜 발생하는지 알 수 있을까요?

제가 테스트 한 환경은 다음과 같습니다.
윈도우 10 Enterprise 22H2 버전이며,
wsl2 기반 윈도우 docker desktop 에서 ubuntu20.04 버전을 실행하고
있습니다.
파이썬은 3.8.10 이며,
pytorch 버전은 1.13.1+cu116 입니다.
GPU는 1080ti 2장이 설치되어 있는 상황입니다.

테스트 실행 결과는 다음과 같습니다.

root@9023c839d35c:/data/hf_test# python -m torch.distributed.launch --nproc_per_node=2 broadcast.py
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


torch.cuda.current_device() : 0
torch.cuda.current_device() : 1
before rank 0: tensor([[ 0.5138, -1.4212],
[-0.8317, 1.0614]], device='cuda:0')

before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f93e939f457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f93e93693ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f9414a47c64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f9414a1f3e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9414a22054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f943f11fe23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f93e937f9e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f93e937faf9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f943f37dc68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f943f37df85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f945f86d083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32834b2457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f328347c3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f32aeb5ac64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f32aeb323e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f32aeb35054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f32d9232e23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f32834929e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3283492af9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f32d9490c68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f32d9490f85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f32f9980083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 260) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

broadcast.py FAILED

Failures:
[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):
[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.