tencent / patrickstar Goto Github PK
View Code? Open in Web Editor NEWPatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
License: BSD 3-Clause "New" or "Revised" License
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
License: BSD 3-Clause "New" or "Revised" License
我在尝试用你们提供的样例中的train_simple_net.py运行时,有一个报错:
Traceback (most recent call last):
File "/home/somnus/Learn_deeplearing/PatrickStar-master/examples/train_simple_net.py", line 90, in
model, optim = initialize_engine(model_func=model_func, local_rank=0, config=config)
File "/home/somnus/Learn_deeplearing/PatrickStar-master/patrickstar/runtime/init.py", line 70, in initialize_engine
client = PatrickStarClient(
File "/home/somnus/Learn_deeplearing/PatrickStar-master/patrickstar/core/client.py", line 53, in init
tracer_config = config.get("mem_tracer", None)
AttributeError: 'NoneType' object has no attribute 'get'
我并没有对代码进行任何改动,请问这是什么原因呢?
pip install . --user met below error, because future is not supported on python3.6. comment patrickstar/core/eviction_policy.py#L30 solved this, seems importing future is not necessary?
ERROR: Command errored out with exit status 1:
command: /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.4/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-9pb26brm/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-9pb26brm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-req-build-9pb26brm/pip-egg-info
cwd: /tmp/pip-req-build-9pb26brm/
Complete output (14 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-9pb26brm/setup.py", line 32, in <module>
from patrickstar.ops.op_builder import CPUAdamBuilder
File "/tmp/pip-req-build-9pb26brm/patrickstar/__init__.py", line 30, in <module>
from .core import PatrickStarClient
File "/tmp/pip-req-build-9pb26brm/patrickstar/core/__init__.py", line 31, in <module>
from .chunk_list import ChunkList
File "/tmp/pip-req-build-9pb26brm/patrickstar/core/chunk_list.py", line 43, in <module>
from patrickstar.core.eviction_policy import ChunkEvictionPolicyBase
File "/tmp/pip-req-build-9pb26brm/patrickstar/core/eviction_policy.py", line 30
from __future__ import annotations
^
SyntaxError: future feature annotations is not defined
Similar to DeepSpeed.
PatrickStar is awesome, it helps reduce memory used by model state!
currently trends show using MegatronDeepSpeed as framework to train transformer based NLP models, both pretrain and finetuing. so will you guys support PatrickStar on MegatronDeepSpeed?
Memory-centric tiling(MCT) is able to split a model data tensor into pieces, and they do not need to be stored in contiguous memory space. This will help reduce chunk size.
DeepSpeed MCT
This technique is a trick and should not be implemented in core function of patrickstar.
It is helpful to improve our benchmark results, and therefore should be put in the dir ./example.
目前多卡不能兼容hf的from_pretrain接口载入模型
https://huggingface.co/transformers/_modules/transformers/modeling_utils.html#PreTrainedModel.from_pretrained
主要问题产生于,model初始化阶段,我们把non-local param设置为[],而from_pretrain在模型初始化后还需要一个_load_state_dict_into_model过程,把state_dict弄到model的参数中。这样遇到[]参数会报错。
MP的风潮是Megatron-LM引入到PTM训练中的,通过对transformer的实现插入定制的集合通信操作,实现了模型切分。
模型并行有很多诟病,
MP and PatrickStar
在PatrickStar中,显存的最大消耗量和chunk size有关,即使不使用异构存储空间,把所有chunk都放在gpu中,model data的尺寸也是原来的1/N,和MP消耗类似。PatrickStar和PP兼容即可,不需要兼容MP。
之前Zero-Offload会去兼容MP,这是很奇怪的。阅读代码,我觉得是因为Zero3的通信用了非常差的设计,需要临时在gpu分配world_size*tensor_numel大小的临时buffer,加上预取的存在,可能同时分配了多个这样的buffer,尤其对于embedding layer这种大参数层,可能会爆炸内存,因此需要用MP减少单个进程的tensor_numel。
TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据
https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61
TencentPretrain还有一个野生开源项目
https://github.com/dbiir/UER-py
定义:P param fp16 (grad fp16), OS (optimizer state)
假设1:GPU无法存放P,CPU+GPU可以存放P,CPU+GPU无法存放OS+P
假设2:access_chunk中的,CPU-GPU-NVME之间的通信无法和计算重叠。
ADMA :
access_chunk的方向
GPU (P) -(P)-> CPU (P, OS) <-(OS)- NVMe (OS)
offload_chunk的方向
GPU (P) <-(P)- CPU (P, OS) -(OS)> NVMe (OS)
推论1:ADAM计算完CPU是满的。
推论2:FWD前GPU上的P是满的。
FWD :
access_chunk的方向
GPU (P) <-(P)- CPU (P, OS) <-(None)- NVMe (OS)
offload_chunk的方向
GPU (P) -(P)-> CPU (P, OS) -(OS)-> NVMe (OS)
GPU上P总量减少->CPU上P总量增多。
BWD:
access_chunk的方向
GPU (P) <-(P)- CPU (P, OS) <-(None)- NVMe (OS)
offload_chunk的方向
GPU (P) -(P)-> CPU (P, OS) -(None)-> NVMe (OS)
GPU上P总量增多->CPU上P总量减少。
推论2:FWD+BWD时候需要OS以chunk为粒度向NVMe移动(如上加粗),这种移动穿插在FWD计算之间。
推论3:OS从来不会出现在GPU上。
推论4:有一部分OS一直在CPU上,我们称之为OS_cpu。
推论5:OS去除OS_cpu为OS_nvme.
优化:
在AMAM计算之后,FWD开始时异步offload OS_nvme到NVMe。这部分offload操作和FWD计算重叠。
这样可以减少FWD阶段细粒度的NVMe移动开销。
在FWD结束后,BWD开始前,我们把OS_nvme异步从NVMe移动到CPU上,这部分操作可以和BWD计算重叠。
这样可以减少ADMA阶段NVMe移动开销。
relates to #120 .
The following is the CPU memory figure of GPT2_4B model. We could find that the chunk memory is much larger than the total memory (psutil.virtual_memory().used
).
One possible reason is that psutil failed to count the pin memory.
现在chunk reuse方式可以将overall memory footprint从DeepSpeed的18M降低到14M(M是参数量)。但是派大星目前实现有局限。派大星采用静态方式去设计重用方案。在训练开始前,它规定param fp16所在的chunk内存,被grad fp16复用。这种方式前提是参数不会被BWD更新两边。对于BERT和GPT不会有任何问题,但是对于LSTM和seq2seq transformer不能work。
针对后者,我们可以改成一种动态重用方式。也就是BWD时候实时监测chunk的空隙(param fp16不用时候可以释放出chunk的空隙),把grad fp16内存分配在空隙处。
不过LSTM和seq2seq目前很少有超大模型需求。我们可以暂时保留这个需求,等到必要时再去做。
在混合精度训练中使用 loss scaler。目标是和 deepspeed 的混合精度对齐。
Hi! I am a newbie in this field. DeepSpeed provides a tutorial on GAN (https://www.deepspeed.ai/tutorials/gan/). I am curious about PatrickStar's performance in models like GANs or other CV models. I really hope that PatrickStar can make my poor GPU accommodate a large-scale GAN.
The current profiler is messy and we have to reorganize these code.
Memory and speed profiler for both PatrickStar and PyTorch.
debug_flag太分散了,简化多机模拟单机的逻辑。
env MODEL_NAME="GPT3_12B" CPU_EBD=0 CS=128 ACT_OFFLOAD=0 GPU_NUM=8 BS=8 AW=1 bash run_bert.sh
commit 6b4739a
观察可以发现,我们的chunk data使用有点过度的,其实让chunk data少用些内存应该可以跑更大模型。
Currently, during training, we sample cuda/cpu memory usage before and after submodule( operator in paper ) computing. However, it is not able to accurately depict the max memory consumption of a submodule, which will easily lead to OOM during submodule computing!
This issue advocates a more accurate sampling method. During warmup-iteration, we launch a thread to concurrently sample CPU and GPU memory usage every 0.01s. In this way, we know the peak memory of a submodule more accurately.
Two partitions for model data.
We are going to make user support both static and dynamic partitions.
Running a 40B model on an 8xA100 SuperPod node. The time details are as follows.
allocate payload takes too much time. This comes from frequent chunk new and del for communication buff.
If that is true, we should cache (world_size - 1) chunks in GPU memory.
Step 4 elaspe 63.65251803398132 s, 41.79747933716272 Tflops
CHUNK_LIST_prepare_device ............. 69.95204448699951, 16.157479393621628 %
CHUNK_allocate_payload ................ 170.51438927650452, 39.38530676631084 %
CLIENT_access ......................... 89.209885597229, 20.605643463536712 %
CLIENT_release ........................ 2.333692789077759, 0.5390348977945169 %
chunk_cpu_gpu_move .................... 123.20745587348938, 28.458380938192935 %
CLIENT_fetch_remote_chunks_broadcast .. 2.685667037963867, 0.620333689204676 %
CLIENT_fetch_remote_chunks ............ 130.1259322166443, 30.056406267825142 %
CLIENT_access_dist .................... 263.52323508262634, 60.86843167792419 %
CLIENT_release_dist ................... 18.868732452392578, 4.358287995999212 %
chunk_gpu_cpu_move .................... 76.71733140945435, 17.720121126871 %
CHUNK_LIST_chunk_move ................. 76.74013209342957, 17.725387614565417 %
FWD ................................... 79.00622844696045, 18.24880913004294 %
CLIENT_release_dist_reduce ............ 0.033098697662353516, 0.007645116441658537 %
HOOK_torch_allreduce .................. 5.245239019393921, 1.2115420212804222 %
BWD ................................... 238.41187143325806, 55.068224640575735 %
ADAM_prepare_data_grad_copy ........... 11.999637842178345, 2.771668828091965 %
ADAM_prepare_data ..................... 65.50793719291687, 15.130982276149549 %
ADAM_compute .......................... 27.286763429641724, 6.302679515178464 %
ADAM_param_fp32_to_fp16 ............... 21.466453075408936, 4.958307877399151 %
ADAM_release_data ..................... 0.21362972259521484, 0.04934405943401391 %
ADAM .................................. 115.520991563797, 26.68296622938133 %
CHUNK_LIST_make_room .................. 7.30447244644165, 1.6871824676488498 %
TOTAL ................................. 432.9390914440155
------------- DATA MOVE RESULTS --------------
chunk_cpu_gpu_move: 903168.0 MB, 1176 times, 7330.465462474785 MB/s
chunk_gpu_cpu_move: 962304.0 MB, 1253 times, 12543.50200040208 MB/s
ADAM_prepare_data_grad_copy: 99757.8125 MB, 1045 times, 8313.401938628052 MB/s
ADAM_param_fp32_to_fp16: 199515.625 MB, 1045 times, 9294.29861091289 MB/s
Aug 10的性能结果
log.GPT2small_gpu_1_cs_64_bs_128_cpueb_1_margin_0.8_warmup_0.2_gpu_0.8_adamcvt_1
2021-08-10:14:34:53,509 INFO [memory_monitor.py:65] CPU Virtual Memory: used = 15.08 GB, percent = 96.6%
605 2021-08-10:14:34:53,509 INFO [test_bert.py:223] ckp True fp16 True ps True: step elapse 5.177955627441406 sec/iter, 18.463766371092152 Tflops
606 2021-08-10:14:34:53,509 INFO [test_bert.py:225] model 0.72940493
607 2021-08-10:14:34:53,509 INFO [global_timer.py:45] *********** PROFILE RESULTS *************
608 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_prepare_device, 0, 0.0 %
609 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_allocate_payload, 0, 0.0 %
610 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access, 0.019408226013183594, 0.338427821424322 %
611 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release, 0.014924049377441406, 0.2602357121256555 %
612 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_cpu_gpu_move, 0, 0.0 %
613 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access_dist, 0.03873419761657715, 0.6754213447995139 %
614 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release_dist, 0.3606679439544678, 6.289089298897653 %
615 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_gpu_cpu_move, 0, 0.0 %
616 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_chunk_move, 0, 0.0 %
617 2021-08-10:14:34:53,509 INFO [global_timer.py:50] FWD, 0.28232502937316895, 4.9229973187357 %
618 2021-08-10:14:34:53,509 INFO [global_timer.py:50] BWD, 2.9886157512664795, 52.1135067722565 %
619 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy, 0.2039637565612793, 3.5565852198787224 %
620 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data, 0.22702884674072266, 3.958779022397416 %
621 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_compute, 0.013135433197021484, 0.2290470049819615 %
622 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_param_fp32_to_fp16, 0.5844182968139648, 10.190700111226695 %
623 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_release_data, 0.016661882400512695, 0.29053889612597344 %
624 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM, 0.9849364757537842, 17.174671477149886 %
625 2021-08-10:14:34:53,509 INFO [global_timer.py:76] *********** DATA MOVE RESULTS *************
626 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_cpu_gpu_move: 0.0 MB
627 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_gpu_cpu_move: 0.0 MB
628 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy: 2782.4589920043945 MB, 393 times, 13641.92854120348 MB/s
629 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_param_fp32_to_fp16: 2782.4589920043945 MB, 393 times, 4761.0744002597885 MB/s
The chunk list construction in PreprocessCtx not including memory copy is very time-consuming. It prevents us from quickly testing large models and makes the interaction frequency lower.
如果我们想自己用派大星跑一个预训练模型出来,可以采用如下方案
https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
派大星现在整块的功能已经满足开源的条件了。9月份需要重点关注如何释放开源的影响力。主要是应用效果和性能效果两个方面
应用效果
性能效果
As we hope to support MoE in #187, and MoE is mainly of model parallel structure instead of data parallel, we need to support managing only part of the model with chunk.
There are several design choices need to make, including but not limited to:
class Net(nn.Module):
def __init__(self, ...):
self.A = SubNetA(...) # managed by chunk
self.B = SubNetB(...) # not managed by chunk
self.C = SubNetC(...) # managed by chunk
self.A
and self.C
need model.backward(loss)
while self.B
only need loss.backward()
.cc @feifeibear
现在的engine里面设计模式十分混乱,需要重构。
Engine包括如下成员变量,client(负责访问和释放API),chunk-tensor-mapper(ctm)(数据库,存储tensor-chunk关系),chunkmgr(负责管理chunk)。这三个模块是解耦的,每次client的访问先去ctm寻址,然后可能触发mgr返回一个tensor。
现在的client包含ctm和chunkmgr其实很不合理。
ctm初始化有些放在model的初始化过程,有些放在client的初始化中了。
以下逻辑初始化engine
chunk_tensor_mapper = ChunkTensorIndex()
chunkmgr = ChunkMgr()
# 构建param fp16, param fp32 的chunk tensor映射关系,并初始化对应chunk的内存
with ps_model_init(ctm = chunk_tensor_mapper, chunkmgr= chunkmgr):
model = model_func()
# 根据param fp16的ctm初始化,momentum,variance和的chunk tensor映射关系,并初始化对应的内存
optimizer = PS_fp16_adam(chunk_tensor_mapper, chunkmgr= chunkmgr)
client = PSClient(chunk_tensor_mapper, chunkmgr)
model.register_hook(client)
return model, optimizer
因为embedding的参数比其他layer参数大很多,我们不把ebd参数交给chunk管理,并将其计算固定在CPU中。
这样每次计算前用hook将input从gpu拷贝到cpu,CPU ebd layers计算之后,再将输出的activations拷贝回GPU,以参与后面计算。
但是,有些PyTorch版本不支持torch.half类型的CPU embedding计算(比如torch, 1.4.0+cu100不支持,1.7.1+cu110则支持)。
现在cpu ebd也有param fp16和param fp32两份参数,但是param fp16也存成torch.float类型,用于FWD和BWD的计算,param fp32用于ADAM计算。而且每个进程都存储全部参数。
这样存在巨大内存浪费,首先其实只需要存一份torch.float类型的param,并可以用模型并行方式,分布在多个进程。
Currently, the training will start whether the config of 2 nodes are the same or not. This may cause some weird result during benchmarking. We should consider communicate the config among nodes to make sure they are running the same program...
完成以下feature可以对内开源
新feature开发
探索性feature
We would like to have a CI to run unitests each time an MR proposed to branch develop and master. However, we currently have no idea how to find a GPU to run the unitests. Does anyone have ideas?
rank 0进程读入模型,然后P2P传递给其他进程。
目前的派大星并不并不管理 buffer。这使得所有的 buffer 都被放在了 cpu 上,导致 runtime error。目前有以下的几个管理 buffer 的方式:
代码注释翻译情况:
On SuperNode, 4xA100. The schema is:
log.GPT_DS_20B_gpu_4_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_1
If we do not open loss scaler.
91.16946903509981 TFlops
CLIENT_fetch_remote_chunks_broadcast .. 0.008783340454101562, 0.045844479770245986 %
CHUNK_LIST_prepare_device ............. 1.2182786464691162, 6.358782407949941 %
CHUNK_allocate_payload_cuda ........... 2.8260343074798584, 14.750432744387322 %
CLIENT_fetch_remote_chunks ............ 3.865243434906006, 20.174565176495836 %
CLIENT_access_dist .................... 4.45051383972168, 23.229373011158142 %
CLIENT_release_dist ................... 0.5840051174163818, 3.04820369095599 %
chunk_cpu_gpu_move .................... 1.315248727798462, 6.864915917752167 %
FWD ................................... 3.430629253387451, 17.90610465665054 %
chunk_gpu_cpu_move .................... 1.211526870727539, 6.323541641863795 %
CHUNK_LIST_chunk_move ................. 1.2118439674377441, 6.325196722159518 %
CLIENT_release_dist_reduce ............ 0.004155874252319336, 0.021691507244168236 %
BWD ................................... 9.724532127380371, 50.757011949891286 %
ADAM_prepare_data_grad_copy ........... 0.7624289989471436, 3.9794837739847733 %
CLIENT_access ......................... 0.4744589328765869, 2.4764294477411446 %
ADAM_prepare_data ..................... 1.2415502071380615, 6.480247879758143 %
ADAM_compute .......................... 2.4682326316833496, 12.88289364880856 %
ADAM_param_fp32_to_fp16 ............... 2.1865973472595215, 11.41290359584114 %
CLIENT_release ........................ 0.02169013023376465, 0.11321122549126299 %
ADAM_release_data ..................... 0.024057865142822266, 0.1255695731730847 %
ADAM .................................. 6.003831148147583, 31.33688339345817 %
TOTAL ................................. 19.158992528915405
chunk_cpu_gpu_move: 165120.0 MB, 214 times, 1472599.3546247077 MB/s
chunk_gpu_cpu_move: 148992.0 MB, 194 times, 195513.34532889986 MB/s
ADAM_prepare_data_grad_copy: 49831.09375 MB, 525 times, 39540.773513608205 MB/s
ADAM_param_fp32_to_fp16: 99662.1875 MB, 525 times, 58856.407915226984 MB/s
open loss scaler.
69.97870338199793 TFlops
CLIENT_fetch_remote_chunks_broadcast .. 0.007257938385009766, 0.049356369721460944 %
CHUNK_LIST_prepare_device ............. 0.7660794258117676, 5.209592224489411 %
CHUNK_allocate_payload_cuda ........... 2.468181848526001, 16.784448888027352 %
CLIENT_fetch_remote_chunks ............ 3.210313558578491, 21.8311887637781 %
CLIENT_access_dist .................... 3.4126784801483154, 23.207336831979323 %
CLIENT_release_dist ................... 0.131638765335083, 0.8951869286978318 %
FWD ................................... 2.59273362159729, 17.63144193688909 %
chunk_gpu_cpu_move .................... 0.7620553970336914, 5.182227504426379 %
CHUNK_LIST_chunk_move ................. 0.7620978355407715, 5.182516100241714 %
chunk_cpu_gpu_move .................... 0.11212825775146484, 0.7625090559096998 %
CLIENT_release_dist_reduce ............ 0.0038404464721679688, 0.026116299962988396 %
BWD ................................... 6.16964864730835, 41.95564133156423 %
ADAM_prepare_data_grad_copy ........... 1.2602458000183105, 8.570086207136955 %
CLIENT_access ......................... 0.010293245315551758, 0.0699974558171156 %
ADAM_prepare_data ..................... 1.2735180854797363, 8.660342116395686 %
ADAM_compute .......................... 2.7817044258117676, 18.916505598856098 %
ADAM_param_fp32_to_fp16 ............... 1.6933107376098633, 11.5150703113443 %
CLIENT_release ........................ 0.010694026947021484, 0.07272290281474308 %
ADAM_release_data ..................... 0.011899471282958984, 0.0809203210300938 %
ADAM .................................. 5.942788362503052, 40.41291673154668 %
TOTAL ................................. 14.705170631408691
chunk_cpu_gpu_move: 195072.0 MB, 244 times, 148315.6728283042 MB/s
chunk_gpu_cpu_move: 179712.0 MB, 225 times, 148335.13341068564 MB/s
ADAM_prepare_data_grad_copy: 39864.875 MB, 420 times, 52286.67201149269 MB/s
ADAM_param_fp32_to_fp16: 79729.75 MB, 420 times, 36462.93182415404 MB/s
当前运行派大星总是需要 init_process_group
。应该允许用户通过 python xxx.py 的方式直接运行。
有一个想法,我们其实可以进一步缩减memory footprint。
我们可以只保留param fp32,FWD时, submodule(etc. :Linear)需要的param fp16临时分配,并从param fp32拷贝数据。计算完毕就释放。
BWD时候, submodule需要的话再从param fp32转化,产生grad fp16,后立刻开始adam计算,更新param fp32。这样grad fp16也可以扔掉。
总的内存消耗从14M降低到12M,也就是等于OS的大小(M参数个数)。
也就是fusion FWD,BWD and ADAM
有一个paper支持了我这个想法:
OPTIMIZER FUSION: EFFICIENT TRAINING WITH BETTER LOCALITY AND PARALLELISM
https://arxiv.org/pdf/2104.00237.pdf
派大星的使命通过开源让PTM训练**化,因此我们必须要让deepspeed的接口足够简单,并保证精度和某个广受认可的训练框架一致。广受认可的框架有几个选项
我们可以选择融入到deepspeed的生态中,具体来说就是精度和DeepSpeed对齐,尽可能小改动的去把派大星的弄到deepspeed里面去,有如下原因
目前在test_bert.py
中运行 Bert 模型时,开启和关闭派大星的速度差距较大。在 V100 上测试结果为:开启时单步速度为 0.35 s/iter,而关闭时单步速度为 0.13s/iter,并且 GPU 利用率差距较大。对于这种可以全部放在 GPU 上的小模型,派大星的速度应该和原生 pytorch 差距不大才对。
不排除是因为目前的 test_bert.py
中开启了过多的 profiler 的因素。
Mixure of Experts(MoE) is a popular PTM structure. We hope to support MoE trainining in PatrickStar.
So far, there are few MoE implementation in pytorch, e.g. laekov/fastmoe, and none of them support huggingface. Therefore, we hope:
Notice that we may not able to put the experts in chunks as they may be visited randomly.
Chunk size is a critical hyperparameter in PatrickStar. An appropriate chunk size setting is able
We intend to develop a script to choose the best chunk size before the exact training process starts.
For historical reasons, the Manager is a singleton and involves Metronome, Training Stage, and Memory tracer.
I am going to split these functions out of the Manager. And make it work as a runtime tracker belonging to the Client.
在业务实践中遇到了比较奇怪的问题,业务镜像不允许写入编译的结果(会报各式各样的错),所以需要改成在安装的时候编译对应的 .so,这样在制作镜像的时候就把写入操作都完成了,运行的时候就不会有问题了。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.