stas00 / ml-engineering Goto Github PK

View Code? Open in Web Editor NEW

9.8K 99.0 589.0 6.06 MB

Machine Learning Engineering Open Book

Home Page: https://stasosphere.com/machine-learning/

License: Creative Commons Attribution Share Alike 4.0 International

Python 84.84% Shell 12.38% Makefile 1.00% CSS 1.79%

pytorch slurm large-language-models llm machine-learning scalability transformers machine-learning-engineering mlops ai

ml-engineering's Introduction

Machine Learning Engineering Open Book

This is an open collection of methodologies, tools and step by step instructions to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, I'm working on developing/training open-source Retrieval Augmented Generation (RAG) models at Contextual.AI.

I've been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual I'm happy to share these with the wider ML community.

My apologies if the layout is a bit unstable while I'm writing new chapters and gradually re-organizing the content to be more intuitive.

Part 1. Insights

The AI Battlefield Engineering - what you need to know in order to succeed

Part 2. Hardware

Compute - accelerators, CPUs, CPU memory.
Storage - local, distributed and shared file systems.
Network - intra- and inter-node networking.

Part 3. Orchestration

SLURM - the main orchestration environment

Part 4. Training

Training - model training related guides

Part 5. Development

Debugging and Troubleshooting - how to debug easy and difficult issues
And more debugging
Testing - numerous tips and tools to make test writing enjoyable

Part 6. Miscellaneous

Resources - LLM/VLM chronicles

Updates

I announce any significant updates on my twitter channel https://twitter.com/StasBekman.

PDF version

Download the PDF version of the book.

I will try to rebuild it once a week or so, but if you want the latest, the instructions for building are here.

Thanks to HuggingFace for giving me permission to host my book's PDF at the HF hub.

Discussions

If you want to discuss something related to ML engineering this repo has the community discussions available - so please don't hesitate to share your experience or start a new discussion about something you're passionate about.

Key comparison tables

High end accelerators:

Networks:

Shortcuts

Things that you are likely to need to find quickly and often.

Tools:

all_reduce_bench.py - a much easier way to benchmark network throughput than nccl-tests.
torch-distributed-gpu-test.py - a tool to quickly test your inter-node connectivity

Guides:

debugging pytorch applications - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
slurm for users - a slurm cheatsheet and tricks
make tiny models/datasets/tokenizers
LLM/VLM chronicles collection

Gratitude

None of this would have been possible without me being entrusted with doing the specific LLM/VLM trainings I have learned the initial know-how from. This is a privilege that only a few enjoy due to the prohibitively expensive cost of renting huge ML compute clusters. So hopefully the rest of the ML community will vicariously learn from these notes.

Special thanks go to Thom Wolf who proposed that I lead the BLOOM-176B training back when I didn't know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

Currently, I'm continue expanding my knowledge and experience while training models and building systems at Contextual.AI and I'm grateful to that opportunity.

I'd also like to say thanks to the numerous contributors who have been making this text awesome and error-free.

Contributing

If you found a bug, typo or would like to propose an improvement please don't hesitate to open an Issue or contribute a PR.

License

The content of this site is distributed under Attribution-ShareAlike 4.0 International.

My repositories map

✔ Machine Learning: ML Engineering Open Book | ML ways | Porting

✔ Guides: The Art of Debugging

✔ Applications: ipyexperiments

ml-engineering's People

Contributors

Stargazers

Watchers

Forkers

databill86 techthiyanes muellerzr markhng525 lvg77 yparvej ishankgp ryanholbrook bcui19 carmocca saforem2 sanyamlakhanpal anminhhung ichirutake anh-vunguyen henrytranvan nguyenthienhy anhlocdeptrai thanhpham1987 vodanhbk95 martinhoang11 khacthuan1996 tonyle9 lienminhquang tuananh1406 underlmao theeyesneverlie28 sprklinginfo tomrailio pavrav fred-fan alexios-25 alee02 tpnguyen tonywhite11 phuocnguyenhuu msmiyels dreamin-edu john-adeojo giaplee jbellars jingli-wtbox hhtnghia321 tuanbc ayush-agarwala ifeanyi55 jackman337 sumit6597 evelynmitchell b0xtch chaosisnotrandomitisrhythmic laurentmazare ksindi ajnovice twebberbr mthad devbox10 pitmonticone alienrobotninja muharremokutan cx0 ole-e-ole goswamig ailabteam polya20 gitbeiro pablovazquezg aadehamid christindbose mbrukman antferdom mikedelafuente chschroeder hbcbh1999 tristanoprofetto dumpmemory mrxvt munirabobaker hungphongtrn ibrahimciko ashioyajotham ravinaik giangdip2410 seshakiran raoulbia skp80 pengzhangzhi abhijeet00 obi-wan-shinobi roshankarande hurricanejin zeroxclem thisishubert omkar-sh guillermo-ayala shahsarick gowsreeni al-amini lewieyasu iliemihai

ml-engineering's Issues

discuss the solutions to Not fully recovering spikes

Not fully recovering spikes are quite common in model training at scale. Like what I have:

It's worth discussing the causes and potential solutions to that.
My general guesses are:

some of the toxic data points at certain iterations
The learning rate.

Welcome everyone to talk about your experiences dealing with the spikes...

Reference:

https://github.com/stas00/ml-engineering/blob/master/training/instabilities/training-loss-patterns.md

Clarification for gradient memory in mixed precision training

Hi @stas00 ,

I was going over your excellent notes on model memory usage and I noticed the following:

Gradients

4 bytes * number of parameters for either fp32 or mixed precision training (gradients are almost always kept in fp32).
2 bytes * number of parameters for more recent works where half-precision is used

I was trying to see how you got the 4 bytes number for mixed precision! From what I've seen in the mixed precision paper and the ZeRO paper, the gradients are kept in half-precision and should take 2 bytes / parameter. However, during the optimization step, the gradients are converted to FP32 - but I assumed this happens on the fly and thus you don't have to write the full FP32 gradient tensor into HBM.

When I was benchmarking memory usage using some of your older BERT-based benchmarking code, the memory usage with FP16 training works out to be 11GB on my machine. If I use the 4 bytes number for gradients instead of 2 bytes, the estimated memory will end up coming closer to the 11GB amount! I was initially assuming this discrepancy is some overhead with temporary buffers (instead of number of bytes for gradient elements to be 4). I'm just curious to know more about this 4 byte number, and even better if you're aware of the implementation details for this in Apex or DeepSpeed. Thank you!

TPU v4 has 1,200GB/s of mem bandwidth and not 2,400, right?

In the README.md:

On Google's documentation: https://cloud.google.com/tpu/docs/v4

Daisy chain batch jobs

The job array works well to queue up multiple jobs:

https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#queue-up-multiple-training-jobs

Another common approach is to "daisy chain" jobs by having the job script submit another job that is dependent on itself. For example, in train.slurm you'd have a line like:

# when train.slurm executes, have it submit another job dependent on itself
sbatch --dependency=$SLURM_JOBID train.slurm

This is usually done near the top of the script, before the command that actually launches the run.

One might also pair that with some logic to stop the chaining when the job is done. For example, the application or the user might touch a "run.done" file when it completes. Then the script can check for that file.

# exit right away if "run.done" file is detected
if [ -f run.done ] ; then
  exit 0
fi

# otherwise chain up another job
sbatch --dependency=$SLURM_JOBID train.slurm

# then launch the run
<<launch run>>>

Additionally, one could check for the "run.done" file after the run and attempt to cancel any already daisy-chained job.

I don't have a list of pros/cons vs the job array, but it's one more method I see in practice.

Conflicting opinions about streaming data from cloud storage?

(1) and (2) seem to express different opinions:

In the "3 Machine Learning IO needs" section, one of the bullet points under "Incoming suggestions from Ross Wightman to integrate" is "Note that once your datasets are optimally friendly for a large, distributed network filesystem, they can usually just be streamed from bucket storage in cloud systems that have that option. So better to move them off the network
filesystem in that case."
The section "Local storage beats cloud storage" starts with "While cloud storage is cheaper the whole idea of fetching and processing your training data stream dynamically at training time is very problematic with a huge number of issues around it...It’s so much better to have enough disk space locally for data loading."

What am I missing?

Convert to bfloat16 failing

I tried your shell script torch-checkpoint-convert-to-bf16 on a quantized model with no .bin files, but it didn't quite work. Here are my results:

(gptq) ~/gptq/models/TheBloke_StableBeluga2-70B-GPTQ$ ll
total 34506488
-rw-rw-r-- 1 matt matt        7020 Jul 28 12:44 LICENSE.txt
-rw-rw-r-- 1 matt matt       15560 Jul 28 12:44 README.md
-rw-rw-r-- 1 matt matt        4766 Jul 28 12:44 USE_POLICY.md
-rw-rw-r-- 1 matt matt         679 Jul 28 12:44 config.json
-rw-rw-r-- 1 matt matt         137 Jul 28 12:44 generation_config.json
-rw-rw-r-- 1 matt matt 35332232456 Jul 28 12:49 gptq_model-4bit--1g.safetensors
-rw-rw-r-- 1 matt matt         301 Jul 28 12:44 huggingface-metadata.txt
-rw-rw-r-- 1 matt matt         183 Jul 28 12:49 quantize_config.json
-rw-rw-r-- 1 matt matt         411 Jul 28 12:49 special_tokens_map.json
-rw-rw-r-- 1 matt matt     1842764 Jul 28 12:49 tokenizer.json
-rw-rw-r-- 1 matt matt      499723 Jul 28 12:49 tokenizer.model
-rw-rw-r-- 1 matt matt         649 Jul 28 12:49 tokenizer_config.json

(gptq) ~/gptq/models/TheBloke_StableBeluga2-70B-GPTQ$ bash ~/mr/utilz/torch-checkpoint-convert-to-bf16 
creating a new checkpoint under dir bf16
converting *bin torch files
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 791, in load
    with _open_file_like(f, 'rb') as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 271, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 252, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '*bin'
converting *safetensors files
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 791, in load
    with _open_file_like(f, 'rb') as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 271, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/gptq/lib/python3.11/site-packages/torch/serialization.py", line 252, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '*bin'
the dir bf16 now contains a copy of the original checkpoint with bf16 weights

(gptq) ~/gptq/models/TheBloke_StableBeluga2-70B-GPTQ$ ll bf16/
total 2312
-rw-rw-r-- 1 matt matt     679 Aug 15 18:27 config.json
-rw-rw-r-- 1 matt matt     137 Aug 15 18:27 generation_config.json
-rw-rw-r-- 1 matt matt     183 Aug 15 18:27 quantize_config.json
-rw-rw-r-- 1 matt matt     411 Aug 15 18:27 special_tokens_map.json
-rw-rw-r-- 1 matt matt 1842764 Aug 15 18:27 tokenizer.json
-rw-rw-r-- 1 matt matt  499723 Aug 15 18:27 tokenizer.model
-rw-rw-r-- 1 matt matt     649 Aug 15 18:27 tokenizer_config.json

What are your thoughts? Thank you!

Missing `hparams` section

Hi Stas, thank you for making these notes public! They are an invaluable resource.

I noticed that the hparams folder, linked here in the readme, seems to be missing from the repository. Was this intentional?

Parallel training hangs

Hi, I saw your toolbox link in a Huggingface issue and gave it a try. My four new GPUs hang when trying to fine tune a transformer, and they appear to do the same thing when running your torch-distributed-gpu-test.py tool, too. However, I'm not sure what the expected outcome is here. I should point out that I can fine tune a transformer with just a single GPU. I'm using Python 3.9.7, Transformers 4.17.0, PyTorch 1.11.0+cu113, NCCL 2.12.7 for CUDA 11.6, and four Nvidia A6000 GPUs.

$ NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:21288:21288 [0] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21288:21288 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21288:21288 [0] NCCL INFO NET/IB : No device found.
DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21288:21288 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:21290:21290 [2] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21289:21289 [1] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21290:21290 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21289:21289 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21289:21289 [1] NCCL INFO NET/IB : No device found.
DeepWhite:21290:21290 [2] NCCL INFO NET/IB : No device found.
DeepWhite:21290:21290 [2] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21289:21289 [1] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21290:21290 [2] NCCL INFO Using network Socket
DeepWhite:21289:21289 [1] NCCL INFO Using network Socket
DeepWhite:21291:21291 [3] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21291:21291 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21291:21291 [3] NCCL INFO NET/IB : No device found.
DeepWhite:21291:21291 [3] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21291:21291 [3] NCCL INFO Using network Socket
DeepWhite:21289:21327 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
DeepWhite:21291:21329 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
DeepWhite:21290:21328 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
DeepWhite:21288:21326 [0] NCCL INFO Channel 00/02 :    0   1   2   3
DeepWhite:21288:21326 [0] NCCL INFO Channel 01/02 :    0   1   2   3
DeepWhite:21288:21326 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:21288:21326 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:21291:21329 [3] NCCL INFO Channel 00 : 3[4a000] -> 0[3000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 00 : 1[21000] -> 2[49000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 00 : 2[49000] -> 3[4a000] via P2P/IPC
DeepWhite:21291:21329 [3] NCCL INFO Channel 01 : 3[4a000] -> 0[3000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 01 : 1[21000] -> 2[49000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 01 : 2[49000] -> 3[4a000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Connected all rings
DeepWhite:21290:21328 [2] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Channel 00 : 3[4a000] -> 2[49000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Channel 01 : 3[4a000] -> 2[49000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 00 : 2[49000] -> 1[21000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 01 : 2[49000] -> 1[21000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:21291:21329 [3] NCCL INFO Connected all trees
DeepWhite:21291:21329 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21291:21329 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21288:21326 [0] NCCL INFO Connected all trees
DeepWhite:21288:21326 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21288:21326 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21290:21328 [2] NCCL INFO Connected all trees
DeepWhite:21290:21328 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21290:21328 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21289:21327 [1] NCCL INFO Connected all trees
DeepWhite:21289:21327 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21289:21327 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21291:21329 [3] NCCL INFO comm 0x7f8894002fb0 rank 3 nranks 4 cudaDev 3 busId 4a000 - Init COMPLETE
DeepWhite:21289:21327 [1] NCCL INFO comm 0x7fd2c8002fb0 rank 1 nranks 4 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:21290:21328 [2] NCCL INFO comm 0x7f7aa0002fb0 rank 2 nranks 4 cudaDev 2 busId 49000 - Init COMPLETE
DeepWhite:21288:21326 [0] NCCL INFO comm 0x7fb314002fb0 rank 0 nranks 4 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:21288:21288 [0] NCCL INFO Launch mode Parallel

GPU requirements and cost estimation.

Hey, I just came towards this repo and I highly appreciate the content that you put up here. I also have started to do fine-tuning and moving towards large scale fine-tuning. One of the key things before getting started is estimating the number of GPUs required and the estimated cost for doing the fine-tuning. Now there are so many variables like:

What GPU you use.
How many GPUs we are using.
Is this full-finetuning or using LoRA or QLoRA, etc. etc.

The worst part is when we take some amount of GPU but that is not fully utilized. So, I want to understand whether there is any empirical way of doing things or not. Yes, it might not be precise. However, it would be really useful.

If you want, I can dump some of my findings here.

Question about changing precision post training

In the Changing precision post-training section it is stated that :

Using fp16-pretrained model in bf16 regime usually fails - due to overflows [...]

Using bf16-pretrained model in fp16 regime usually works - it will lose some performance on conversion [...]

When reading this statement I consider the following scenario:

model_in_fp16.to(bf16) # Overflow
model_in_bf16.to(fp16) # OK

I'm quite surprised and would have expected the opposite statement as converting weights from fp16 $[-65504;66504]$ to bf16 $[-2^{126}; 2^{127}]$ wouldn't results in a overflow where converting weights from bf16 $[-2^{126}; 2^{127}]$ to fp16 $[-65504;66504]$ could result in a under/overflow.

Is there something I'm overlooking or misunderstanding?
Is the term "in bf16 regime" actually implying that it receives bf16 inputs?

ML

Quarto Site

Hey @stas00 ,

I added a comment on my previous PR:

@stas00 if you're curious, I've done some re-org and tried rendering everything with Quarto in my fork

You can see it online at:

https://saforem2.github.io/ml-engineering/

This is mostly just a first (rough) pass (and copied + modified from existing Quarto sites I've made),
but I made an effort to fix / change things where appropriate (e.g. links, etc.)

disclaimer: I mostly just did this out of personal curiosity / testing, but thought I'd share incase you're interested 🤷🏻‍♂️

I know that I, personally have a hard time keeping track of GitHub comments ( especially on closed PRs 😂 ),
so I figured it probably made more sense to move the discussion to a separate issue (I hope that's okay)

Improve folder structure

Would it be possible to improve the folder structure so it somewhat matches your table of contents?

eg:
├── Part 1
│ ├── Topic 1
│ ├── Topic 2
├── Part 2
│ ├── Topic 3
│ ├── Topic 4

convert markdown to pdf

Hi, I had a good read of this book! Wondering if we can convert the markdown files to PDF so that we can print it out to read. I would like to submit a pr for that! Let me know if you are interested!

Question about the right hidden dim when using SwiGLU

Context

In the SwiGLU-based MLP section, it is stated :

The SwiGLU-based MLP contains an additional learned matrix in its activation function. There the MLP block contains 3 matrices instead of the original 2. To preserve the total number of parameters in the MLP block the paper that introduces SwiGLU proposes to use dim_mlp = 4dim_attninstead of the typical dim_mlp = 8/3dim_attn. The The Case for Co-Designing Model Architectures with Hardware paper provides recommendations for finding the value of hidden dimension (h) that would lead to the best matmul performance, and if you used 8/3*h is likely to result in a much slower MLP block, because 1/3 will break all the alignments.

And later on, in the final recommendations for model sizing section:

The full recommendations are:
[...]
6. For SwiGLU search for the best performing hidden size close to 8/3*h

It took me a couple of readings to understand those statements, and I'm still not entirely certain of their meaning tbh.
It could just be me, as I'm learning all those informations but I do have a few questions about it:

Questions

Right hidden dim when using SwiGLU

Here you said that: "For SwiGLU search for the best performing hidden size close to 8/3*h"

But earlier you've written "if you used 8/3*h is likely to result in a much slower MLP block, because 1/3 will break all the alignments".

Does that mean that you still recommend using 8/3*h despite it leading to slower MLP block?

1/3 breaking the alignments

Also regarding the statement "and if you used 8/3*h is likely to result in a much slower MLP block, because 1/3 will break all the alignments."
I'm not sure what the $1/3$ refers to, clarify its meaning for me?

Note: It could greatly help newcomers like myself to have those small sections clarified for better understanding :)

Minor Typo in emulate multi node

Please add import torch.distributed as dist in the section
4. Create a test script to check both GPUs are used.

I tried creating a branch and raise a PR but could not do so due to

$:~/ml-engineering-stats00$ git push origin asaha/emulate-multi-node
remote: Permission to stas00/ml-engineering.git denied to anindya-saha.
fatal: unable to access 'https://github.com/stas00/ml-engineering.git/': The requested URL returned error: 403

pip install -r build/requirements.txt fails due to github_md_utils

ERROR: Could not find a version that satisfies the requirement github_md_utils (from versions: none) ERROR: No matching distribution found for github_md_utils

Any ideas? Googling github_md_utils gives 0 hits: https://www.google.com/search?client=safari&rls=en&q=github_md_utils&ie=UTF-8&oe=UTF-8

Thanks!

stas00 / ml-engineering Goto Github PK

ml-engineering's Introduction

Machine Learning Engineering Open Book

Table of Contents

Updates

PDF version

Discussions

Key comparison tables

Shortcuts

Gratitude

Contributing

License

My repositories map

ml-engineering's People

Contributors

Stargazers

Watchers

Forkers

ml-engineering's Issues

Reference:

Context

Questions

Right hidden dim when using SwiGLU

1/3 breaking the alignments

Recommend Projects

Recommend Topics

Recommend Org