hpcaitech / palm-colossalai Goto Github PK
View Code? Open in Web Editor NEWScalable PaLM implementation of PyTorch
License: Apache License 2.0
Scalable PaLM implementation of PyTorch
License: Apache License 2.0
Hello,
Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.
Warning thrown:
Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.
Contained Example:
dataset = load_dataset("the_pile", 'enron_emails')
print(dataset)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
seq_len = 2048
def tokenize(examples):
seq_length = seq_len
examples = tokenizer(examples["text"])
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
if total_length >= seq_length:
total_length = (total_length // seq_length) * seq_length
result = {
k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
for k, t in concatenated_examples.items()
}
result["labels"] = copy.deepcopy(result["input_ids"])
return result
tokenized_dataset = dataset.map(
tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)
Thank you,
Enrico
Have you reproduced all model details? Only implementing partial model components (i.e., building the decoder as reported) should not be called reproducing PaLM or an implementation of PaLM. Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Or you just chase the clout and assert your startup in a grandstanding manner?
Overclaiming is a form of falsification or academic misconduct since you have not implemented or reproduced at all.
Making fake news are very disrespectful and could mislead the entire community.🙂️
See MR #41
The launching script
env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=./configs/palm_8b_zero_gemini_badcase.py
When using 'auto', the error log. failed in the 1st iteration's backward
However, using 'cpu' will pass.
That indicates our gemini is not robust enough.
Hi,
I tried to run your demo but I got some problems.
When using your default settings, i.e., colossalai 0.1.2, I got
Traceback (most recent call last):
File "train.py", line 18, in <module>
from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/__init__.py)
Then I tried to upgrade to colossalai 0.1.3, I got
Traceback (most recent call last):
File "train.py", line 176, in <module>
train_palm()
File "train.py", line 149, in train_palm
hooks.ThroughputHook(ignored_steps=10, tflop_per_step = tflop),
TypeError: __init__() got an unexpected keyword argument 'tflop_per_step'
I also tried to build colossalai 0.1.5 based on your docker file, I got
NameError: name 'colossal_C' is not defined
So can you provide some suggestions about how to set up the environment in a correct way?
Why not directly answering the issues #27?
Have you reproduced all model details? Yes or Not?
Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Yes or Not?
As you said GitHub encourages communication and ongoing review, why are you tracking down my real identity and closing the issues, rather than directly answering the questions? Is that how your team treating users' questions?
If you didn't overclaim, please show the reproduced results, rather than closing or deleting the issues.
GitHub is a place to share source codes and encourage open collaboration, but it does not encourage to make fake news.
Chasing the clout is your freedom, but you also have to be respectful to the spirit of open source.
A high school student even knows better than you about how to respect others.
Using MR #41
The launching script is as follows.
env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=configs/palm_8b_zero_2p5d_badcase.py
It failed after a few iterations. I prefer to attribute the bug to Gemini.
Error log likes
Or do I need to use the PaLM_PyTorch by lucidrains?
to run it efficiently?
the README file has a small error.
bash ./tools/download_token.py </PATH/TO/TOKENIZER/>
which is supposeed to be:
bash ./tools/download_token.sh </PATH/TO/TOKENIZER/>
On a Multi GPU A100 system:
$ cat CONFIG_FILE.py
from colossalai.amp import AMP_TYPE
SEQ_LENGTH = 512
BATCH_SIZE = 8
NUM_EPOCHS = 10
WARMUP_EPOCHS = 1
parallel = dict(
tensor=dict(mode="1d", size=4),
)
model = dict(
type="palm_small",
# use_grad_checkpoint=False,
# use_act_offload=False,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
clip_grad_norm = 1.0
export DATA=wiki_dataset/
export TOKENIZER=tokenizer/
$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py
Traceback (most recent call last):
File "train.py", line 18, in
from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-05-29_08:01:15
host : 1d3306a6abee
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.