hpcaitech / palm-colossalai Goto Github PK

View Code? Open in Web Editor NEW

192.0 192.0 28.0 82 KB

Scalable PaLM implementation of PyTorch

License: Apache License 2.0

Python 98.46% Shell 0.44% Dockerfile 1.10%

palm-colossalai's People

Contributors

Stargazers

Watchers

palm-colossalai's Issues

Warning When Using Different HuggingFace Datasets

Hello,

Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.

Warning thrown:

Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.

Contained Example:

dataset = load_dataset("the_pile", 'enron_emails')

print(dataset)

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

seq_len = 2048

def tokenize(examples):
    seq_length = seq_len
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_dataset = dataset.map(
    tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)

Thank you,

Enrico

Have you really reproduced PaLM or just joking?

Have you reproduced all model details? Only implementing partial model components (i.e., building the decoder as reported) should not be called reproducing PaLM or an implementation of PaLM. Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Or you just chase the clout and assert your startup in a grandstanding manner?

Overclaiming is a form of falsification or academic misconduct since you have not implemented or reproduced at all.

Making fake news are very disrespectful and could mislead the entire community.🙂️

Gemini badcase

See MR #41
The launching script

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=./configs/palm_8b_zero_gemini_badcase.py

When using 'auto', the error log. failed in the 1st iteration's backward

However, using 'cpu' will pass.
That indicates our gemini is not robust enough.

There should be a version mismatching problem

Hi,

I tried to run your demo but I got some problems.
When using your default settings, i.e., colossalai 0.1.2, I got

Traceback (most recent call last):
File "train.py", line 18, in <module>
    from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
        ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/__init__.py)

Then I tried to upgrade to colossalai 0.1.3, I got

Traceback (most recent call last):
File "train.py", line 176, in <module>
    train_palm()
File "train.py", line 149, in train_palm
    hooks.ThroughputHook(ignored_steps=10, tflop_per_step = tflop),
        TypeError: __init__() got an unexpected keyword argument 'tflop_per_step'

I also tried to build colossalai 0.1.5 based on your docker file, I got

NameError: name 'colossal_C' is not defined

So can you provide some suggestions about how to set up the environment in a correct way?

GitHub encourages communication and ongoing review.

Why not directly answering the issues #27?
Have you reproduced all model details? Yes or Not?
Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Yes or Not?

As you said GitHub encourages communication and ongoing review, why are you tracking down my real identity and closing the issues, rather than directly answering the questions? Is that how your team treating users' questions?

If you didn't overclaim, please show the reproduced results, rather than closing or deleting the issues.

GitHub is a place to share source codes and encourage open collaboration, but it does not encourage to make fake news.
Chasing the clout is your freedom, but you also have to be respectful to the spirit of open source.
A high school student even knows better than you about how to respect others.

[feature] add model checkpointing

torch.distributed.elastic.multipro cessing.errors.ChildFailedError

Above is the program operation log，its says torch.distributed.elastic.multipro cessing.errors.ChildFailedError.
Can anybody know why it happen.Thanks!

Gemin+2.5D badcase

Using MR #41

The launching script is as follows.

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=configs/palm_8b_zero_2p5d_badcase.py

It failed after a few iterations. I prefer to attribute the bug to Gemini.
Error log likes

Can I run this on one rtx 4070 ti?

Or do I need to use the PaLM_PyTorch by lucidrains?
to run it efficiently?

[feature] Add performance and scalability results

ModuleNotFoundError: No module named 'torch._six'

why had it problem happen?

Is it the wrong version of torch？

bash ./tools/download_token.py </PATH/TO/TOKENIZER/>

the README file has a small error.
bash ./tools/download_token.py </PATH/TO/TOKENIZER/>
which is supposeed to be:
bash ./tools/download_token.sh </PATH/TO/TOKENIZER/>

Fails with cannot import colo_set_process_memory_fraction in Docker

On a Multi GPU A100 system:

$ cat CONFIG_FILE.py

from colossalai.amp import AMP_TYPE

SEQ_LENGTH = 512

BATCH_SIZE = 8

NUM_EPOCHS = 10

WARMUP_EPOCHS = 1

parallel = dict(

tensor=dict(mode="1d", size=4),

)

model = dict(

type="palm_small",

# use_grad_checkpoint=False,

# use_act_offload=False,

)

fp16 = dict(mode=AMP_TYPE.NAIVE)

clip_grad_norm = 1.0

export DATA=wiki_dataset/

export TOKENIZER=tokenizer/

$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py

Traceback (most recent call last):

File "train.py", line 18, in

from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity

ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python

Traceback (most recent call last):

File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in

sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper

return f(*args, **kwargs)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main

run(args)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run

elastic_launch(

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

train.py FAILED

Failures:

<NO_OTHER_FAILURES>

Root Cause (first observed failure):

[0]:

time : 2022-05-29_08:01:15

host : 1d3306a6abee

rank : 0 (local_rank: 0)

exitcode : 1 (pid: 11)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

hpcaitech / palm-colossalai Goto Github PK

palm-colossalai's People

Contributors

Stargazers

Watchers

Forkers

palm-colossalai's Issues

Recommend Projects

Recommend Topics

Recommend Org