Code Monkey home page Code Monkey logo

palm-colossalai's People

Contributors

binmakeswell avatar fastalgo avatar feifeibear avatar frankleeeee avatar kurisusnowdeng avatar ver217 avatar wesley-jzy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

palm-colossalai's Issues

Warning When Using Different HuggingFace Datasets

Hello,

Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.

Warning thrown:

Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.

Contained Example:

dataset = load_dataset("the_pile", 'enron_emails')

print(dataset)

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

seq_len = 2048

def tokenize(examples):
    seq_length = seq_len
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_dataset = dataset.map(
    tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)

Thank you,

Enrico

Have you really reproduced PaLM or just joking?

Have you reproduced all model details? Only implementing partial model components (i.e., building the decoder as reported) should not be called reproducing PaLM or an implementation of PaLM. Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Or you just chase the clout and assert your startup in a grandstanding manner?

Overclaiming is a form of falsification or academic misconduct since you have not implemented or reproduced at all.

Making fake news are very disrespectful and could mislead the entire community.🙂️

Gemini badcase

See MR #41
The launching script

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=./configs/palm_8b_zero_gemini_badcase.py

When using 'auto', the error log. failed in the 1st iteration's backward

image

However, using 'cpu' will pass.
That indicates our gemini is not robust enough.

There should be a version mismatching problem

Hi,

I tried to run your demo but I got some problems.
When using your default settings, i.e., colossalai 0.1.2, I got

Traceback (most recent call last):
File "train.py", line 18, in <module>
    from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
        ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/__init__.py)

Then I tried to upgrade to colossalai 0.1.3, I got

Traceback (most recent call last):
File "train.py", line 176, in <module>
    train_palm()
File "train.py", line 149, in train_palm
    hooks.ThroughputHook(ignored_steps=10, tflop_per_step = tflop),
        TypeError: __init__() got an unexpected keyword argument 'tflop_per_step'

I also tried to build colossalai 0.1.5 based on your docker file, I got

NameError: name 'colossal_C' is not defined

So can you provide some suggestions about how to set up the environment in a correct way?

GitHub encourages communication and ongoing review.

Why not directly answering the issues #27?
Have you reproduced all model details? Yes or Not?
Have you really reproduced similar results (e.g., the sample model outputs in Appendix F)? Yes or Not?

As you said GitHub encourages communication and ongoing review, why are you tracking down my real identity and closing the issues, rather than directly answering the questions? Is that how your team treating users' questions?

If you didn't overclaim, please show the reproduced results, rather than closing or deleting the issues.

GitHub is a place to share source codes and encourage open collaboration, but it does not encourage to make fake news.
Chasing the clout is your freedom, but you also have to be respectful to the spirit of open source.
A high school student even knows better than you about how to respect others.

Gemin+2.5D badcase

Using MR #41

The launching script is as follows.

env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=configs/palm_8b_zero_2p5d_badcase.py

It failed after a few iterations. I prefer to attribute the bug to Gemini.
Error log likes

image

image

Fails with cannot import colo_set_process_memory_fraction in Docker

On a Multi GPU A100 system:

$ cat CONFIG_FILE.py

from colossalai.amp import AMP_TYPE

SEQ_LENGTH = 512

BATCH_SIZE = 8

NUM_EPOCHS = 10

WARMUP_EPOCHS = 1

parallel = dict(

tensor=dict(mode="1d", size=4),

)

model = dict(

type="palm_small",

# use_grad_checkpoint=False,

# use_act_offload=False,

)

fp16 = dict(mode=AMP_TYPE.NAIVE)

clip_grad_norm = 1.0

export DATA=wiki_dataset/

export TOKENIZER=tokenizer/


$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py

Traceback (most recent call last):

File "train.py", line 18, in

from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity

ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python

Traceback (most recent call last):

File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in

sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper

return f(*args, **kwargs)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main

run(args)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run

elastic_launch(

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

train.py FAILED


Failures:

<NO_OTHER_FAILURES>


Root Cause (first observed failure):

[0]:

time : 2022-05-29_08:01:15

host : 1d3306a6abee

rank : 0 (local_rank: 0)

exitcode : 1 (pid: 11)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.