lassl / lassl Goto Github PK
View Code? Open in Web Editor NEWEasy Language Model Pretraining leveraging Huggingface's Transformers and Datasets
License: Apache License 2.0
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets
License: Apache License 2.0
poetry run python3 train_tokenizer.py --corpora_dir corpora \
--corpus_type sent_text \
--model_type roberta \
--vocab_size 51200 \
--min_frequency 2
poetry run python3 serialize_corpora.py --model_type roberta \
--tokenizer_dir tokenizers/roberta \
--corpora_dir corpora \
--corpus_type sent_text \
--max_length 512 \
--num_proc 96 \
--batch_size 1000 \
--writer_batch_size 1000
ref:
train_tokenizer.py
에 사용자가 special tokens을 추가할 수 있도록 한다.
Update requirements.txt
기본적으로 전체적인 틀은 잡혀있는 사항 v0.1.0
을 release하기에 앞서 다음의 내용에 대해서 논의
serialize_corpora.py
와 train_tokenizer.py
가 지원하는 model_type
에 이격이 존재
serialie_corpora.py
: roberta
, gpt2
, albert
train_tokenizer.py
: bert-uncased
, bert-cased
, gpt2
, roberta
, albert
, electra
README.md
pretrain_language_model.py
Blender
class for mixing datasetsto: @seopbo
cc: @lassl/authors
README.md
에 https://huggingface.co/lassl
link에 badge와 link를 추가한다.
cc: @lassl/authors
transformers
4.13.0이 update되어 dependency를 update 합니다. release note는 아래의 link입니다.
Add CITATION.ctf
to repository.
cc: @lassl/authors
renew-hf-style
에 RobertaPreProcessor
를 개선한다.
load_corpora
function을 개선한다. 아래의 형태를 추가 지원한다.
문서0
문장0,0
문장0,1
문장0,2
...
문장0,N
문서1
문장1,0
문장1,1
문장1,2
...
문장1,M
pyproject.toml
에 author를 추가한다.
gpt2 serialize를 위한 preprocessor를 추가한다
Replace MIT
with Apache-2.0
.
cc: @lassl/authors
Fix a typo in clause for import module in src/collators.py
_torch_collator_batch
-> _torch_collate_batch
Line 5 in 27f8229
cc: @lassl/authors
DataCollatorForBart
Is your feature request related to a problem? Please describe.
BART processor, collator 추가하기
Describe the solution you'd like
text_infilling 방법을 collator로 추가한다.
Fix packaging subpackage of lassl
Add examples configs (bert-small.yaml
, roberta-small.yaml
, gpt2-small.yaml
, albert-small.yaml
)
Describe the bug
docu_text
, DocuSent
)docu_json
, DocuJson
)sent_text
, SentText
)sent_json
, SentJson
)text_type_per_line
-> corpus_type
scripts
-> loading
Support training T5 model
Describe the bug
train_tokenizer.py에서 np.choice의 디폴트값으로 replace=True을 취하고 있어 중복 데이터를 사용할 가능성 존재
Support training Electra model
슬랙에서도 소개하긴 했는데 Universal Language Learning Paradigm 논문에 소개된 Mixture of Denoisers 를 활용한 목적함수가 기존 Span corruption, MLM, CLM 보다 전반적으로 좋다고 합니다. 저도 마침 회사에서 활용해 볼 생각이 있어서 lassl에 collator 및 processor를 구현하려고 하는데 어떻게 생각하시나요??
poetry.lock
, pyproject.toml
과 requirements.txt
의 버전 이격을 해소한다.
github action을 이용하여 isort
, black
formatting을 강제하기
Describe the bug
DataCollatorForGpt2 클래스가 DataCollatorForLanguageModeling 상속을 받지 않았습니다.
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
lassl/pretrain_language_model.py
Line 47 in c507a54
ModelArguments
from config.json
fileTrainingArguments
from config.json
fileUpload default configs for gpu, tpu.
Add albert processor for serialization
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.