Code Monkey home page Code Monkey logo

langbox's Introduction

Language + Vision pipelines playground

Implementation of own ideas during my free time. Mainly in area of NLP and CV.
Written only for educational and research purposes.

Contents

  1. Fake news generation pipeline.

Fake news generation

This solution allows to train and do inference of ru-gpt3 model on Telegram channel data.
After that you can run ru-dalle model in low-resource friendly mode on generated text.

Pipeline

  1. Download any Telegram channel data in .json format. You can use the desktop version of messenger for this.
  2. Parse downloaded data with script in training-ready format:
python utils/parse_telegram.py --data_path /PATH/TO/DOWNLOADED/JSON.json --start_date 1980-01-01T00:00:00 --special_remove STRINGS;OF;WORDS;TO;REMOVE
  1. Optionally you can do own further data cleaning.
  2. Run to randomly split on train / val sets:
python dataset/utils.py /PATH/TO/PARSED/DATA.txt PORTION_OF_VAL
  1. Train ru-gpt3 model on prepared data:
python train_telegram.py --model_type /PATH/TO/rugpt3small_based_on_gpt2 --input_file /PATH/TO/train.txt --val_file /PATH/TO/val.txt --run_name EXPERIMENT_NAME --save_every 180 --n_epochs 10 --sample_file /PATH/TO/sample.txt --lr 0.00005

Here sample.txt is a file with the beginnings of sentences you want to be generated. For example:

Британские учёные открыли
Российские физики из ННГУ им. Н.И. Лобачевского
Астрономы обнаружили

Sampling will occur at every validation step. You can use own schedulers, optimizers & metrics. LM loss used for best model selection. 6. Run trained language model:

python run_gpt.py /PATH/TO/BEST_OR_FINAL/CHECKPOINT /PATH/TO/sample.txt /PATH/TO/rut5_paraphrase

Where sample.txt is the beginnings file in format from previous step. This script also performs keyword extraction from generated text to use it as input to DALL-e model. I used a paraphraser to make extracted texts more natural.
As a result there will be two files generated: predict_sample.txt with "fake news" and generation_predict_sample.txt with extracted keywords.

Examples:

predict_sample.txt:

Британские учёные открыли новый тип графов — латинские символы с крутыми вершинами. Это стало возможным благодаря исследованию числа Δ (или единицы — это не «знак умножения», а просто число) и, как следствие, произошло изменение знака препинания в древнем и современном языках.
Российские физики из ННГУ им. Н.И. Лобачевского создали квантовый компьютер, в котором реализована технология двойных преобразований информации. А теперь можно проверить работу системы на сжатие.
Астрономы обнаружили на границе созвездия Венера протопланету, которая будет наблюдать рождение в нем нестабильных звезд. По расчетам ученых, они могут быть связаны с ростом масс звезд, расположенными близко к Земле.

generation_predict_sample.txt:

новый тип графов и изменение знака препинания
двойные преобразования информации и российские физика
рост масс звёзд и расчёты учёных
  1. Run rudalle model on generated texts:
python run_rudalle.py /PATH/TO/generation_predict_sample.txt --rudalle_path /PATH/TO/rudalle_malevich --rudalle_name Malevich

This script will do inference of rudalle model, sort generated pictures with ru-clip model, apply x2 resolution & store top-clip-score generated pictures in folder dalle_generation_predict_sample. Inference tested on cuda-11.4 and NVIDIA RTX2070 Super with 8 Gb VRAM.

  1. Visualize generated news with a script:
python utils/visualize.py /PATH/TO/predict_sample.txt /PATH/TO/RUDALLE_IMAGES_FOLDER font=/PATH/TO/TTF.ttf

You also need to download a true type font for text visualization.

Acknowledgements

Project code mainly based on:

GPT-3 pretrained model:

langbox's People

Contributors

4sunshine avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.