Code Monkey home page Code Monkey logo

nlg-system's Introduction

NLG-System

A natural language generation system to generate technical docs

This code uses OpenAi gpt-2 and also finetuning by nshepperd

Preprocessing

Things you can or should do before training.

Download Model with:

Available are Modells "117M" and "354M" (not tested) adjust output_dir in script!

python 1Preprocessing\download_model.py 117M

Be careful we added a Language-identifier to the Hyperparams in h_params.json. Please add "h_params" = "en"

Create encoder.json and vocab.bpe

Use subword-nmt by Rico Sennrich to create new Byte Pair Encoding for your Language.

  1. Place a .txt File you want to extract embeddings from in data/embedding
  2. start process with

    subword-nmt learn-joint-bpe-and-vocab --input data/embedding/yourfile.txt --output data/embedding/vocab.txt --write-vocabulary data/embedding/encoder.txt --separator Ġ --symbols 50257 -v

  3. Reformat Output so it fits gpt-2

    python 1Preprocessing/format_embeddings.py

  4. Move encoder.json and vocab.bpe to your base language-model in directory models

Convert Trainingdata PDFs to txt

  1. Place PDFs in training/PDF
  2. Clean PDFs with own rules (regex, str.replace) in pdf_to_txt.py
  3. Use pdf_to_txt.py to parse PDFs to txt-File (with Apache Tika)

    python 1Preprocessing/pdf_to_txt.py

Create .npz

If you don't want to encode your Trainingdata on every run, you can save it encoded with numpy savez and load from that file.

python 1Preprocessing\pre_encode.py .\data\training\PDF .\data\training\trainingsdaten.npz --model_name ISW_Model

Training (based on nshepperd)

Steps to train your own model
  1. We recommend to parse your file into single .txt (see Preprocessing)

  2. Pre-Encode to npz (recommended see Preprocessing)

  3. download model to retrain and rename it

  4. Create Embeddings (encoder.json and vocab.bpe) for your language (optional, not recommended because of Problems)

  5. replace encoder and vocab files

  6. start retraining with:

    python 2Training/train.py --dataset ./data/training/trainingsdaten.npz --model_name ISW_Model --sample_every 100 --sample_length 200 --run_name iswTrain1

  7. wait

  8. if you are satisfied with samples (data/training/samples) and loss: stop (ctrl+c)

  9. get newest checkpoint from data/training/checkpoint/runX

  10. replace the following files in your model with the new ones

    * checkpoint
    * model.ckpt.data-00000-of-00001
    * model.ckpt.index
    * model.ckpt.meta
    
  11. Adjust hparams.n_lang to your language

  12. your model is ready to use. If you want to see some stats on tensorboard use:

    tensorboard  --logdir=data/training/iswTrain1/checkpoint

Backend (based on OpenAi GPT-2)

See what the Backend does

Uses Huggingface pytorch-Transformer

Communicates with Frontend via REST-API on Flask server.
Receives supporting Words which the Text should contain and Settings from User-Input in Frontend.

Implements 4 different Strategies to build Text from given supporting Words:

1a. Beam-Search
1b. Beam-Search with Scope
2.Search until fit
3.Cut-off and insert
4.BERT-GPT2 Hybrid

For Details see paper.

Frontend

How to use the Word-Add-In

Generated with Yeoman-Generator for Office-Add-ins

For changes edit React App in 4Frontend/src/taskpane/components

To sideload your Add-In in Word use the following command inside of directory 4Frontend

npm start

and

npm stop

Open Start > Show TDTG > Enter your Inputs and Settings > click generate

Wenn Änderungen, die Sie am Manifest vorgenommen haben, z. B. Dateinamen von Symbolen für Schaltflächen im Menüband anscheinend nicht wirksam werden, löschen Sie den Office-Cache auf Ihrem Computer. Löschen des Inhalts des Ordners %LOCALAPPDATA%\Microsoft\Office\16.0\Wef\

Be careful with spaces in your Pathname, they are followed by errors with webpack loading the CA-Certificate and block sideloading your Add-In

nlg-system's People

Contributors

zaksel avatar

Stargazers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.