Code Monkey home page Code Monkey logo

text2sql's Introduction

Text2SQL

How many times have you pulled your hair apart writing a SQL query, now use natural language to convert to appropriate SQL and save your precious hair.

It's a bitch to train large networks, I am thinking of porting the model to tensorflow and running it on TPUs, any help would be welcome!

Install all the dependencies before running, see requirements.

Parsing

In order to convert JSON objects into langauge models use the parse_to_lm.py, usage:

➜  text2sql git:(master) ✗ python3 parse_to_lm.py --help

This file converts the dataset sentences to my format to be used for langauge
modelling and use GPT insted of BERT models.

optional arguments:
  -h, --help            show this help message and exit
  --pairs PAIRS         path to pairs dump
  --tables TABLES       path to tables lm dump
  --dev-pairs DEV_PAIRS
                        path to dev pairs dump
  --fresh-tokenizer [FRESH_TOKENIZER]
                        if passed create a new sentencepiece tokenizer model.
                        Change args from file.
  --corpus CORPUS       what will be the file to feed to tokenizer.

Trainer File

Once the parsing is completed we can then start to train / finetune GPT2 Model. If finetune option is chosen we use huggingface's transformers.GPT2LMHeadModel, else a new model is created based on your arguments. For arguments refer below:

➜  text2sql git:(master) ✗ python3 train_gpt2.py --help
Train GPT2 model on t2sql corpus

optional arguments:
  -h, --help            show this help message and exit
  --tf {t,f}            Either to train the model from scratch (t) or finetune
                        (f)
  --pairs PAIRS         path to pairs dump
  --tables TABLES       path to tables lm dump
  --num_epochs NUM_EPOCHS
                        Number of epochs to train / finetune
  --save_folder SAVE_FOLDER
                        Folder to save model to
  --model MODEL         Saved model to have filepath `<model>.pt`
  --save_every SAVE_EVERY
                        Save model every this epoch
  --tensorboard [TENSORBOARD]
                        If passed, prepares tensorbaord summaries
  --maxlen MAXLEN       Maximum sequence length
  --n_embd N_EMBD       Embedding Dimension of model
  --n_layer N_LAYER     Number of layers in the model
  --n_head N_HEAD       Number of attention heads in the model
  --activation_function ACTIVATION_FUNCTION
                        Activation function to use in between the layers
  --resid_pdrop RESID_PDROP
                        Residual connection dropout probability
  --embd_pdrop EMBD_PDROP
                        Embedding connection dropout probability
  --attn_pdrop ATTN_PDROP
                        Attention connection dropout probability
  --layer_norm_epsilon LAYER_NORM_EPSILON
                        Layer norm epsilon value
  --initializer_range INITIALIZER_RANGE
                        Range for initializer
  --summary_type SUMMARY_TYPE
                        summary_type in GPTConfig
  --summary_use_proj SUMMARY_USE_PROJ
                        summary_use_proj in GPTConfig
  --summary_activation SUMMARY_ACTIVATION
                        summary_activation in GPTConfig
  --summary_proj_to_labels SUMMARY_PROJ_TO_LABELS
                        summary_proj_to_labels in GPTConfig
  --summary_first_dropout SUMMARY_FIRST_DROPOUT
                        summary_first_dropout in GPTConfig
  --bos_token_id BOS_TOKEN_ID
                        beggining of statement ID in vocabulary
  --eos_token_id EOS_TOKEN_ID
                        end of statement ID in vocabulary

Datasets

Using CoSQL, Spider, Sparc datasets, credit to the authors. There are a couple of things to note, we have in total 178 tables, but only 166 tables in training date and dev set has 20 tables.

License

The code I am using is under MIT License and transformers code is under Apache License, Version 2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.