Code Monkey home page Code Monkey logo

syntaxsql's Introduction

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .

👍 03/20/2022: We open-sourced a simple but SOTA model (just T5) for 20 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!

Citation

@InProceedings{Yu&al.18.emnlp.syntax,
  author =  {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
  title =   {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
  year =    {2018},  
  booktitle =   {Proceedings of EMNLP},  
  publisher =   {Association for Computational Linguistics},
}

Environment Setup

  1. The code uses Python 2.7 and Pytorch 0.2.0 GPU.
  2. Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

  1. Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
  2. Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
  3. Download evaluation.py and process_sql.py from the Spider github page
  4. Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
    • generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
    • generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

  • data/ contains raw train/dev/test data and table file
  • generated_datasets/ described as above
  • models/ contains the code for each module.
  • evaluation.py is for evaluation. It uses process_sql.py.
  • train.py is the main file for training. Use train_all.sh to train all the modules (see below).
  • test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
  • generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Evaluation

Follow the general evaluation process in the Spider github page.

Cross-Domain Data Augmentation

You could find preprocessed augmented data at generated_datasets/generated_data_augment.

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data. Second, run get_data_wikisql.py to generate WikiSQL augment json file. Finally, use merge_jsons.py to generate the final spider + wikisql + wikisql augment dataset.

Acknowledgement

The implementation is based on SQLNet. Please cite it too if you use this code.

syntaxsql's People

Contributors

taoyds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

syntaxsql's Issues

Performance on real life applications?

As far as I can see your work shows results for cross domain tasks with an accuracy of 27.2% on the test set.

While in the context of this research work the results seem impressive I wonder how that transfers to the real life.
The goal of cross domain learning is to learn from one domain to finally improve in a target domain.
What I'm looking for is to apply this model on a real life database, use the cross domain knowledge and refine it to perform well on the real life database.
I'm missing the last part in this work.
To be a successfull cross domain learning model wouldn't you need to show performance results on a unknown database (test set) as WikiSQL? Or is your performance on this database also about 27.2%? If so, then your performance would be far off from the current leading 83.6%.

What I'm asking is: Is there a way to use the syntaxSQL model, use the cross domain knowledge and refine it for a target database (for example for WikiSQL)?

If this is not possible I wonder if this work has any real life application. An accuracy of 27.2% is totally unusable for a real life application. Or do I totally miss something?

Code doesn't have a license file

Hi,
Thanks for sharing your research and the code! Unfortunately since there's no license file in the repo, right now the code isn't really usable. Could you add a LICENSE file to the repo (also for TypeSQL since this is based on that)?
Since this seems to based on TypeSQL which is based on SQLNet, maybe using the same license as SQLNet would be easiest?
Thanks!

Code is not cpu-compatible

Wanted to let you know that a lot of the model code makes calls to .cuda() before checking if gpu params are set to False. This is the case in net_utils and most of the model sub-classes.

Using BERT instead of Glove

What changes would one have to make to swap out Glove and use BERT embeddings instead? And, do you have plans to do this in the future?

I was wondering whether you could provide the augment data which is similar to train_spider.json

I will get 2198380 SQL and NL pairs if I run ‘generate_wikisql_augment.py’ and ‘get_data_wikisql.py’. But you only used 98000 of these pairs mentioned on your paper.
I was wondering whether you could provide the augment data which is similar to train_spider.json or train_other.json in spider. Because, I can create the training dataset through preprocess_train_dev_data.py but it is hard to get the original json dataset from generated_dataset.
Thank you.

Which pretrained models are they?

generated_datasets.zip only contains two model sets-- generated_data_augment and generated_data. Which history and table options were they trained on? fulland std in both cases?

How to regenerate generated_datasets/generated_data_augment?

README.md states

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data.

which creates files with names like data_augment/1-23937219-2.txt. However, there does not appear to be any code for reading these files in preprocess_train_dev_data.py so they can be preprocessed and output into generated_data_augment. How can we reproduce this step?

Running your data augmentation on SparC - any recommendations?

Hi!

I'm trying to run your data augmentation scripts on the SparC dataset. The fields are slightly different (interaction, final, etc.). So I'm changing them.

--> Aside from the field names, do you think there could be any other problem? Or in theory it would work fine?

Do you have any recommendations or list of things to do or check to do this?

Thanks very much for the help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.