taoyds / syntaxsql Goto Github PK

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross Domain Text-to-SQL Task

Home Page: https://yale-lily.github.io/spider

Python 98.54% Shell 1.46%

syntaxsql's Introduction

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Source code of our EMNLP 2018 paper: SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task .

👍 03/20/2022: We open-sourced a simple but SOTA model (just T5) for 20 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!

Citation

@InProceedings{Yu&al.18.emnlp.syntax,
  author =  {Tao Yu and Michihiro Yasunaga and Kai Yang and Rui Zhang and Dongxu Wang and Zifan Li and Dragomir Radev},
  title =   {SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task},
  year =    {2018},  
  booktitle =   {Proceedings of EMNLP},  
  publisher =   {Association for Computational Linguistics},
}

Environment Setup

The code uses Python 2.7 and Pytorch 0.2.0 GPU.
Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
Download evaluation.py and process_sql.py from the Spider github page
Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
- generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
- generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

data/ contains raw train/dev/test data and table file
generated_datasets/ described as above
models/ contains the code for each module.
evaluation.py is for evaluation. It uses process_sql.py.
train.py is the main file for training. Use train_all.sh to train all the modules (see below).
test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Evaluation

Follow the general evaluation process in the Spider github page.

Cross-Domain Data Augmentation

You could find preprocessed augmented data at generated_datasets/generated_data_augment.

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data. Second, run get_data_wikisql.py to generate WikiSQL augment json file. Finally, use merge_jsons.py to generate the final spider + wikisql + wikisql augment dataset.

Acknowledgement

The implementation is based on SQLNet. Please cite it too if you use this code.

syntaxsql's People

Contributors

Stargazers

Watchers

Forkers

miyamm michiyasunaga pyseany mhany90 tin-chata rshin atamborrino saravananpsg jiangnanhugo duytinvo zengjichuan ygan qianwenyuan mad-ant thesage21 zhanzecheng ishay2b ldsink chrisjbaik zenghanfu bonnieiu awesomemachinelearning dragomirradev wlwn nomkat xyxyxwtxwt bohanjason sahara2001 nameer indirection42 robertolima1 xiwei-wang aramvillasana xrosliang piyush-ghotkar pawel1981 yale-lily toitoi vikram687 workdivyeshvishwakarma

syntaxsql's Issues

Performance on real life applications?

As far as I can see your work shows results for cross domain tasks with an accuracy of 27.2% on the test set.

While in the context of this research work the results seem impressive I wonder how that transfers to the real life.
The goal of cross domain learning is to learn from one domain to finally improve in a target domain.
What I'm looking for is to apply this model on a real life database, use the cross domain knowledge and refine it to perform well on the real life database.
I'm missing the last part in this work.
To be a successfull cross domain learning model wouldn't you need to show performance results on a unknown database (test set) as WikiSQL? Or is your performance on this database also about 27.2%? If so, then your performance would be far off from the current leading 83.6%.

What I'm asking is: Is there a way to use the syntaxSQL model, use the cross domain knowledge and refine it for a target database (for example for WikiSQL)?

If this is not possible I wonder if this work has any real life application. An accuracy of 27.2% is totally unusable for a real life application. Or do I totally miss something?

May I ask that, what do you think of text2sql and graph-embedding to solve knowledge base problem?

Thank you!
@taoyds

Code doesn't have a license file

Hi,
Thanks for sharing your research and the code! Unfortunately since there's no license file in the repo, right now the code isn't really usable. Could you add a LICENSE file to the repo (also for TypeSQL since this is based on that)?
Since this seems to based on TypeSQL which is based on SQLNet, maybe using the same license as SQLNet would be easiest?
Thanks!

Code is not cpu-compatible

Wanted to let you know that a lot of the model code makes calls to .cuda() before checking if gpu params are set to False. This is the case in net_utils and most of the model sub-classes.

If we have enough data, what will the model achieve for hard SQL?

We are afraid that the hard SQL like TABLE JOIN is the limit for industrial application.

Thank you very much.

Using BERT instead of Glove

What changes would one have to make to swap out Glove and use BERT embeddings instead? And, do you have plans to do this in the future?

I was wondering whether you could provide the augment data which is similar to train_spider.json

I will get 2198380 SQL and NL pairs if I run ‘generate_wikisql_augment.py’ and ‘get_data_wikisql.py’. But you only used 98000 of these pairs mentioned on your paper.
I was wondering whether you could provide the augment data which is similar to train_spider.json or train_other.json in spider. Because, I can create the training dataset through preprocess_train_dev_data.py but it is hard to get the original json dataset from generated_dataset.
Thank you.

Which pretrained models are they?

generated_datasets.zip only contains two model sets-- generated_data_augment and generated_data. Which history and table options were they trained on? fulland std in both cases?

Small bug in `preprocess_train_dev_data.py`

Just a small bug.
At line 12, 13, train_dev variable is used before declaration.

Thanks for sharing you guys great work!

How to regenerate generated_datasets/generated_data_augment?

README.md states

If you would like to run data augmentation by yourself, first download wikisql_tables.json and train_patterns.json from here, and then run python generate_wikisql_augment.py to generate more training data.

which creates files with names like data_augment/1-23937219-2.txt. However, there does not appear to be any code for reading these files in preprocess_train_dev_data.py so they can be preprocessed and output into generated_data_augment. How can we reproduce this step?

Double requirement in requirements.txt (records==0.5.0/0.5.2)

Hi! Just FYI:

ERROR: Double requirement given: records==0.5.2 (from -r requirements.txt (line 14)) (already in records==0.5.0 (from -r requirements.txt (line 5)), name='records')

... Should I keep 0.5.2 or 0.5.0?

Thank you!

Running your data augmentation on SparC - any recommendations?

Hi!

I'm trying to run your data augmentation scripts on the SparC dataset. The fields are slightly different (interaction, final, etc.). So I'm changing them.

--> Aside from the field names, do you think there could be any other problem? Or in theory it would work fine?

Do you have any recommendations or list of things to do or check to do this?

Thanks very much for the help!