berlino / tensor2struct-public Goto Github PK

View Code? Open in Web Editor NEW

89.0 89.0 22.0 30.06 MB

Semantic parsers based on encoder-decoder framework

License: MIT License

Jsonnet 3.98% Python 95.28% Shell 0.40% Cython 0.33%

tensor2struct-public's People

Contributors

Stargazers

Watchers

tensor2struct-public's Issues

Unable to pre-train for task sql2nl

According to this readme file, I should be able to run the Pretraining step by running this line of code-
python experiments/sql2nl/run.py train configs/spider/run_config/run_en_spider_pretrain_syn.jsonnet

However, I run into this error-

  File "/tensor2struct-public/tensor2struct/models/spider/spider_enc_bert.py", line 175, in load
    with open(os.path.join(self.data_dir, "relations.json"), "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/spider/spider-0727-google/electra-base-discriminator,other_train-true,syn-r6-m128,sc_link=true,cv_link=true/enc/relations.json'

I've tried looking for relations.json in the repo but couldn't find it there

ERROR: Saving directory (...tensor2struct-public/.vector_cache/bert-base-multilingual-uncased) should be a directory

when I process spider like this: tensor2struct preprocess configs/spider/run_config/run_ch_spider_bert_baseline.jsonnet

and then I received this error:

ERROR: Saving directory (...tensor2struct-public/.vector_cache/bert-base-multilingual-uncased) should be a directory

please tell me how can I solve this problem, thanks!

Does the model sql2nl supports the wikiTable datasets?

For example, sample_synthetic_data_spider supports to generate the synthetic data of wikiTable datasets?

spider processing

registry.lookup("model", config["model"]).Preproc, config["model"]
KeyError: 'model'

when precessing spider.

run_ch_spider_bert_baseline.jsonnet also doesn't have the model keyword.

Did I miss sth?

Thank you

Pre-trained models

Hi!

Thank you for open-sourcing this project.

Is it possible to access the pre-trained models?
I wanted to evaluate the models on some databases that were not used while training.

Thanks,

A problem encountered while processing Spider data

I have already finished "./setup" and download Spider, but found a problem:

(tensor2struct) fsx@FoolishCow:~/model/tensor2struct$ tensor2struct preprocess configs/spider/run_config/run_en_spider_bert_baseline.jsonnet
Traceback (most recent call last):
  File "/home/fsx/anaconda3/envs/tensor2struct/bin/tensor2struct", line 33, in <module>
    sys.exit(load_entry_point('tensor2struct', 'console_scripts', 'tensor2struct')())
  File "/home/fsx/anaconda3/envs/tensor2struct/bin/tensor2struct", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/home/fsx/anaconda3/envs/tensor2struct/lib/python3.7/site-packages/importlib_metadata/__init__.py", line 954, in distribution
    return Distribution.from_name(distribution_name)
  File "/home/fsx/anaconda3/envs/tensor2struct/lib/python3.7/site-packages/importlib_metadata/__init__.py", line 542, in from_name
    raise PackageNotFoundError(name)
importlib_metadata.PackageNotFoundError: No package metadata was found for tensor2struct

Runtime of the base model

Hi,

Thanks for sharing the source code! I'm trying to reproduce the non-BERT results reported in Meta-Learning for Domain Generalization. Unfortunately, I found that the training speed of the non-BERT model was not as fast as reported in the paper. Below is the config file I use:

run_config:

{
    local exp_id = 0,
    project: "spider_value_wo_bert",
    logdir: "log/spider/value_%d" %exp_id,
    model_config: "configs/spider/model_config/en_spider_value.jsonnet",
    model_config_args: {
        # data 
        use_other_train: true,

        # model
        num_layers: 6,
        sc_link: true,
        cv_link: true,
        loss_type: "softmax", # softmax, label_smooth

        # bert
        opt: "torchAdamw",   # bertAdamw, torchAdamw
        lr_scheduler: "bert_warmup_polynomial_group_v2", # bert_warmup_polynomial_group,bert_warmup_polynomial_grou_v2

        # grammar
        include_literals: true,

        # training
        bs: 16,
        att: 0,
        lr: 5e-4,
        clip_grad: 0.3,
        num_batch_accumulated: 1,
        max_steps: 20000,
        save_threshold: 19000,
        use_bert_training: false,
        device: "cuda:0",
    },

    eval_section: "val",
    eval_type: "all", # match, exec, all
    eval_method: "spider_beam_search_with_heuristic",
    eval_output: "ie_dir/spider_value_wo_bert",
    eval_beam_size: 3,
    eval_debug: false,
    eval_name: "run_%d_%s_%s_%d_%d" % [exp_id, self.eval_section, self.eval_method, self.eval_beam_size, self.model_config_args.att],

    local _start_step = $.model_config_args.save_threshold / 1000,
    local _end_step = $.model_config_args.max_steps / 1000,
    eval_steps: [ 1000 * x for x in std.range(_start_step, _end_step)],
}

model_config:

local _data_path = 'data/spider/';
local spider_base = import "spider_base_0512.libsonnet";

function(args, data_path=_data_path) spider_base(args, data_path=_data_path) {
    data: {
        local PREFIX = data_path + "raw/",
        local ts = if $.args.use_other_train then
            ['spider', 'others']
        else
            ['spider'],

        train: {
            name: 'spider', 
            paths: [
              PREFIX + 'train_%s.json' % [s]
              for s in ts],
            tables_paths: [
              PREFIX + 'tables.json',
            ],
            db_path: PREFIX + 'database',
        },
        val: {
            name: 'spider', 
            paths: [PREFIX + 'dev.json'],
            tables_paths: [PREFIX + 'tables.json'],
            db_path: PREFIX + 'database',
        },

    },

}

The model is trained using NVIDIA GeForce RTX 3090. Are there any problems in the above two config files? Or could you please share the config files you use?

Thanks.

Segmentation Fault Issue Encountered in Spider Experiment

Hello,

I hope this message finds you well. I am currently engaged in an experiment with the Chinese Spider dataset, specifically training BERT models for CSpider. Unfortunately, I've encountered a segmentation fault issue while executing the following command:

tensor2struct preprocess configs/spider/run_config/run_ch_spider_bert_baseline.jsonnet

Below is the output and error information from the execution:

Using default log base dir /root/tensor2struct-public-main
WARNING <class 'tensor2struct.models.enc_dec.EncDecPreproc'>: superfluous {'name': 'EncDecV2'}
train:   0%|                                                                                                                                      | 0/8659 [00:00<?, ?it/s]2023-11-06 11:30:12 INFO: Loading these models for language: en (English):
=========================
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| ner       | ontonotes |
=========================

2023-11-06 11:30:12 INFO: Use device: cpu
2023-11-06 11:30:12 INFO: Loading: tokenize
2023-11-06 11:30:12 INFO: Loading: pos
Segmentation fault (core dumped)

I have ensured that memory is not an issue and that the versions of the dependencies comply with the project requirements. I suspect the problem might be related to the version of the Transformers library. Currently, I am using version 3.2.0. When I attempt to use the latest version of Transformers, I immediately encounter a segmentation fault. I have also tried using version 2.11.0, but to no avail.

This issue has been a persistent challenge for me, and I would greatly appreciate any guidance or assistance you could provide. Thank you in advance for your time and help.

Base model performance

Hi,

I run the base model by the following command:

python experiments/spider_dg/run.py train configs/spider/run_config/run_en_spider_bert_baseline.jsonnet

After 20K training steps, I got the performance as follows:

Step: 19000 match score, 0.6624758220502901 exe score: 0.6702127659574468
Step: 20000 match score, 0.660541586073501 exe score: 0.6789168278529981

Does it make sense? I can see your reported number in the paper (https://arxiv.org/abs/2104.05827) (around >70).
Should I need to train for more steps, says 80K-100K?

Thanks!

a question on running MLCG-codes

Hello, I want to run your code to reproduce the results for your paper"Meta Learning to Compositionally Generalize" and unfortunately I met up with a problem.
Here is how's that going: following your instruction in ReadMe file, I successfully updated and installed all packets metioned in setup script file. But when I tried to follow the Preprocessing step for COGS training and enter "tensor2struct preprocess configs/cogs/run_config/run_cogs_comp.jsonnet" on the command line, there is an error ocurring("KeyError: 'cogs_enc'"), and the entire error info is :

Using default log base dir /data2/home/zhaoyi/SrcCode for MLCG
WARNING <class 'tensor2struct.models.enc_dec.EncDecPreproc'>: superfluous {'name': 'EncDec'}
Traceback (most recent call last):
File "/data2/home/zhaoyi/anaconda3/envs/tensor2struct/bin/tensor2struct", line 33, in
sys.exit(load_entry_point('tensor2struct', 'console_scripts', 'tensor2struct')())
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/commands/run.py", line 241, in main
preprocess.main(preprocess_config)
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/commands/preprocess.py", line 47, in main
preprocessor = Preprocessor(config)
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/commands/preprocess.py", line 15, in init
registry.lookup("model", config["model"]).Preproc, config["model"]
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/utils/registry.py", line 61, in instantiate
return callable(**merged)
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/models/enc_dec.py", line 36, in init
self.enc_preproc = registry.lookup("encoder", encoder["name"]).Preproc(
File "/data2/home/zhaoyi/SrcCode for MLCG/tensor2struct/utils/registry.py", line 28, in lookup
return _REGISTRY[kind][name]
KeyError: 'cogs_enc'

I want to know how to solve it and thanks a lot for your reply!

where is the detail code for SQL generating by PCFG?

Hi
Thanks for the source code. but i'm a little confused about where the detail code for SQL generating by PCFG in the papper "Learning to Synthesize Data for Semantic Parsing".
I found the guideline in readme.md mainly focus on the sql2nl
I'm doing some experiment about generating some code having different grammar with SQL, your method with PCFG inspire me a lot, could you explain how can we get the new SQL by PCFG, i mean the source code. Thanks!!

Processed GeoQuery

Hi @berlino and @todpole3,

Could you please point to or share the processed version of the GeoQuery dataset that you report in Table 1 of your paper Learning to Synthesize Data for Semantic Parsing ?

Thanks,

Perhaps a bug in the code

Hi, @berlino

In tensor2struct/languages/ast/spider.py, line 253. Should if not infer_from_conditions be if infer_from_conditions?

Looking forward to your reply, thanks.

a question on running CSpider dataset

Hello, I want to run your code to reproduce the experiment on CSpider dataset and unfortunately I met up with a problem.
Here is how's that going: following your instruction in ReadMe file, I successfully updated and installed all packets metioned in setup script file. But when I tried to follow the Preprocessing step for CSpider training and enter "tensor2struct preprocess configs/spider/run_config/run_ch_spider_bert_baseline.jsonnet" on the command line, there is an error ocurring("KeyError: 'spider-bert'"), and the entire error info is :

Using default log base dir /data/data2/tyf/PycharmProject/tensor2struct-public
WARNING <class 'tensor2struct.models.enc_dec.EncDecPreproc'>: superfluous {'name': 'EncDecV2'}
Traceback (most recent call last):
File "/data/data2/conda_envs/tensor2struct/bin/tensor2struct", line 33, in
sys.exit(load_entry_point('tensor2struct', 'console_scripts', 'tensor2struct')())
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/commands/run.py", line 241, in main
preprocess.main(preprocess_config)
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/commands/preprocess.py", line 47, in main
preprocessor = Preprocessor(config)
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/commands/preprocess.py", line 15, in init
registry.lookup("model", config["model"]).Preproc, config["model"]
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/utils/registry.py", line 61, in instantiate
return callable(**merged)
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/models/enc_dec.py", line 36, in init
self.enc_preproc = registry.lookup("encoder", encoder["name"]).Preproc(
File "/data/data2/tyf/PycharmProject/tensor2struct-public/tensor2struct/utils/registry.py", line 28, in lookup
return _REGISTRY[kind][name]
KeyError: 'spider-bert'

I want to know how to solve it and thanks a lot for your reply!

Problem with synthetic data having JOIN

Hi there,

Thanks for your great work.

I was looking at synthetic data (https://github.com/berlino/tensor2struct-public/tree/main/experiments/sql2nl/data-spider-with-ssp-synthetic) and found that:

For utterances with JOIN, e.g.,
Original: SELECT * FROM Courses JOIN Student_Course_Registrations ON course_id = course_id

It should be with table aliases, e.g.,
Expected:
SELECT * FROM Courses as T1 JOIN Student_Course_Registrations as T2 ON T1.course_id = T2.course_id
or
SELECT * FROM Courses JOIN Student_Course_Registrations ON Courses.course_id = Student_Course_Registrations.course_id

Otherwise, the Spider package cannot parse the original SQL query.

BTW, it would be great if you could also provide the code for synthetic data generator so that ppl can run by themselves?

Thanks!

cannot import name 'data_scheduler' from 'experiments.sql2nl'

tensor2struct-public/experiments/sql2nl/__init__.py", line 1, in <module> from experiments.sql2nl import data_scheduler ImportError: cannot import name 'data_scheduler' from 'experiments.sql2nl'
I want to run the code for "Learning from Executions for Semantic Parsing", and run the instruction tensor2struct preprocess configs/overnight/run_config/run_overnight_supervised.jsonnet ,but it failed.
Could you please tell me how to solve it?
Thx！

Fitting a general PCFG for all the DBs

Hi @berlino ,

tensor2struct-public/experiments/sql2nl/scripts/sample_synthetic_data_spider.py

Line 44 in cbe8785

# could also use a general pcfg for all dbs

Do you have any suggestions to modify this code for fitting a general PCFG over all the DBs?
I tried simply recording productions for all the trees. But during sampling, I guess the current code has no way differentiate col-i of one schema from col-i of another schema.

Any suggestions/pointers would be greatly appreciated.
Thanks for open-sourcing this code! :)

berlino / tensor2struct-public Goto Github PK

tensor2struct-public's People

Contributors

Stargazers

Watchers

Forkers

tensor2struct-public's Issues

Recommend Projects

Recommend Topics

Recommend Org