microsoft / tuta_table_understanding Goto Github PK

View Code? Open in Web Editor NEW

91.0 11.0 19.0 13.57 MB

TUTA and ForTaP for Structure-Aware and Numerical-Reasoning-Aware Table Pre-Training

License: MIT License

Python 99.71% Shell 0.29%

tuta_table_understanding's Introduction

Table Understanding

This is the official repository of:

TUTA is a unified pretrained model for understanding generally structured tables.

Based on TUTA, ForTaP further endows the model with stronger numerical-reasoning skills by pretraining on spreadsheet formulas.

🍻 News

2022-03-22: We released ForTaP code.
2022-03-08: ForTaP was accepted by ACL 2022. You may find ForTaP paper here.
2022-01-09: We updated cell type classification code for TUTA.
2021-10-29: We released TUTA code.
2021-9-2: We released HiTab, a large dataset on question answering and data-to-text over complex hierarchical tables.
2021-8-17: TUTA was accepted by KDD 2021.
2020-10-21: We released our TUTA paper on arXiv.

Code and Usages

Detailed implementations and usages of the pretrain models are shown in their folders:

TUTA
ForTaP

Citation

If you find TUTA and ForTaP useful in your research, please consider citing following papers:

@inproceedings{wang2021tuta,
  title={TUTA: Tree-based Transformers for Generally Structured Table Pre-training},
  author={Wang, Zhiruo and Dong, Haoyu and Jia, Ran and Li, Jia and Fu, Zhiyi and Han, Shi and Zhang, Dongmei},
  booktitle={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining},
  pages={1780--1790},
  year={2021}
}

@article{cheng2021fortap,
  title={FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining},
  author={Cheng, Zhoujun and Dong, Haoyu and Cheng, Fan and Jia, Ran and Wu, Pengfei and Han, Shi and Zhang, Dongmei},
  journal={arXiv preprint arXiv:2109.07323},
  year={2021}
}

Contact

If you have any problems regarding the paper or code, please feel free to submit issues in this repository. Or you can reach us by emails.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

tuta_table_understanding's People

Contributors

Stargazers

Watchers

Forkers

emperorkaiser haoareyudong pj0616 dcbark01 kianasun blankcheng sereph pseudo-labs pseudolabs-demo lincgcg xuzihan990713 musharrafleo95 test-mass-forker-org-1 ariking11 chaoyan cmjxll sudabai666 lordakims chrisdove

tuta_table_understanding's Issues

In the article it was mentioned "pre-trained model will be publicly available", how can I get it?

Hey all,
I'd like to experiment with those pre-trained models,
Where can I get them?

Thank you 👍

WCC dataset

Thanks for your great work,
may I ask where can I find the WCC dataset for the table type classification task?
Thanks!

The WebSheet dataset

It seems that the websheet dataset used in the Cell Type Classification cannot be accessed. Could you also share the dataset?

How to convert WikiTables dataset to your JSON format?

I've downloaded the dataset but it needs some pre-processing to get it to your format, as in the sample you provide in the repo.
Do you have the scripts for this process?

How to get labels for CTC task?

Hi,
I've been trying to do ctc_finetune but on other json files than deex.json presented in your Readme and I am wondering how you can get the labels to do this task? In deex.json, label_matrix is already a part of a file, and therefore, I cannot reproduce it to other json files, including those you have in your repository like wiki-tables-samples.json.
I would be very grateful for an advice :)

No requirements.txt file

Can you please upload the pip requirements of the project? Thanks

How to generate hierarchy files for excel sheets with those unique keys?

I'd like to make a simple inference on few excel sheets of mine,
I have noticed the json files holds some keys, which are not mentioned in the data originated from wiki.

T: cell text 
V: cell value
NS: number string
DT: data type (internally stored in spreadsheets, text=0,number=1,data_time=2,percentage=3,currency=4,others=5)
HF: if has formula
A1: formula string with A1 form (absolute cell reference)
R1: formula string with R1C1 form (relative cell reference)
LB: if has left border
TB: if has top border
BB: if has bottom border
RB: if has right border
BC: if has non-white background color
FC: if has non-black font color
FB: if has font bold
I: if has font italic
HA: horizontal alignment (center=0, center_across_selection=1,distributed=2,fill=3,general=4,justify=5,left=6,right=7)
VA: vertical alignment (top=0,center=1,bottom=2,justify=3,distributed=4)

@HaoAreYuDong, I couldn't find the script which generates those annotations/keys for the corresponding excel files?
Can you please share it.

Thank you very much,
I appreciate the assistance!

Sample spreadsheet does not work during execution - Why?

Hi, I am trying to pretrain a model based on your spreadsheet sample. However, the code rejects the sample for some reason.
The same happens for some manual spreadsheet samples that I've created.

I believe this happens due to lines 681-682 in tokenizer.py:

if (max(top_pos_list[icell]) == -1) or (max(left_pos_list[icell]) == -1):
    return None

But I can not understand thoroughly what's happening behind these conditions for these lists.
Can you please explain why your given spreadsheet sample is not working?

The link to A preprocessed dataset of DeEx is not working

Hi,@HaoAreYuDong,The link to A preprocessed dataset of DeEx is not working, can you put a brand new link in?

The link to preprocessed dataset of DeEx in tuta is broken

The link I refer to is as follows:
https://drive.google.com/file/d/1xJkq2DQciWvndhgm0aHZXMqzIWSan9z9/view?usp=sharing

TTC downstream task

Hi there,
do you plan to release code (and data, or example data) for the table type classification downstream task?
Thanks!

How to convert spreadsheets to your JSON format?

Hi and thanks for uploading your code repo.

How can someone preprocess their spreadsheet and generate a JSON for it according to your format?
https://github.com/microsoft/TUTA_table_understanding/blob/main/data/pretrain/spreadsheet/spreadsheet-sample.json

Script to convert formula prediction dataset to pretokenized fortap version

I want to convert these files enron_{train/dev/test}.pt to enron_{train/test}_fortap_input.pt. Is there a script to do so?

How to convert WDC dataset to your JSON format?

I've downloaded the dataset but it needs some pre-processing to get it to your JSON format, as in the sample you provide in the repo.
Do you have the scripts for this process?

How to convert Deex dataset to your JSON format?

Could you give me the three datasets WebSheet, SAUS, and CIUS for "Cell Type Classification"?

@HaoAreYuDong

Could you give me the three datasets WebSheet, SAUS, and CIUS for "Cell Type Classification"?

I would like to have the data after conversion to TUTA input format.

Could you please tell me the command line arguments when you fine-tune with the TUTA-implicit model?

"--hidden_size", type=int, default=768
"--intermediate_size", type=int, default=3072
"--magnitude_size", type=int, default=10
"--precision_size", type=int, default=10
"--top_digit_size", type=int, default=10
"--low_digit_size", type=int, default=10
"--max_cell_length", type=int, default=16
"--row_size", type=int, default=2560
"--column_size", type=int, default=2560
"--tree_depth", type=int, default=4
"--node_degree", type=str, default="32,32,64,256"
"--attention_distance", type=int, default=2
"--attention_step", type=int, default=0
"--num_attention_heads", type=int, default=12
"--num_encoder_layers", type=int, default=12
"--hidden_dropout_prob", type=int, default=0.1
"--attention_dropout_prob", type=int, default=0.1
"--layer_norm_eps", type=float, default=1e-6
"--hidden_act", type=str, default="gelu"
"--learning_rate", type=float, default=8e-6

"--max_seq_len", type=int, default=512
"--max_cell_num", type=int, default=256
"--text_threshold", type=float, default=0.5
"--value_threshold", type=float, default=0.1
"--clc_rate", type=float, default=0.3
"--wcm_rate", type=float, default=0.3
"--add_separate", type=bool, default=True
"--num_ctc_type", type=int, default=6

"--attn_method", type=str, default="add", choices=["max", "add"]
"--hier_or_flat", type=str, default="both", choices=["hier", "flat", "both"]
"--org_or_weigh", type=str, default="original", choices=["original", "weighted"]
"--num_format_feature", type=int, default=11
"--sep_or_tok", type=int, default=0, choices=[0, 1]
"--sep_weight", type=float, default=0.0
"--aggregator", type=str, default="sum", choices=["sum", "avg"]

"--target", type=str, default="tuta"

"--batch_size", type=int, default=2
"--report_steps", type=int, default=200
"--epochs_num", type=int, default=40
"--dataset_num", type=int, default=1
"--early_stopping_bound", type=int, default=100

microsoft / tuta_table_understanding Goto Github PK

tuta_table_understanding's Introduction

Table Understanding

🍻 News

Code and Usages

Citation

Contact

Contributing

Trademarks

tuta_table_understanding's People

Contributors

Stargazers

Watchers

Forkers

tuta_table_understanding's Issues

Could you give me the three datasets WebSheet, SAUS, and CIUS for "Cell Type Classification"?

Could you please tell me the command line arguments when you fine-tune with the TUTA-implicit model?

Recommend Projects

Recommend Topics

Recommend Org