Code Monkey home page Code Monkey logo

tuta_table_understanding's Introduction

Table Understanding

This is the official repository of:

TUTA is a unified pretrained model for understanding generally structured tables.

Based on TUTA, ForTaP further endows the model with stronger numerical-reasoning skills by pretraining on spreadsheet formulas.

๐Ÿป News

  • 2022-03-22: We released ForTaP code.

  • 2022-03-08: ForTaP was accepted by ACL 2022. You may find ForTaP paper here.

  • 2022-01-09: We updated cell type classification code for TUTA.

  • 2021-10-29: We released TUTA code.

  • 2021-9-2: We released HiTab, a large dataset on question answering and data-to-text over complex hierarchical tables.

  • 2021-8-17: TUTA was accepted by KDD 2021.

  • 2020-10-21: We released our TUTA paper on arXiv.

Code and Usages

Detailed implementations and usages of the pretrain models are shown in their folders:

Citation

If you find TUTA and ForTaP useful in your research, please consider citing following papers:

@inproceedings{wang2021tuta,
  title={TUTA: Tree-based Transformers for Generally Structured Table Pre-training},
  author={Wang, Zhiruo and Dong, Haoyu and Jia, Ran and Li, Jia and Fu, Zhiyi and Han, Shi and Zhang, Dongmei},
  booktitle={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining},
  pages={1780--1790},
  year={2021}
}
@article{cheng2021fortap,
  title={FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining},
  author={Cheng, Zhoujun and Dong, Haoyu and Cheng, Fan and Jia, Ran and Wu, Pengfei and Han, Shi and Zhang, Dongmei},
  journal={arXiv preprint arXiv:2109.07323},
  year={2021}
}

Contact

If you have any problems regarding the paper or code, please feel free to submit issues in this repository. Or you can reach us by emails.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

tuta_table_understanding's People

Contributors

blankcheng avatar haoareyudong avatar microsoft-github-operations[bot] avatar microsoftopensource avatar musharrafleo95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tuta_table_understanding's Issues

WCC dataset

Thanks for your great work,
may I ask where can I find the WCC dataset for the table type classification task?
Thanks!

The WebSheet dataset

It seems that the websheet dataset used in the Cell Type Classification cannot be accessed. Could you also share the dataset?

How to get labels for CTC task?

Hi,
I've been trying to do ctc_finetune but on other json files than deex.json presented in your Readme and I am wondering how you can get the labels to do this task? In deex.json, label_matrix is already a part of a file, and therefore, I cannot reproduce it to other json files, including those you have in your repository like wiki-tables-samples.json.
I would be very grateful for an advice :)

How to generate hierarchy files for excel sheets with those unique keys?

I'd like to make a simple inference on few excel sheets of mine,
I have noticed the json files holds some keys, which are not mentioned in the data originated from wiki.

T: cell text 
V: cell value
NS: number string
DT: data type (internally stored in spreadsheets, text=0,number=1,data_time=2,percentage=3,currency=4,others=5)
HF: if has formula
A1: formula string with A1 form (absolute cell reference)
R1: formula string with R1C1 form (relative cell reference)
LB: if has left border
TB: if has top border
BB: if has bottom border
RB: if has right border
BC: if has non-white background color
FC: if has non-black font color
FB: if has font bold
I: if has font italic
HA: horizontal alignment (center=0, center_across_selection=1,distributed=2,fill=3,general=4,justify=5,left=6,right=7)
VA: vertical alignment (top=0,center=1,bottom=2,justify=3,distributed=4)

@HaoAreYuDong, I couldn't find the script which generates those annotations/keys for the corresponding excel files?
Can you please share it.

Thank you very much,
I appreciate the assistance!

Sample spreadsheet does not work during execution - Why?

Hi, I am trying to pretrain a model based on your spreadsheet sample. However, the code rejects the sample for some reason.
The same happens for some manual spreadsheet samples that I've created.

I believe this happens due to lines 681-682 in tokenizer.py:

if (max(top_pos_list[icell]) == -1) or (max(left_pos_list[icell]) == -1):
    return None

But I can not understand thoroughly what's happening behind these conditions for these lists.
Can you please explain why your given spreadsheet sample is not working?

TTC downstream task

Hi there,
do you plan to release code (and data, or example data) for the table type classification downstream task?
Thanks!

Could you give me the three datasets WebSheet, SAUS, and CIUS for "Cell Type Classification"?

@HaoAreYuDong

Could you give me the three datasets WebSheet, SAUS, and CIUS for "Cell Type Classification"?

I would like to have the data after conversion to TUTA input format.

Could you please tell me the command line arguments when you fine-tune with the TUTA-implicit model?

"--hidden_size", type=int, default=768
"--intermediate_size", type=int, default=3072
"--magnitude_size", type=int, default=10
"--precision_size", type=int, default=10
"--top_digit_size", type=int, default=10
"--low_digit_size", type=int, default=10
"--max_cell_length", type=int, default=16
"--row_size", type=int, default=2560
"--column_size", type=int, default=2560
"--tree_depth", type=int, default=4
"--node_degree", type=str, default="32,32,64,256"
"--attention_distance", type=int, default=2
"--attention_step", type=int, default=0
"--num_attention_heads", type=int, default=12
"--num_encoder_layers", type=int, default=12
"--hidden_dropout_prob", type=int, default=0.1
"--attention_dropout_prob", type=int, default=0.1
"--layer_norm_eps", type=float, default=1e-6
"--hidden_act", type=str, default="gelu"
"--learning_rate", type=float, default=8e-6

"--max_seq_len", type=int, default=512
"--max_cell_num", type=int, default=256
"--text_threshold", type=float, default=0.5
"--value_threshold", type=float, default=0.1
"--clc_rate", type=float, default=0.3
"--wcm_rate", type=float, default=0.3
"--add_separate", type=bool, default=True
"--num_ctc_type", type=int, default=6

"--attn_method", type=str, default="add", choices=["max", "add"]
"--hier_or_flat", type=str, default="both", choices=["hier", "flat", "both"]
"--org_or_weigh", type=str, default="original", choices=["original", "weighted"]
"--num_format_feature", type=int, default=11
"--sep_or_tok", type=int, default=0, choices=[0, 1]
"--sep_weight", type=float, default=0.0
"--aggregator", type=str, default="sum", choices=["sum", "avg"]

"--target", type=str, default="tuta"

"--batch_size", type=int, default=2
"--report_steps", type=int, default=200
"--epochs_num", type=int, default=40
"--dataset_num", type=int, default=1
"--early_stopping_bound", type=int, default=100

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.