clinicalml / tabllm Goto Github PK

License: MIT License

Python 98.92% Shell 1.08%

tabllm's Introduction

TabLLM: Few-shot Classification of Tabular Data with Large Language Models

This repository contains the code to reproduce the results of the paper TabLLM: Few-shot Classification of Tabular Data with Large Language Models by Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag.

Update 10/30/2023: We Added Additional Instructions to the Readme

Since several issues were raised regarding the code, we decided to add some additional instructions to the readme. We now provide all steps to reproduce an entry of our final results table. Reproducing the remaining results mainly consists of changing the experimental parameters. Thanks for everyone who provided feedback!

Overview

Reproducing the main results consists of three steps:

Creating textual serializations of the nine public tabular datasets
Train and evaluate TabLLM (use code from t-few project) on serialized datasets
Running the baseline models on the tabular datasets

We did not include the code to serialize and evaluate the private healthcare dataset due to privacy concerns. Also, code for some additional experiments is not included. Feel free to contact us if you have any questions concerning these experiments.

Setting the Correct Paths

TabLLM and the t-few project use the path /root/<project> by default and we will assume that you cloned both repositories to this location for this readme, i.e., /root/TabLLM for TabLLM and /root/t-few for the t-few repository. It is very likely that you have to adapt those paths for your own setup. The easiest way is to replace all occurrences of /root with your own path. When you get an error running the code, please ensure that you set all paths correctly.

Preparing the Environments

We used conda to create the necessary virtual environments. For the TabLLM environment, we used python 3.8:

conda create -n tabllm python==3.8
conda activate tabllm

Next, install the necessary requirements for TabLLM.

conda install numpy scipy pandas scikit-learn
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install datasets transformers sentencepiece protobuf xgboost lightgbm tabpfn

If you want to run training and inference for TabLLM, you also have to setup the environment for t-few. You can follow their readme to setup the environment. We had some dependency issues when following their instructions. Here are the commands that worked for us (taken and adapted from their instructions):

conda create -n tfew python==3.7
conda activate tfew
pip install fsspec==2021.05.0
pip install --use-deprecated=legacy-resolver  -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install urllib3==1.26.6
pip install importlib-metadata==4.13.0
pip install scikit-learn

To ensure that the t-few project ist setup correctly, you can run the command given in their repository:

export HF_HOME=~/.cache/huggingface
CUDA_VISIBLE_DEVICES=0 python -m src.pl_train -c t03b.json+rte.json -k save_model=False exp_name=first_exp

The result of the experiment should be stored in /root/t-few/exp_out/first_exp.

1. Creating Serialized Datasets

To create a textual serialization for one of the tabular datasets execute the following script with additional optional arguments for a specific serialization type. This will create a folder with a huggingface dataset in datasets_serialized:

create_external_datasets.py --dataset (car|income|diabetes|heart|bank|blood|calhousing|creditg|jungle) (--list) (--list (--tabletotext|--t0serialization|--values|--permuted|--shuffled))

For the serialization Text GPT, we used a script querying the GPT-3 API with a row entry encoded as a list and the prompts given in the paper.

We provide the Text serializations in datasets_serialized. The other serializations are omitted here due to size constraints. The Text serialization achieved the best results in our experiments.

2. Train and Evaluate TabLLM on Serialized Datasets

We used the codebase of the t-few project for our experiments. We made some small modifications to their code to enable experiments with our custom datasets and templates. We included all changed files in the t-few folder and you have to copy them over.

cp /root/TabLLM/t-few/bin/few-shot-pretrained-100k.sh  /root/t-few/bin/
cp /root/TabLLM/t-few/configs/* /root/t-few/configs/
cp /root/TabLLM/t-few/src/models/EncoderDecoder.py /root/t-few/src/models/
cp /root/TabLLM/t-few/src/data/* /root/t-few/src/data/
cp /root/TabLLM/t-few/src/scripts/get_result_table.py /root/t-few/src/scripts/

Please, check that you also set the paths correctly for the t-few project. In particular, you should check /root/t-few/src/data/dataset_readers.py to ensure that DATASETS_OFFLINE in line 75 points to /root/TabLLM/datasets_serialized and yaml_dict = yaml.load(open(...)) in line 233 uses the path /root/TabLLM/templates/templates_.

The script /root/t-few/bin/few-shot-pretrained-100k.sh runs all our TabLLM experiments for the different serializations and stores them in /root/t-few/exp_out. To run the 4-shot heart experiment with the Text serialization using the T0-3B model, set the for-loops going over the different experimental settings in /root/t-few/bin/few-shot-pretrained-100k.sh to:

for model in 't03b'
do
  [...]
  for num_shot in 4
  do
    [...]
    for dataset in heart 
    do
      [...]
      for seed in 42 1024 0 1 32  # Keep this for-loop as it is
      do
        [...]
      done
    done
  done
done

Then, you can run the specified setup from the t-few folder /root/t-few via:

./bin/few-shot-pretrained-100k.sh

The result of the experiment should be stored in /root/t-few/exp_out/t03b_heart_numshot4_seed*. Note that we use no validation set, hence, in the code our test data is treated as validation (=pred) set. As a consequence, you can find the test performance for seed 42 in /root/t-few/exp_out/t03b_heart_numshot4_seed42_ia3_pretrained100k/dev_scores.json:

cat /root/t-few/exp_out/t03b_heart_numshot4_seed42_ia3_pretrained100k/dev_scores.json
{"AUC": 0.617825311942959, "PR": 0.6409831261754565, "micro_f1": 0.5869565217391305, "macro_f1": 0.5511042629686697, "accuracy": 0.5869565217391305, "num": 184, "num_steps": -1, "score_gt": 0.8486858865489131, "score_cand": 0.9136485224184783}

To collect the results of several runs, we slightly changed the /root/t-few/src/scripts/get_result_table.py script to report the mean AUC and standard deviation. For the above example, using the script looks as follows:

python /root/t-few/src/scripts/get_result_table.py -e t03b* -d heart
================================================================================
Find 5 experiments fit into t03b*
heart: 67.65 (12.87)
Save result to exp_out/summary.csv

This results corresponds to the entry "TabLLM (T0 3B + Text Template)" for the heart dataset for 4 training examples (shots) on page 21 in our paper. To obtain the other experiments you have to adapt /root/t-few/bin/few-shot-pretrained-100k.sh accordingly. For more information, please also consider the original t-few repository or raise an issue.

3. Running the Baseline Models

We tested TabLLM against several baselines. They use the standard non-serialized datasets. The hyperparameter ranges are given in the paper. You can specify the baseline models and datasets that you want to run in the code. To run a baseline model execute

evaluate_external_datasets.py

We hope these instructions help you to reproduce our results. Feel free to contact us if you have any questions!

Citation

If you want to cite our work please use:

@inproceedings{hegselmann2023tabllm,
  title={Tabllm: Few-shot classification of tabular data with large language models},
  author={Hegselmann, Stefan and Buendia, Alejandro and Lang, Hunter and Agrawal, Monica and Jiang, Xiaoyi and Sontag, David},
  booktitle={International Conference on Artificial Intelligence and Statistics},
  pages={5549--5581},
  year={2023},
  organization={PMLR}
}

We use the code of

@article{liu2022few,
  title={Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning},
  author={Liu, Haokun and Tam, Derek and Muqeeth, Mohammed and Mohta, Jay and Huang, Tenghao and Bansal, Mohit and Raffel, Colin A},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={1950--1965},
  year={2022}
}

@inproceedings{bach2022promptsource,
  title={PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts},
  author={Bach, Stephen and Sanh, Victor and Yong, Zheng Xin and Webson, Albert and Raffel, Colin and Nayak, Nihal V and Sharma, Abheesht and Kim, Taewoon and Bari, M Saiful and F{\'e}vry, Thibault and others},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  pages={93--104},
  year={2022}
}

tabllm's People

Contributors

Stargazers

Watchers

Forkers

choronx 012345 jnelen apollohuang1 marcucla tqtensor ajithkrajeswari sandeep2194 grv805 pappuks zhangxinsheng ashkspark navaneeth20 codeaudit ruiningli bigandsweet wenwen66796 williamtran29 tanayshah23 jingmouren bzp92 tampueroc yanyu96 chenyujiehome ping-song jeongwhanchoi token1029 vikram687 fangarenotgnu imrohankataria anoop-qasolve mactonm nahian099 rodrigosimass notjameshan cbys4 doris404 choengtaweep kang9779 thisguyisnotajumpingbear abdd68

tabllm's Issues

Custom dataset use

How to use TabLLM on my dataset, I am unable to figure out how to do so, any help?

Element 0 of tensor does not require grad and does not have a grad_fn

I am getting following error while trying to replicate your code

Version used:
torch == 2.0.1
pytorch_lightning == 1.9.2
deepspeed == 0.10.3

11,402.764Total estimated model params size (MB)

Epoch 0: 0%| | 0/8 [00:00<?, ?it/s]Traceback (most recent call last):

File “/code/llm/t_few/src/pl_train.py", line 139, in

main(config)

File "code/llm/t_few/src/pl_train.py", line 99, in main

trainer.fit(model, datamodule)

File "python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit

self._call_and_handle_interrupt(

File "python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt

return trainer_fn(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl

results = self._run(model, ckpt_path=self.ckpt_path)

File “python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run

results = self._run_stage()

File "python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage

return self._run_train()

File "python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train

self.fit_loop.run()

File "python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run

self.advance(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance

self._outputs = self.epoch_loop.run(self._data_fetcher)

File "python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run

self.advance(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance

batch_output = self.batch_loop.run(batch, batch_idx)

File "python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run

self.advance(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance

outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)

File "python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run

self.advance(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance

result = self._run_optimization(

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization

self._optimizer_step(optimizer, opt_idx, batch_idx, closure)

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step

self.trainer._call_lightning_module_hook(

File “python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook

output = fn(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step

optimizer.step(closure=optimizer_closure)

File "python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step

step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)

File "python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step

return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)

File “python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 80, in optimizer_step

return super().optimizer_step(model, optimizer, optimizer_idx, closure, **kwargs)

File "python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step

return optimizer.step(closure=closure, **kwargs)

File "python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper

return wrapped(*args, **kwargs)

File "python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper

return func(*args, **kwargs)

File “python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context

return func(*args, **kwargs)

File "python3.9/site-packages/transformers/optimization.py", line 649, in step

loss = closure()

File "python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 140, in _wrap_closure

closure_result = closure()

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in call

self._result = self.closure(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 143, in closure

self._backward_fn(step_output.closure_loss)

File "python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 311, in backward_fn

self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)

File “python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook

output = fn(*args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 168, in backward

self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward

model.backward(closure_loss, optimizer, *args, **kwargs)

File "python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1370, in backward

loss.backward(*args, **kwargs)

File "python3.9/site-packages/torch/_tensor.py", line 307, in backward

torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File "python3.9/site-packages/torch/autograd/init.py", line 154, in backward

Variable._execution_engine.run_backward(

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Epoch 0: 0%| | 0/8 [03:19<?, ?it/s]

Problem with reproducing

Hi,
Thanks for your work.
When trying to run few-shot-pretrained-100k.sh with the "car" dataset, the pl_train file looks for this dataset_readers.get_dataset_reader in t-few's configs, but it's not there. Therefore this script in not able to be completed.

How do I enable a run for the "car" dataset?

Traceback:

/root/t-few/bin/few-shot-pretrained-100k.sh: 45: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 46: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 49: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 52: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 55: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 58: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 61: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 64: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 67: [[: not found /root/t-few/bin/few-shot-pretrained-100k.sh: 70: [[: not found [2023-10-05 12:21:45,078] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) Start experiment t011b_car_numshot4_seed42_ia3_pretrained100k { "exp_dir": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k", "exp_name": "t011b_car_numshot4_seed42_ia3_pretrained100k", "allow_skip_exp": false, "seed": 42, "model": "EncDec", "max_seq_len": 1024, "origin_model": "bigscience/T0", "load_weight": "pretrained_checkpoints/t011b_ia3_finish.pt", "dataset": "car", "few_shot": true, "num_shot": 4, "few_shot_random_seed": 42, "train_template_idx": -1, "eval_template_idx": -1, "batch_size": 4, "eval_batch_size": 16, "num_workers": 8, "change_hswag_templates": false, "raft_cross_validation": true, "raft_validation_start": 0, "raft_labels_in_input_string": "comma", "cleaned_answer_choices_b77": false, "compute_precision": "bf16", "compute_strategy": "none", "num_steps": 30, "eval_epoch_interval": 30, "eval_before_training": false, "save_model": true, "save_step_interval": 20000, "mc_loss": 1, "unlikely_loss": 1, "length_norm": 1, "grad_accum_factor": 1, "split_option_at_inference": false, "optimizer": "adafactor", "lr": 0.003, "trainable_param_names": ".*lora_b.*", "scheduler": "linear_decay_with_warmup", "warmup_ratio": 0.06, "weight_decay": 0, "scale_parameter": true, "grad_clip_norm": 1, "model_modifier": "lora", "prompt_tuning_num_prefix_emb": 100, "prompt_tuning_encoder": true, "prompt_tuning_decoder": true, "lora_rank": 0, "lora_scaling_rank": 1, "lora_init_scale": 0.0, "lora_modules": ".*SelfAttention|.*EncDecAttention|.*DenseReluDense", "lora_layers": "k|v|wi_1.*", "bitfit_modules": ".*", "bitfit_layers": "q|k|v|o|wi_[01]|w_o", "adapter_type": "normal", "adapter_non_linearity": "relu", "adapter_reduction_factor": 4, "normal_adapter_residual": true, "lowrank_adapter_w_init": "glorot-uniform", "lowrank_adapter_rank": 1, "compacter_hypercomplex_division": 8, "compacter_learn_phm": true, "compacter_hypercomplex_nonlinearity": "glorot-uniform", "compacter_shared_phm_rule": false, "compacter_factorized_phm": false, "compacter_shared_W_phm": false, "compacter_factorized_phm_rule": false, "compacter_phm_c_init": "normal", "compacter_phm_rank": 1, "compacter_phm_init_range": 0.01, "compacter_kronecker_prod": false, "compacter_add_compacter_in_self_attention": false, "compacter_add_compacter_in_cross_attention": false, "intrinsic_projection": "fastfood", "intrinsic_said": true, "intrinsic_dim": 2000, "intrinsic_device": "cpu", "fishmask_mode": null, "fishmask_path": null, "fishmask_keep_ratio": 0.05, "prefix_tuning_num_input_tokens": 10, "prefix_tuning_num_target_tokens": 10, "prefix_tuning_init_path": null, "prefix_tuning_init_text": null, "prefix_tuning_parameterization": "mlp-512", "train_pred_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/train_pred.txt", "dev_pred_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/dev_pred.txt", "dev_score_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/dev_scores.json", "test_pred_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/test_pred.txt", "test_score_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/test_scores.json", "finish_flag_file": "exp_out/t011b_car_numshot4_seed42_ia3_pretrained100k/exp_completed.txt" } Mark experiment t011b_car_numshot4_seed42_ia3_pretrained100k as claimed Downloading (…)okenizer_config.json: 100% 1.86k/1.86k [00:00<00:00, 287kB/s] Downloading (…)lve/main/config.json: 100% 633/633 [00:00<00:00, 107kB/s] Downloading spiece.model: 100% 792k/792k [00:00<00:00, 18.9MB/s] Downloading (…)cial_tokens_map.json: 100% 1.79k/1.79k [00:00<00:00, 1.00MB/s] You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in[ https://github.com/huggingface/transformers/pull/24565](https://github.com/huggingface/transformers/pull/24565) Downloading pytorch_model.bin: 100% 44.5G/44.5G [03:06<00:00, 238MB/s] Traceback (most recent call last): File "/usr/local/envs/tabllm/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/envs/tabllm/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/t-few/bin/src/pl_train.py", line 92, in <module> main(config) File "/root/t-few/bin/src/pl_train.py", line 40, in main dataset_reader = get_dataset_reader(config) File "/root/t-few/bin/src/data/dataset_readers.py", line 16, in get_dataset_reader dataset_class = { KeyError: 'car'

Requesting the 'pl_train' module from src

Hello TabLLM authors, thanks for providing the source code. I am especially interested in the fine-tuning scheme of your model. However, I could not find any training script from the current repository. The closest thing I found is the 'src.pl_train', which is mentioned in the shell script located in t-few/bin, and appears to be the module that was executed over there. Could you kindly provide this module?

How can the trained model be used to make predictions on unlabeled data?

Hello, thank you very much for your work. How can the trained model be used to make predictions on unlabeled data?

How can we use Llama2 here?

I see from the code repo that we are using OpenAI APIs, how can we make this work for Open source models like Llama2?
can someone give me detail on this and what steps I need to follow?

The scores after every 30 epochs doesn't change at all during fine-tuning

Hi,

I am trying to fine-tune the t03b model with a 4-shot approach on the heart dataset. On checking the logs after every 30 epochs, I don't see any difference in the scores that are printed.
It prints the same score after the first 30 epochs and the same at the end of 5600 epochs

{"AUC": 0.5823586744639375, "PR": 0.6555849741948994, "micro_f1": 0.5706521739130435, "macro_f1": 0.4795001253267447, "accuracy": 0.5706521739130435, "num": 184, "num_steps": -1, "score_gt": 0.3382242697736491, "score_cand": 0.37344987593267276}
....
....
....
{"AUC": 0.5823586744639375, "PR": 0.6555849741948994, "micro_f1": 0.5706521739130435, "macro_f1": 0.4795001253267447, "accuracy": 0.5706521739130435, "num": 184, "num_steps": 30, "score_gt": 0.3382242697736491, "score_cand": 0.37344987593267276}

I also tried the same thing on a different dataset and got the same results. Here's the screenshot:

Can you please tell me what am I doing wrong?

Only dev_scores.txt available in exp_out folder

Hi,

I am trying to train a model on the heart dataset. The training works, but I am unable to find the following files in the exp_out folder:

train_pred.txt
dev_pred.txt
test_pred.txt
test_score.txt

Here's the screenshot for your reference:

My main concern is now that we have the trained model checkpoint, how can we use it to get results on the dev/test set and get the results?

Thanks!

The table_to_text function returns an empty list

I am trying to run the code in create_external_datasets.py file for income dataset using t0serialization or tabletotext with debug arguments. But in , the function table_to_text has a regular expression that returns an empty list (around line 180 in file). I think there is the same problem in entry_to_text function.

def table_to_text(note):
         re_name_value = re.compile(r"^- (.*):([^:]*)$", re.MULTILINE)
         name_values = re_name_value.findall(note)                                # --->  name_values = []
         examples = [write_into_table(x[0].strip(), x[1].strip()) for x in name_values]
         return [preprocess(e)['linearized_table'] for e in examples]

I can not find the issue to run the code. Could you please let me know is that problem with template or something else? Thanks

No such file or directory: 'requirements.txt'

windows 本地环境安装 pip install torch==1.10.1+cpu torchvision==0.11.2+cpu torchaudio==0.10.1 -f https://download.pytorch.org/whl/cpu/torch_stable.html

之后执行 pip install --use-deprecated=legacy-resolver -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html 报错 No such file or directory: 'requirements.txt'

这种报错如何处理

Confusion about the implementation of “Text-Template” serialization

Hi, @stefanhgm. Thank you for your perfect work. I have an interesting question that has been puzzling me. How is the Serialization method in your article [Text Template: An textual enumeration of all featuresas “The column name is value.”] implemented? Manually or in code?

Can TabLLM run on Windows?

Dear author, I can't get dev_scores.json and other files when I run tfew code. Can TabLLM only run on Linux?
Thanks!

Totally Can NOT Re-Produce!!!

AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup.<locals>.lr_lambda'

Traceback (most recent call last):
File "/opt/conda/envs/tfew/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/envs/tfew/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/codespace/TABLLM/t-few/src/pl_train.py", line 86, in
main(config)
File "/codespace/TABLLM/t-few/src/pl_train.py", line 57, in main
trainer.fit(model, datamodule)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/envs/tfew/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
process.start()
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/envs/tfew/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup..lr_lambda'

How can I apply it to my own data set?

HI@stefanhgm
My current dataset contains test data and diagnostic reports for 3,000 patients. The number of rows of test data is not uniform. Each test data is a three-column csv file. The first column of the file is the time, starting from 0 and incrementing every 0.1 seconds. The other two columns are the test content; the diagnostic report is a txt file that contains the patient's examination information and the doctor's diagnosis. Opinion. Each csv file corresponds to one and txt file, I would like to ask how should I process my data set to train your model?

When I reviewed the code, I couldn't find the output of the model

When I reviewed the code, I couldn't find the output of the model：

do not find the output in the output file

Cannot reproduce performance on blood dataset with text template serialisations

I am using the same code as yours with seed = 42, 32 shots, batch size of 4 and dataset being blood dataset with text template serialisations. I got 37.2 % accuracy compared to 67% as reported in the paper.
Could you suggest parameters that I can change to get the desired results

Originally posted by @YasHGoyaL27 in #13 (comment)

Zero-shot inference with t-few does not produce the scores and inferences

Hi @stefanhgm,

I am able to run the t-few code to infer all the datasets with any number of shots except 0 shot. Following your instruction, I changed the num_shot to 0 in the .sh file to run the pl_train.py but the output experiment folder does not contain the supposed dev_scores.json and the inferences. It works for the other shots, the issue only happens for zero-shot. I've attached the terminal output when the code is run for your reference.

yaml template custom tag

Hi the yaml template has two custom tag, which are !Template and !Meta. When I use fulloader, the loader can't recognize the two custom tags and report constructor error. However, I didn't find file or function relevant to the constructor definition. Thanks!

Error with Creating Serialized Datasets

I did run the previous code in Preparing the Environments section.

Could you please upload the entire source codes?

Thanks for your interesting work! I clone your repo and try to reproduce your results in the paper, however, I find that there are too many missing modules in these codes, including in the procedure of data processing, models, etc. So, could you please clean your codes again and add these missing parts, I think it would significantly benefit and push forward a large step in the field of tabular learning.

dev_scores.json is not found when num_shot is 0

Hi,

I tried to reproduce the result on the zero-shot scenario. However, as num_shot is set to be 0 in few-shot-pretrained-100k.sh. I cannot see dev_scores.json in exp_out, but I can see dev_scores.json when num_shot is 4. Did I misconfigure anything or misunderstand the usage?

  # For zero-shot set to '0', for all to 'all'
  for num_shot in 4 8 16 32 64 128 256 512
  do
    ...

Key error: "probabilities"

Hi there,

I encounter this error when I try to run the code. Could you please take a look at it?

File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 131, in on_run_end
    self._evaluation_epoch_end(outputs)
  File "/data/home/jianfengchi/miniconda/envs/llmtab/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 236, in _evaluation_epoch_end
    model.validation_epoch_end(outputs)
  File "/data/home/jianfengchi/code/TabLLM/t-few/src/models/EncoderDecoder.py", line 256, in validation_epoch_end
    metrics = self.dataset_reader.compute_metric(accumulated)
  File "/data/home/jianfengchi/code/TabLLM/t-few/src/data/dataset_readers.py", line 312, in compute_metric
    pos_probs = [p[1] for p in accumulated['probabilities']]

global.json not found

Hi there,

It seems that your code miss a json file global.json. Could you please upload it? Thanks

TabLLM/t-few/src/utils/Config.py", line 118, in __init__
    self.update_kwargs(json.load(open(filename)), eval=False)
FileNotFoundError: [Errno 2] No such file or directory 'TabLLM/t-few/configs/global.json'

GPT-3 zero-shot prediction

Hi there,

In your evaluate_external_dataset.py, the gpt3_output file in add_gpt3_zero_shot_predictions is missing. Could you please share the script to query gpt-3 and generate the zero-shot prediction? Thanks!

clinicalml / tabllm Goto Github PK

tabllm's Introduction

TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Update 10/30/2023: We Added Additional Instructions to the Readme

Overview

Setting the Correct Paths

Preparing the Environments

1. Creating Serialized Datasets

2. Train and Evaluate TabLLM on Serialized Datasets

3. Running the Baseline Models

Citation

tabllm's People

Contributors

Stargazers

Watchers

Forkers

tabllm's Issues

Recommend Projects

Recommend Topics

Recommend Org