dreamquark-ai / tabnet Goto Github PK

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

Home Page: https://dreamquark-ai.github.io/tabnet/

License: MIT License

Dockerfile 0.22% Makefile 1.90% Jupyter Notebook 35.41% Python 59.96% Shell 1.68% Batchfile 0.31% Smarty 0.05% CSS 0.25% HTML 0.22%

pytorch deep-neural-networks machine-learning-library tabular-data research-paper pytorch-tabnet tabnet

tabnet's People

Contributors

Stargazers

Watchers

Forkers

j-abi wildcat47 alexandrecameron hal-314 takotab changrongji jtilly entn-at trungnghiahoang96 fortin-alex niuwan1 martinsotir manujosephv eduardocarvp priyalnarang x-malet intrinsic-tech-dev kiminh longjun0615 galacticsurfer akakakakakaa valeman stjordanis saewony lijiashi saswat0 nanaakwasiabayieboateng genka7 albertvillanova andreipit cxz nickhuanga hadraed zhi-hope zergey yangqiu benleungpg naveenkb frankherfert ryanwongsa i8dnlo dsadulla sachinruk michaelgao8 manikant92 pro100olga jingmouren law101 lakimad ddofer bobycv06fpm rlds-107 athewsey alexismignon stockedge chang111 codetcode chillum-codex geodesic1 albertocastelo csuzhhj bennyjg rodrigolima82 amaigo quboanthony abhijit-ml hsviscarra yuntai transconnectome utksh jaredcolerosenberg guolz-ml kukuleta forbu pgsrv nhoues kkontoudi jrfiedler xrosliang 610265158 zeta1999 yinanli617 vanrao-stack cuikaichina panda-puff khuongnd codingmice mohamed-180 hirune924 xinjieinformatik isears miguel-bm prakriti06041999 petomajci teacher-tony12138 rrrajjjj tanish-g dmitriyg228 vibhatha ashiakerwang

tabnet's Issues

Weight initialization different from the original paper

From the experiment section of the TabNet paper:

"Adam optimization algorithm (Kingma & Ba, 2014) and Glorot uniform initialization are used for training of all models."

Also, from the TensorFlow implementation provided by the authors, they used tf.layers.dense which seems to use glorot_uniform by default.

However, in the tab_network.py:

def initialize_non_glu(module, input_dim, output_dim):
    gain_value = np.sqrt((input_dim+output_dim)/np.sqrt(4*input_dim))
    torch.nn.init.xavier_normal_(module.weight, gain=gain_value)
    # torch.nn.init.zeros_(module.bias)
    return


def initialize_glu(module, input_dim, output_dim):
    gain_value = np.sqrt((input_dim+output_dim)/np.sqrt(input_dim))
    torch.nn.init.xavier_normal_(module.weight, gain=gain_value)
    # torch.nn.init.zeros_(module.bias)
    return

So my questions are:

Why use Glorot normal initialization instead of Glorot uniform initialization as described in the paper?
What are the reasons behind the formulas used here to calculate the gain value? Is there any reference for this? The recommended gain value for a linear layer should be the default value 1.

Thanks!

Can't really set n_independant or n_shared to zero

Describe the bug

What is the current behavior?
It's possible without error to train with n_independent=0 and n_shared=0 and looking at the code it seems that zero is actually 1, so minimal value is 1 and this should not be the case.

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Well I guess 0 and 0 should throw a clear error, but 0 should mean 0.

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Add Changelog, and process for release

question : only fc is shared, but bn (and glu) is not?

According to the paper, it seems that in the feature transformer in Figure.4(a),
all fc-bn-glu are shared. However, your implementation only shares fc.

is there a reason for this implementation?

Using Pytorch-Tabnet as nn.Module blocks or torchvision models

class Roberta(transformers.BertPreTrainedModel):
    def __init__(self, conf):
        super(TweetModel, self).__init__(conf)
        self.roberta = transformers.RobertaModel.from_pretrained(ROBERTA_PATH, config=conf)
        
        self.dropout = nn.Dropout(0.1)
        self.l0 = nn.Linear(768, 2)
  
        torch.nn.init.normal_(self.l0.weight, std=0.02)
        torch.nn.init.normal_(self.l1.weight, std=0.02)

I want to do something like this with Tabnet and have my own custom model so that I have all the liberties of using a neural net and I don't have to do it like scikit learn again

getting error while fit

Describe the bug

new() received an invalid combination of arguments - got (list, int), but expected one of:

(*, torch.device device)
didn't match because some of the arguments have invalid types: (!list!, !int!)
(torch.Storage storage)
(Tensor other)
(tuple of ints size, *, torch.device device)
(object data, *, torch.device device)

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Make TabNet Scikit Compatible

Feature request

Currently, the library can't be used as simply as a scikit model. It would be great to be fully scikit compatible

What is the expected behavior?
We need new classes for TabNetRegressor, TabNetClassifier.
We also need to get scikit compatible global explainations.

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Not calling set_params is making model crash (no batch size)

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Research : Change Attention Transformer Inputs

Main Remark

Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.

But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?

Proposed Solution

I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.

So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.

This could potentially improve the attentive transformer part.

If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!

Models don't accept model_name, saving_path

Describe the bug

Models don't accept model_name, saving_path as initialization arguments.

What is the current behavior?

See above.

If the current behavior is a bug, please provide the steps to reproduce.

clf: TabNetClassifier = TabNetClassifier(saving_path="/home/user123/dev/", device_name="cpu")

Expected behavior

Models should accept model_name, saving_path as initialization arguments as specified in the documentation.

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

On a related note: How can models be persisted? The mentioned init parameters strongly suggest that it is possible, but I couldn't find any information on this - either in the documentation nor in the code.

RuntimeError: CUDA error: an illegal memory access was encountered

Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.

It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.

Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

The details of the error:

RuntimeError Traceback (most recent call last)
in
7 batch_size=16384, virtual_batch_size=1024,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
133 virtual_batch_size=self.virtual_batch_size,
134 momentum=self.momentum,
--> 135 device_name=self.device_name).to(self.device)
136
137 self.reducing_matrix = create_explain_matrix(self.network.input_dim,

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in init(self, input_dim, output_dim, n_d, n_a, n_steps, gamma, cat_idxs, cat_dims, cat_emb_dim, n_independent, n_shared, epsilon, virtual_batch_size, momentum, device_name)
250 device_name = 'cpu'
251 self.device = torch.device(device_name)
--> 252 self.to(self.device)
253
254 def forward(self, x):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
--> 425 return self._apply(convert)
426
427 def register_backward_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
221 # with torch.no_grad():
222 with torch.no_grad():
--> 223 param_applied = fn(param)
224 should_use_set_data = compute_should_use_set_data(param, param_applied)
225 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)

RuntimeError: CUDA error: an illegal memory access was encountered

Verbosity with LR scheduler is not working properly

Describe the bug

What is the current behavior?
So when setting verbose to a value >1 and a scheduler, the verbosities don't match :
see https://www.kaggle.com/tanulsingh077/achieving-sota-results-with-tabnet#877426

Expected behavior
Well learning rates should follow same verbosity (or potentially be hidden not sure)

Additional context

[Question/Feature Request] You mentioned it works with GPU, does Fast-TabNet work with TPUs?

Feature request

What is the expected behavior?
Same as the outcome on CPUs and GPUs

What is motivation or use case for adding/changing the behaviour?
Better training performance

How should this be implemented in your opinion?
Similar to Tensorflow/Pytorch sends the data to the TPU

Are you willing to work on this yourself?
Happy to contribute along with another experienced developer

Bug with 1 shared layer and 2 independent layers

Describe the bug

There is a problem with the way we deal with layers indexing that deals to a bug.

What is the current behavior?

You'll get an error if trying to set n_shared to 1 and n_independent to 2 for example.

Expected behavior

We should be able to put any value without error.
A fairly simple fix should be done

Ghost Batch Norm : refactorize

Feature request

What is the expected behavior?
As mentioned in #102 with @hengck23 ghost batch norm implementation could probably be improved, his code here could be a good solution : https://gist.github.com/hengck23/c21b8b6f2f34634687ebd8a4e963f560

What is motivation or use case for adding/changing the behavior?

Cleaner and faster implementation

How should this be implemented in your opinion?
see above

Are you willing to work on this yourself?
why not

Add str and repr method

Would be good to have STR and repr method

Research : Boosted-TabNet?

Main Remark

Tabnet architecture is using sequential steps in order to mimic some kind of random forest paradigm.
But since boosting algorithms often outperform random forests shouldn't we try to move towards boosting methods instead of random forest?

Proposed Solutions

One solution I see here would be to predict different things at each step of the tabnet to perform boosting:

first step would remain as now
second step would try to predict the residuals (i.e the difference between the actual target and the first step predictions)
next step would try to predict residuals as well (i.e the difference between the actual target and the sum of previous steps predictions)

This looks like it could work quite easily for regression problems but I'm not sure how it could work for classification tasks, you can't stay in the classification paradigm and try to predict residuals. If anyone knows about a specific loss function that would make that happen I think it's worth a try!

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

[Question/Feature Request] Any explainability output or examples available to try out

Feature request

What is the expected behavior?
No behaviour changes. Rather add examples to the docs or the examples section of the repo.

What is motivation or use case for adding/changing the behavior?
Make it easy to allow users to adapt it into their ML workflow. As explainability is an important topic in the current atmosphere.

How should this be implemented in your opinion?
No implementation needed, just docs and examples either as a python code snippet, a Jupyter notebook or a Kaggle kernel will be sufficient.

Are you willing to work on this yourself?
yes

Checkpoints

Feature request

Save/load/average checkpoints.

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?
Smarter early stopping and possibly better generalization on predictions.

How should this be implemented in your opinion?
Good source of inspiration here: https://github.com/Qwicen/node/blob/master/lib/trainer.py

Are you willing to work on this yourself?
yes

[Question/Feature Request] Any example of applying tabnet in reinforcement and self-supervised learning

Feature request

What is the expected behavior?
New example in the examples section of the repo.

What is motivation or use case for adding/changing the behavior?
Adding a new application area.

How should this be implemented in your opinion?
Just docs and examples of using tabnet with openai and small data.

Are you willing to work on this yourself?
Maybe. Not sure.

#Abhishek-eBook

device in torch.nn.Module

Hello and thank you for your great work!

What was the idea behind passing device parameter to the constructor of nn.Module and storing it? I've never seen that pattern before in Pytorch.

Training is very slow

Hi all,
Thanks for the clean implementation of this model!
I'm comparing tabnet to an MLP and some gradient boosted tree models on a very large (~terabyte) dataset. Tabnet is several orders of magnitude slower than the MLP with a comparable parameter count. It also seems to occupy a lot of memory on the GPU. Is this expected and is there something I can do about this?

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Improve verbosity

Feature request

Currently we are plotting scores at #verbose epoch but we should incorporate call backs or at least history to avoid calling matplotlib each time

What is the expected behavior?
Something XGBoost like

What is motivation or use case for adding/changing the behavior?
Many

How should this be implemented in your opinion?
Not quite sure yet

Are you willing to work on this yourself?
yes

Sample weigh support for regression problems

Feature request

What is the expected behavior?
It would be very helpful to add sample weight support for regression problems. The idea would be to add a 'sample_weight' parameter to the .fit() call, and give a weighted regression.

What is motivation or use case for adding/changing the behavior?
Many datasets involve different sample weights. This is especially common with sports data (where I work), but is frequently used elsewhere.

How should this be implemented in your opinion?
The usual implementation I've seen has been to multiply the individual residuals by the sample weight, but I am not very familiar with the underlying math here, so don't know how it would work.

Are you willing to work on this yourself?
I am happy to help, but my understanding of the underlying code is lacking at the moment.

What is the objective function to optimize?

I notice that for every epoch, there will be train and valid accuracy.
Is the accuracy the metrics for the optimization? I am currently dealing with binary classification problem, and I would like to use auc or recall as an metric. May I be able to do that too?

Thank yo very much for your response.

Adding Callbacks

Main Problem

Currently some things can be changed like scheduler or optimizer but it's not possible to do things like changing the loss function, the early stopping metrics and probably some important things for specific problems.

Proposed Solutions

We should find a simple way of using callbacks in order to customize more the training process.
Something that would resemble one of these:

The easier it is and the less invasive solution for the code the better

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

RandomizedSearchCV with pytorch-tabnet

It appears that the TabNetClassifier does not have a get_params method for hyperparameter estimation.

Is this reproducible your end?

Many thanks

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-03d6c8d15377> in <module>()
      4 
      5 start = time()
----> 6 randomSearch.fit(X_train, y_train)
      7 
      8 

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in clone(estimator, safe)
     65                             "it does not seem to be a scikit-learn estimator "
     66                             "as it does not implement a 'get_params' methods."
---> 67                             % (repr(estimator), type(estimator)))
     68     klass = estimator.__class__
     69     new_object_params = estimator.get_params(deep=False)

TypeError: Cannot clone object 'TabNetClassifier(n_d=32, n_a=32, n_steps=5,
                 lr=0.02, seed=0,
                 gamma=1.5, n_independent=2, n_shared=2,
                 cat_idxs=[],
                 cat_dims=[],
                 cat_emb_dim=1,
                 lambda_sparse=0.0001, momentum=0.3,
                 clip_value=2.0,
                 verbose=1, device_name="auto",
                 model_name="DreamQuarkTabNet", epsilon=1e-15,
                 optimizer_fn=<class 'torch.optim.adam.Adam'>,
                 scheduler_params={'gamma': 0.95, 'step_size': 20},
                 scheduler_fn=<class 'torch.optim.lr_scheduler.StepLR'>, saving_path="./")' (type <class 'pytorch_tabnet.tab_model.TabNetClassifier'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

Add CI to launch unit test

Speed up: stop computing masks explanations during training

Feature request

What is the expected behavior?
During training, masks don't need to be available for users. We could skip some computations as discussed in #102

What is motivation or use case for adding/changing the behavior?
This should speed things up

How should this be implemented in your opinion?
not sure yet

Are you willing to work on this yourself?
yes

Refactorize embeddings

Feature request

Creating an external module for embeddings generation would make code clearer.
Some improvement to skip this part if no embeddings are needed would also make the training faster (see #97 ).

What is the expected behavior?
Nothing would change, just code optimization

What is motivation or use case for adding/changing the behavior?
Code clearer and faster.

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Make the model scikit compatible (eg, for grid search)

Feature request

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Build doc and make it github pages

Feature request

What is the expected behavior?

What is motivation or use case for adding/changing the behavior?

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

model saving produces error on Windows

Hi,

ytorch-tabnet 1.0.4,
ON windwos got this error:
OSError: [Errno 22] Invalid argument: './DreamQuarkTabNet_13-03-2020_12:47:25.pt'

In tab_model.py:
Lines 112-113
model_name is defined with:
dt_string = now.strftime("%d-%m-%Y%H:%M:%S")
self.model_name += dt_string

once this is run it produces above error on windows:
torch.save(self.network, self.saving_path+f"{self.model_name}.pt")

--> Please change
line 113 to
dt_string = now.strftime("%d-%m-%Y%H_%M_%S")

Embedding dims does not work for cat_emb_dim > 1

Describe the bug

What is the current behavior?

If you try to set cat_emb_dim to a value bigger than 1 you'll get an DimensionError due to explain and embeddings.
If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

This should work and return sum of importances for embedded dimensions
Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

test slack

Describe the bug

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Example Regression not working in Google Colab

Describe the bug
I tried to run on gpu and then on cpu and with different embedding sizes.
Still I get a dimension error.

Here is the link to the notebook:
https://colab.research.google.com/drive/1wDQ28PNxtEJA1XZyN2eVA6iTSd6ctf-E?usp=sharing

Maybe related to #94

RuntimeError: CUDA error: device-side assert triggered

Describe the bug

I get this CUDA error when trying to fit the classifier (with GPU).

I've also tried switching to CPU and got a different error => "RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657" where now the error seems to be related to an index tensor that has invalid indices and I'm not sure on how to solve this.

What is the current behavior?
This error happen when fitting a classifier with exactly the same parameters as in the "census_examples" notebook but on the different dataset.

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

Here is the details of the error when running fit with CPUs :

RuntimeError Traceback (most recent call last)
in
7 batch_size=512, virtual_batch_size=128,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
165 self.patience_counter < self.patience):
166 starting_time = time.time()
--> 167 fit_metrics = self.fit_epoch(train_dataloader, valid_dataloader)
168
169 # leaving it here, may be used for callbacks later

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit_epoch(self, train_dataloader, valid_dataloader)
222 DataLoader with valid set
223 """
--> 224 train_metrics = self.train_epoch(train_dataloader)
225 valid_metrics = self.predict_epoch(valid_dataloader)
226

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in train_epoch(self, train_loader)
487
488 for data, targets in train_loader:
--> 489 batch_outs = self.train_batch(data, targets)
490 if self.output_dim == 2:
491 y_preds.append(torch.nn.Softmax(dim=1)(batch_outs["y_preds"])[:, 1]

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in train_batch(self, data, targets)
530 self.optimizer.zero_grad()
531
--> 532 output, M_loss = self.network(data)
533
534 loss = self.loss_fn(output, targets)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, x)
254 def forward(self, x):
255 x = self.embedder(x)
--> 256 return self.tabnet(x)
257
258 def forward_masks(self, x):

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, x)
130
131 for step in range(self.n_steps):
--> 132 M = self.att_transformers[step](prior, att)
133 M_loss += torch.mean(torch.sum(torch.mul(M, torch.log(M+self.epsilon)),
134 dim=1))

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in forward(self, priors, processed_feat)
290 x = self.bn(x)
291 x = torch.mul(x, priors)
--> 292 x = self.sp_max(x)
293 return x
294

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in forward(self, input)
89
90 def forward(self, input):
---> 91 return sparsemax(input, self.dim)
92
93

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in forward(ctx, input, dim)
41 max_val, _ = input.max(dim=dim, keepdim=True)
42 input -= max_val # same numerical stability trick as for softmax
---> 43 tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)
44 output = torch.clamp(input - tau, min=0)
45 ctx.save_for_backward(supp_size, output)

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/sparsemax.py in _threshold_and_support(input, dim)
74
75 support_size = support.sum(dim=dim).unsqueeze(dim)
---> 76 tau = input_cumsum.gather(dim, support_size - 1)
77 tau /= support_size.to(input.dtype)
78 return tau, support_size

RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657

Performance of pytorch-tabnet on forest cover type dataset

Running out of the box the forest_example, the results differ significantly from the ones in the original paper. Specifically, I get the following:

preds = clf.predict_proba(X_test)
y_true = y_test
test_acc = accuracy_score(y_pred=np.argmax(preds, axis=1), y_true=y_true)
print(f"BEST VALID SCORE FOR {dataset_name} : {clf.best_cost}")
BEST VALID SCORE FOR EPIGN : -0.8830427851320214

print(f"FINAL TEST SCORE FOR {dataset_name} : {test_acc}")
FINAL TEST SCORE FOR EPIGN : 0.0499728922661205

Do you get similar results? Many thanks.

Research : Binary Mask vs Sparse Mask?

Main Remark

Tabnet architecture is using sparsemax function in order to perform instance-wise feature selection, and this is one of the important feature of TabNet.

One of the interesting properties of sparsemax is that it's outputs sum to 1, but do we really want this?
Is it the role of the mask to perform both selection (0s for unused features) and importance (a value between 0 and 1)?
I would say that the feature transformer should be used to create importance (by summing values of the relu outputs as it's done in the paper) and the masks should output binary masks that would not sum to 1.

On problem I see with non binary maks is that they change the values for the next layers, if someone is 50 year old, and the attention layer think that age is half of the solution then attention for age would be 0.5, and the next layer would see age=25. But how can the next layers differentiate from 75 / 3, 50 /2 and 25? They can't really, so it seems that some information is lost along the way because of the masks, that's why I would be interested to see how binary masks perform!

Proposed Solutions

I'm not quite sure if there are known solutions for this, would thresholding a softmax works? Would you add this threshold as a parameter? or would it be learnt by the model itself? I'm not even sure that it would

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

Create a set of benchmark dataset

Feature request

I created some Research Issues that would be interesting to work on. But it's hard to tell if an idea is a good idea without having a clear benchmark on different dataset.

So it would be great to have a few notebooks that could run on different datasets in order to monitor performances uplift of a new implementation.

What is the expected behavior?
The idea would be to run this for each improvement proposal and see whether it helped or not.

How should this be implemented in your opinion?
This issue could be closed little by little by adding new notebooks that each perform a benchmark on one well known dataset.

Or maybe it's a better a idea to incorporate tabnet to existing benchmarks like Catboost Benchmark : https://github.com/catboost/benchmarks

Are you willing to work on this yourself?
yes of course, but any help would be appreciated!

Num workers as parameters

Feature request

In order to improve speed, user could change num_workers directly in model parameters or fit parameters (probably better on fit parameters).

What is the expected behavior?

This could ease users to try to use as many thread as possible using torch Dataloaders num_workers

What is motivation or use case for adding/changing the behavior?
See #97

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Error with using Pytorch Lr Scheduler

I am trying to use ReduceOnplateau lr scheduler with TabnetRegressor and I am getting the following error:
step() missing 1 required positional argument: 'metrics'

I don't find any argument to pass in the metrics or something I even went through the code of Tabnet Help would be appreciated
Thanks in advance

Explain to tensor

Feature request

Currently output of explain is of tensor format, should be of numpy.

What is the expected behavior?
Should be numpy array

What is motivation or use case for adding/changing the behavior?

Everyone expects numpy arrays
How should this be implemented in your opinion?
.detach().numpy()

Are you willing to work on this yourself?
yes

Ensuring I have attention right

Hi there! Could you please help verify that I've made sure to do attention right? I'm working off the fastai implementation, and so it would be faster to read up here but essentially I made a modification to his model that can return the masks. So it currently looks like so:

learn.model.eval()
for batch_nb, data in enumerate(dl):
  with torch.no_grad():
    out, M_loss, M_explain, masks = learn.model(data[0], data[1], True)
  for key, value in masks.items():
    masks[key] = csc_matrix.dot(value.numpy(), matrix)
  if batch_nb == 0:
    res_explain = csc_matrix.dot(M_explain.numpy(),
                                 matrix)
    res_masks = masks
  else:
    res_explain = np.vstack([res_explain,
                             csc_matrix.dot(M_explain.numpy(),
                                            matrix)])
    for key, value in masks.items():
      res_masks[key] = np.vstack([res_masks[key], value])

From here to plot, I do:

fig, axs = plt.subplots(1, 3, figsize=(20,20))
for i in range(3):
  axs[i].imshow(np.expand_dims(res_masks[0][i], 0))

Now I chose to do the np.expand_dims as it let's us visualize on an indivudal item level what is going on. Is this the correct way to do this sort of analysis? Or should I have included it at a batch level (or does it really not make a difference in the end).

Thanks!

Add CI for Git lint

Add CI to enforce conventional commit : https://www.conventionalcommits.org/en/v1.0.0/

Bug with unordered cat_idx

Describe the bug

If the list of cat_idx is unordered the corresponding cat_dims used into embeddings will not match.

What is the current behavior?
The bug appear into the forward of EmbeddingGenerator.
A for loop walk througth features and take embedding corresponding to each categorical feature from the self.embeddings list wich is build in the same order as cat_idx.

If the current behavior is a bug, please provide the steps to reproduce.

Provide an unordered cat_idx list with corresponding cat_dims.

Solution

Sort the cat_dims and the corresponding emb_dims with respect to cat_idx

        self.embeddings = torch.nn.ModuleList()

        # Sort dims by cat_idx
        sorted_idxs = np.argsort(cat_idxs)
        cat_dims = [cat_dims[i] for i in sorted_idxs]
        self.cat_emb_dims = [self.cat_emb_dims[i] for i in sorted_idxs]

        for cat_dim, emb_dim in zip(cat_dims, self.cat_emb_dims):
            self.embeddings.append(torch.nn.Embedding(cat_dim, emb_dim))

Multi-class output in binary classification

The original tabnet classifier by google is hard-coded to pass predictions in a multi-class format, regardless of whether num_classes is 2.

Would you know if the above means

there are two output neurons in the model
performance is affected for binary classification problems?

Is your implementation similar in this aspect?

Add template for PR and issues

Create network on model instantiation

Feature request

What is the expected behavior?
The network attribute should be created as soon as a model classifier or regressor is instantiated.

What is motivation or use case for adding/changing the behavior?
The network's existence is independent of the fit function and this will help with saving/loading features. None of the network parameters depend on any fit-only information.

How should this be implemented in your opinion?

Are you willing to work on this yourself?
yes

Research : Embedding Aware Attention

Main Problem

When training with large embedding dimensions, the mask size goes up.

One problem I see is that sparsemax does not know about which columns come from the same embedded columns, this could create something a bit difficult for the model to learn:

create embeddings that make sense
mask embeddings without destroying them, in fact since sparsemax is sparse it's very unlikely that all the columns from a same embedding are used, so you lose the power of your embedding

Proposed Solutions

It's an open problem but one way I see as promising is to create embedding aware attention.

The idea would be to mask all dimensions from a same embedding the same way, either by using the mean or the max of the initial mask.

I implemented a first version here : #92

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

dreamquark-ai / tabnet Goto Github PK

tabnet's People

Contributors

Stargazers

Watchers

Forkers

tabnet's Issues

Feature request

Main Remark

Proposed Solution

Feature request

Feature request

Main Remark

Proposed Solutions

Feature request

Feature request

Feature request

Feature request

Feature request

Main Problem

Proposed Solutions

Feature request

Feature request

Feature request

Feature request

Here is the details of the error when running fit with CPUs :

Main Remark

Proposed Solutions

Feature request

Feature request

Feature request

Feature request

Main Problem

Proposed Solutions

Recommend Projects

Recommend Topics

Recommend Org