Code Monkey home page Code Monkey logo

sets2sets's Introduction

Sets2Sets

This is our implementation for the paper:

Haoji Hu and Xiangnan He (2019). Sets2Sets: Learning from Sequential Sets with Neural Networks. In the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD โ€™19), Anchorage, AK, USA

Please cite our paper if you use our codes and datasets. Thanks!

@inproceedings{hu2019sets2sets,
  title={Sets2Sets: Learning from Sequential Sets with Neural Networks},
  author={Hu, Haoji and He, Xiangnan},
  booktitle={Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining},
  pages={1491--1499},
  year={2019},
  organization={ACM}
}

Author: Haoji Hu

Environment Settings

We use pytorch to implement our method.

  • Torch version: '1.0.1'
  • Python version: '3.6.8'

A quick start to run the codes with Ta-Feng data set.

Training:

python Sets2Sets.py ./data/TaFang_history.csv ./data/TaFang_future.csv TaFang 2 1 

The above command will train our model based on 4 folds of the Ta-Feng data set. The three parameters in the command tail are the model name, the number of subsequent sets in the training instances, and the flag for mode. Our example data can only support the number of subsequent sets no more than 3, which is the same as the results reported in our paper. Note that our method can handle variable length of subsequent sets due to the RNN. We fix this for experimental goal. The flag is set to 1 for training mode and 0 for test mode. The models learned from different epochs are saved under the folder './models/' (Our code will create this folder). We use a default number of max epochs 20 for demonstration. You can change this if you need more epochs.

Test:

python Sets2Sets.py ./data/TaFang_history.csv ./data/TaFang_future.csv TaFang 2 0 

The above command will test the learned model on the left 1 fold data. We just need to change the mode flag from 1 to 0. The test performance of the model giving best performance on the validation set will be printed out.

Preprocess the Dunnhumby data set

If you want to try our method on Dunnhumby data set, please visit the offical website. View the 'Let's Get Sort-of-Real'. Download the the data for randomly selected sample of 50,000 customers. We provide our script to transfer their data into the formate our method needs. After extracting all the files in the zip file and put them under a folder (e.g. ./dunnhumby_50k/), please remember to delete a file named time.csv which is not needed in our method. Then, put our script and the folder './dunnhumby_50k/' at the same level. Run our script by following command:

python Dunnhumby_data_preprocessing.py ./dunnhumby_50k/ past.csv future.csv

The data will be generated under the current folder. You can just replace the two files (TaFang_history.csv and TaFang_future.csv) with these two generated files to apply our method on Dunnhumby data set as before.

Update

We update the training loss view as previous version is not easy for observing the training loss at each epoch. The model selection is also added for test step in this version.

Last Update Date: Oct. 1, 2019

sets2sets's People

Contributors

haojihu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sets2sets's Issues

about the softmax function

When predicting the scores for every item in the set, you use softmax to normarlize the vector, but it will result in sum 1.
In the multi-label classification, should we use the sigmoid function instead?
Actually, in my own multi-label classification experiment, i use the softmax on my output scores, the performance drops than no softmax.

about the training process

It semms that when calculating the WMSE loss, we calculate the mse between a softmax probability vector generated by decoder and the groundtruth multi-hot vector.
When testing, we choose top-k to get the multi-hot prediction vector.
Have you ever tried when training, we also choose the topk from o(vi) generated by the decoder, and then calculate the distance between two multi-hot vector?
In this way, the operations of training and testing is consistent.

Waiting for your response, thank you!

Details about the OPTUM dataset

can you give more details about the OPTUM dataset used in this paper?
i wan t to know where can i get this dataset. Thank you very much.

Bugs occurred in other datasets

I get a bug in T-mall datasets called:

(base) wzk@ddst:~/work/Sets2Sets$ python Sets2Sets.py ./data/alibaba_history.csv ./data/alibaba_future.csv 1 2 1
start dictionary generation...
{'MATERIAL_NUMBER': 9531}
# dimensions of final vector: 9531 | 2962
finish dictionary generation*****
num of vectors having entries more than 1: 16462
num of vectors having entries more than 1: 15275
Traceback (most recent call last):
  File "Sets2Sets.py", line 990, in <module>
    main(sys.argv)
  File "Sets2Sets.py", line 955, in main
    codes_freq = get_codes_frequency_no_vector(data_chunk[past_chunk],input_size,data_chunk[future_chunk].keys())
  File "Sets2Sets.py", line 935, in get_codes_frequency_no_vector
    for idx in X[pid]:
KeyError: '371250'

Have anyone met this before? I'd be really appreciated if anyone can help.

About the two parts of loss function

Hi, in your loss function, it contains the WSME and PSE parts, have you ever done the experiments about the single part? Can i just use the WSME part to do the multi-label classification task?
Waiting for your response, thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.