sjvasquez / instacart-basket-prediction Goto Github PK

View Code? Open in Web Editor NEW

496.0 20.0 235.0 40 KB

Kaggle | Instacart Market Basket Analysis🥕🥉

Python 99.44% Shell 0.56%

instacart-basket-prediction's Introduction

Instacart Market Basket Analysis

My solution for the Instacart Market Basket Analysis competition hosted on Kaggle.

The Task

The dataset is an open-source dataset provided by Instacart (source)

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Below is the full data schema (source)

orders (3.4m rows, 206k users):

order_id: order identifier

user_id: customer identifier

eval_set: which evaluation set this order belongs in (see SET described below)

order_number: the order sequence number for this user (1 = first, n = nth)

order_dow: the day of the week the order was placed on

order_hour_of_day: the hour of the day the order was placed on

days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):

product_id: product identifier

product_name: name of the product

aisle_id: foreign key

department_id: foreign key

aisles (134 rows):

aisle_id: aisle identifier

aisle: the name of the aisle

deptartments (21 rows):

department_id: department identifier

department: the name of the department

order_products__SET (30m+ rows):

order_id: foreign key

product_id: foreign key

add_to_cart_order: order in which each product was added to cart

reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):

"prior": orders prior to that users most recent order (~3.2m orders)

"train": training data supplied to participants (~131k orders)

"test": test data reserved for machine learning competitions (~75k orders)

The task is to predict which products a user will reorder in their next order. The evaluation metric is the F1-score between the set of predicted products and the set of true products.

The Approach

The task was reformulated as a binary prediction task: Given a user, a product, and the user's prior purchase history, predict whether or not the given product will be reordered in the user's next order. In short, the approach was to fit a variety of generative models to the prior data and use the internal representations from these models as features to second-level models.

First-level models

The first-level models vary in their inputs, architectures, and objectives, resulting in a diverse set of representations.

Product RNN/CNN (code): a combined RNN and CNN trained to predict the probability that a user will order a product at each timestep. The RNN is a single-layer LSTM and the CNN is a 6-layer causal CNN with dilated convolutions.
Aisle RNN (code): an RNN similar to the first model, but trained at the aisle level (predict whether a user purchases any products from a given aisle at each timestep).
Department RNN (code): an RNN trained at the department level.
Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model.
Order size RNN (code): an RNN trained to predict the next order size, minimizing RMSE.
Order size RNN mixture model (code): an RNN trained to predict the next order size, maximizing the likelihood of a gaussian mixture model.
Skip-Gram with Negative Sampling (SGNS) (code): SGNS trained on sequences of ordered products.
Non-Negative Matrix Factorization (NNMF) (code): NNMF trained on a matrix of user-product order counts.

Second-level models

The second-level models use the internal representations from the first-level models as features.

GBM (code): a lightgbm model.
Feedforward NN (code): a feedforward neural network.

The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score.

Requirements

64 GB RAM and 12 GB GPU (recommended), Python 2.7

Python packages:

lightgbm==2.0.4
numpy==1.13.1
pandas==0.19.2
scikit-learn==0.18.1
tensorflow==1.3.0

instacart-basket-prediction's People

Stargazers

Watchers

Forkers

zkk995 kylinorange strideradu linjielangdang babylls melody-xiaomi slidelucask dorgun hangtongluo wuqixiaobai qaohv shixw1991 awasthimaddy psdhillon lxianwei003 shenjiawei19 fujiyuu75 serzzh pandeyaah zhouwm1990 kesjien superpreneur amorgun jxlijunhao aihill wucz jtchaoren raphaelhpze sh1ng serignecisse haowei01 zxlmufc blolivier alexkruegger ekansrm ksharpdabu dequadras yangqiu janismdhanbad gokul180288 kahirul marcantoinegiuliani leo-zhanglj tedfyw allensmile fulquan rickdyang jazzman37 cometyang springkim623 youthcolor a20140501 neerajsarwan gfun johnsonfx yairbeer kartikvega yinghawl kari0219 lampts githubbayes lulzzz ptiwaree panyao paulyangsz yixianzhu mdiby orian svetistefan airxiechao tangyuan5833 johndpope sidrid oltip pchankh mastertony ailanchong ab-be rafaelmd m0tao0 plantsgo lyang24 kentchun33333 lucius-yu naplessss taoranli roxw rahasayantan hanzeil ywguo1126 gustavocarita weepon alphaseekerli qicst23 shujian2015 ashishlal weiansheng lancifollia luffly1123 xiaoli-chen

instacart-basket-prediction's Issues

Can I run it with 16 GB of RAM and 8GB GPU ?

Builidng product Wavenet model

Wavenet is calling temporal_convolution_layer which throws an exception (something about int vs. float tensor expected).
Can be workaround by setting casual=False.

Can you please fix?

why you did not used lstm

why you did not used lstm
in
Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model

print() is a function in Python 3

flake8 testing of https://github.com/sjvasquez/instacart-basket-prediction on Python 3.6.2

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./models/tf_base_model.py:101:27: E999 SyntaxError: invalid syntax
        print 'built graph'
                          ^

./models/blend/nn_blend.py:27:16: E999 SyntaxError: invalid syntax
        print df.shapes()
               ^

./models/nnmf/nnmf.py:22:26: E999 SyntaxError: invalid syntax
        print 'train size', len(self.train_df)
                         ^

./models/rnn_aisle/prepare_aisle_data.py:36:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_aisle/rnn_aisle.py:34:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_department/prepare_department_data.py:35:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_department/rnn_department.py:33:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_order/prepare_order_size_data.py:33:21: E999 SyntaxError: invalid syntax
            print idx
                    ^

./models/rnn_order/rnn_order_size.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_order/rnn_order_size_gmm.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_product/prepare_product_data.py:70:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_product/rnn_product.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_product/rnn_product_bmm.py:40:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/sgns/prepare_sgns_data.py:14:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./models/sgns/sgns.py:23:26: E999 SyntaxError: invalid syntax
        print 'train size', len(self.train_df)
                         ^

./preprocessing/create_aisle_data.py:30:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./preprocessing/create_department_data.py:26:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./preprocessing/create_product_data.py:37:19: E999 SyntaxError: invalid syntax
            print _
                  ^

18    E999 SyntaxError: invalid syntax

CUDA memory error

When running rnn_product.py I get the following error, although it seems that my GPU has enough memory. Any idea?

trainable parameter count:
79772859
2017-12-22 23:31:59.579386: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX i
nstructions, but these are available on your machine and could speed up CPU computations.
2017-12-22 23:31:59.594986: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2
instructions, but these are available on your machine and could speed up CPU computations.
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6575
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 10.71GiB
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:986] 0: Y
2017-12-22 23:32:05.429386: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device
: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-12-22 23:32:06.131386: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 10.17G (10922166272 bytes) fro
m device: CUDA_ERROR_OUT_OF_MEMORY
2017-12-22 23:32:06.599386: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 9.15G (9829949440 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY
2017-12-22 23:32:07.332586: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 8.24G (8846954496 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY
built graph

errors in file rnn_order_size_gmm.py

please check the line 159 . I think the log_likelihood should be 0-dimensional tensor. right?

log_likelihood = -tf.log(tf.reduce_sum(mixing_coefs*n_likelihoods, axis=2) + 1e-10)

another error in this file: the definition of data_cols in line 17

predictions_bmm does not exist

https://github.com/sjvasquez/instacart-basket-prediction/blob/master/models/blend/prepare_blend_data.py#L43

you look to load predictions.npy - but this file is never generated because "predictions" is not a key in the prediction_tensor dictionary

https://github.com/sjvasquez/instacart-basket-prediction/blob/master/models/rnn_product/rnn_product_bmm.py#L222

as a result, the prepare_blend_data dies due to a file not found error

i'm not sure what 'predictions' means in the rnn_product_bmm model context, so I have just set 'predictions': final_states as a stop-gap for now... but it would be great to know what you mean by predictions (or what you used for your predictions.npy file in your blended model)