Code Monkey home page Code Monkey logo

instacart-basket-prediction's Introduction

Instacart Market Basket Analysis

My solution for the Instacart Market Basket Analysis competition hosted on Kaggle.

The Task

The dataset is an open-source dataset provided by Instacart (source)

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Below is the full data schema (source)

orders (3.4m rows, 206k users):

  • order_id: order identifier
  • user_id: customer identifier
  • eval_set: which evaluation set this order belongs in (see SET described below)
  • order_number: the order sequence number for this user (1 = first, n = nth)
  • order_dow: the day of the week the order was placed on
  • order_hour_of_day: the hour of the day the order was placed on
  • days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):

  • product_id: product identifier
  • product_name: name of the product
  • aisle_id: foreign key
  • department_id: foreign key

aisles (134 rows):

  • aisle_id: aisle identifier
  • aisle: the name of the aisle

deptartments (21 rows):

  • department_id: department identifier
  • department: the name of the department

order_products__SET (30m+ rows):

  • order_id: foreign key
  • product_id: foreign key
  • add_to_cart_order: order in which each product was added to cart
  • reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):

  • "prior": orders prior to that users most recent order (~3.2m orders)
  • "train": training data supplied to participants (~131k orders)
  • "test": test data reserved for machine learning competitions (~75k orders)

The task is to predict which products a user will reorder in their next order. The evaluation metric is the F1-score between the set of predicted products and the set of true products.

The Approach

The task was reformulated as a binary prediction task: Given a user, a product, and the user's prior purchase history, predict whether or not the given product will be reordered in the user's next order. In short, the approach was to fit a variety of generative models to the prior data and use the internal representations from these models as features to second-level models.

First-level models

The first-level models vary in their inputs, architectures, and objectives, resulting in a diverse set of representations.

  • Product RNN/CNN (code): a combined RNN and CNN trained to predict the probability that a user will order a product at each timestep. The RNN is a single-layer LSTM and the CNN is a 6-layer causal CNN with dilated convolutions.
  • Aisle RNN (code): an RNN similar to the first model, but trained at the aisle level (predict whether a user purchases any products from a given aisle at each timestep).
  • Department RNN (code): an RNN trained at the department level.
  • Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model.
  • Order size RNN (code): an RNN trained to predict the next order size, minimizing RMSE.
  • Order size RNN mixture model (code): an RNN trained to predict the next order size, maximizing the likelihood of a gaussian mixture model.
  • Skip-Gram with Negative Sampling (SGNS) (code): SGNS trained on sequences of ordered products.
  • Non-Negative Matrix Factorization (NNMF) (code): NNMF trained on a matrix of user-product order counts.

Second-level models

The second-level models use the internal representations from the first-level models as features.

  • GBM (code): a lightgbm model.
  • Feedforward NN (code): a feedforward neural network.

The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score.

Requirements

64 GB RAM and 12 GB GPU (recommended), Python 2.7

Python packages:

  • lightgbm==2.0.4
  • numpy==1.13.1
  • pandas==0.19.2
  • scikit-learn==0.18.1
  • tensorflow==1.3.0

instacart-basket-prediction's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

instacart-basket-prediction's Issues

Builidng product Wavenet model

Wavenet is calling temporal_convolution_layer which throws an exception (something about int vs. float tensor expected).
Can be workaround by setting casual=False.

Can you please fix?

why you did not used lstm

why you did not used lstm
in
Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model

print() is a function in Python 3

flake8 testing of https://github.com/sjvasquez/instacart-basket-prediction on Python 3.6.2

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./models/tf_base_model.py:101:27: E999 SyntaxError: invalid syntax
        print 'built graph'
                          ^

./models/blend/nn_blend.py:27:16: E999 SyntaxError: invalid syntax
        print df.shapes()
               ^

./models/nnmf/nnmf.py:22:26: E999 SyntaxError: invalid syntax
        print 'train size', len(self.train_df)
                         ^

./models/rnn_aisle/prepare_aisle_data.py:36:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_aisle/rnn_aisle.py:34:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_department/prepare_department_data.py:35:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_department/rnn_department.py:33:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_order/prepare_order_size_data.py:33:21: E999 SyntaxError: invalid syntax
            print idx
                    ^

./models/rnn_order/rnn_order_size.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_order/rnn_order_size_gmm.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_product/prepare_product_data.py:70:19: E999 SyntaxError: invalid syntax
            print i, num_rows
                  ^

./models/rnn_product/rnn_product.py:41:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/rnn_product/rnn_product_bmm.py:40:18: E999 SyntaxError: invalid syntax
        print self.test_df.shapes()
                 ^

./models/sgns/prepare_sgns_data.py:14:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./models/sgns/sgns.py:23:26: E999 SyntaxError: invalid syntax
        print 'train size', len(self.train_df)
                         ^

./preprocessing/create_aisle_data.py:30:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./preprocessing/create_department_data.py:26:19: E999 SyntaxError: invalid syntax
            print _
                  ^

./preprocessing/create_product_data.py:37:19: E999 SyntaxError: invalid syntax
            print _
                  ^

18    E999 SyntaxError: invalid syntax

CUDA memory error

When running rnn_product.py I get the following error, although it seems that my GPU has enough memory. Any idea?

trainable parameter count:
79772859
2017-12-22 23:31:59.579386: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX i
nstructions, but these are available on your machine and could speed up CPU computations.
2017-12-22 23:31:59.594986: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2
instructions, but these are available on your machine and could speed up CPU computations.
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.6575
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 10.71GiB
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-12-22 23:32:05.288986: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:986] 0: Y
2017-12-22 23:32:05.429386: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device
: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-12-22 23:32:06.131386: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 10.17G (10922166272 bytes) fro
m device: CUDA_ERROR_OUT_OF_MEMORY
2017-12-22 23:32:06.599386: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 9.15G (9829949440 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY
2017-12-22 23:32:07.332586: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorf
low\stream_executor\cuda\cuda_driver.cc:924] failed to allocate 8.24G (8846954496 bytes) from
device: CUDA_ERROR_OUT_OF_MEMORY
built graph

errors in file rnn_order_size_gmm.py

  1. please check the line 159 . I think the log_likelihood should be 0-dimensional tensor. right?
log_likelihood = -tf.log(tf.reduce_sum(mixing_coefs*n_likelihoods, axis=2) + 1e-10)
  1. another error in this file: the definition of data_cols in line 17

predictions_bmm does not exist

In

https://github.com/sjvasquez/instacart-basket-prediction/blob/master/models/blend/prepare_blend_data.py#L43

you look to load predictions.npy - but this file is never generated because "predictions" is not a key in the prediction_tensor dictionary

https://github.com/sjvasquez/instacart-basket-prediction/blob/master/models/rnn_product/rnn_product_bmm.py#L222

as a result, the prepare_blend_data dies due to a file not found error

i'm not sure what 'predictions' means in the rnn_product_bmm model context, so I have just set 'predictions': final_states as a stop-gap for now... but it would be great to know what you mean by predictions (or what you used for your predictions.npy file in your blended model)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.