johnson0722 / ctr_prediction Goto Github PK

View Code? Open in Web Editor NEW

744.0 744.0 298.0 23 KB

CTR prediction using FM FFM and DeepFM

Python 100.00%

ctr_prediction's People

Contributors

Stargazers

Watchers

Forkers

jackeyou zhouliang1979 lixiangbao sunmingze sunnymarkliu lgstd lxcvictory jieyuhe happynoom hitflame gaoyz0625 clingdeng inistlwq yuanjie-ai conggoer shirleywan guhay algoding duhuasong easoncer ambier zhuifeng414 zhuiyuan616124 kinglai wakupoo tensorzhangzheng satdragon7 luck49 xd9999 chuyingjun lhylygr yc-dreaming zzzzzigzag cnzjhdx hanst jiangyangw3rt13 gzpan 2rings kwoks2017 tongcu rosefun vosix liaoyubinai csearch nosuggest whmnoe4j xxyy1 zenwan jkhlot henryslzhao yffbit fengyuan777 chenghuige renlang97 beautifulnow1992 lu839684437 lakezhang excuses123 hkxiron mandyzore pandascute githubbeinner deepquantitative cookyo tiffen drbridge shuoranly squirren lionkt-competition wangkanger billyyi sicklich lijunjie1994 ahujack yu3401 milkboylyf tanzhouxing hnkfwhw songtaoxy mdiby lulupanppy a3w4e5r yishuihanhan xiaomaohoujiao2 countif ferrero-zhang zkyzq yueyedeai yvelzhang cjmmya qiwei119 duhangnju suyangshuo worldwar2008 jiadiwu jrjdr yanggang12311 ys0232 leedong123 dsivaji

ctr_prediction's Issues

deepFM中utilities的one_hot_representation方法返回的index截取，我认为存在bug

CTR_Prediction/Deep_FM/utilities.py

Line 26 in ec4de3d

return array,idx[:21]

idx[:21]按逻辑是想返回的下标中排除click项的值吧？而for field in fields_dict循环的最后一项，不一定是click。这会导致排除的是其他特征项的index，而click项还保留在当中。然后，dnn做embedding会下标越界…… 是吧？

self.X also need tf.pow(self.X, 2)

CTR_Prediction/FM/FM.py

Line 58 in ec4de3d

tf.sparse_tensor_dense_matmul(self.X, tf.pow(v, 2))),

ValueError: Cannot feed value of shape (512, 0) for Tensor 'Placeholder_2:0', which has shape '(?, 21)'

I am getting this error in DeepFM.py.. Can anyone please help??

deepFM中 train_model 方法中one_hot_representation方法返回值只有一个，这里是两个

FM/FM.py中的train_sparse_data_frac_0.01.pkl文件哪来的？

def train_model(sess, model, epochs=10, print_every=50):
"""training model"""
# Merge all the summaries and write them out to train_logs
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter('train_logs', sess.graph)
# get sparse training data
with open('../avazu_CTR/train_sparse_data_frac_0.01.pkl', 'rb') as f:
sparse_data_fraction = pickle.load(f)
# get number of batches
num_batches = len(sparse_data_fraction)

for e in range(epochs):
    num_samples = 0
    losses = []
    for ibatch in range(num_batches):
        # batch_size data
        batch_y = sparse_data_fraction[ibatch]['labels']
        batch_y = np.array(batch_y)
        actual_batch_size = len(batch_y)
        batch_indexes = np.array(sparse_data_fraction[ibatch]['indexes'], dtype=np.int64)
        batch_shape = np.array([actual_batch_size, feature_length], dtype=np.int64)
        batch_values = np.ones(len(batch_indexes), dtype=np.float32)
        # create a feed dictionary for this batch
        feed_dict = {model.X: (batch_indexes, batch_values, batch_shape),
                     model.y: batch_y,
                     model.keep_prob:1.0}


        loss, accuracy,  summary, global_step, _ = sess.run([model.loss, model.accuracy,
                                                             merged,model.global_step,
                                                             model.train_op], feed_dict=feed_dict)
        # aggregate performance stats
        losses.append(loss*actual_batch_size)
        num_samples += actual_batch_size
        # Record summaries and train.csv-set accuracy
        train_writer.add_summary(summary, global_step=global_step)
        # print training loss and accuracy
        if global_step % print_every == 0:
            logging.info("Iteration {0}: with minibatch training loss = {1} and accuracy of {2}"
                         .format(global_step, loss, accuracy))
            saver.save(sess, "checkpoints/model", global_step=global_step)
    # print loss of one epoch
    total_loss = np.sum(losses)/num_samples
    print("Epoch {1}, Overall loss = {0:.3g}".format(total_loss, e+1))

InvalidArgumentError (see above for traceback): k (303) from index[19,1] out of bounds (>=303)

Hello @Johnson0722 , I meet this error when I run FM.py. I use the same dataset as you. Can you give me some help?

`Caused by op 'interaction_layer/SparseTensorDenseMatMul/SparseTensorDenseMatMul', defined at:
File "FM.py", line 223, in
model.build_graph()
File "FM.py", line 94, in build_graph
self.inference()
File "FM.py", line 61, in inference
tf.pow(tf.sparse_tensor_dense_matmul(self.X, v), 2),
File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\sparse_ops.py", line 1822, in sparse_tensor_dense_matmul
adjoint_b=adjoint_b)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_sparse_ops.py", line 3213, in sparse_tensor_dense_mat_mul
name=name)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): k (303) from index[19,1] out of bounds (>=303)
[[Node: interaction_layer/SparseTensorDenseMatMul/SparseTensorDenseMatMul = SparseTensorDenseMatMul[T=DT_FLOAT, Tindices=DT_INT64, adjoint_a=false, adjoint_b=false, _device="/job:localhost/replica:0/task:0/devi
ce:CPU:0"](_arg_Placeholder_2_0_2, _arg_Placeholder_1_0_1, _arg_Placeholder_0_0, interaction_layer/v/read)]]
`

FFM 63行是笔误吗？

tf.reduce_sum(tf.multiply(v[i,self.feature2field[i]], v[j,self.feature2field[j]])),
这个应该是：
tf.reduce_sum(tf.multiply(v[i,self.feature2field[j]], v[j,self.feature2field[i]])),
才对吧。

话说CTR模型的交互类的特征是怎么做的和存储的？

比如有user有1000W，item有1000W，那么要有 1000W*1000W = 1000000亿的特征数据？

DeepFM.py文件报错，难道是tensorflow版本问题（本人用tf1.12-GPU测试），求告知？

Caused by op u'Ftrl/update_Variable/SparseApplyFtrl', defined at:
File "DeepFM.py", line 325, in
model.build_graph()
File "DeepFM.py", line 132, in build_graph
self.train()
File "DeepFM.py", line 124, in train
self.train_op = optimizer.minimize(self.loss, global_step=self.global_step)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 410, in minimize
name=name)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 610, in apply_gradients
update_ops.append(processor.update_op(self, grad))
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 128, in update_op
return optimizer._apply_sparse_duplicate_indices(g, self._v)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 1019, in _apply_sparse_duplicate_indices
return self._apply_sparse(gradient_no_duplicate_indices, var)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/ftrl.py", line 224, in _apply_sparse
use_locking=self._use_locking)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/training/gen_training_ops.py", line 3299, in sparse_apply_ftrl
use_locking=use_locking, name=name)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py",line 488, in new_func
return func(*args, **kwargs)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/u2019101432/.conda/envs/tf1.12/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Index 131989 at offset 131989 in indices is out of range
[[node Ftrl/update_Variable/SparseApplyFtrl (defined at DeepFM.py:124) = SparseApplyFtrl[T=DT_FLOAT, Tindices=DT_INT64, use_locking=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable, Variable/Ftrl, Variable/Ftrl_1, Ftrl/update_Variable/UnsortedSegmentSum, Ftrl/update_Variable/Unique, Ftrl/learning_rate, Ftrl/l1_regularization_strength, Ftrl/update_DNN/b1/Cast, Ftrl/learning_rate_power)]]

it is not working

话说现在用TensorFlow分布式训练CTR模型怎么搞速度快啊？

Parameter Server架构还是All Reduce架构？
CPU还是GPU？
有没有开源代码参考？
用不用改TensorFlow源码？
性价比最高的方案是？

where I can download the dataset,could anyone give me the website?thank you.

I want to find the dataset which had processed.Also,the nature dataset is ok. thank you.

deep fm中deep测输入为什么是隐向量v呢

deep fm中deep测输入为什么是隐向量v呢？

FM

请教一个离散数据特征和计算的问题

代码DeepFM.py:
"""
# shape of [None, 2]
self.linear_terms = tf.add(tf.matmul(self.X, w1), b)

        # shape of [None, 1]
        self.interaction_terms = tf.multiply(0.5,
                                             tf.reduce_mean(
                                                 tf.subtract(
                                                     tf.pow(tf.matmul(self.X, v), 2),
                                                     tf.matmul(tf.pow(self.X, 2), tf.pow(v, 2))),

1, keep_dims=True))
"""
问题：DeepFM论文中每个离散特征（one hot之前）被表示成latent_size长度的embedding值，本质上感觉可以看成一个局部全连接。注意这个地方是每个离散特征。在每个离散特征表示成embedding之后，然后两两相乘形成interaction.
但是代码中用“tf.matmul(self.X, v)”这种方式相乘其结果是把所有的特征最终变成了一个self.k大小的embedding，而不是每个特征都变成self.k大小。
这个地方是不是存在问题呢？感觉应该用tf.multiply(self.X, v)，不知道我理解的是否有问题。

how to merge first-order and second order

In the implement, second order is broadcasted added to first order. I would like to know why they are added and what the meaning is.
In my opinion, there is a replacement method that the first order is summed to a scalar and then it is added to the second order part.

您好，请教一个问题

fields_train_dict = {}
   for field in fields_train:
       with open('dicts/' + field + '.pkl', 'rb') as f:
           fields_train_dict[field] = pickle.load(f)
   fields_test_dict = {}
   for field in fields_test:
       with open('dicts/' + field + '.pkl', 'rb') as f:
           fields_test_dict[field] = pickle.load(f)

这段代码中路径＂dicts/＂下的field保存的是什么？
feature_length和field_cnt是什么关系的？