Code Monkey home page Code Monkey logo

bars's Introduction

RecZoo

RecZoo: A curated model zoo for recommendation tasks

Matching

No Model Publication
1 UltraGCN Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, Xiuqiang He. UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation, in CIKM 2021.
2 SimpleX Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao, Xiuqiang He. SimpleX: A Simple and Strong Baseline for Collaborative Filtering, in CIKM 2021.

Ranking

No Model Publication
1 FinalMLP Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, Zhenhua Dong. FinalMLP: An Enhanced Two-Stream MLP Model for CTR Prediction, in AAAI 2023.
2 FinalNet Jieming Zhu, Qinglin Jia, Guohao Cai, Quanyu Dai, Jingjie Li, Zhenhua Dong, Ruiming Tang, Rui Zhang. FINAL: Factorized Interaction Layer for CTR Prediction, in SIGIR 2023.
3 RAT Yushen Li, Jinpeng Wang, Tao Dai, Jieming Zhu, Jun Yuan, Rui Zhang, Shu-Tao Xia. RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction, in WWW 2024.
4 STEM Liangcai Su, Junwei Pan, Ximei Wang, Xi Xiao, Shijie Quan, Xihua Chen, Jie Jiang. STEM: Unleashing the Power of Embeddings for Multi-task Recommendation, in AAAI 2024.
5 Helen Zirui Zhu, Yong Liu, Zangwei Zheng, Huifeng Guo, Yang You. Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian Eigenvalue Regularization, in WWW 2024.
6 Combined-Pair Zhutian Lin, Junwei Pan, Shangyu Zhang, Ximei Wang, Xi Xiao, Shudong Huang, Lei Xiao, Jie Jiang. Understanding the Ranking Loss for Recommendation with Sparse User Feedback, in KDD 2024.
7 AdaGIN Lei Sang, Honghao Li, Yiwen Zhang, Yi Zhang, Yun Yang. AdaGIN: Adaptive Graph Interaction Network for Click-Through Rate Prediction, in TOIS 2024.
8 SimCEN Honghao Li, Lei Sang, Yi Zhang, Yiwen Zhang. SimCEN: Simple Contrast-enhanced Network for CTR Prediction, in MM 2024.
9 RecSys Qi Zhang, Jieming Zhu, Jiansheng Sun, Guohao Cai, Ruining Yu, Bangzheng He, Liangbi Li. Enhancing News Recommendation with Real-Time Feedback and Generative Sequence Modeling, in RecSys Challenge Workshop 2024.
10 DCNv3 Honghao Li, Yiwen Zhang, Yi Zhang, Hanwei Li, Lei Sang, Jieming Zhu. DCNv3: Towards Next Generation Deep Cross Network for CTR Prediction, in Arxiv 2024.

Reranking

Pretraining

No Model Publication
1 UNBERT Qi Zhang, Jingjie Li, Qinglin Jia, Chuyuan Wang, Jieming Zhu, Zhaowei Wang, Xiuqiang He. UNBERT: User-News Matching BERT for News Recommendation, in IJCAI 2021.

Personalization

No Model Publication
1 PMG Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, Xi Xiao. PMG: Personalized Multimodal Generation with Large Language Models, in WWW 2024.

bars's People

Contributors

kyriemao avatar liangcaisu avatar xpai avatar zhujiem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bars's Issues

parameter setting for BCEloss on yelp2018

Hi, I came across CIKM2021-SimpleX: A Simple and Strong Baseline forCollaborative Filtering, Nice work!
However, in terms of Table 1, I could not reproduce the result of BCELoss on yelp2018, Looking forward to your help!
Thanks a lot @zhujiem @xpai

关于 TaobaoAd_x1 数据集

aobaoAd_x1
Dataset description

Taobao is a dataset provided by Alibaba, which contains 8 days of ad click-through data (26 million records) that are randomly sampled from 1140000 users. Following the original data split, we use the first 7 days (i.e., 20170506-20170512) of samples for training, and the last day's samples (i.e., 20170513) for testing. We follow the preprocessing steps that have been applied to reproducing the DMR work. Note that a small part (~5%) of samples have been dropped during preprocessing due the missing of user or item profiles. The preprocessed data can be accessed from the BARS benchmark.

上述描述部分指出: a small part (~5%) of samples have been dropped during preprocessing due to the missing of user or item profiles.

可以请教下,这部分sample是如何去除的吗?是否在 DMR data preprocessing 所提供代码的基础上引入了新的逻辑呀?感谢~!

Is it normal that I cannot exactly reproduce the results?

I am trying to reproduce the results of DeepFM on criteo_4x_001 dataset. I have setup my enviroment as follows:

python             3.6.13
cuda                11.7
torch               1.10.2
fuxictr             1.0.2
h5py                3.1.0
numpy               1.19.5
pandas              1.1.5
scipy               1.5.4

Is is not exactly the same as the environment in this repo, but at least I have set up fuxictr version 1.0.2 exactly.
Then I followed the config as
https://github.com/reczoo/BARS/tree/main/ranking/ctr/DeepFM/DeepFM_criteo_x4_001

The results were slightly differerent. I noticed that the AUC results in the original exepreiment had a big jump from 0.809407 to 0.813303 at epoch 4 to 5. My results also had such kind of a jump, but it came later at epoch 8 to 9, where AUC jump from 0.809898 to 0.813443.

Avazu_x4 weirdly requires an extremely large amount of video memory.

I directly downloaded the preprocessed Avazu_x4 dataset, then discarded the id feature, and processed all other features as string categorical features. However, it is very strange that it needs a huge amount of GPU memory to be loaded (about 31G !!!!), and I can’t even use this dataset on Nvidia V100 32G because of OOM.
Is this normal? Is there any way to fix this?

sequential models

When will sequential models like din/dien to be evaluated on these public benchmarks? Do you have a plan?

jit.trace not working for the models

torch.jit.trace(fibinet, batch_data)

File "/anaconda3/envs/fuxictr3.8/lib/python3.8/site-packages/torch/jit/_trace.py", line 794, in trace
return trace_module(
File "/anaconda3/envs/fuxictr3.8/lib/python3.8/site-packages/torch/jit/_trace.py", line 1056, in trace_module
module._c._create_method_from_trace(
File "/anaconda3/envs/fuxictr3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/anaconda3/envs/fuxictr3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1488, in _slow_forward
result = self.forward(*input, **kwargs)
TypeError: forward() takes 2 positional arguments but 3 were given

Unable to reproduce the AUC of FinalNet

Here is the config:

{
"batch_norm": "True",
"batch_size": "4096",
"block1_dropout": "0.1",
"block1_hidden_activations": "ReLU",
"block1_hidden_units": "[400, 400]",
"block2_dropout": "0",
"block2_hidden_activations": "ReLU",
"block2_hidden_units": "[400, 400]",
"block_type": "2B",
"data_format": "csv",
"data_root": "/data/workspace/rec/dataset/Criteo_x1/",
"dataset_id": "criteo_x1_fuxictr2",
"debug_mode": "False",
"early_stop_patience": "2",
"embedding_dim": "10",
"embedding_regularizer": "1e-05",
"epochs": "100",
"eval_interval": "1",
"eval_steps": "None",
"feature_cols": "[{'active': True, 'dtype': 'float', 'name': ['I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11', 'I12', 'I13'], 'type': 'numeric'}, {'active': True, 'dtype': 'float', 'name': ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26'], 'type': 'categorical'}]",
"feature_specs": "None",
"gpu": "0",
"group_id": "None",
"label_col": "{'dtype': 'float', 'name': 'label'}",
"learning_rate": "0.001",
"loss": "binary_crossentropy",
"metrics": "['AUC', 'logloss']",
"min_categr_count": "1",
"model": "FinalNet",
"model_id": "FINAL_criteo_x1_028_b6c861e4",
"model_root": "./checkpoints/FINAL_criteo_x1/",
"monitor": "AUC",
"monitor_mode": "max",
"net_regularizer": "0",
"num_workers": "3",
"optimizer": "adam",
"ordered_features": "None",
"pickle_feature_encoder": "True",
"save_best_only": "True",
"seed": "2021",
"shuffle": "True",
"task": "binary_classification",
"test_data": "/data/workspace/rec/dataset/Criteo_x1/test.csv",
"train_data": "/data/workspace/rec/dataset/Criteo_x1/train.csv",
"use_field_gate": "True",
"valid_data": "/data/workspace/rec/dataset/Criteo_x1/valid.csv",
"verbose": "1"
}

Here is the log:
FINAL_criteo_x1_028_b6c861e4.log

Experimental Results on Gowalla dataset

Hello,

Thank you so much for your very useful work.

I have a question about results on Gowalla dataset, based on the log in https://github.com/openbenchmark/BARS/tree/master/candidate_matching/benchmarks/SimpleX/SimpleX_gowalla_x1

How do you divide training data into training and validation sets? From the log, test set is currently treated as validation set and reported results are selected as the best iteration on test set, which causes test data leakage. A similar problem was reported in this paper https://arxiv.org/pdf/2005.09683.pdf?

Look forward to hearing from you soon,
Thank you so much.

the method of embedding table initialization for different CTR prediction models

How to choose the method of embedding table Initialization? if I choose the xavier_uniform method, the AUC is always 0.5 when I run the code of SAM model.But when I run the code of other models (AutoInt, FiGNN, AFN)using the xavier_uniform method, the results are good and normal.
I see that the method using in the project is normal initialization(embedding_initializer="partial(nn.init.normal_, std=1e-4)"). Is all model the mehod?

About the DIEN sequence model

I want to run datasets such as Criteo and Avazu on the DIEN sequence model. How should I modify the parameters in the model_config.yaml file? Looking forward to your reply!

download csv and use h5

when I download Avazu dataset in kaggle, the format of Avazu is csv
but when I run the code and the format of DeepFM_avazu_x4_tuner_config_01.yaml is h5
how can I transform csv to h5?
image
image

Is fuxctr<=1.x a must?

I noticed that the latest version of fuxictr was 2.3.x. Does this repo run on fuxictr 2.3.x?
By the way, I nocited that many of the requirement file in this git group do not specify version.
I strongly suggest doing it. I have encountered problems of version compatiblity of some depended libraries.
I have to try downgrading version by version to test which version is compatible.

Can you provide more details about the implementation?

This is a nice project. Just one question. I can find very few introduction about the implementation.
Are all the methods implemented on the same base framework like OpenMMlab projects, e,g., mmdetection?
Or the implementation of each method is a standalone bunch of code, borrowed from other sources?

Amazon-CDs/Movies/Beauty

Though BGCF KDD'20 provides the dataset splitting details, it is still hard to fully reproduce it (due to the lack of the random seed). Could you provide official links of AmazonCDs/Movies/Beauty_m1 to download?

Thanks.

DNN's parameter sizes not consistent with the log.

DNN_avazu_x1 log shows "2022-02-08 10:25:00,299 P50417 INFO [Metrics] AUC: 0.763019 - logloss: 0.368178".
But the results in my machine is "2023-04-25 19:35:46,393 P47366 INFO [Metrics] AUC: 0.755872 - logloss: 0.371449"

The reason may be the inconsistency of the network, where the number of parameters in DNN_avazu_x1 log is 13805192 while mine is 13805191. I can't figure out what the 1 parameter is.
The detail of my parameters is as follows:
embedding_layer.embedding_layer.embedding_layer.feat_1.weight torch.Size([8, 10])
embedding_layer.embedding_layer.embedding_layer.feat_2.weight torch.Size([8, 10])
embedding_layer.embedding_layer.embedding_layer.feat_3.weight torch.Size([3479, 10])
embedding_layer.embedding_layer.embedding_layer.feat_4.weight torch.Size([4270, 10])
embedding_layer.embedding_layer.embedding_layer.feat_5.weight torch.Size([25, 10])
embedding_layer.embedding_layer.embedding_layer.feat_6.weight torch.Size([4863, 10])
embedding_layer.embedding_layer.embedding_layer.feat_7.weight torch.Size([304, 10])
embedding_layer.embedding_layer.embedding_layer.feat_8.weight torch.Size([32, 10])
embedding_layer.embedding_layer.embedding_layer.feat_9.weight torch.Size([228185, 10])
embedding_layer.embedding_layer.embedding_layer.feat_10.weight torch.Size([1048284, 10])
embedding_layer.embedding_layer.embedding_layer.feat_11.weight torch.Size([6514, 10])
embedding_layer.embedding_layer.embedding_layer.feat_12.weight torch.Size([5, 10])
embedding_layer.embedding_layer.embedding_layer.feat_13.weight torch.Size([5, 10])
embedding_layer.embedding_layer.embedding_layer.feat_14.weight torch.Size([1939, 10])
embedding_layer.embedding_layer.embedding_layer.feat_15.weight torch.Size([9, 10])
embedding_layer.embedding_layer.embedding_layer.feat_16.weight torch.Size([10, 10])
embedding_layer.embedding_layer.embedding_layer.feat_17.weight torch.Size([348, 10])
embedding_layer.embedding_layer.embedding_layer.feat_18.weight torch.Size([5, 10])
embedding_layer.embedding_layer.embedding_layer.feat_19.weight torch.Size([60, 10])
embedding_layer.embedding_layer.embedding_layer.feat_20.weight torch.Size([170, 10])
embedding_layer.embedding_layer.embedding_layer.feat_21.weight torch.Size([51, 10])
embedding_layer.embedding_layer.embedding_layer.feat_22.weight torch.Size([25, 10])
dnn.dnn.0.weight torch.Size([400, 220])
dnn.dnn.0.bias torch.Size([400])
dnn.dnn.3.weight torch.Size([400, 400])
dnn.dnn.3.bias torch.Size([400])
dnn.dnn.6.weight torch.Size([400, 400])
dnn.dnn.6.bias torch.Size([400])
dnn.dnn.9.weight torch.Size([1, 400])
dnn.dnn.9.bias torch.Size([1])

Data format mismatch

why using h5 file? directly using csv is not efficient?

Hi,

I am trying to use this to reproduce results of some benchmarks. But h5 file does not work well on our computation system.
Is there any reason why using h5 file? Could we directly use csv file in constructing train, val and test data?
directly using csv is not efficient?

Thanks~

按照Running steps做的结果和README中的logs信息不一致

您好,
在DCN_criteo_x4_001中,按照Running steps做出的结果是
"data_format": "csv",
"dataset_id": "criteo_x4_9ea3bdfc",
"model_id": "DCN_criteo_x4_012_dc8ab363",
"model_root": "./Criteo/DCN_criteo_x4_001/",
最终结果 logloss:0.437622 - AUC:0.814319

但README中的logs信息为
"data_format": "h5",
"dataset_id": "criteo_x4_5c863b0f",
"model_id": "DCN_criteo_x4_5c863b0f_012_56382345",
"model_root": "./Criteo/DCN_criteo/min10/",
最终结果 logloss: 0.437612 - AUC: 0.814437
且项目中没有找到README中包含的内容

@zhujiem @xpai 感谢解答!

Add papers to the model list

  • [TKDD'2020] Core Interest Network for Click-Through Rate Prediction

  • [CIKM2020] Deep Multi-Interest Network for Click-through Rate Prediction

  • [CIKM'21] Efficient Learning to Learn a Robust CTR Model for Web-scale Online Sponsored Search Advertising

  • [CIKM'21] Enhancing Explicit and Implicit Feature Interactions via Information Sharing for Parallel Deep CTR Models

  • [CIKM'21] One Model to Serve All: Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

  • [CIKM'21] AutoIAS: Automatic Integrated Architecture Searcher For Click-Trough Rate Prediction

  • [CIKM'21] Click-Through Rate Prediction with Multi-Modal Hypergraphs

  • [CIKM'21] AutoHERI: Automated Hierarchical Representation Integration for Post-Click Conversion Rate Estimation

  • [CIKM'21] Disentangled Self-Attentive Neural Networks for Click-Through Rate Prediction

about code for NIA-GCN

I found that results of NIA-GCN are reported in your survey paper, I was wondering if you could share the source code for NIA-GCN?

fuxictr_1.0.2中按FGCNN_avazu_x4_001的配置文件运行代码,AUC始终为0.5

您好! 我在使用fuxictr_1.0.2+BARS时,按FGCNN_avazu_x4_001的配置文件运行代码,AUC始终为0.5。将配置文件中的conv_batch_norm与dnn_batch_norm全部设置为false时才能运行出有效结果(库中原配置文件conv_batch_norm为true),这是为什么?
或者将学习率由0.001调整到0.0001也能解决此问题

Taobao dataset result and config

Hello~ I find there are no dataset config files of Taobao dataset in the folder of benchmarks, would you provide them in a further version?

关于 model_config.yaml 中的模型后缀

感谢作者建立并开源了BARS项目,是 CTR prediction 领域巨人的肩膀

model_config.yaml 中会有多个模型配置,如:

  • DIN_taobaoad_x1_001_6c779213
  • DIN_taobaoad_x1_002_5f3df4a0
  • ...
  • DIN_taobaoad_x1_012_13c23a36

请问这里的八位数字/字母组合得到的后缀,有什么含意吗?以及他们是怎么生成的

感谢~!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.