pmixer / sasrec.pytorch Goto Github PK

PyTorch(1.6+) implementation of https://github.com/kang205/SASRec

License: Apache License 2.0

Python 100.00%

pytorch recommender-system sasrec sequential-models

sasrec.pytorch's Introduction

Hi there 👋

I'm an R&D engineer, who enjoys the life by understanding how things work, and earns a living by making something faster.

From research prospective, I work on how to make current models better with optimized initializers and optimizers, and how to make better models guided by fine-grained evaluation methods, hinted by model attacking experiments.

From engineering prospective, I work on software and hardware to better serve these models with lower cost and higher thoughput, and collaborate with teammates to make use of these products to really serve humankind as much as possible.

sasrec.pytorch's People

Contributors

Stargazers

Watchers

Forkers

ctanac arita37 cmychina gbordyugov tzs930 vjaguilera diweiqiang vgeek-z peiyance yxk9810 jiangyiheng1 syedtauhidullahshah xhuihui98 ctk117 zhaosiheng waston-li tttae sean-chuang iamhyeonje zxcvbnmloveu kerengaiger tokkiu xiaxin1998 stat-eklee chorus12 cindymuji jamboneylj codepracticer yk287 xbinglzh aidenzich ptky18it4 starcosmoszz xiey1 swyo peternara nicholaslea kuangdiontheearth diegoserranoventurini xie-fairy leelige microsoft-fevieira ehosseiniasl dlhden hpcslag cjgzxq seanswyi ziwenyeee ddkrsiten francishy yuchenhui22314 tagirov0 yar3k isuxiz stevenseven1 nameisyui r3ha frankyuchen ryushige tsuki-saki zxy1123 sanjaykrishnamurthy chandlerzuo wangyuxiang123 d7185540 starr8845 hieunt1410 alpha-beta-gamma-delta-pi shaohuacongwen zoybzo schrieffer-z johnswyou jardasvai andikazidanef15 blastlove jdvillegasg 714627034 cloudcatcher888 lihuibng tiebreaker4869 acoding-xuan

sasrec.pytorch's Issues

Predicting only for a single user sequence

Is there a way that predict method works in the following way : (This can be used for real time cases)

Given a new user sequence of items that the user interacted with in real time and the model created based on the training data, a set of recommended items with their scores returns.

P.S https://github.com/THUwangcy/ReChorus is also SASRec Pytorch implementation but it is not easy to use it for the single user.
Thanks,
Sara

When using multihead_attention, why does the queries are normalized while keys and values are not ?

for i in range(len(self.attention_layers)):
seqs = torch.transpose(seqs, 0, 1)
Q = self.attention_layernormsi
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
attn_mask=attention_mask)
# key_padding_mask=timeline_mask
# need_weights=False) this arg do not work?
seqs = Q + mha_outputs
seqs = torch.transpose(seqs, 0, 1)
In the SASRec paper, Ⅲ. Methodology part B.Self-Attention Block, the formula uses the same embedding object as queries, keys and values, then converts it through linear projections. Why does queries are normalized, while keys and values are not in the code?

Ask about adding user id

If I want to add user id to the feature, how should the code be rewritten?

Predictions in utils.py

utils.py：

for u in users:
    seq = np.zeros([args.maxlen], dtype=np.int32)
    predictions = -model.predict(*[np.array(l) for l in [[u], [seq], item_idx]])

I don't know what the purpose of this line of code is?
predictions = predictions[0]

In the case of a user, and seq is defined as a one-dimensional array, the resulting predictions should only have one user. So what is the meaning of predictions [0]？

Looking forward to your reply and answer!

Is evaluation only happening for < 10,000 users?

Hi. I was going through the code and noticed that during evaluation the number of users we evaluate on is capped at 10,000 (

SASRec.pytorch/utils.py

Line 116 in 4297d09

if usernum>10000:

)? Is that correct?

if usernum>10000:
        users = random.sample(range(1, usernum + 1), 10000)
    else:
        users = range(1, usernum + 1)
    for u in users:
        ...

In my own implementation I looped over all users (e.g., 52K for Amazon Beauty) and divided the nDCG and hit rates by the total number of users.

I'm actually not familiar with how evaluation in recommendation systems works and am wondering if I understood correctly and if this kind of sampled evaluation is common.

Thanks.

About the item_emb

Hello, I have a question. After importing the trained model and inputting the new data of a certain user, I found that the shape of item_emd is [3417,50], 3417 seems to be the largest item_id in the training data and 50 is max_len. In this case, does it mean that the largest item_id value must be 3416 when we enter new data? It may be that my understanding is wrong, and I look forward to your reply.

log2feats函数中有疑惑

attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev))
seqs的维度应该是(batch_size,seq_len,embedding)其中(tl, tl)怎么能保证batch_size=seq_len?
seqs = torch.transpose(seqs, 0, 1)为什么要transpose呀
期待你的答复

Possible memory leak during iteration for large number users (10+ millions)?

I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.

Use of `pos` and `neg` vectors in training

https://github.com/pmixer/SASRec.pytorch/blob/master/utils.py#L32
Could you please explain the reason why the pos vector is the same as the input sequence but offset by one timestep? Is it attempting to learn to predict each of the items in pos and not just the final one?

Why aren't the positional embeddings used during inference?

Hi. I just had another question regarding the implementation. I'm not sure if you'd be able to answer or not but I observed that the positional embeddings aren't being used for inference (inside the model's predict method).

Why is this? In my own implementation I made a separate EmbeddingLayer module that passes input sequences through an item embedding matrix and adds positional embeddings as well in one go (as is also implied in the official paper's figure).

Is there some sort of reason why it's done this way, or is this just some arbitrary choice by the original authors?

Clarification regarding evaluate vs evaluate_valid

Thank you for the implementation~

A quick point of clarification is that in utils.py, for evaluate, line 126 reads the same as evaluate_valid:
seq[idx] = valid[u][0]

Would this possibly be "seq[idx] = test[u][0]" instead?

Edit: I figured out my mistake, namely the validation item is the one preceding the test item, please disregard.

Why add a minus sign when forecasting

I don't quite understand why there is a minus sign in front of the forecast. Ask the boss to answer.

Explanation on why PyTorch 1.6 or above version is required and other info

Hi Guys,

Thx for checking the repo, as you may still meet some problem due to various HW&SW settings, here's few links to help resolve potential issues:

Although MultiheadAttention layer has been available since PyTorch 1.1, pls be sure to use PyTorch 1.6 or above, there’s some problems with attention mask implementation(for enforcing causality) in older versions shown in: pytorch/pytorch#21518
There’s a small bug in original tf implementation of SASRec which has been fixed it: kang205/SASRec#15 as we are using PyTorch's official MultiheadAttention implementation, similar problem should not exist.
Pls output logits and use torch.nn.BCEWithLogitsLoss rather than applying sigmoid and use torch.nn.torch.nn.BCELoss for model training, depending on PyTorch version, you may meet a bug if do it in the second approach: NVIDIA/pix2pixHD#9
Current version converges slower compared to original tf implementation, I’m still checking the details to find out the root cause, pls help if you happen to be interested and have bandwidth for doing small fixes :)

Stay Healthy,
Zan

Same output when right padding

I am experiencing an issue when I give the network sequences in which the last object is replaced by padding (0).
In this case, the trained model always outputs the same sequence, regardless of the other values present in the sequence.
Is this by any chance a known problem?

From what I understand, the emb_dropout in log2feats should make the model robust against this type of sequences.
Am I wrong?

Thank you in advance for your response

指标问题？？

为什么SASrec和BERTrec在ml-1m数据集上的指标在各自论文中相差很大？评估方法我看论文中说的是一致的，请答复，谢谢

Why do not mask padding?

In model.py, I'm confused about setting the padding part to zero instead of -inf.

SASRec.pytorch/model.py

Line 72 in 4297d09

timeline_mask = torch.BoolTensor(log_seqs == 0).to(self.dev)

SASRec.pytorch/model.py

Line 73 in 4297d09

seqs *= ~timeline_mask.unsqueeze(-1) # broadcast in last dim

why not pass timeline_mask to torch.nn.MultiheadAttention as key_padding_mask?

SASRec.pytorch/model.py

Line 83 in 4297d09

# key_padding_mask=timeline_mask

Is there any explanation for adjusting seqs after embedding?

I found seq *= self.item_emb.embedding_dim ** 0.5 in function log2feats(self, log_seqs), Is there any reason for adjusting seqs after embedding?

`seqs = self.item_emb(torch.LongTensor(log_seqs).to(self.dev))

seqs *= self.item_emb.embedding_dim ** 0.5`

Converge as Fast as TF Implementation Now

Guys, FYI, I added few lines of weights initialization code and made it converges much faster now.

Calculation of Q in the code

        Q = self.attention_layernorms[i](seqs)  ** #Why Q should be calculated this way, should not be seqs * w_q?**
        mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs, 
                                        attn_mask=attention_mask)
                                        # key_padding_mask=timeline_mask
                                        # need_weights=False) this arg do not work?
        seqs = Q + mha_outputs  **# This sentence is not very understandable**
        seqs = torch.transpose(seqs, 0, 1)

        seqs = self.forward_layernorms[i](seqs)
        seqs = self.forward_layers[i](seqs)
        seqs *=  ~timeline_mask.unsqueeze(-1)

Doubt about the evaluation process

First of all, many thanks to the author for providing the Pytoch version of the code.
Although the author says that some parts are the same as in original tensorflow implementation,
I still have doubts about the evaluation process. The 100 parameter is set at line 135 of sasrec.pytorch /utils.py
for _ in range(100): t = np.random.randint(1, itemnum + 1) while t in rated: t = np.random.randint(1, itemnum + 1) item_idx.append(t)
I think this code should be generating a candidate set with the number of POIs in length, but the source code did not.
The recommended performance is too high due to setting 100.
If you also have doubts, you can leave me a message.

Doubt about the attention_mask

First:
timeline_mask = torch.BoolTensor(log_seqs == 0).to(self.dev) -> timeline_mask = torch.FloatTensor(log_seqs > 0).to(self.dev)

Second:
attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev)) - > attention_mask = torch.tril(torch.ones((tl, tl), dtype=torch.float, device=self.dev))

Why there has a sign ~ ?

Is it Python 3

Original is Python 2
What is very limiting

training not converge on Amazon Beauty dataset according to the paper's hyperparameter setting

Infinite loop in sample_function

In the sample_function (utils module, line 36) there is a "while True" loop, where the sample() result is appended infinitely many times.

你好，main.py: error: the following arguments are required: --dataset, --train_dir 报错怎么解决

Pls undo the fix for acitivation function sometimes...

Hi Guys,

Just FYI.

I found the mistake of forgetting the activation function in former implementation and get it fixed now: 30c43cf

But according to personal experiments on TiSASRec, you may prefer to undo it for getting better performance sometimes, pls check the issue as below for details:

pmixer/TiSASRec.pytorch#1

Currently, just for SASRec, after trained 601 epochs, w/ ReLU I got:

test (NDCG@10: 0.5626, HR@10: 0.8073)

w/o ReLU I got:

test (NDCG@10: 0.5715, HR@10: 0.8157)

w/o ReLU and use AdamW instead of Adam, I got:

test (NDCG@10: 0.5781, HR@10: 0.8096)

still bit far from the paper reported:

test (NDCG@10: 0.5905, HR@10: 0.8245)

when setting maxlen=200 for all experiments, guess replacing MHA in PyTorch 1.6 with self-made MHA(https://github.com/pmixer/TiSASRec.pytorch/blob/e87342ead6e90898234432f7d9b86e76695008bc/model.py#L25) which may lead to a bit of leaky of future information can remove the gap.

Regards,
Zan

after log2feats function， why do you want to multiply it bitwise with pos embedding?

pos_logits = (log_feats * pos_embs).sum(dim=-1)
neg_logits = (log_feats * neg_embs).sum(dim=-1)
Ask the boss to explain the two steps and why to multiply by bits. After reading this paper, I didn't understand the meaning of these two steps, so I asked for an answer.

Adding attention output to query vector?

SASRec.pytorch/model.py

Line 85 in 72ba89d

seqs = Q + mha_outputs

Can you explain what this line does? Why is the attention output being added to Q?

failed loading state_dicts, pls check file path: ml-1m_default/TiSASRec.epoch=601.lr=0.001.layer=2.head=1.hidden=50.maxlen=200.pth

Is anyone faced this problem? How to solve it?

Why is only the query being layer normalized for the input to self-attention?

Hi. I was going through your code for some self-studying purposes and noticed that only the query is being layer normalized. This is also observed in the original TensorFlow implementation (https://github.com/kang205/SASRec/blob/master/model.py#L54). Is there a reason why you do so?

I would assume that the inputs to the query, key, and value should all be the same. I've implemented the Transformer myself and have done it this way.

why is the attention_mask's shape (tl, tl)

tl = seqs.shape[1]  # time dim len for enforce causality
    
attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev))

I can't understand why the attention_mask is this shape. Can you give me an answer or some references? I would be very grateful for your help!

Best way to interpret final `matmul`?

In this line, what is the best way to think about this matmul? I see that it is calculating dot products for the final_feat and each emb in item_embs. If item_embs were normalized, then I could see this being essentially evaluating the cosine similarity (within a scaling factor) of the item_embs with respect to final_feat, but because the item_embs can vary in magnitude by ~30% or so it is not quite the same. Can you give any insight into this?

Thanks!

About key_padding_mask in multihead self attention

Hi!

Thank you for your implementation!

I would like to know if there are particular reasons why https://github.com/pmixer/SASRec.pytorch/blob/master/model.py#L83 this line for key_padding_mask is commented? It seems that this mask is necessary to prevent from attending to paddings?

Thanks again,

Sincerely

Yuchen

Why put test[u][0] into the item_index list ??

Hi, could you please answer my question?
In the evaluate function, it makes an item_index list and put the test[u][0] in it.
What I consider is that the test[u][0] should be what we want to predict, but in this way, the model knows it should predict from the possibility of these candidates, including the one we want to predict.
Is this a kind of data leaking? Or did I misunderstand something?

for u in users:

        if len(train[u]) < 1 or len(test[u]) < 1: continue

        seq = np.zeros([args.maxlen], dtype=np.int32)
        idx = args.maxlen - 1
        seq[idx] = valid[u][0]
        idx -= 1
        for i in reversed(train[u]):
            seq[idx] = i
            idx -= 1
            if idx == -1: break
        rated = set(train[u])
        rated.add(0)
        item_idx = [test[u][0]]  ##### WHY???
        for _ in range(100):
            t = np.random.randint(1, itemnum + 1)
            while t in rated: t = np.random.randint(1, itemnum + 1)
            item_idx.append(t)

        predictions = -model.predict(*[np.array(l) for l in [[u], [seq], item_idx]])
        predictions = predictions[0] # - for 1st argsort DESC

        rank = predictions.argsort().argsort()[0].item()

My understanding of this phase is that: The model randomly chooses 100 candidates from all items (except those in the train sequence ) and adds the one it wants to predict into the candidate set. Then it predicts the probability of these 101 candidates. The logic seems to be strange.

process ml-1m dataset

Hi, many thanks for your amazing work. When I checked the ml-1m dataset, I found that it has been already processed, could you share the methods about how you process this dataset (or the mapping relation between the original user/item id and mapped user/item id)?

Adding license to the repo

Hello, thank you for sharing your implementation with the community.
Could you please add a license (such as an MIT license) to this repository?
Thank you.

How to generate a list of items as the recommendation result?

thank you!

notes on data pre-processing

As you can see, I borrowed the pre-processed datasets from the paper authors' repo, you can generate your own datasets just by groupby users and then sort by timestamps, lastly drop other columns except these two columns for generating (user_id, item_id) pairs used in model training/validation/testing.

FYI to those who care about data pre-processing, I just noticed paper authors claimed to remove users and items with less than 3 interactions, so for ml1m dataset, the pre-processed dataset has bit less rows than original ml1m rating file. It's reasonable to remove these users/items, otherwise we can not generate training/validation(2nd last interacted item for the user)/testing(last interacted item for the user) data.

Something confused about the datasets

The codes are running well, but if I substitute the given dataset for the dataset used in Tisasrec which was also wirtten by you (the first line is 1 1193 5 ...) the result is too good to be normal (NDCG is bigger than 0.75 after 20 epochs). But I have not found the reason till now. What is the difference if I use the dataset in Tisasrec with the third column and forth column unused? Thank you.
ps. If I use the dataset with the first line 1 2970 4 ..., which is a re-order version of the dataset mentioned above, the result is normal...

Questions about the several datasets provided in your code

Hello, could you please send me an introduction to the VIDEO and Wikipedia datasets provided in your code? If you are using the datasets from previous papers, could you please send me that article? Thank you very much.

pmixer / sasrec.pytorch Goto Github PK

sasrec.pytorch's Introduction

Hi there 👋

sasrec.pytorch's People

Contributors

Stargazers

Watchers

Forkers

sasrec.pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org