Code Monkey home page Code Monkey logo

sasrec.pytorch's Introduction

Hi there 👋

I'm an R&D engineer, who enjoys the life by understanding how things work, and earns a living by making something faster.

From research prospective, I work on how to make current models better with optimized initializers and optimizers, and how to make better models guided by fine-grained evaluation methods, hinted by model attacking experiments.

From engineering prospective, I work on software and hardware to better serve these models with lower cost and higher thoughput, and collaborate with teammates to make use of these products to really serve humankind as much as possible.

sasrec.pytorch's People

Contributors

pmixer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sasrec.pytorch's Issues

Predicting only for a single user sequence

Is there a way that predict method works in the following way : (This can be used for real time cases)

Given a new user sequence of items that the user interacted with in real time and the model created based on the training data, a set of recommended items with their scores returns.

P.S https://github.com/THUwangcy/ReChorus is also SASRec Pytorch implementation but it is not easy to use it for the single user.
Thanks,
Sara

When using multihead_attention, why does the queries are normalized while keys and values are not ?

for i in range(len(self.attention_layers)):
seqs = torch.transpose(seqs, 0, 1)
Q = self.attention_layernormsi
mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs,
attn_mask=attention_mask)
# key_padding_mask=timeline_mask
# need_weights=False) this arg do not work?
seqs = Q + mha_outputs
seqs = torch.transpose(seqs, 0, 1)
In the SASRec paper, Ⅲ. Methodology part B.Self-Attention Block, the formula uses the same embedding object as queries, keys and values, then converts it through linear projections. Why does queries are normalized, while keys and values are not in the code?

Predictions in utils.py

utils.py:

for u in users:
    seq = np.zeros([args.maxlen], dtype=np.int32)
    predictions = -model.predict(*[np.array(l) for l in [[u], [seq], item_idx]])

I don't know what the purpose of this line of code is?
predictions = predictions[0]

In the case of a user, and seq is defined as a one-dimensional array, the resulting predictions should only have one user. So what is the meaning of predictions [0]?

Looking forward to your reply and answer!

Is evaluation only happening for < 10,000 users?

Hi. I was going through the code and noticed that during evaluation the number of users we evaluate on is capped at 10,000 (

if usernum>10000:
)? Is that correct?

if usernum>10000:
        users = random.sample(range(1, usernum + 1), 10000)
    else:
        users = range(1, usernum + 1)
    for u in users:
        ...

In my own implementation I looped over all users (e.g., 52K for Amazon Beauty) and divided the nDCG and hit rates by the total number of users.

I'm actually not familiar with how evaluation in recommendation systems works and am wondering if I understood correctly and if this kind of sampled evaluation is common.

Thanks.

About the item_emb

Hello, I have a question. After importing the trained model and inputting the new data of a certain user, I found that the shape of item_emd is [3417,50], 3417 seems to be the largest item_id in the training data and 50 is max_len. In this case, does it mean that the largest item_id value must be 3416 when we enter new data? It may be that my understanding is wrong, and I look forward to your reply.

log2feats函数中有疑惑

attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev))
seqs的维度应该是(batch_size,seq_len,embedding)其中(tl, tl)怎么能保证batch_size=seq_len?
seqs = torch.transpose(seqs, 0, 1)为什么要transpose呀
期待你的答复

Possible memory leak during iteration for large number users (10+ millions)?

I am testing with a large data set with 10+ million users. I have 64GB RAM. The dataset can fit in RAM initially. But as training progresses, e.g. during epoch 1 - iterations, the RAM consumption keep increasing. Eventually all RAM will be consumed. Is this expected behavior for large data set, or possible memory leak somewhere during iteration steps? I tried with smaller max sequence length (20) and smaller batch size (64), it is same observation with RAM consumption keeps increasing during training. Thanks in advance.

Why aren't the positional embeddings used during inference?

Hi. I just had another question regarding the implementation. I'm not sure if you'd be able to answer or not but I observed that the positional embeddings aren't being used for inference (inside the model's predict method).

Why is this? In my own implementation I made a separate EmbeddingLayer module that passes input sequences through an item embedding matrix and adds positional embeddings as well in one go (as is also implied in the official paper's figure).

Is there some sort of reason why it's done this way, or is this just some arbitrary choice by the original authors?

Clarification regarding evaluate vs evaluate_valid

Thank you for the implementation~

A quick point of clarification is that in utils.py, for evaluate, line 126 reads the same as evaluate_valid:
seq[idx] = valid[u][0]

Would this possibly be "seq[idx] = test[u][0]" instead?

Edit: I figured out my mistake, namely the validation item is the one preceding the test item, please disregard.

Explanation on why PyTorch 1.6 or above version is required and other info

Hi Guys,

Thx for checking the repo, as you may still meet some problem due to various HW&SW settings, here's few links to help resolve potential issues:

  • Although MultiheadAttention layer has been available since PyTorch 1.1, pls be sure to use PyTorch 1.6 or above, there’s some problems with attention mask implementation(for enforcing causality) in older versions shown in: pytorch/pytorch#21518
  • There’s a small bug in original tf implementation of SASRec which has been fixed it: kang205/SASRec#15 as we are using PyTorch's official MultiheadAttention implementation, similar problem should not exist.
  • Pls output logits and use torch.nn.BCEWithLogitsLoss rather than applying sigmoid and use torch.nn.torch.nn.BCELoss for model training, depending on PyTorch version, you may meet a bug if do it in the second approach: NVIDIA/pix2pixHD#9
  • Current version converges slower compared to original tf implementation, I’m still checking the details to find out the root cause, pls help if you happen to be interested and have bandwidth for doing small fixes :)

Stay Healthy,
Zan

Same output when right padding

I am experiencing an issue when I give the network sequences in which the last object is replaced by padding (0).
In this case, the trained model always outputs the same sequence, regardless of the other values present in the sequence.
Is this by any chance a known problem?

From what I understand, the emb_dropout in log2feats should make the model robust against this type of sequences.
Am I wrong?

Thank you in advance for your response

指标问题??

为什么SASrec和BERTrec在ml-1m数据集上的指标在各自论文中相差很大?评估方法我看论文中说的是一致的,请答复,谢谢

Is there any explanation for adjusting seqs after embedding?

I found seq *= self.item_emb.embedding_dim ** 0.5 in function log2feats(self, log_seqs), Is there any reason for adjusting seqs after embedding?

`seqs = self.item_emb(torch.LongTensor(log_seqs).to(self.dev))

seqs *= self.item_emb.embedding_dim ** 0.5`

Calculation of Q in the code

        Q = self.attention_layernorms[i](seqs)  ** #Why Q should be calculated this way, should not be seqs * w_q?**
        mha_outputs, _ = self.attention_layers[i](Q, seqs, seqs, 
                                        attn_mask=attention_mask)
                                        # key_padding_mask=timeline_mask
                                        # need_weights=False) this arg do not work?
        seqs = Q + mha_outputs  **# This sentence is not very understandable**
        seqs = torch.transpose(seqs, 0, 1)

        seqs = self.forward_layernorms[i](seqs)
        seqs = self.forward_layers[i](seqs)
        seqs *=  ~timeline_mask.unsqueeze(-1)

Doubt about the evaluation process

First of all, many thanks to the author for providing the Pytoch version of the code.
Although the author says that some parts are the same as in original tensorflow implementation,
I still have doubts about the evaluation process. The 100 parameter is set at line 135 of sasrec.pytorch /utils.py
for _ in range(100): t = np.random.randint(1, itemnum + 1) while t in rated: t = np.random.randint(1, itemnum + 1) item_idx.append(t)
I think this code should be generating a candidate set with the number of POIs in length, but the source code did not.
The recommended performance is too high due to setting 100.
If you also have doubts, you can leave me a message.

Doubt about the attention_mask

First:
timeline_mask = torch.BoolTensor(log_seqs == 0).to(self.dev) -> timeline_mask = torch.FloatTensor(log_seqs > 0).to(self.dev)

Second:
attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev)) - > attention_mask = torch.tril(torch.ones((tl, tl), dtype=torch.float, device=self.dev))

Why there has a sign ~ ?

Infinite loop in sample_function

In the sample_function (utils module, line 36) there is a "while True" loop, where the sample() result is appended infinitely many times.

Pls undo the fix for acitivation function sometimes...

Hi Guys,

Just FYI.

I found the mistake of forgetting the activation function in former implementation and get it fixed now: 30c43cf

But according to personal experiments on TiSASRec, you may prefer to undo it for getting better performance sometimes, pls check the issue as below for details:

pmixer/TiSASRec.pytorch#1

Currently, just for SASRec, after trained 601 epochs, w/ ReLU I got:

test (NDCG@10: 0.5626, HR@10: 0.8073)

w/o ReLU I got:

test (NDCG@10: 0.5715, HR@10: 0.8157)

w/o ReLU and use AdamW instead of Adam, I got:

test (NDCG@10: 0.5781, HR@10: 0.8096)

still bit far from the paper reported:

test (NDCG@10: 0.5905, HR@10: 0.8245)

when setting maxlen=200 for all experiments, guess replacing MHA in PyTorch 1.6 with self-made MHA(https://github.com/pmixer/TiSASRec.pytorch/blob/e87342ead6e90898234432f7d9b86e76695008bc/model.py#L25) which may lead to a bit of leaky of future information can remove the gap.

Regards,
Zan

Why is only the query being layer normalized for the input to self-attention?

Hi. I was going through your code for some self-studying purposes and noticed that only the query is being layer normalized. This is also observed in the original TensorFlow implementation (https://github.com/kang205/SASRec/blob/master/model.py#L54). Is there a reason why you do so?

I would assume that the inputs to the query, key, and value should all be the same. I've implemented the Transformer myself and have done it this way.

why is the attention_mask's shape (tl, tl)

tl = seqs.shape[1]  # time dim len for enforce causality
    
attention_mask = ~torch.tril(torch.ones((tl, tl), dtype=torch.bool, device=self.dev))

I can't understand why the attention_mask is this shape. Can you give me an answer or some references? I would be very grateful for your help!

Best way to interpret final `matmul`?

In this line, what is the best way to think about this matmul? I see that it is calculating dot products for the final_feat and each emb in item_embs. If item_embs were normalized, then I could see this being essentially evaluating the cosine similarity (within a scaling factor) of the item_embs with respect to final_feat, but because the item_embs can vary in magnitude by ~30% or so it is not quite the same. Can you give any insight into this?

Thanks!

Why put test[u][0] into the item_index list ??

Hi, could you please answer my question?
In the evaluate function, it makes an item_index list and put the test[u][0] in it.
What I consider is that the test[u][0] should be what we want to predict, but in this way, the model knows it should predict from the possibility of these candidates, including the one we want to predict.
Is this a kind of data leaking? Or did I misunderstand something?

for u in users:

        if len(train[u]) < 1 or len(test[u]) < 1: continue

        seq = np.zeros([args.maxlen], dtype=np.int32)
        idx = args.maxlen - 1
        seq[idx] = valid[u][0]
        idx -= 1
        for i in reversed(train[u]):
            seq[idx] = i
            idx -= 1
            if idx == -1: break
        rated = set(train[u])
        rated.add(0)
        item_idx = [test[u][0]]  ##### WHY???
        for _ in range(100):
            t = np.random.randint(1, itemnum + 1)
            while t in rated: t = np.random.randint(1, itemnum + 1)
            item_idx.append(t)

        predictions = -model.predict(*[np.array(l) for l in [[u], [seq], item_idx]])
        predictions = predictions[0] # - for 1st argsort DESC

        rank = predictions.argsort().argsort()[0].item()

My understanding of this phase is that: The model randomly chooses 100 candidates from all items (except those in the train sequence ) and adds the one it wants to predict into the candidate set. Then it predicts the probability of these 101 candidates. The logic seems to be strange.

process ml-1m dataset

Hi, many thanks for your amazing work. When I checked the ml-1m dataset, I found that it has been already processed, could you share the methods about how you process this dataset (or the mapping relation between the original user/item id and mapped user/item id)?

Adding license to the repo

Hello, thank you for sharing your implementation with the community.
Could you please add a license (such as an MIT license) to this repository?
Thank you.

notes on data pre-processing

As you can see, I borrowed the pre-processed datasets from the paper authors' repo, you can generate your own datasets just by groupby users and then sort by timestamps, lastly drop other columns except these two columns for generating (user_id, item_id) pairs used in model training/validation/testing.

FYI to those who care about data pre-processing, I just noticed paper authors claimed to remove users and items with less than 3 interactions, so for ml1m dataset, the pre-processed dataset has bit less rows than original ml1m rating file. It's reasonable to remove these users/items, otherwise we can not generate training/validation(2nd last interacted item for the user)/testing(last interacted item for the user) data.

Something confused about the datasets

The codes are running well, but if I substitute the given dataset for the dataset used in Tisasrec which was also wirtten by you (the first line is 1 1193 5 ...) the result is too good to be normal (NDCG is bigger than 0.75 after 20 epochs). But I have not found the reason till now. What is the difference if I use the dataset in Tisasrec with the third column and forth column unused? Thank you.
ps. If I use the dataset with the first line 1 2970 4 ..., which is a re-order version of the dataset mentioned above, the result is normal...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.