david-cortes / ctpfrec Goto Github PK

Python implementation of "Content-based recommendations with poisson factorization", with some extensions

License: BSD 2-Clause "Simplified" License

Python 47.11% Jupyter Notebook 35.59% Cython 17.30%

poisson-factorization topic-modeling cold-start collaborative-topic-factorization

ctpfrec's Introduction

Collaborative Topic Poisson Factorization

Python implementation of the algorithm for probabilistic matrix factorization described in Content-based recommendations with poisson factorization (Gopalan, P.K., Charlin, L. and Blei, D., 2014).

This is a statistical model aimed at recommender systems with implicit data consisting of counts of user-item interactions (e.g. clicks by each user on different products) plus bag-of-words representations of the items. The model is fit using mean-field variational inference. Can also fit the model to side information on the users consisting of counts on different attributes (same format as the bag-of-words for items).

As it takes side information about items, it has the advantage of being able to recommend items without any ratings/clicks/plays/etc. If extending it with user side information, can also make cold-start recommendations, albeit speed is not great for that.

Supports parallelization, different stopping criteria for the optimziation procedure, and adding users/items without refitting the model entirely. The bottleneck computations are written in fast Cython code.

For a similar package for explicit feedback data see also cmfrec.

For Poisson factorization without side information see hpfrec and poismf.

Model description

The model consists in producing non-negative low-rank matrix factorizations of counts data (such as number of times each user played each song in some internet service) of user-item interactions and item-word counts, produced by a generative model specified as follows:

Item model:
B_vk ~ Gamma(a, b)
T_ik ~ Gamma(c, d)
W_iv ~ Poisson(T * B')

Interactions model:
N_uk ~ Gamma(e, f)
E_ik ~ Gamma(g, h)
R_ui ~ Poisson(N * (T + E)')

(Where W is the bag-of-words representation of the items, R is the user-item interactions matrix, u is the number of users, i is the number of items, v is the number of words, and k is the number of latent factors or topics)

For more details see the references section at the bottom.

When adding user information, the model becomes as follows:

Item model:
B_vk ~ Gamma(a, b)
T_ik ~ Gamma(c, d)
W_iv ~ Poisson(T * B')

User model:
K_ak ~ Gamma(e, f)
O_uk ~ Gamma(l, m)
Q_ua ~ Poisson(O * K')

Interactions model:
N_uk ~ Gamma(i, j)
E_ik ~ Gamma(g, h)
R_ui ~ Poisson((O + N) * (T + E)')

A huge drawback of this model compared to LDA is that, as the matrices are non-negative, items with more words will have larger values in their factors/topics, which will result in them having higher scores regardless of their popularity. This effect can be somewhat decreased by using only a limited number of words to represent each item (scaling upwards the ones that don't have enough words), by standardizing the bag-of-words to have all rows summing up to a certain number (this is hard to do when the counts are supposed to be integers, but the package can still work mostly fine with decimals that are at least >= 0.9, and has the option to standardize the inputs), or to a lesser extent by standardizing the resulting Theta shape matrix to have its rows sum to 1 (also supported in the package options).

Installation

Note: requires a C compiler configured for Python. See this guide for instructions.

Package is available on PyPI, can be installed with

pip install ctpfrec

Or if that fails:

pip install --no-use-pep517 ctpfrec

Note for macOS users: on macOS, the Python version of this package might compile without multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:

brew install libomp

And then reinstall this package: pip install --upgrade --no-deps --force-reinstall ctpfrec.

IMPORTANT: the setup script will try to add compilation flag -march=native. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable DONT_SET_MARCH=1, or by (b) manually supplying compilation CFLAGS as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:

export DONT_SET_MARCH=1
pip install ctpfrec

or, by specifying some compilation flag for architecture:

export CFLAGS="-march=x86-64"
pip install ctpfrec

Sample usage

import numpy as np, pandas as pd
from ctpfrec import CTPF

## Generating a fake dataset
nusers = 10**2
nitems = 10**2
nwords = 5 * 10**2
nobs   = 10**4
nobs_bag_of_words = 10**4

np.random.seed(1)
counts_df = pd.DataFrame({
	'UserId' : np.random.randint(nusers, size=nobs),
	'ItemId' : np.random.randint(nitems, size=nobs),
	'Count'  : (np.random.gamma(1, 1, size=nobs) + 1).astype('int32')
	})
counts_df = counts_df.loc[~counts_df[['UserId', 'ItemId']].duplicated()].reset_index(drop=True)

words_df = pd.DataFrame({
	'ItemId' : np.random.randint(nitems, size=nobs_bag_of_words),
	'WordId' : np.random.randint(nwords, size=nobs_bag_of_words),
	'Count'  : (np.random.gamma(1, 1, size=nobs_bag_of_words) + 1).astype('int32')
	})
words_df = words_df.loc[~words_df[['ItemId', 'WordId']].duplicated()].reset_index(drop=True)

## Fitting the model
## (Can also pass the inputs as COO matrices)
recommender = CTPF(k = 15, reindex=True)
recommender.fit(counts_df=counts_df, words_df=words_df)

## Making predictions
recommender.topN(user=10, n=10, exclude_seen=True)
recommender.topN(user=10, n=10, exclude_seen=False, items_pool=np.array([1,2,3,4]))
recommender.predict(user=10, item=11)
recommender.predict(user=[10,10,10], item=[1,2,3])
recommender.predict(user=[10,11,12], item=[4,5,6])

## Evaluating Poisson log-likelihood
recommender.eval_llk(counts_df, full_llk=True)

## Adding new items without refitting
nitems_new = 10
nobs_bow_new = 2 * 10**3
np.random.seed(5)
words_df_new = pd.DataFrame({
	'ItemId' : np.random.uniform(low=nitems, high=nitems+nitems_new, size=nobs_bow_new),
	'WordId' : np.random.randint(nwords, size=nobs_bow_new),
	'Count' : np.random.gamma(1, 1, size=nobs_bow_new).astype('int32')
	})
words_df_new = words_df_new.loc[words_df_new.Count > 0]

recommender.add_items(words_df_new)

If passing reindex=True, all user and item IDs that you pass to .fit will be reindexed internally (they need to be hashable types like str, int or tuple), and you can use these same IDs to make predictions later. The IDs returned by topN are these same IDs passed to .fit too.

For a more detailed example, see the IPython notebook recommending products with RetailRocket's event logs illustrating its usage with the RetailRocket dataset consisting of activity logs (view, add-to-basket, purchase) and item descriptions.

Documentation

Documentation is available at readthedocs: http://ctpfrec.readthedocs.io

It is also internally documented through docstrings (e.g. you can try help(ctpfrec.CTPF)), help(ctpfrec.CTPF.fit), etc.

Speeding up optimization procedure

For faster fitting and predictions, use SciPy and NumPy libraries compiled against MKL or OpenBLAS. These come by default with MKL in Anaconda installations.

The constructor for CTPF allows some parameters to make it run faster (if you know what you're doing): these are allow_inconsistent_math=True, full_llk=False, stop_crit='diff-norm', reindex=False, verbose=False. See the documentation for more details.

Saving model with pickle

Don't use pickle to save an CTPF object, as it will fail due to problems with lambda functions. Use dill instead, which has the same syntax as pickle:

import dill
from ctpfrec import CTPF

c = CTPF()
dill.dump(c, open("CTPF_obj.dill", "wb"))
c = dill.load(open("CTPF_obj.dill", "rb"))

References

[1] Gopalan, Prem K., Laurent Charlin, and David Blei. "Content-based recommendations with poisson factorization." Advances in Neural Information Processing Systems. 2014.

ctpfrec's People

Contributors

Stargazers

Watchers

Forkers

xuetf mindis pombredanne semanticbeeng k-tahiro fagan2888 g-github-science minghao2016 echo-valor

ctpfrec's Issues

"ValueError: Categorical categories must be unique" when using .additems()

I am trying to add new articles to the recommender class using recommender.add_items(word_counts_test) however I am presented with the error message "ValueError: Categorical categories must be unique". Can you please explain to me what this means exactly? My pandas data frame word_counts_test is in the required form of
columns={"ItemId":,"WordId":, "Count":}.
Surely all three columns will have non unique categorical values as the articles contain more than a single word and words appear in multiple articles?

Thank you

Adding new users breaks

When trying to add new users to a previously trained model, NumPy complains with:

   1813    if self.keep_data and (counts_df is not None):
-> 1814        for u in range(new_max_id):
   1815            items_this_user = counts_df.ItemId.values[counts_df.UserId == u]

TypeError: 'numpy.float64' object cannot be interpreted as an integer

It seems that new_max_id is casted to a numpy float somewhere along the way - even when the counts and words dataframes only contain integer identifiers.

Minimal code block to reproduce the error:

import numpy as np
import pandas as pd
from ctpfrec import CTPF

# Dummy data
counts_df = pd.DataFrame([[0,0,1],[0,1,1]], columns = ['UserId','ItemId','Count'])
words_df = pd.DataFrame([[0,0,1],[0,1,1]], columns = ['ItemId','WordId','Count'])

# Fit model
recommender = CTPF(k = 5,
                   reindex = True)
recommender.fit(counts_df = counts_df,
                words_df = words_df)

# Generate new dummy user
counts_df_new = pd.DataFrame([[1,0,1],[1,1,1]], columns = ['UserId','ItemId','Count'])

# Add new dummy user !< This breaks
recommender.add_users(counts_df = counts_df_new)

I suspected this might have had something to do with the re-indexing, but when disabling this option in the instantiation of the CTPF object, the fit call complains with:

--> 896    items_intersect = np.in1d(items_words_df, items_counts_df)
    897    words_include = self._words_df.WordId.loc[np.in1d(self._words_df.ItemId, items_words_df[items_intersect])].unique()

NameError: name 'items_counts_df' is not defined

My NumPy version is 1.18.1.

ctpfrec outputs training error, unlike hpfrec which outputs validation error

I've noticed when training the ctpfrec model it outputs training error, unlike hpfrec which outputs validation error. This is going to be problematic for model selection as obviously the training error will continue to decrease with model complexity and therefore result in overfitting on the test set. Do you have any advice on how I can find out the validation error?

Thank you

ctpfrec is unable to perform out-of-matrix prediction

It appears that ctpfrec is unable to make out-of-matrix prediction, i.e. it can't recommend items without any ratings/clicks/plays/etc.

You did ask my to upload a toy dataset to show you which I am having trouble doing. I am also unable to upload the datasets I am using due to GDPR.

It is however very simple: I have three sets (in the required pandas triplet form {"UserId" : , "ItemId" : , "Count" : }) of user click data user_counts_train, user_counts_validation and user_counts_test, and another set word_counts for the items (in the required pandas triplet form {"ItemId" : , "WordId" : , "Count" : }).
Importantly, there are no items in the three user sets that aren't in the word_counts set.

I fit my model using the training and validation sets:

recommender.fit(counts_df=user_counts_train, words_df=word_counts, val_set=user_counts_validation)

The issue is when I attempt to make an out-of-matrix prediction using an item that appears only in the user_counts_test and word_counts sets via:

new_user_count = pd.DataFrame({'UserId': 1.,'ItemId': [48081576,48081576,48081576],'Count': [1,1,2]}) # user clicks on item not in the training or validation sets
recommender4.add_users(new_user_count) # add new item to recommender4
recs = recommender4.topN(user = 1, n=k, exclude_seen = False) # output top k recommendations

Is the issue with ctpfrec itself, or the way I am attempting to add a new user history and make predictions with topN?

Thank you

Error when using .items_pool

I am trying to restrict the set of items ctpfrec recommends. My items are each uniquely identified by a string e.g '48069855'.

I have tried the following yet they all result in an error being thrown:

Using either recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=user_counts_test.ItemId.unique(),
or
recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array(['48069855', '47994812', '47994813', '47811334', '47809545','47770950']) )

I'm presented with:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     57     try:
---> 58         return bound(*args, **kwds)
     59     except TypeError:

TypeError: Partition index must be integer

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-94-c4d8742971d3> in <module>()
      5 # new_user_count = pd.DataFrame({'UserId': -1,'ItemId': ['48028651','48065053','48057353'],'Count': [1,1,1]})
      6 # recommender.add_users(new_user_count)
----> 7 recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array(['48069855', '47994812', '47994813', '47811334', '47809545','47770950']) ) # think about excluding seen

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in topN(self, user, n, exclude_seen, items_pool)
   1300                         raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit")
   1301 
-> 1302                 return self._topN(self._M1[user], n, exclude_seen, items_pool, user)
   1303 
   1304         def topN_cold(self, user_df, n=10, items_pool=None, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3):

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _topN(self, user_vec, n, exclude_seen, items_pool, user)
   1245                         if exclude_seen:
   1246                                 n_ext = np.min([n + self._n_seen_by_user[user], items_pool.shape[0]])
-> 1247                                 rec = np.argpartition(allpreds, n_ext-1)[:n_ext]
   1248                                 seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]]
   1249                                 if self.reindex:

<__array_function__ internals> in argpartition(*args, **kwargs)

/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in argpartition(a, kth, axis, kind, order)
    830 
    831     """
--> 832     return _wrapfunc(a, 'argpartition', kth, axis=axis, kind=kind, order=order)
    833 
    834 

/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     65         # Call _wrapit from within the except clause to ensure a potential
     66         # exception has a traceback chain.
---> 67         return _wrapit(obj, method, *args, **kwds)
     68 
     69 

/home/research/jackmck/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     42     except AttributeError:
     43         wrap = None
---> 44     result = getattr(asarray(obj), method)(*args, **kwds)
     45     if wrap:
     46         if not isinstance(result, mu.ndarray):

TypeError: Partition index must be integer

If I call the ItemIds as integers - rather than their original string format:
recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array([48069855, 47994812, 47994813, 4781133, 47809545, 47770950]) )
I am presented with the error:

ValueError                                Traceback (most recent call last)
<ipython-input-97-6594e242d050> in <module>()
      5 # new_user_count = pd.DataFrame({'UserId': -1,'ItemId': ['48028651','48065053','48057353'],'Count': [1,1,1]})
      6 # recommender.add_users(new_user_count)
----> 7 recommender.topN(user = -1, n=5, exclude_seen = True, items_pool=np.array([48069855, 47994812, 47994813, 4781133, 47809545, 47770950]) ) # think about excluding seen

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in topN(self, user, n, exclude_seen, items_pool)
   1300                         raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit")
   1301 
-> 1302                 return self._topN(self._M1[user], n, exclude_seen, items_pool, user)
   1303 
   1304         def topN_cold(self, user_df, n=10, items_pool=None, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3):

/home/research/jackmck/.local/lib/python3.7/site-packages/ctpfrec/__init__.py in _topN(self, user_vec, n, exclude_seen, items_pool, user)
   1230                                         del nan_ix
   1231                                         if items_pool_reind.shape[0] == 0:
-> 1232                                                 raise ValueError("No items to recommend.")
   1233                                         elif items_pool_reind.shape[0] == 1:
   1234                                                 raise ValueError("Only 1 item to recommend.")

ValueError: No items to recommend.

david-cortes / ctpfrec Goto Github PK

ctpfrec's Introduction

Collaborative Topic Poisson Factorization

Model description

Installation

Sample usage

Documentation

Speeding up optimization procedure

Saving model with pickle

References

ctpfrec's People

Contributors

Stargazers

Watchers

Forkers

ctpfrec's Issues

Recommend Projects

Recommend Topics

Recommend Org