Code Monkey home page Code Monkey logo

collabtm's Introduction

Reference

P. Gopalan, L. Charlin, D.M. Blei, Content-based recommendations with Poisson factorization, NIPS 2014.

Paper PDF

Installation

Required libraries: gsl, gslblas, pthread

On Linux/Unix run

./configure make; make install

Note: We have NOT tested the code on Mac. Please use Linux.

On Mac OS, the location of the required gsl, gslblas and pthread libraries may need to be specified:

./configure LDFLAGS="-L/opt/local/lib" CPPFLAGS="-I/opt/local/include" make; make install

The binary 'collabtm' will be installed in /usr/local/bin unless a different prefix is provided to configure. (See INSTALL.)

COLLABTM: Nonnegative Collaborative Topic Modeling tool

collabtm [OPTIONS]

-dir <string>            path to dataset directory with files described under INPUT below

-mdocs <int>	     number of documents

-nuser <int>	     number of users

-nvocab <int>	     size of vocabulary
	    
-k <int>                 latent dimensionality

-fixeda                  fix the document length correction factor ('a') to 1

-lda-init                use LDA based initialization (see below)

OPTIONAL:

-binary-data             treat observed ratings data as binary; if rating > 0 then rating is treated as 1

-doc-only                use document data only

-ratings-only            use ratings data only

-content-only            use both data, but predict only with the topic affinities (i.e., topic offsets are 0)

EXPERIMENTAL:

-online                  use stochastic variational inference

-seq-init -doc-only	     use sequential initialization for document only fits

RECOMMENDED

We recommend running CTPF using the following options:

~/src/collabtm -dir /mendeley -nusers 80278 -ndocs 261248 -nvocab 10000 -k 100 -lda-init -fixeda

If the document lengths are expected to vary significantly, we recommend additionally running without the "-fixeda" option above.

The above options depend on LDA-based fits being available for the document portion of the model. See below.

INPUT

The following files must be present in the data directory (as indicated by the '-dir' switch):

train.tsv, test.tsv, validation.tsv, test_users.tsv

train/valid/test files contain triplets in the following format (one per line): userID itemID rating

where tab characters separate the fields.

test_users.tsv contains the userIDs of all users that are tested on (one per line).

The new files additionally needed are mult.dat and vocab.dat. (They are really text files.) This is the "document" portion of the data. Each line of mult.dat is a document and has the following format:

 <number of words> <word-id0:count0> <word-id1:count1>....

Each line of vocab.dat is a word. Note that both the word index and the document index starts at 0. So a word-id in vocab.dat can be 0 and the document id "rated" in train.tsv can be 0.

EXAMPLE

Run two versions -- with the correction scalar 'a' inferred and one with 'a' fixed at a 1. One of these fits might be better than the other. The "-fixeda" option specifies that the documents are of similar lengths.

Always use LDA-based initialization.

~/src/collabtm -dir /mendeley -nusers 80278 -ndocs 261248 -nvocab 10000 -k 100 -lda-init

~/src/collabtm -dir /mendeley -nusers 80278 -ndocs 261248 -nvocab 10000 -k 100 -fixeda -lda-init

LDA BASED INITIALIZATION

  1. Run Chong's gibbs sampler to obtain LDA fits on the word frequencies (see below for details)

  2. Create a directory "lda-fits" within the "dataset directory" above and put two files in it: the topics beta-lda-k.tsv and the memberships theta-lda-k.tsv. If K=100, these files will be named beta-lda-k100.tsv and theta-lda-k100.tsv, respectively.

  3. Run collabtm inference with the -lda-init option as follows (the -fixeda option fixes 'a' at 1):

~/src/collabtm -dir /mendeley -nusers 80278 -ndocs 261248 -nvocab 10000 -k 100 -lda-init

~/src/collabtm -dir /mendeley -nusers 80278 -ndocs 261248 -nvocab 10000 -k 100 -lda-init -fixeda

CHONG's GIBBS SAMPLER

The LDA code is provided under the "lda" directory.

For example, run LDA with parameters - 50 topics - the topic Dirichlet set to 0.01 - the topic proportion Dirichlet set to 0.1 as follows:

./lda --directory fit_50/ --train_data ~/arxiv/dat/mult_lda.dat --num_topics 50 --eta 0.01 --alpha 0.1 --max_iter -1 --max_time -1

mult_lda.dat contains the documents (see the David Blei's lda-c package for the exact format: http://www.cs.princeton.edu/~blei/lda-c/index.html)

Note The values of eta and alpha need to reflect those used when loading the LDA fits in CTPF (see collabtm.cc:initialize()).

The output directory ("fit_50/" in the above example) will contain the fit files which can be used to initialize CTPF with -lda-init option. Specifically *.topics corresponds to beta-lda-k.tsv, and *.doc.states corresponds to theta-lda-k.tsv.

collabtm's People

Contributors

lcharlin avatar premgopalan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

collabtm's Issues

Training likelihood decreasing with iterations

If I pass a validation file which is a copy of the train file and look at the numbers in validation.txt, I see that at some point the log-likelihood starts decreasing (moving away from zero) rather than increasing.

I’m not an expert in variational inference, but, if doing full-batch updates in which each parameter is set to its expected value given the other variables, shouldn’t the training likelihood be monotonically increasing with respect to the number of iterations?

The dataset is under this link:
https://drive.google.com/open?id=1FzBzQnGU3bQ3ojLIGy9Hby9A6Tcun-JQ

Called with the following parameters:
collabtm -dir path_to_data -nusers 191770 -ndocs 119448 -nvocab 342260 -k 100

Log-likelihood shows a decrease at iterations 60 and 70, after which it stops.

0	121	-14.392807046	434084
10	1331	-13.920642836	434084
20	2543	-12.258906021	434084
30	3767	-12.187407095	434084
40	4989	-12.173852715	434084
50	6210	-12.170551069	434084
60	7458	-12.172230009	434084
70	8680	-12.180070428	434084

Bad_alloc with even the smallest datasets

I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets.

As an example, I generated the following random data in a Python script:

import numpy as np, pandas as pd
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split

nusers = 200
nitems = 300
ntopics = 30
nwords = 250

np.random.seed(1)
a=.3 + np.random.gamma(.1, .05)
b=.3 + np.random.gamma(.1, .05)
c=.3 + np.random.gamma(.1, .05)
d=.3 + np.random.gamma(.1, .05)
e=.3 + np.random.gamma(.1, .05)
f=.5 + np.random.gamma(.1, .05)
g=.3 + np.random.gamma(.1, .05)
h=.5 + np.random.gamma(.1, .05)

np.random.seed(1)
Beta = np.random.gamma(a, b, size=(nwords, ntopics))
Theta = np.random.gamma(c, d, size=(nitems, ntopics))
W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))

Eta = np.random.gamma(e, f, size=(nusers, ntopics))
Epsilon = np.random.gamma(g, h, size=(nitems, ntopics))
R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))

Rcoo=coo_matrix(R)
df = pd.DataFrame({
    'UserId':Rcoo.row,
    'ItemId':Rcoo.col,
    'Count':Rcoo.data
})

df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)
df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)

df_train.sort_values(['UserId', 'ItemId'], inplace=True)
df_test.sort_values(['UserId', 'ItemId'], inplace=True)
df_val.sort_values(['UserId', 'ItemId'], inplace=True)

df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')

df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False)
df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False)
df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False)
pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\
.to_csv("<dir>/test_users.tsv", index=False, header=False)


Wcoo = coo_matrix(W)
Wdf = pd.DataFrame({
    'ItemId':Wcoo.row,
    'WordId':Wcoo.col,
    'Count':Wcoo.data
})
def mix(a, b):
    nx = len(a)
    out=str(nx) + " "
    for i in range(nx):
        out += str(a[i]) + ":" + str(float(b[i])) + " "
    return out
Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\
.to_frame().to_csv("<dir>/mult.dat", index=False, header=False)

pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False)

Generating files that look as follows:

  • train.tsv:
0	0	4.0
0	1	6.0
0	5	5.0
0	7	5.0
0	9	2.0
0	10	5.0
  • test.tsv:
0	2	1.0
0	4	4.0
0	12	4.0
0	14	3.0
0	16	4.0
  • validation.tsv
0	23	5.0
0	30	3.0
0	32	1.0
0	33	2.0
0	46	3.0
  • test_users.tsv:
0
1
2
3
4
  • vocab.dat:
0
1
2
3
4
5
  • mult.dat:
141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 
156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0 

(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)

Which I think seem to fit the description of the files in the main page.

However, after trying to run this program on this data (with and without the last two argments):

collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init

It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates.

Am I missing something?

Validation and test not taking all the available data

When looking at the log file validation.txt, I see that the program is taking fewer entries than there are on the file validation.tsv. As a concrete example, I’ve upload some dataset under this link:
https://drive.google.com/open?id=1FzBzQnGU3bQ3ojLIGy9Hby9A6Tcun-JQ

in which the train.tsv and validation.tsv files are the same, thus I guess there’s no reason to discard any data (unless there is some filter according to minimum number of entries that a user must have or something like that). All users in train.tsv have >= 2 items.

The files contain 625315 entries each, but the log says there are 434084 ratings. Same for test.txt, wich says there are 62197 entries but the file test.tsv contains 96485 entries.

The program was run with the following parameters:
collabtm -dir /home/david/RRoff -nusers 191770 -ndocs 119448 -nvocab 342260 -k 50

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.