premgopalan / collabtm Goto Github PK
View Code? Open in Web Editor NEWCollaborative Topic Modeling. P. Gopalan, L. Charlin, D.M. Blei, Content-based recommendations with Poisson factorization, NIPS 2014.
License: GNU General Public License v3.0
Collaborative Topic Modeling. P. Gopalan, L. Charlin, D.M. Blei, Content-based recommendations with Poisson factorization, NIPS 2014.
License: GNU General Public License v3.0
When looking at the log file validation.txt
, I see that the program is taking fewer entries than there are on the file validation.tsv
. As a concrete example, I’ve upload some dataset under this link:
https://drive.google.com/open?id=1FzBzQnGU3bQ3ojLIGy9Hby9A6Tcun-JQ
in which the train.tsv
and validation.tsv
files are the same, thus I guess there’s no reason to discard any data (unless there is some filter according to minimum number of entries that a user must have or something like that). All users in train.tsv
have >= 2 items.
The files contain 625315 entries each, but the log says there are 434084 ratings. Same for test.txt
, wich says there are 62197 entries but the file test.tsv
contains 96485 entries.
The program was run with the following parameters:
collabtm -dir /home/david/RRoff -nusers 191770 -ndocs 119448 -nvocab 342260 -k 50
I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc
) even in small datasets.
As an example, I generated the following random data in a Python script:
import numpy as np, pandas as pd
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split
nusers = 200
nitems = 300
ntopics = 30
nwords = 250
np.random.seed(1)
a=.3 + np.random.gamma(.1, .05)
b=.3 + np.random.gamma(.1, .05)
c=.3 + np.random.gamma(.1, .05)
d=.3 + np.random.gamma(.1, .05)
e=.3 + np.random.gamma(.1, .05)
f=.5 + np.random.gamma(.1, .05)
g=.3 + np.random.gamma(.1, .05)
h=.5 + np.random.gamma(.1, .05)
np.random.seed(1)
Beta = np.random.gamma(a, b, size=(nwords, ntopics))
Theta = np.random.gamma(c, d, size=(nitems, ntopics))
W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))
Eta = np.random.gamma(e, f, size=(nusers, ntopics))
Epsilon = np.random.gamma(g, h, size=(nitems, ntopics))
R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))
Rcoo=coo_matrix(R)
df = pd.DataFrame({
'UserId':Rcoo.row,
'ItemId':Rcoo.col,
'Count':Rcoo.data
})
df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)
df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)
df_train.sort_values(['UserId', 'ItemId'], inplace=True)
df_test.sort_values(['UserId', 'ItemId'], inplace=True)
df_val.sort_values(['UserId', 'ItemId'], inplace=True)
df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')
df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False)
df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False)
df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False)
pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\
.to_csv("<dir>/test_users.tsv", index=False, header=False)
Wcoo = coo_matrix(W)
Wdf = pd.DataFrame({
'ItemId':Wcoo.row,
'WordId':Wcoo.col,
'Count':Wcoo.data
})
def mix(a, b):
nx = len(a)
out=str(nx) + " "
for i in range(nx):
out += str(a[i]) + ":" + str(float(b[i])) + " "
return out
Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\
.to_frame().to_csv("<dir>/mult.dat", index=False, header=False)
pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False)
Generating files that look as follows:
0 0 4.0
0 1 6.0
0 5 5.0
0 7 5.0
0 9 2.0
0 10 5.0
0 2 1.0
0 4 4.0
0 12 4.0
0 14 3.0
0 16 4.0
0 23 5.0
0 30 3.0
0 32 1.0
0 33 2.0
0 46 3.0
0
1
2
3
4
0
1
2
3
4
5
141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0
156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0
(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)
Which I think seem to fit the description of the files in the main page.
However, after trying to run this program on this data (with and without the last two argments):
collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init
It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc
and terminates.
Am I missing something?
If I pass a validation file which is a copy of the train file and look at the numbers in validation.txt
, I see that at some point the log-likelihood starts decreasing (moving away from zero) rather than increasing.
I’m not an expert in variational inference, but, if doing full-batch updates in which each parameter is set to its expected value given the other variables, shouldn’t the training likelihood be monotonically increasing with respect to the number of iterations?
The dataset is under this link:
https://drive.google.com/open?id=1FzBzQnGU3bQ3ojLIGy9Hby9A6Tcun-JQ
Called with the following parameters:
collabtm -dir path_to_data -nusers 191770 -ndocs 119448 -nvocab 342260 -k 100
Log-likelihood shows a decrease at iterations 60 and 70, after which it stops.
0 121 -14.392807046 434084
10 1331 -13.920642836 434084
20 2543 -12.258906021 434084
30 3767 -12.187407095 434084
40 4989 -12.173852715 434084
50 6210 -12.170551069 434084
60 7458 -12.172230009 434084
70 8680 -12.180070428 434084
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.