bstewart / stm Goto Github PK

View Code? Open in Web Editor NEW

393.0 393.0 99.0 125.43 MB

An R Package for the Structural Topic Model

License: Other

R 90.84% C++ 2.35% C 0.08% TeX 6.73%

stm's People

Contributors

Stargazers

Watchers

Forkers

jblumenau ahalterman jrnold gtuckerkellogg elbamos imclab fototo johnnay antoniocoppola jskuk niutyut rosemm jasonqiangguo sw1 crew102 fenerbahcem kbenoit fangzheng354 jongokko rpowers anhnguyendepocen iamkbpark twhyte jiangelle bryant1410 leeper shiraito nickhkor foo-bar-baz-qux dondealban ares7823 sebastianmng dyuval dpmccabe wizardshowing ajagaja seebeyond brewneaux nanaakwasiabayieboateng cschwem2er vanatteveldt mikajoh petershan1119 vineetbansal jesschannelliler brooksambrose bryanhe hyzale bedantaguru lilislilit kashenfelter djacques7188 muziejus limbu2 panjj1125 milliff gomfy pmp55 jooyoungseo carlyrknight wulixin eandrewjones jonneguyt praveenrj908 sfxyky seonghobae kongyeqing zackczq georgeyean zhangc927 doldolseoul sangwon9591 ihsankahveci keruiduo bulululu babymetal287 adernild mkrcke michaelgru jltobias86 a-piece-or-leaf yingyingfan0059 hpicatto tedwangpengda aristidesvara ttrann202 bettinagruen gortegasolis gberman-aus eddelbuettel tpholston aminmarani aadams149 jcarlosrm grlju fennecfish fahdamjad

stm's Issues

Deleting duplicate documents

Does anyone know how duplicate documents can be deleted?

Extracting by-topic probability by document

I am having difficulties finding a command to extract the topic probabilities by document from the topic models I have created. Using the topic topic models package, it can be done like this:

library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

Is there a way to accomplish this using the stm package?

Kind Regards,
Rioh

Integer overflow

Hi,

I have 86k short documents (but larger than 50 characters) and I got an error about integer overflow:

> processed <- textProcessor(data$documents, metadata = data,
 verbose = T, lowercase = F, removestopwords = F, removenumbers = F, removepunctuation = F, stem = F)
Building corpus... 
Creating Output... 
Warning messages:
1: In nr * nc : NAs produced by integer overflow
2: In nr * nc : NAs produced by integer overflow
3: In nr * nc : NAs produced by integer overflow
4: In nr * nc : NAs produced by integer overflow

I use stm from CRAN (1.1.3)
How much documents stm can handle?

P.S. I reduced the number of documents, but got another error after prepDocuments:

Error in stm(out$documents, out$vocab, K = 15, ... : 
  number of observations in content covariate (34086) prevalence 
  covariate (32733) and documents (34086) are not all equal

Wishlist: Add automated conversion for dates

Hi,
I'd would be very convenient if the use of date vectors would not only work in stm, but also for estimateEffect. The required conversion to numeric is a little annoying, especially when it comes to plotting. Can you think of a way were this gets handled automatically?

problems with Rtsne

Hi!
I have a problem when running stm(), with k=0.
I'm working with a trimmed dfm (4613 x 6048 sparse Matrix of class "dfmSparse", with 396486 entries).

The following error shows:
Error in tsneAnchor(Q) : an unknown error has occured in Rtsne

The weird thing is that I have run the command different times and I never had this problem before (it has happened in the last week only).
Do you have any suggestion on what the problem may be?

Thanks,
Noemi

Error in validObject(r) : invalid class “dgTMatrix” object: lengths of slots i and j must match

When running a batch script I get the error code

stm v1.2.1 (2017-03-06) successfully loaded. See ?stm for help. Error in validObject(r) : invalid class “dgTMatrix” object: lengths of slots i and j must match Calls: stm ... stm.control -> opt.beta -> mnreg -> sparseMatrix -> validObject Execution halted

Batch script:

`
library(quanteda)
library(stm)
library(pryr)

setwd("/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse")
load("presDfmSTM")
load("PostProcessMeta") # called meta

mem_used()

mem_change(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "Spectra$

t <- system.time(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "S$

iterations <- 3

repeat {
wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest,
max.em.its = iterations, data = meta, model = wood_cement_content, init.type = "Spectral")

iterations <- iterations + 1

if(stmFitted$convergence$converged == TRUE){
break
}
save(wood_cement_content, file="/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse/wood_cement_content_not_converged")
print(iterations)
}
save(wood_cement_content, file="/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse/wood_cement_content_converged")

Output from batch script:

library(quanteda)
library(stm)
library(pryr)

setwd("/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse")
load("presDfmSTM")
load("PostProcessMeta") # called meta

mem_used()
222 MB

mem_change(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "Spect$
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Beginning Initialization.
Calculating the gram matrix...
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'

Finding anchor words...

     Recovering initialization...
    ..............................................................................................

Initialization complete.
....................................................................................................
Completed E-Step (601 seconds).
....................................................................................................
Completed M-Step (613 seconds).
Model Terminated Before Convergence Reached
266 MB

t <- system.time(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = $
Beginning Initialization.
Calculating the gram matrix...
Finding anchor words...

     Recovering initialization...
    ..............................................................................................

Initialization complete.
....................................................................................................
Completed E-Step (638 seconds).
....................................................................................................
Completed M-Step (906 seconds).
Model Terminated Before Convergence Reached

iterations <- 3

repeat {

wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest,

                         max.em.its = iterations, data = meta, model = wood_cement_content, init.type = "Spectral")

iterations <- iterations + 1
if(stmFitted$convergence$converged == TRUE){
```
break
```
}
save(wood_cement_content, file="/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse/wood_cement_content_not_converged")
print(iterations)
}
Note: no visible binding for global variable 'stmFitted'
Restarting Model...
....................................................................................................
Completed E-Step (477 seconds).

`
SessionInfo:

`> sessioninfo()
Error: could not find function "sessioninfo"

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.2 (Maipo)

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base
`

Files:

rcpp throws Error: not compatible with requested type error.

I've been using the STM package successfully with a number of different text corpus (does corpus have a plural?)

I recently started using it a new set of document and I'm encountering an error that I can't seem to debug. This corpus is slightly larger than what I've done in the past (around 21,000 documents and about 4,900 terms.) Other than that it is a different sub-sample from the same data source that I've successfully used in the past.

I've pulled down some data from sql with the documents and data in a single dataframe and I've followed the standard steps from the vignette to prepare the documents (which I've successfully done in the past with a number of different data sets)

When i attempt to use stm() I get the same error message each time:

Error: not compatible with requested type

The error is slightly different depending on the init.type that I use.

With spectral I get the following message:

Beginning Initialization.
     Calculating the gram matrix...
     Finding anchor words...
    ....................
     Recovering initialization...
    .................................................
Initialization complete.
Error: not compatible with requested type

for both Random and LDA initialization I get the same error message but directlyafter the message:

Beginning Initialization.

I'm pretty sure that this is a scoping problem with the C++ code. but I can't figure it out.

Let me know if you have questions or if I can do more to clarify.

Thanks,

Lewis

Reproducibility

I have a quick question about reproducibility. What precautions can I take (if any) to make sure that the same exact corpus results in the exact same stm fit from run-to-run (including different machines)?

And, also, while you're at it. Can I possibly re-fit a subset of the original corpus using the fitNewDocuments function and retrieve the same topic model scores?

Thanks for reading!

Add a faster replacement for princomp in K=0 Spectral

In investigating #71 found that a lot of time is being taken doing a 50 dimension PCA as preprocessing in tSNE using princomp. Could replace with a much faster randomized alternative such as with the package rsvd and then just jump to the tSNE step.

Make tm suggests conditional per CRAN error

Add function to create comparable vocab

When doing the heldout likelihood fitting, you need to have the document numbered in the same way as the original STM model (same vocab, same numbers corresponding to each word). Need a function that can convert another document set in STM format over to the vocab of the first (i.e. drop extra words and renumber).

@kbenoit Do you have any code for doing this in quanteda that I could include in the help file for users? The use case would basically be I create a quanteda dfm and now i have some new documents and want what their rows in the original dfm would have been. No worries if you guys don't do this sort of thing, just thought I'd check.

Add error checking for missing content covariates

searchK with a large number of documents

Hi, I encountered the following problem with searchK and wonder if anyone could shed some light on the possible causes. Thanks in advance!

I have a corpus that, after implementing prepDocuments, contains about 1.3 million documents, three thousand terms, and 15 million tokens. Then, when I used searchK, it shows an error during "Recovering initialization..." The error says: "Error in t.default(La.res$vt) : argument is not a matrix."

Could this be because the corpus is too large? When I test-ran it with a small subset of one thousand documents, it worked out smoothly without an errors...

Thank you!

textProcessor error

I'm getting an error on the textProcessor() function when I try to use it on my full 4000 row df (6 MB as an .rda), but no error on a 100 row sample of it.

Error in UseMethod("meta", x) : 
  no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning messages:
1: In mclapply(content(x), FUN, ...) :
  scheduled core 1 encountered error in user code, all values of the job will be affected

Here's a 1000 line sample of the data I'm using: https://www.dropbox.com/s/60n86skjf9f3ybo/recap-sample.rda

I used both the CRAN and current GitHub versions of STM with the same error.

Code to replicate:

library(stm)
load("recap-sample.rda")

recap.processed <- textProcessor(documents=recapsample$Text, metadata=recapsample)
# error
recapsample <- recapsample[1:100,]
recap.processed <- textProcessor(documents=recapsample$Text, metadata=recapsample)
# no error

Any idea what's going on?

memory use for spectral initialization

Hi,

whenever I try to use spectral initialization (either for searchK or stm fit) all my remaining memory space (~13GB) is instantly consumed and R crashes with the error:

Error: cannot allocate vector of size 1.7 Gb

I tried this on Windows 10 and Linux Mint 17. My corpus has 20133 documents, 21532 terms and 402629 tokens.

Despite this, while LDA initialization works completely fine for model fitting, it crashes on the searchKfunction:

storage<-searchK(documents, vocab, K=c(40,60), init.type = 'LDA',
                     prevalence =~ CIO2,  data=meta)

Beginning Initialization.
Error in structure(.Call("collapsedGibbsSampler", documents, as.integer(K),  : 
  document must be a matrix with 2 rows of type Integer.

CIO2 is a binary variable from the meta object. I'm using stm version 1.10. Do you have an idea what is going wrong here?

Parallelization of stm?

Thank you for making this wonderful package available to the research community! Do you have and script to parallelize stm? Any plans on this front? Thanks, again!

Fitting LDA topics for new data

Hi there,

I'm looking to apply my trained stm model to new data. Something like the posterior() function in topicmodels. Any suggestions?

I should note that my new data has the same covariates as the training set, and I'm using a prevalence model.

Thanks,
Rochelle

Johannes

Any reason why permutationTest does not work? Thanks

temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
documents <- out$documents
vocab <- out$vocab
meta <- out$meta
set.seed(02138)
mod.out <- stm(documents, vocab, 3, prevalence=treatment + s(pid_rep), data=meta)
summary(mod.out)
prep <- estimateEffect(1:3 ~ treatment + s(pid_rep), mod.out, meta)
plot(prep, "treatment", model=mod.out,
method="difference",cov.value1=1,cov.value2=0)
test <- permutationTest(formula= treatment + s(pid_rep), stmobj=mod.out,
treatment="treatment", nruns=25, documents=documents,
vocab=vocab,data=meta, stmverbose=FALSE)
plot(test,2, xlab="Effect", ylab="Model Index", main="Topic 2 Placebo Test")

Reintroduce the . notation versions of functions

Using the approach that Leeper shows in pull request for plot.estimateEffect

Summary plots and LDAvis conversion

Hi,

I noticed some strange things summary plots for toLDAvis() results.

First, summary plots for STM models do not allow to only show topic proportions with one or zero terms per topic:

> plot.STM(model, type='summary', n=1)
Error in FUN(X[[i]], ...) : second argument must be a list
> plot.STM(model, type='summary', n=0)
Error in FUN(X[[i]], ...) : second argument must be a list

I think it would be useful if one could also choose to only show proportions without any terms. Even cooler would be a possibility to convert every STM plot into a ggplot, which could then be passed to plotly in one line of code and we have interactive graphs ;-)

Second, if I understand correctly the marginal topic distributions in LDAvis should be equivalent to the topic proportions for STM, right?

When comparing visualizations this does not seem to be the case:

plot.STM(model, type='summary', n=2)

visout <- toLDAvis(model, out$documents)
prepLDAvis()

For example, if you compare the blob sizes of topics 20 and 14, they are quite different to the topic proportions in the STM plot.

An image for reproduction is available here.

searchK generates different "results" table every time

Using the same raw dataset and running searchK repeatedly generates different "results" table each time. That is,

search_output = searchK (...)
search_output$results

prints different tables each time. Why is it? (It matters because different "results" tables may suggest different choices of K.

Thanks!

Does this package work only for texts in english?

Does it include english specific code? Can I simply input spanish text and it would work?

Keep hashtags in textprocessing

Is there a way to keep hashtags in the documents without having to rely on external textprocessors? I would like to remove punctuation and numbers but keep hashtags.

At the moment I'm using quanteda preprocessing which enables just this, but the resulting matrix has to be converted back to a stm friendly format afterwards.

stm freezes during E-step

Hi,

with the recent development version of stm, fitting a model always freezes during the E-step when including a specific factor covariate.
This works just fine:

library(stm)

stm14g <- stm(documents=prep_stm_14g$documents, vocab=prep_stm_14g$vocab, 
data= prep_stm_14g$meta,
               init.type='Spectral', K=5, prevalence=~ s(day))

And this always freezes:

stm14g <- stm(documents=prep_stm_14g$documents, vocab=prep_stm_14g$vocab, 
data= prep_stm_14g$meta,
               init.type='Spectral', K=5, prevalence=~ s(day) + topic_id)

Whenever I keyboard interrupt after a freeze, there's another arbitrary error raised everytime. Including the covariate in the model does not make much sense here, as its basically just an identifier for each document. But I guess stm should still not just freeze without an error.
Please find a file for reproduction here.

Multicore benefits?

Hi,

I'm aware that computation-intensive parts of STM are written in c++. But does STM in general benefit from multicore setups? As an example, would switching to MRO result in a substantial performance increase?

Add dropped observation error to stm()

Re: #38 it might be a good idea to have a specific error that checks whether observations are dropped for missingness in the model.matrix. This would be clearer for the end user.

How to using ggplot2 for stm effect plot?

Thanks for great package.

I hope to have a effect plot using ggplot.

Is there any solution to get the data for the effect plot for ggplot?

DMR topic model?

it's wonderful. And I try to use DMR topic model, but i find there is something wrong with the parameter init.type of manyTopics function.
like this:
Error in match.arg(init.type) :
'arg' should be one of "LDA", "Random", "Spectral"
I install the latest version 1.0.12.
Why? Hope you can help me. Thanks!

searchK

Hello,

First of all, thank you for the package!

I have a corpus of about 10.000 documents with a mean length of about 6.000 characters (not words). In order to determine the number of topics I ran searchK over a wide range of possible numbers from 5 to 400.

The issue is that coherence is at its highest value at 5 topics (which is too low) while the heldout likelihood as well as exclusivity seems to monotonically increase in K.

Looking at the data it seems odd to me that the two indicators go into opposite directions. From a substantive point of view 5 topics seem to be far too low while 400 are too many.

I wonder if this is kind of a common issue and if there is a way to handle it.

Thanks in advance,
ftt

R Keeps Freezing

R keeps "not responding" even when I am just typing in R script. I tried uninstalling R and installing again but that did not work. Can anyone help?

This is what i got from the diagnostic report:

SysInfo:
sysname release version nodename machine login
"Windows" ">= 8 x64" "build 9200" "JOVANA" "x86-64" "jkarano1"
user effective_user
"jkarano1" "jkarano1"

R Version:
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)

19 May 2016 15:37:29 [rsession-jkarano1] ERROR system error 32 (The process cannot access the file because it is being used by another process) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/monitored/user-settings/user-settings]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::open_w(boost::shared_ptr<std::basic_ostream >, bool) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:1052; LOGGED FROM: void rstudio::core::Settings::writeSettings() C:\Users\Administrator\rstudio\src\cpp\core\Settings.cpp:156
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/77D3D8A3, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/77D3D8A3]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/933D541B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/933D541B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/B2D7A33B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/B2D7A33B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/D063547B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/D063547B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/D63CB0E6, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/D63CB0E6]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/DC3DA251, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/DC3DA251]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/EC47F429, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/EC47F429]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
21 Mar 2017 16:13:58 [rsession-jkarano1] ERROR system error 32 (The process cannot access the file because it is being used by another process) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/addin_registry]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::open_w(boost::shared_ptr<std::basic_ostream >, bool) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:1111; LOGGED FROM: void rstudio::session::modules::r_addins::{anonymous}::AddinRegistry::saveToFile(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\session\modules\SessionRAddins.cpp:114
21 Mar 2017 17:16:36 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/Documents/Master thesis/.RDataTmp]; OCCURRED AT: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586; LOGGED FROM: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586
17 Apr 2017 13:17:20 [rsession-jkarano1] ERROR r-graphics error 6 (Plot rendering error); OCCURRED AT: rstudio::core::Error rstudio::r::session::graphics::Plot::renderFromDisplay() C:\Users\Administrator\rstudio\src\cpp\r\session\graphics\RGraphicsPlot.cpp:139; CAUSED BY: ERROR r error 4 (R code execution error) [errormsg=cannot open the connection]; OCCURRED AT: rstudio::core::Error rstudio::r::exec::{anonymous}::evaluateExpressionsUnsafe(SEXP, SEXP, SEXPREC**, rstudio::r::sexp::Protect*, rstudio::r::exec::{anonymous}::EvalType) C:\Users\Administrator\rstudio\src\cpp\r\RExec.cpp:159; LOGGED FROM: virtual void rstudio::r::session::graphics::PlotManager::render(boost::function<void(rstudio::r::session::graphics::DisplayState)>) C:\Users\Administrator\rstudio\src\cpp\r\session\graphics\RGraphicsPlotManager.cpp:481
20 Apr 2017 09:28:10 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198
20 Apr 2017 16:43:48 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/Documents/Master thesis/~WRD2881.tmp]; OCCURRED AT: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586; LOGGED FROM: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 10:46:57 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198
28 Apr 2017 19:08:33 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 19:08:33 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 21:54:58 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198

textProcessor should remove punctuation before doing anything else

I first came across this issue when trying to pass a custom stop word list to textProcessor. I was still ending up with some of my stop words appearing in the final output because they were abutting punctuation (e.g., including customstopwords = "hi" didn't remove "hi."). This behavior is well-documented in textProcessor but I still think it is not ideal.

I would like to reorder the functions in textProcessor so that punctuation is removed first, then stop words are removed, then stemming is completed. It think this will provide results that are more in line with what the user expects.

I'll issue a PR with these changes (including changes to documentation) if you're amendable to the change.

Chris

plot.STM with very large K

The method of handling labels in plot.STM falls aparts in very large K because text is shifted by a constant:

 text(frequency[invrank[i]] + 0.01, i, lab[invrank[i]], 
                family = family, pos = 4, cex = text.cex)

and should be shifted to a constant dependent on the scale of the data.

labelTopics with n=0 or n=1

Right now the minimum number of labels in labelTopics is 2 as noticed in #21

I added documentation but in the future we may want to change this. The key is to be careful about breaking the many things that are downstream of labelTopics().

fitNewDocuments and documents with 1 term

Hi Brandon,

Can you confirm whether the following behavior is expected for fitNewDocuments?

If a hold-out corpus has only 1 term in it, fitNewDocuments will throw the error, "not a matrix", during the optimization step. If I force all documents to have at least 2 terms, I do not receive an error. Is this expected? I don't see it in the documentation, although it's reasonable to expect weird behavior when a document is almost empty.

Thanks!

fitNewDocuments and prevalencePrior = "Average"

Hi Brandon,

The following describes a likely bug:

When scoring new documents with fitNewDocuments from a correlated topic model, I receive an error re: mu[,i]. Although I don't have the full error message at the moment, the error appears related to lines 181 and 282 of the fitNewDocuments code block here. Namely, when:

prevtype == "Average",
gamma is not null, and
the topic model is a CTM

mu is a vector (line 181). However, on line 282, mu is referenced as a matrix. The function appears to work properly by changing the source code to ensure mu is a matrix for the case described above, e.g.

Please let me know if my observations are not reasonable. Thanks!

findThoughts creates an error

thoughts1<-findThoughts(poliblogPrevFit, texts=shortdoc,n=2, topics=1)
plotQuote(thoughts1, width=40, main="Topic 1")

I am getting the following error and output.

Error in findThoughts(poliblogPrevFit, texts = shortdoc, n = 2, topics = 1): object 'shortdoc' not found

Thanks!

date variable as covariate

Hi,

I'd like to use a date variable (Y-m-d) as a prevalance covariate in the model. While calculating the model works after converting to date format, trying to plot estimated effects (continuous) does not.
How can I prevent this form happening? I saw that you converted dates to a simple days variable in your vignette. Are there alternatives?

Cheers

Number of topics, perflexcity

Comment on the last issues was big help.
I run the search K, and model diagnostics.

So based on highest heldout likelihood, I decided number of topics.

As I know there are some other indicator for decide number of topic such as perplexity.

Is there specific function or way to calculate perplexity from stm package or stm generated object?

searchK not storing models

Hi,

is there a reason why searchK does fit models and store diagnostics afterwards but not the models themselves? Especially for models which take a lot of time to compute this seems to be suboptimal.

FREX not working with perspectives plot

See issue raised in #41

searchK error for models with content covariates

Hi,

after computing searchK for models with content covariates an error is raised:

Error in exclusivity(model, M = M, frexw = 0.7) : 
  Exclusivity calculation only designed for models without content covariates�

I think a more elegant solution would be to either not calculate exclusivity for content models in searchK, or directly forbid to usesearchKon them in the first place.

Cheers,
Carsten

prepDocuments returning null for docs.removed

prepDocuments is returning null for docs.removed. This prevents using metadata properly, because the size of the metadata won't match-up with the size of the dtm. Example:

out <- readCorpus(tst$term_document_matrix, type = "slam")
str(tst$term_document_matrix)
List of 6
$ i : int [1:6312] 22 92 116 40 42 119 132 133 113 119 ...
$ j : int [1:6312] 1 1 3 4 4 4 4 4 5 5 ...
$ v : num [1:6312] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 137
$ ncol : int 2547
$ dimnames:List of 2
..$ Terms: chr [1:137] "@beslimandtrim15" "@herbalifetruth" "@naomifrances1" "@quoththeravensa" ...
..$ Docs : chr [1:2547] "1" "2" "3" "4" ...

attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"

attr(*, "weighting")= chr [1:2] "term frequency" "tf"
out2 <- stm::prepDocuments(out$documents, out$vocab, tst$tweet.df, verbose = TRUE)
Detected Missing Terms, renumbering
Removing 586 of 2013 terms (586 of 6312 tokens) due to frequency
Your corpus now has 137 documents, 1427 terms and 5726 tokens.
out2$docs.removed
NULL
nrow(tst$tweet.df)
[1] 2547
nrow(out2$meta)
[1] 2547

Also -- the documentation is not really clear about how the metadata is specified and used. Must every field be populated or will it handle NA's? Does it know the difference between factors, dates, ordered's and numerics? I can't test this because I can't get stm to eat any of my metadata because of the error above.

Thanks!

K=2 Error with LDA init

LDA initialization behaves strangely under K=2


Beginning Initialization.
Error in base::colMeans(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions
In addition: Warning message:
In stm(doc, voc, K = 2, prevalence = ~Dec, data = meta, max.em.its = 1000,  :
  K=2 is equivalent to a unidimensional scaling model which you may prefer.

outputdata can return vector of length > 1

outputdata (in produce_cmatrix.r) returns the median or mode of a variable depending on type.

If we have a character or factor vector with 2+ levels with the same number of observations outputdata will return all those levels.

Ultimately this leads to an issue since either cdata will not have the same number of rows as the output or more importantly we'll have different values for the controls.

Error for plot.estimateEffect

Hi, when trying to plot effect estimates in a model with a content variable an error is raised:

# post_type: factor (4 levels)
# numdate: continuous numeric
# core: dummy

comments <- stm(out$documents,out$vocab,K=50, 
                  prevalence =~ post_type + s(numdate) * core, 
                  content =~ core,
                data=out$meta,
                  max.em.its =150, seed=1337, emtol= 1e-4,
                  init.type='Spectral', verbose=T)

prep <- estimateEffect(c(12) ~ post_type + s(numdate) * core, 
                       comments, metadata=out$meta) # takes 20 minutes


Error in names(cdata) <- covariate : object 'cdata' not found

From what I can tell this does not depend on the plot method:

plot.estimateEffect(prep, 'post_type')
Error in names(cdata) <- covariate : object 'cdata' not found
)

Question: Development branch stability

Hi Brandon,

is the code in the developmental branch stable enough to use? As far as I understand the fix for #16 is not yet in the master branch right?

Question: Best use of stm for 1d-scaling

Hi Brandon,

as stated in the description of STM fitting models for only two topics corresponds to a unidimensional scaling. I'd like to ask you about your opinion whether this is reasonable and might even be a better solution than alternative scaling methods.
Say there is a corpus of legislative debates, where document information like speaker_id, speaker_party, debate_id and debate_date are available. Wouldn't STM then be superior than simply using something like wordfish? As we can incorporate more information in the model (e.g. all covariates mentioned above), this should give us better estimates right?
I played around with this a little bit and noticed some things that raised questions:

As expected topicCorr reveals that the topics are perfectly correlated. Shouldn't then the sigma.prior be adjusted and if so, what would be a reasonable value?
The metric of highest probable words becomes kind of useless, as it basically only captures stop words for both topics. In contrast, the FREX metric is very useful: For a corpus of German legislative debates, topic 1 frex terms show terms related to economy and topic 2 related to social rights and warfare, which I interpret as a sign that an important dimension was identified by the model.
Whenever factor variables with a lot of factors (speaker_id, debate_id) are included in the formula for estimateEffect it raises the warning for a non-singular matrix. How problematic is this for proper estimates? Obviously such factors are crucial for determining speaker positions.
For this special case, should estimateEffect parameters like nsims be adjusted?

Thanks for your help :)
Carsten

Make package dependencies clearer

Hi again -

I'd like to make three changes in my next PR:

I think stm should be clearer about the packages it relies on. One of the core functions in the stm workflow, textProcessor, uses the tm package a ton for manipulating text, but stm doesn't import this package. It would be easier to just import tm. What do you think? I understand wanting to reduce dependencies, but for such a critical function, I think we should just import it.
I'd also like to do away with importing the functions in grDevices, graphics, stats. Can't we just use the :: function to access these functions?
Can we use devtools/Roxygen from now on to create the NAMESPACE and .RD files? I don't like having to hunt down the documentation each time I make a change to a function and need to change its doc. I'm willing to move all of your existing documentation over to Roxygen tags.

Chris

Local estimation error

Hi Brandon,

when trying to estimate effects on topic proportions locally I receive an error:

prep <- estimateEffect(c(1:86) ~ s(numdate), pegida_stm, metadata=posts, uncertainty='Local',
                      documents=out$documents)

Error in base::colSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions
Traceback:
, uncertainty='Local',
                      documents=out$documents)
1. estimateEffect(c(1:86) ~ s(numdate), pegida_stm, metadata = posts, 
 .     uncertainty = "Local", documents = out$documents)
2. thetaPosterior(stmobj, nsims = 1, type = thetaty, uncertainty='Local',
                      documents=out$documents)pe, documents = documents)
3. thetapost.local(model, documents, nsims)
4. ln.hess(eta, theta, doc.beta, doc.ct, siginv)
5. colSums(EB)
6. colSums(EB)
7. base::colSums(x, na.rm = na.rm, dims = dims, ...)
8. stop("'x' must be an array of at least two dimensions")

Is it incorrect to use out$documents for the documents? Global estimation works without problems for me. Can you maybe give short explanation why one would prefer one over the other (local vs global)?

Thanks and Cheers,
Carsten

toLDAvis functionality

Hi,

I just discovered the little helper function toLDAvis and really like this possibility to quickly inspect the model. Is it also possible to feed another metric to LDAvis, say, FREX values?

And is there a easy solution to store topic proportions in a variable? I'd like to feed them into a graph for topic correlations.

bstewart / stm Goto Github PK

stm's People

Contributors

Stargazers

Watchers

Forkers

stm's Issues

Recommend Projects

Recommend Topics

Recommend Org