bstewart / stm Goto Github PK
View Code? Open in Web Editor NEWAn R Package for the Structural Topic Model
License: Other
An R Package for the Structural Topic Model
License: Other
Does anyone know how duplicate documents can be deleted?
I am having difficulties finding a command to extract the topic probabilities by document from the topic models I have created. Using the topic topic models package, it can be done like this:
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma), paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))
Is there a way to accomplish this using the stm package?
Kind Regards,
Rioh
Hi,
I have 86k short documents (but larger than 50 characters) and I got an error about integer overflow:
> processed <- textProcessor(data$documents, metadata = data,
verbose = T, lowercase = F, removestopwords = F, removenumbers = F, removepunctuation = F, stem = F)
Building corpus...
Creating Output...
Warning messages:
1: In nr * nc : NAs produced by integer overflow
2: In nr * nc : NAs produced by integer overflow
3: In nr * nc : NAs produced by integer overflow
4: In nr * nc : NAs produced by integer overflow
I use stm from CRAN (1.1.3)
How much documents stm
can handle?
P.S. I reduced the number of documents, but got another error after prepDocuments
:
Error in stm(out$documents, out$vocab, K = 15, ... :
number of observations in content covariate (34086) prevalence
covariate (32733) and documents (34086) are not all equal
Hi,
I'd would be very convenient if the use of date vectors would not only work in stm
, but also for estimateEffect
. The required conversion to numeric is a little annoying, especially when it comes to plotting. Can you think of a way were this gets handled automatically?
Hi!
I have a problem when running stm(), with k=0.
I'm working with a trimmed dfm (4613 x 6048 sparse Matrix of class "dfmSparse", with 396486 entries).
The following error shows:
Error in tsneAnchor(Q) : an unknown error has occured in Rtsne
The weird thing is that I have run the command different times and I never had this problem before (it has happened in the last week only).
Do you have any suggestion on what the problem may be?
Thanks,
Noemi
When running a batch script I get the error code
stm v1.2.1 (2017-03-06) successfully loaded. See ?stm for help. Error in validObject(r) : invalid class “dgTMatrix” object: lengths of slots i and j must match Calls: stm ... stm.control -> opt.beta -> mnreg -> sparseMatrix -> validObject Execution halted
Batch script:
`
library(quanteda)
library(stm)
library(pryr)
setwd("/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse")
load("presDfmSTM")
load("PostProcessMeta") # called meta
mem_used()
mem_change(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "Spectra$
t <- system.time(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "S$
iterations <- 3
repeat {
wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest,
max.em.its = iterations, data = meta, model = wood_cement_content, init.type = "Spectral")
iterations <- iterations + 1
if(stmFitted$convergence$converged == TRUE){
break
}
save(wood_cement_content, file="/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse/wood_cement_content_not_converged")
print(iterations)
}
save(wood_cement_content, file="/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse/wood_cement_content_converged")
`
Output from batch script:
`>
library(quanteda)
library(stm)
library(pryr)setwd("/home/XXX/Scratch/R_output/tmpdir/Corpora2Analyse")
load("presDfmSTM")
load("PostProcessMeta") # called metamem_used()
222 MBmem_change(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = "Spect$
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Beginning Initialization.
Calculating the gram matrix...
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Note: no visible binding for global variable 'Dimnames'
Finding anchor words...
Recovering initialization...
..............................................................................................
Initialization complete.
....................................................................................................
Completed E-Step (601 seconds).
....................................................................................................
Completed M-Step (613 seconds).
Model Terminated Before Convergence Reached
266 MB
t <- system.time(wood_cement_content <- stm(presDfmSTM$documents, presDfmSTM$vocab, K=0, prevalence = ~ Interest + Region, content= ~ Interest, max.em.its = 1, data = meta, init.type = $
Beginning Initialization.
Calculating the gram matrix...
Finding anchor words...
Recovering initialization...
..............................................................................................
Initialization complete.
....................................................................................................
Completed E-Step (638 seconds).
....................................................................................................
Completed M-Step (906 seconds).
Model Terminated Before Convergence Reached
iterations <- 3
repeat {
max.em.its = iterations, data = meta, model = wood_cement_content, init.type = "Spectral")
break
`
SessionInfo:
`> sessioninfo()
Error: could not find function "sessioninfo"
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.2 (Maipo)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
`
Files:
I've been using the STM package successfully with a number of different text corpus (does corpus have a plural?)
I recently started using it a new set of document and I'm encountering an error that I can't seem to debug. This corpus is slightly larger than what I've done in the past (around 21,000 documents and about 4,900 terms.) Other than that it is a different sub-sample from the same data source that I've successfully used in the past.
I've pulled down some data from sql with the documents and data in a single dataframe and I've followed the standard steps from the vignette to prepare the documents (which I've successfully done in the past with a number of different data sets)
When i attempt to use stm() I get the same error message each time:
Error: not compatible with requested type
The error is slightly different depending on the init.type that I use.
With spectral I get the following message:
Beginning Initialization.
Calculating the gram matrix...
Finding anchor words...
....................
Recovering initialization...
.................................................
Initialization complete.
Error: not compatible with requested type
for both Random and LDA initialization I get the same error message but directlyafter the message:
Beginning Initialization.
I'm pretty sure that this is a scoping problem with the C++ code. but I can't figure it out.
Let me know if you have questions or if I can do more to clarify.
Thanks,
Lewis
I have a quick question about reproducibility. What precautions can I take (if any) to make sure that the same exact corpus results in the exact same stm fit from run-to-run (including different machines)?
And, also, while you're at it. Can I possibly re-fit a subset of the original corpus using the fitNewDocuments function and retrieve the same topic model scores?
Thanks for reading!
In investigating #71 found that a lot of time is being taken doing a 50 dimension PCA as preprocessing in tSNE using princomp. Could replace with a much faster randomized alternative such as with the package rsvd and then just jump to the tSNE step.
When doing the heldout likelihood fitting, you need to have the document numbered in the same way as the original STM model (same vocab, same numbers corresponding to each word). Need a function that can convert another document set in STM format over to the vocab of the first (i.e. drop extra words and renumber).
@kbenoit Do you have any code for doing this in quanteda that I could include in the help file for users? The use case would basically be I create a quanteda dfm and now i have some new documents and want what their rows in the original dfm would have been. No worries if you guys don't do this sort of thing, just thought I'd check.
Hi, I encountered the following problem with searchK and wonder if anyone could shed some light on the possible causes. Thanks in advance!
I have a corpus that, after implementing prepDocuments, contains about 1.3 million documents, three thousand terms, and 15 million tokens. Then, when I used searchK, it shows an error during "Recovering initialization..." The error says: "Error in t.default(La.res$vt) : argument is not a matrix."
Could this be because the corpus is too large? When I test-ran it with a small subset of one thousand documents, it worked out smoothly without an errors...
Thank you!
I'm getting an error on the textProcessor()
function when I try to use it on my full 4000 row df (6 MB as an .rda), but no error on a 100 row sample of it.
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning messages:
1: In mclapply(content(x), FUN, ...) :
scheduled core 1 encountered error in user code, all values of the job will be affected
Here's a 1000 line sample of the data I'm using: https://www.dropbox.com/s/60n86skjf9f3ybo/recap-sample.rda
I used both the CRAN and current GitHub versions of STM with the same error.
Code to replicate:
library(stm)
load("recap-sample.rda")
recap.processed <- textProcessor(documents=recapsample$Text, metadata=recapsample)
# error
recapsample <- recapsample[1:100,]
recap.processed <- textProcessor(documents=recapsample$Text, metadata=recapsample)
# no error
Any idea what's going on?
Hi,
whenever I try to use spectral initialization (either for searchK
or stm
fit) all my remaining memory space (~13GB) is instantly consumed and R crashes with the error:
Error: cannot allocate vector of size 1.7 Gb
I tried this on Windows 10 and Linux Mint 17. My corpus has 20133 documents, 21532 terms and 402629 tokens.
Despite this, while LDA
initialization works completely fine for model fitting, it crashes on the searchK
function:
storage<-searchK(documents, vocab, K=c(40,60), init.type = 'LDA',
prevalence =~ CIO2, data=meta)
Beginning Initialization.
Error in structure(.Call("collapsedGibbsSampler", documents, as.integer(K), :
document must be a matrix with 2 rows of type Integer.
CIO2 is a binary variable from the meta object. I'm using stm version 1.10. Do you have an idea what is going wrong here?
Thank you for making this wonderful package available to the research community! Do you have and script to parallelize stm? Any plans on this front? Thanks, again!
Hi there,
I'm looking to apply my trained stm model to new data. Something like the posterior()
function in topicmodels. Any suggestions?
I should note that my new data has the same covariates as the training set, and I'm using a prevalence model.
Thanks,
Rochelle
Any reason why permutationTest does not work? Thanks
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
documents <- out$documents
vocab <- out$vocab
meta <- out$meta
set.seed(02138)
mod.out <- stm(documents, vocab, 3, prevalence=treatment + s(pid_rep), data=meta) treatment + s(pid_rep), stmobj=mod.out,
summary(mod.out)
prep <- estimateEffect(1:3 ~ treatment + s(pid_rep), mod.out, meta)
plot(prep, "treatment", model=mod.out,
method="difference",cov.value1=1,cov.value2=0)
test <- permutationTest(formula=
treatment="treatment", nruns=25, documents=documents,
vocab=vocab,data=meta, stmverbose=FALSE)
plot(test,2, xlab="Effect", ylab="Model Index", main="Topic 2 Placebo Test")
Using the approach that Leeper shows in pull request for plot.estimateEffect
Hi,
I noticed some strange things summary plots for toLDAvis() results.
First, summary plots for STM models do not allow to only show topic proportions with one or zero terms per topic:
> plot.STM(model, type='summary', n=1)
Error in FUN(X[[i]], ...) : second argument must be a list
> plot.STM(model, type='summary', n=0)
Error in FUN(X[[i]], ...) : second argument must be a list
I think it would be useful if one could also choose to only show proportions without any terms. Even cooler would be a possibility to convert every STM plot into a ggplot, which could then be passed to plotly in one line of code and we have interactive graphs ;-)
Second, if I understand correctly the marginal topic distributions in LDAvis
should be equivalent to the topic proportions for STM
, right?
When comparing visualizations this does not seem to be the case:
plot.STM(model, type='summary', n=2)
visout <- toLDAvis(model, out$documents)
prepLDAvis()
For example, if you compare the blob sizes of topics 20 and 14, they are quite different to the topic proportions in the STM plot.
An image for reproduction is available here.
Using the same raw dataset and running searchK repeatedly generates different "results" table each time. That is,
search_output = searchK (...)
search_output$results
prints different tables each time. Why is it? (It matters because different "results" tables may suggest different choices of K.
Thanks!
Does it include english specific code? Can I simply input spanish text and it would work?
Is there a way to keep hashtags in the documents without having to rely on external textprocessors? I would like to remove punctuation and numbers but keep hashtags.
At the moment I'm using quanteda preprocessing which enables just this, but the resulting matrix has to be converted back to a stm friendly format afterwards.
Hi,
with the recent development version of stm, fitting a model always freezes during the E-step when including a specific factor covariate.
This works just fine:
library(stm)
stm14g <- stm(documents=prep_stm_14g$documents, vocab=prep_stm_14g$vocab,
data= prep_stm_14g$meta,
init.type='Spectral', K=5, prevalence=~ s(day))
And this always freezes:
stm14g <- stm(documents=prep_stm_14g$documents, vocab=prep_stm_14g$vocab,
data= prep_stm_14g$meta,
init.type='Spectral', K=5, prevalence=~ s(day) + topic_id)
Whenever I keyboard interrupt after a freeze, there's another arbitrary error raised everytime. Including the covariate in the model does not make much sense here, as its basically just an identifier for each document. But I guess stm should still not just freeze without an error.
Please find a file for reproduction here.
Hi,
I'm aware that computation-intensive parts of STM
are written in c++
. But does STM
in general benefit from multicore setups? As an example, would switching to MRO result in a substantial performance increase?
Re: #38 it might be a good idea to have a specific error that checks whether observations are dropped for missingness in the model.matrix. This would be clearer for the end user.
Thanks for great package.
I hope to have a effect plot using ggplot.
Is there any solution to get the data for the effect plot for ggplot?
it's wonderful. And I try to use DMR topic model, but i find there is something wrong with the parameter init.type of manyTopics function.
like this:
Error in match.arg(init.type) :
'arg' should be one of "LDA", "Random", "Spectral"
I install the latest version 1.0.12.
Why? Hope you can help me. Thanks!
Hello,
First of all, thank you for the package!
I have a corpus of about 10.000 documents with a mean length of about 6.000 characters (not words). In order to determine the number of topics I ran searchK over a wide range of possible numbers from 5 to 400.
The issue is that coherence is at its highest value at 5 topics (which is too low) while the heldout likelihood as well as exclusivity seems to monotonically increase in K.
Looking at the data it seems odd to me that the two indicators go into opposite directions. From a substantive point of view 5 topics seem to be far too low while 400 are too many.
I wonder if this is kind of a common issue and if there is a way to handle it.
Thanks in advance,
ftt
R keeps "not responding" even when I am just typing in R script. I tried uninstalling R and installing again but that did not work. Can anyone help?
This is what i got from the diagnostic report:
SysInfo:
sysname release version nodename machine login
"Windows" ">= 8 x64" "build 9200" "JOVANA" "x86-64" "jkarano1"
user effective_user
"jkarano1" "jkarano1"
R Version:
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
19 May 2016 15:37:29 [rsession-jkarano1] ERROR system error 32 (The process cannot access the file because it is being used by another process) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/monitored/user-settings/user-settings]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::open_w(boost::shared_ptr<std::basic_ostream >, bool) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:1052; LOGGED FROM: void rstudio::core::Settings::writeSettings() C:\Users\Administrator\rstudio\src\cpp\core\Settings.cpp:156
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/77D3D8A3, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/77D3D8A3]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/933D541B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/933D541B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/B2D7A33B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/B2D7A33B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/D063547B, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/D063547B]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/D63CB0E6, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/D63CB0E6]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/DC3DA251, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/DC3DA251]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
14 Dec 2016 09:06:53 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/per/t/EC47F429, target-path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/sdb/s-112C45DB/EC47F429]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::move(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:673; LOGGED FROM: void rstudio::session::source_database::supervisor::{anonymous}::attemptToMoveSourceDbFiles(const rstudio::core::FilePath&, const rstudio::core::FilePath&) C:\Users\Administrator\rstudio\src\cpp\session\SessionSourceDatabaseSupervisor.cpp:226
21 Mar 2017 16:13:58 [rsession-jkarano1] ERROR system error 32 (The process cannot access the file because it is being used by another process) [path=C:/Users/jkarano1/AppData/Local/RStudio-Desktop/addin_registry]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::open_w(boost::shared_ptr<std::basic_ostream >, bool) const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:1111; LOGGED FROM: void rstudio::session::modules::r_addins::{anonymous}::AddinRegistry::saveToFile(const rstudio::core::FilePath&) const C:\Users\Administrator\rstudio\src\cpp\session\modules\SessionRAddins.cpp:114
21 Mar 2017 17:16:36 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/Documents/Master thesis/.RDataTmp]; OCCURRED AT: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586; LOGGED FROM: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586
17 Apr 2017 13:17:20 [rsession-jkarano1] ERROR r-graphics error 6 (Plot rendering error); OCCURRED AT: rstudio::core::Error rstudio::r::session::graphics::Plot::renderFromDisplay() C:\Users\Administrator\rstudio\src\cpp\r\session\graphics\RGraphicsPlot.cpp:139; CAUSED BY: ERROR r error 4 (R code execution error) [errormsg=cannot open the connection]; OCCURRED AT: rstudio::core::Error rstudio::r::exec::{anonymous}::evaluateExpressionsUnsafe(SEXP, SEXP, SEXPREC**, rstudio::r::sexp::Protect*, rstudio::r::exec::{anonymous}::EvalType) C:\Users\Administrator\rstudio\src\cpp\r\RExec.cpp:159; LOGGED FROM: virtual void rstudio::r::session::graphics::PlotManager::render(boost::function<void(rstudio::r::session::graphics::DisplayState)>) C:\Users\Administrator\rstudio\src\cpp\r\session\graphics\RGraphicsPlotManager.cpp:481
20 Apr 2017 09:28:10 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198
20 Apr 2017 16:43:48 [rsession-jkarano1] ERROR system error 2 (The system cannot find the file specified) [path=C:/Users/jkarano1/Documents/Master thesis/~WRD2881.tmp]; OCCURRED AT: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586; LOGGED FROM: time_t rstudio::core::FilePath::lastWriteTime() const C:\Users\Administrator\rstudio\src\cpp\core\FilePath.cpp:586
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
24 Apr 2017 08:54:32 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 10:46:57 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198
28 Apr 2017 19:08:33 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 19:08:33 [rsession-jkarano1] CLIENT EXCEPTION (rsession-jkarano1): (TypeError) : undefined is not an object (evaluating 'a.n.applicable');|||com/google/gwt/dev/jjs/intrinsic/com/google/gwt/lang/Exceptions.java#28::wrap|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/event/shared/HandlerManager.java#117::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/user/client/ui/Widget.java#127::fireEvent|||com/google/gwt/event/logical/shared/ValueChangeEvent.java#40::fire|||org/rstudio/core/client/widget/SearchWidget.java#163::onKeyUp|||com/google/gwt/event/dom/client/KeyUpEvent.java#55::dispatch|||com/google/web/bindery/event/shared/SimpleEventBus.java#173::doFire|||Client-ID: 33e600bb-c1b1-46bf-b562-ab5cba070b0e|||User-Agent: Mozilla/5.0 (Windows NT 6.2 WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1
28 Apr 2017 21:54:58 [rsession-jkarano1] ERROR system error 109 (The pipe has been ended); OCCURRED AT: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198; LOGGED FROM: virtual void rstudio::session::NamedPipeHttpConnection::close() C:\Users\Administrator\rstudio\src\cpp\session\http\SessionNamedPipeHttpConnectionListener.hpp:198
I first came across this issue when trying to pass a custom stop word list to textProcessor
. I was still ending up with some of my stop words appearing in the final output because they were abutting punctuation (e.g., including customstopwords = "hi"
didn't remove "hi."). This behavior is well-documented in textProcessor
but I still think it is not ideal.
I would like to reorder the functions in textProcessor
so that punctuation is removed first, then stop words are removed, then stemming is completed. It think this will provide results that are more in line with what the user expects.
I'll issue a PR with these changes (including changes to documentation) if you're amendable to the change.
Chris
The method of handling labels in plot.STM falls aparts in very large K because text is shifted by a constant:
text(frequency[invrank[i]] + 0.01, i, lab[invrank[i]],
family = family, pos = 4, cex = text.cex)
and should be shifted to a constant dependent on the scale of the data.
Right now the minimum number of labels in labelTopics is 2 as noticed in #21
I added documentation but in the future we may want to change this. The key is to be careful about breaking the many things that are downstream of labelTopics().
Hi Brandon,
Can you confirm whether the following behavior is expected for fitNewDocuments?
If a hold-out corpus has only 1 term in it, fitNewDocuments will throw the error, "not a matrix", during the optimization step. If I force all documents to have at least 2 terms, I do not receive an error. Is this expected? I don't see it in the documentation, although it's reasonable to expect weird behavior when a document is almost empty.
Thanks!
Hi Brandon,
The following describes a likely bug:
When scoring new documents with fitNewDocuments from a correlated topic model, I receive an error re: mu[,i]. Although I don't have the full error message at the moment, the error appears related to lines 181 and 282 of the fitNewDocuments code block here. Namely, when:
prevtype == "Average",
gamma is not null, and
the topic model is a CTM
mu is a vector (line 181). However, on line 282, mu is referenced as a matrix. The function appears to work properly by changing the source code to ensure mu is a matrix for the case described above, e.g.
Please let me know if my observations are not reasonable. Thanks!
Hi,
I'd like to use a date variable (Y-m-d) as a prevalance covariate in the model. While calculating the model works after converting to date format, trying to plot estimated effects (continuous) does not.
How can I prevent this form happening? I saw that you converted dates to a simple days
variable in your vignette. Are there alternatives?
Cheers
Comment on the last issues was big help.
I run the search K, and model diagnostics.
So based on highest heldout likelihood, I decided number of topics.
As I know there are some other indicator for decide number of topic such as perplexity.
Is there specific function or way to calculate perplexity from stm package or stm generated object?
Hi,
is there a reason why searchK
does fit models and store diagnostics afterwards but not the models themselves? Especially for models which take a lot of time to compute this seems to be suboptimal.
See issue raised in #41
Hi,
after computing searchK
for models with content covariates an error is raised:
Error in exclusivity(model, M = M, frexw = 0.7) :
Exclusivity calculation only designed for models without content covariates�
I think a more elegant solution would be to either not calculate exclusivity for content models in searchK
, or directly forbid to usesearchK
on them in the first place.
Cheers,
Carsten
prepDocuments is returning null for docs.removed. This prevents using metadata properly, because the size of the metadata won't match-up with the size of the dtm. Example:
out <- readCorpus(tst$term_document_matrix, type = "slam")
str(tst$term_document_matrix)
List of 6
$ i : int [1:6312] 22 92 116 40 42 119 132 133 113 119 ...
$ j : int [1:6312] 1 1 3 4 4 4 4 4 5 5 ...
$ v : num [1:6312] 1 1 1 1 1 1 1 1 1 1 ...
$ nrow : int 137
$ ncol : int 2547
$ dimnames:List of 2
..$ Terms: chr [1:137] "@beslimandtrim15" "@herbalifetruth" "@naomifrances1" "@quoththeravensa" ...
..$ Docs : chr [1:2547] "1" "2" "3" "4" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
out2 <- stm::prepDocuments(out$documents, out$vocab, tst$tweet.df, verbose = TRUE)
Detected Missing Terms, renumbering
Removing 586 of 2013 terms (586 of 6312 tokens) due to frequency
Your corpus now has 137 documents, 1427 terms and 5726 tokens.
out2$docs.removed
NULL
nrow(tst$tweet.df)
[1] 2547
nrow(out2$meta)
[1] 2547
Also -- the documentation is not really clear about how the metadata is specified and used. Must every field be populated or will it handle NA's? Does it know the difference between factors, dates, ordered's and numerics? I can't test this because I can't get stm to eat any of my metadata because of the error above.
Thanks!
LDA initialization behaves strangely under K=2
Beginning Initialization.
Error in base::colMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
In addition: Warning message:
In stm(doc, voc, K = 2, prevalence = ~Dec, data = meta, max.em.its = 1000, :
K=2 is equivalent to a unidimensional scaling model which you may prefer.
outputdata (in produce_cmatrix.r) returns the median or mode of a variable depending on type.
If we have a character or factor vector with 2+ levels with the same number of observations outputdata will return all those levels.
Ultimately this leads to an issue since either cdata will not have the same number of rows as the output or more importantly we'll have different values for the controls.
Hi, when trying to plot effect estimates in a model with a content variable an error is raised:
# post_type: factor (4 levels)
# numdate: continuous numeric
# core: dummy
comments <- stm(out$documents,out$vocab,K=50,
prevalence =~ post_type + s(numdate) * core,
content =~ core,
data=out$meta,
max.em.its =150, seed=1337, emtol= 1e-4,
init.type='Spectral', verbose=T)
prep <- estimateEffect(c(12) ~ post_type + s(numdate) * core,
comments, metadata=out$meta) # takes 20 minutes
Error in names(cdata) <- covariate : object 'cdata' not found
From what I can tell this does not depend on the plot method:
plot.estimateEffect(prep, 'post_type')
Error in names(cdata) <- covariate : object 'cdata' not found
)
Hi Brandon,
is the code in the developmental branch stable enough to use? As far as I understand the fix for #16 is not yet in the master branch right?
Hi Brandon,
as stated in the description of STM
fitting models for only two topics corresponds to a unidimensional scaling. I'd like to ask you about your opinion whether this is reasonable and might even be a better solution than alternative scaling methods.
Say there is a corpus of legislative debates, where document information like speaker_id, speaker_party, debate_id and debate_date are available. Wouldn't STM then be superior than simply using something like wordfish? As we can incorporate more information in the model (e.g. all covariates mentioned above), this should give us better estimates right?
I played around with this a little bit and noticed some things that raised questions:
topicCorr
reveals that the topics are perfectly correlated. Shouldn't then the sigma.prior
be adjusted and if so, what would be a reasonable value?estimateEffect
it raises the warning for a non-singular matrix. How problematic is this for proper estimates? Obviously such factors are crucial for determining speaker positions.estimateEffect
parameters like nsims
be adjusted?Thanks for your help :)
Carsten
Hi again -
I'd like to make three changes in my next PR:
textProcessor
, uses the tm package a ton for manipulating text, but stm doesn't import this package. It would be easier to just import tm. What do you think? I understand wanting to reduce dependencies, but for such a critical function, I think we should just import it.::
function to access these functions?Chris
Hi Brandon,
when trying to estimate effects on topic proportions locally I receive an error:
prep <- estimateEffect(c(1:86) ~ s(numdate), pegida_stm, metadata=posts, uncertainty='Local',
documents=out$documents)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions
Traceback:
, uncertainty='Local',
documents=out$documents)
1. estimateEffect(c(1:86) ~ s(numdate), pegida_stm, metadata = posts,
. uncertainty = "Local", documents = out$documents)
2. thetaPosterior(stmobj, nsims = 1, type = thetaty, uncertainty='Local',
documents=out$documents)pe, documents = documents)
3. thetapost.local(model, documents, nsims)
4. ln.hess(eta, theta, doc.beta, doc.ct, siginv)
5. colSums(EB)
6. colSums(EB)
7. base::colSums(x, na.rm = na.rm, dims = dims, ...)
8. stop("'x' must be an array of at least two dimensions")
Is it incorrect to use out$documents
for the documents? Global estimation works without problems for me. Can you maybe give short explanation why one would prefer one over the other (local vs global)?
Thanks and Cheers,
Carsten
Hi,
I just discovered the little helper function toLDAvis
and really like this possibility to quickly inspect the model. Is it also possible to feed another metric to LDAvis, say, FREX values?
And is there a easy solution to store topic proportions in a variable? I'd like to feed them into a graph for topic correlations.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.