An R package for Keyword Assisted Topic Models, created by Shusei Eshima, Tomoya Sasaki, and Kosuke Imai.
Please visit our website for a complete reference.
An R package for Keyword Assisted Topic Models
Home Page: https://keyatm.github.io/keyATM/
License: GNU General Public License v3.0
An R package for Keyword Assisted Topic Models, created by Shusei Eshima, Tomoya Sasaki, and Kosuke Imai.
Please visit our website for a complete reference.
Not compatible with requested type
error when fitting.
sessionInfo() # please run this in R and copy&paste the output
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quanteda_2.0.1 keyATM_0.1.0 nvimcom_0.9-83
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4 magrittr_1.5 stopwords_1.0 tidyselect_1.0.0
[5] munsell_0.5.0 colorspace_1.4-1 lattice_0.20-38 R6_2.4.1
[9] rlang_0.4.5 fastmap_1.0.1 fastmatch_1.1-0 stringr_1.4.0
[13] dplyr_0.8.4 tools_3.6.1 parallel_3.6.1 grid_3.6.1
[17] data.table_1.12.8 gtable_0.3.0 RcppParallel_5.0.0 assertthat_0.2.1
[21] tibble_2.1.3 lifecycle_0.1.0 crayon_1.3.4 Matrix_1.2-18
[25] purrr_0.3.3 ggplot2_3.3.0 glue_1.3.2 stringi_1.4.6
[29] compiler_3.6.1 pillar_1.4.3 scales_1.1.0 pkgconfig_2.0.3
> out <- keyATM(docs = data$docs,
+ keywords = data$keywords,
+ no_keyword_topics = 0,
+ model = "base",
+ options = list(seed = 250, iterations = 10)
+ )
Initializing the model...
Warning in check_keywords(info$wd_names, keywords, options$prune) :
A keyword will be pruned because it does not appear in documents: appointment
Fitting the model. 10 iterations...
Error in keyATM_fit_base(key_model, iter = options$iterations) :
Not compatible with requested type: [type=NULL; target=integer].
It should run.
Can I check why is this line inside the loop?
It seems to me that this should only be done once outside the loop when referring to the log of the conditional posterior distribution.
Or is there something I am missing?
It would be great if a progress bar (see this example) is added to track the running time of the keyATM_read()
function.
# Example
keyATM_read(progress_bar = TRUE)
A progress bar could help track the remaining time to run the keyATM_read()
function.
I noticed that it took quite a while for the keyATM_read()
function to parse a large document-term matrix object. I'm currently using tictoc::tic()
and toc()
to document the running time. It would be convenient, however, if progress_bar
option can be provided as one of the arguments of the function.
In the configuration of keyATM
, one of the parameter to be specified is weights_type
. My understanding is that "information-theory" refers to -log base 2, which is presented in your paper. I would like to clarify how do you define the computation of inverse frequency. Thank you.
I ran a keyATM
on just a subset of 5000 docs and I got this error:
> model.keyATM <- keyATM(
docs = keyATM_docs,
no_keyword_topics = NUM_TOPICS,
keywords = KEYWORDS,
model = "cov",
model_settings = list(covariates_data = data.matrix(stm_dfm$meta),
covariates_formula = ~ as.factor(meetingType))
)
Initializing the model...
Fitting the model. 1500 iterations...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Creating an output object. It may take time...
Error: Column `Proportion` must be length 40 (the number of rows) or one, not 47
This was an error internal to keyATM()
- any idea what this might be about?
keyATM_read()
raises an error if there is an empty document. Should it drop the document silently?
My position is that researchers should be explicit about all the modifications to the data. The number of documents that researchers think they use should match with the number actually used.
It would be nice to monitor the iterations of keyATM, for instance in shiny
# Example
iter <- 1500
shiny::withProgress(message("Running keyATM", max=iter)
out <- keyATM::keyATM(docs=docs, no_keyword_topics=num_topics, keywords=keyw, model= "base", options=list(iterations=iter, pb=shiny::setProgress))
I would like to use this great package from within Shiny.
Hello, I haven't found a method to predict the topic of a new document with the Base model. If this is true, could you please provide this feature?
Hey,
I hope this is the right channel to address this question:
In the preparation section on the keyATM website, you write: "Researchers can use other methods such as a keyword selection algorithm proposed in King, Lam and Roberts (2017)."
However, the keyAMT package does not seem to have a function for this algorithm. Did I overlook something, or was this sentence meant to encourage the readers to make this implementation themselves?
Thanks in advance!
I really enjoyed working with this package - thank you for all the work on it!
What I do miss when working with keyATM is a feature that would enable a comparison among models with different number of topics based on several existing measures.
In particular, the package LDAtuning by @nikita-moor has been of immense help when working with LDA implementations (https://cran.r-project.org/web/packages/ldatuning/index.html).
I was wondering if any of those measures could be also used with keyATM?
I wanted to check whether some of the functions in LDAtuning could be "adjusted" to work with the keyATM base model.
However, since I do not understand all the nuances of how keyATM works compared to LDA models, I was not even sure if any of these adjustments would be valid.
The measures calculated by the LDAtuning package rely on output of LDA models built with topicmodels. So, by reading through the keyATM documentation, I concluded the following:
I gathered that
LDAmodel@logLiks
would be matched by the following for keyATM models:
keyATMmodel$model_fit$`Log Likelihood`
Other metrics call for the beta probabilities of terms over topics:
LDAmodel@beta
Reading the documentations of keyATM, I concluded this is comparable to phi?
keyATMmodel$phi
And the posterior topic distributions:
LDAmodel@gamma
would correspond to:
keyATMmodel$theta
Would any of functions relying of the LDA model outputs above work with keyATM as well?
Thank you!
I successfully fit keyATM with the default single thread mode. I'm interested in going to parallel processing, since I have a 16 core CPU and lots of RAM. The help suggests to use future::plan() but I don't see any further documentation or vignettes or help about this. I have used doParallel() and foreach() before, but I know future is a whole new paradigm.
Can I ask for an example to try to work from, specifically with the keyATM example. For example, changing the keyATM base example to go from single to multithreaded.
incorrect number of dimensions
if there is only one keyword-topic
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.26 quanteda_2.0.1 forcats_0.4.0 stringr_1.4.0
[5] dplyr_0.8.4 purrr_0.3.3 readr_1.3.1 tidyr_1.0.2
[9] tibble_2.1.3 ggplot2_3.3.0 tidyverse_1.2.1 keyATM_0.1.0
[13] nvimcom_0.9-83
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.1 cellranger_1.1.0
[5] stopwords_1.0 tools_3.6.1 jsonlite_1.6.1 lubridate_1.7.4
[9] lifecycle_0.1.0 nlme_3.1-140 gtable_0.3.0 lattice_0.20-38
[13] pkgconfig_2.0.3 rlang_0.4.5 fastmatch_1.1-0 Matrix_1.2-18
[17] cli_2.0.1 rstudioapi_0.11 parallel_3.6.1 xfun_0.11
[21] haven_2.1.1 fastmap_1.0.1 withr_2.1.2 httr_1.4.1
[25] xml2_1.2.2 generics_0.0.2 vctrs_0.2.2 hms_0.5.2
[29] grid_3.6.1 tidyselect_1.0.0 data.table_1.12.8 glue_1.3.2
[33] R6_2.4.1 fansi_0.4.1 readxl_1.3.1 modelr_0.1.5
[37] magrittr_1.5 ellipsis_0.3.0 backports_1.1.5 scales_1.1.0
[41] assertthat_0.2.1 rvest_0.3.4 colorspace_1.4-1 stringi_1.4.6
[45] RcppParallel_5.0.0 munsell_0.5.0 broom_0.5.2 crayon_1.3.4
Error in phi[, which(colnames(phi) %in% colnames(phi_))] :
incorrect number of dimensions
Should run.
Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds
.
# Please copy and paste the code. If possible, please upload the data file as `.rds`.
Using keyATM 0.4.0
I am unable to change the credible interval inside the by_strata_DocTopic() function.
It works fine with the predict command, however changing the "ci" argument does not seem to influence the results.
Am I misunderstanding something?
The preparation document says to run these lines:
save(out, file = "SAVENAME.rds")
out <- readRDS(file = "SAVENAME.rds")
However, that doesn't work because R has separate functions for loading and reading RDS files. You can use save and load or saveRDS and loadRDS, but you can't mix them together. This is my first time coding in R, and it took me a while to figure out that problem.
sessionInfo() # please run this in R and copy&paste the output
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.3.3
Error in readRDS(file = file_name) : unknown input format
Execution halted
The rds file to be loaded back into the program from file.
When fitting the keyATM base model, I get "Error: Something goes wrong in sample_lambda_slice()" after a couple hundred iterations. What could be the source for this?
Let me know what other information you need.
I have fitted the same model with no_keyword_topics=0
before.
Settings for keyATM base model:
mod <- keyATM::keyATM(
docs = keyATM_counts,
no_keyword_topics = 2,
keywords = marker_list,
model = "base",
options = list(seed = 0
iterations = 1500,
verbose = TRUE,
llk_per = 100,
use_weights = TRUE,
weights_type = "inv-freq",
prune = TRUE,
thinning = 10,
store_theta = FALSE,
store_pi = FALSE,
parallel_init = FALSE)
)
R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS/LAPACK: /home/pschaefer/miniconda3/envs/r_env/lib/libopenblasp-r0.3.25.so; LAPACK version 3.11.0
locale:
[1] C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] keyATM_0.5.0 blogdown_1.18 zellkonverter_1.12.1
[4] logging_0.10-108 here_1.0.1 cowplot_1.1.2
[7] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[10] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4
[13] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[16] tidyverse_2.0.0 quanteda_3.3.1 Matrix_1.6-4
[19] MatrixGenerics_1.14.0 matrixStats_1.2.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.32.0 fastmatch_1.1-4
[3] gtable_0.3.4 dir.expiry_1.10.0
[5] xfun_0.41 Biobase_2.62.0
[7] lattice_0.22-5 tzdb_0.4.0
[9] vctrs_0.6.5 tools_4.3.2
[11] bitops_1.0-7 generics_0.1.3
[13] parallel_4.3.2 stats4_4.3.2
[15] fansi_1.0.6 pkgconfig_2.0.3
[17] S4Vectors_0.40.2 RcppParallel_5.1.6
[19] lifecycle_1.0.4 GenomeInfoDbData_1.2.11
[21] compiler_4.3.2 munsell_0.5.0
[23] GenomeInfoDb_1.38.1 RCurl_1.98-1.13
[25] pillar_1.9.0 crayon_1.5.2
[27] SingleCellExperiment_1.24.0 DelayedArray_0.28.0
[29] abind_1.4-5 basilisk_1.14.1
[31] stopwords_2.3 tidyselect_1.2.0
[33] stringi_1.8.3 rprojroot_2.0.4
[35] grid_4.3.2 colorspace_2.1-0
[37] cli_3.6.2 SparseArray_1.2.2
[39] magrittr_2.0.3 S4Arrays_1.2.0
[41] utf8_1.2.4 withr_2.5.2
[43] filelock_1.0.3 scales_1.3.0
[45] timechange_0.2.0 XVector_0.42.0
[47] reticulate_1.34.0 png_0.1-8
[49] hms_1.1.3 GenomicRanges_1.54.1
[51] IRanges_2.36.0 basilisk.utils_1.14.1
[53] rlang_1.1.2 Rcpp_1.0.11
[55] glue_1.6.2 BiocGenerics_0.48.1
[57] jsonlite_1.8.8 R6_2.5.1
[59] zlibbioc_1.48.0
v Initializing the model [32.2s]
[1] log likelihood: -108847548681.69 (perplexity: 14052.86)
[100] log likelihood: -95721577510.63 (perplexity: 4442.06)ETA: 11m
[200] log likelihood: -93863113203.29 (perplexity: 3773.68)ETA: 10m
[300] log likelihood: -93278637181.49 (perplexity: 3585.03)ETA: 9m
[400] log likelihood: -92909389512.57 (perplexity: 3470.74)ETA: 8m
[500] log likelihood: -92615573753.18 (perplexity: 3382.41)ETA: 7m
[600] log likelihood: -92455275564.52 (perplexity: 3335.17)ETA: 7m
[700] log likelihood: -92324715315.99 (perplexity: 3297.18)ETA: 6m
[800] log likelihood: -92207650271.44 (perplexity: 3263.48)ETA: 5m
[900] log likelihood: -92148011044.97 (perplexity: 3246.45)ETA: 5m
[1000] log likelihood: -92085611011.70 (perplexity: 3228.73)TA: 4m
[1100] log likelihood: -92014578934.45 (perplexity: 3208.67)TA: 3m
[1200] log likelihood: -91938402270.84 (perplexity: 3187.29)TA: 2m
Error: Something goes wrong in sample_lambda_slice().0% | ETA: 2m
No error.
Would take some time to make the data available. Not sure how else to reproduce this error.
Hi Shusei-E
Thanks for your reply, sorry I didn't see it ...
Well in your example just change the keywords list like this :
keywords <- list(Government = c("pol", "pal", "pil"),
Constitution = c("constitution", "rights"),
ForeignAffairs = c("foreign", "war", "missingword", "missing_word"))
visualize_keywords(docs = keyATM_docs, keywords = keywords)
and you will obtain:
Warning in check_keywords(unique(unlisted), keywords, prune) :
Keywords will be pruned because they do not appear in documents: pol, pal, pil, missingword, missing_word
Error in check_keywords(unique(unlisted), keywords, prune) :
All keywords are pruned. Please check: Government
Same I have if in any case you d'ont have any word in the topics I suppose
I know in the doc it's written it xan happen, but I don't know why you didn't protect from this possibility: if any words match in the topic ?
Hope this helps, regards
Rod
Originally posted by @rodtaq in #177 (comment)
I ran KeyATM on a collection of survey responses I have. Each response is short and the total number of tokens is around 30000. I set the number of topics = 5. Would the small-N be the reason why out$theta
returns 0 after I run below code?
out <- keyATM(docs = keyATM_ALL, # text input of all 30000 tokens
no_keyword_topics = 1, # number of topics without keywords
keywords = keywords, # keywords
model = "base", # select the model
options = list(seed = 250,
store_theta = TRUE))
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5
out$theta is empty
I expect out$theta returns certain number, instead of being all 0
Dear authors,
keyATM is pretty cool! I like it and intend to use it in my research and introduce to my students. Nevertheless, there are several points in your User's Guide cofusing me and any help would be appreicate:
num_states
to be 5. I guess this argument refers to the state in HMM. Am I right? Apologize for not being an expert on it, but why 5? Is it an arbitrary number or have some prior knowledge of that, or because of some other reason?keep= c("Z", "S")
mean? And how to read the results?Thank you!
sample_z
, sample_s
: try pass-by-value with const
as well (it could be faster than pass-by-reference. In a different branch, after writing tests.
sample_z
sample_s
doc_id
)This is an example to show how to report a bug.
I can't run the dynamic topic modeling specified as follows:
dynamic_out_day <- keyATM(docs = keyATM_docs, # text input
no_keyword_topics = 2, # number of topics without keywords
keywords = keywords, # keywords
model = "dynamic", # select the model
model_settings = list(time_index = docvars(my_corpus)$index,
num_states = 5),
options = list(seed = 250, store_theta = TRUE, thinning = 5))
I assume it has something to do with the C side of the package, but don't know exactly what's going on. My current hunch is the error might be related to the size of the data the function can handle.
sessionInfo() # please run this in R and copy&paste the output
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] quanteda_2.1.0 forcats_0.5.0 stringr_1.4.0
[4] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1
[7] tidyr_1.1.0 tibble_3.0.2 ggplot2_3.3.2
[10] tidyverse_1.3.0 here_0.1 keyATM_0.3.0
# Please copy and paste the error message
R
Initializing the model...
Fitting the model. 1500 iterations...
free(): invalid next size (normal)
Aborted (core dumped)
Please explain what you expected to happen.
When I ran the dynamic model with the month index, it worked (6 months). When I extended the time index to days (159 days), the function stopped working.
Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds
.
Unfortunately, I cannot share the data publicly.
Hello
Is it possible to use a wildcard character in the keywords list?
For example, bank* (similar as the Seeded LDA)
Thank you
keyATMvb()
(as a separate function, using as much as keyATM()
)make_sz_key
with C++
Initialization.
Please add any other information about the feature request.
This is not a feature request, and it is also not a bug.
In the initialization phase (line 394 of the model.R script) you use the parallel::mclapply
function. This effectively prevents users from running the package in parallel on Windows machines. If you change this to future.apply::future_lapply
Windows users would also be able to use the parallel_init
option and it seems like you already use that function in the package.
It seems like only a few lines of code would need to be changed and I could not discern if there was a particular reason for using parallel::mclapply
. I would be happy to do this if required, but since it seems like such a minor thing, you may just want to do it (if you agree that it is a good change).
Allows the latent state of each time step to be exported for customise plot.
# Example
value_figure(fig_timetrend)
# Maybe one more column called latent state
## # A tibble: 290 × 5
## time_index Topic Lower Point Upper
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1789 1_Government 0.100 0.113 0.125
## 2 1789 2_Congress 0.154 0.182 0.211
## 3 1789 3_Peace 0.0441 0.0618 0.0800
## 4 1789 4_Constitution 0.215 0.232 0.254
## 5 1789 5_ForeignAffairs 0.121 0.148 0.173
## 6 1793 1_Government 0.209 0.247 0.287
## 7 1793 2_Congress 0.0207 0.0699 0.0991
## 8 1793 3_Peace 0.0574 0.0876 0.112
## 9 1793 4_Constitution 0.257 0.312 0.388
## 10 1793 5_ForeignAffairs 0.0522 0.0819 0.112
Useful for state transition interpretation.
Hi.
First of all, thank you for making great package.
It has been really helpful for my arcademic research.
The thing that I'd like to ask about is extracting doc number with output model and checking final perplexity of the output model.
I have a news article data which is separated by news companies and I'd like to sort the docs out by a company with the topic number they have so that I can see the distribution of topics for the each news company.
What I have looked into so far is the function named "top_docs" with option "n".
I changed the "n" to print all the doc numbers belong exclusively to each topic but returned dataframe has duplicated doc number around the topics.
Will be there any way to extract the exclusive doc to topic number?
Also, please provide the way to check the perplexity of the output model.
Thank you.
Hi all,
I have a question with regard to the preparation of the dfm
.
In the package description you highlight that one should "aim for 7,000 to 10,000 unique words at the maximum“.
However, I guess that this highly depends on the size of the whole corpus. In my case I am looking at a corpus of nearly 1Mio documents with more than 800,000 unique words. Trimming down this corpus to 7000-10,000 unique words would considerably reduce the complexity of the content of these documents.
Therefore, I wanted to ask why one should aim for max. 10,000 words and how one should deal with the case of such large corpora.
Thank you!
It would be very useful to connect the outputted theta values to the original verbatims for modal classification purposes. Although currently you can grab theta from the keyATM_ouput object and can merge it back in to the original df, it is less than obvious and explicit.
This would be used to examine the probability of topic assignment for each document.
Thank you for a great package!
This line does not work with the newest version of dplyr
(v1.0.0
).
https://github.com/keyATM/keyATM/blob/master/R/model.R#L181
The summarise
function will automatically ungroup
by default (here). Although this functionality is labeled as experimental, it is included in the CRAN version and Travis returns an error.
How to fix (my understanding):
dplyr::summarize(WordCount = dplyr::n(), .group = .data)
Either case, we need to fix the Description. (dplyr >= 1.0.0
)
https://github.com/keyATM/keyATM/blob/master/DESCRIPTION#L12
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quanteda_2.0.1 keyATM_0.3.0 testthat_2.3.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 magrittr_1.5 usethis_1.6.1 stopwords_2.0
[5] tidyselect_1.1.0 munsell_0.5.0 colorspace_1.4-1 lattice_0.20-38
[9] R6_2.4.1 rlang_0.4.6 fastmatch_1.1-0 fansi_0.4.1
[13] stringr_1.4.0 dplyr_1.0.0 tools_3.6.3 grid_3.6.3
[17] data.table_1.12.8 gtable_0.3.0 utf8_1.1.4 cli_2.0.2
[21] ellipsis_0.3.1 assertthat_0.2.1 RcppParallel_5.0.1 tibble_3.0.1
[25] lifecycle_0.2.0 crayon_1.3.4 Matrix_1.2-18 purrr_0.3.4
[29] ggplot2_3.3.1 fs_1.4.1 vctrs_0.3.0 glue_1.4.1
[33] stringi_1.4.6 compiler_3.6.3 pillar_1.4.4 generics_0.0.2
[37] scales_1.1.1 pkgconfig_2.0.3
> p <- visualize_keywords(keyATM_docs, bills_keywords)
`summarise()` ungrouping output (override with `.groups` argument)
Error: `...` is not empty.
We detected these problematic arguments:
* `..1`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
`summarise()` ungrouping output (override with `.groups` argument)
Error: `...` is not empty.
We detected these problematic arguments:
* `..1`
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
visualize_keywords
should run.
The code in the testthat
.rds
.data(keyATM_data_bills)
bills_dfm <- keyATM_data_bills$doc_dfm
bills_keywords <- keyATM_data_bills$keywords
keyATM_docs <- keyATM_read(bills_dfm)
bills_cov <- keyATM_data_bills$cov
bills_time_index <- keyATM_data_bills$time_index
labels_use <- keyATM_data_bills$labels
p <- visualize_keywords(keyATM_docs, bills_keywords)
Hi folks, I'm wondering if you could add an option to export the iteration number to a .txt file while the model is converging.
I'm currently working on a project that must be completed on a virtual machine through Jupyter Notebooks. Basically, I'm analyzing newspapers on ProQuest, and they require all analysis to be done through their VM. However, there is a lot of data and the interface is very poor and doesn't show the progress bar while the code is running. I could get around this issue if the keyATM
function could export the iteration number to a .txt file. I completely understand if this use case is too specific to justify editing the main function in the package.
With much thanks to https://github.com/phargarten2/matrixNormal/issues/1, the Kronecker Product in matrixNormal::rmatnorm
has been changed from koch(U,V)
to koch(V,U)
. The original version used a citation in a paper that was found to be incorrect. I have updated the package (submitting it to CRAN for approval). I am sorry for any inconvenience.
Hello, thanks for your library. I made some tests and I have the same error (windows, R 3.6) 👍
Warning in check_keywords(info$wd_names, keywords, options$prune) :
Keywords will be pruned because they do not appear in documents: "interest_rates",
"net_margins", "cash_margins", [... truncated]
Error in mapped$set(keys[x], values[x]) : key must be not be "" or NA
Is there any restriction on how write the Keyword ?
Thk in advance
Keyword:
Dictionary object with 2 key entries.
Text : any text
Results:
values_fig(key_viz)
A tibble: 13 x 5 Groups: Topic [2]
Word WordCount Proportion(%)
Ranking Topic
1 contracts 12 0.002 1 1_Corpo
2 patents 3 0.001 2 1_Corpo
3 regulations 3 0.001 3 1_Corpo
4 legal 2 0 4 1_Corpo
5 settlements 1 0 5 1_Corpo
6 m_a NA NA 6 1_Corpo
7 operational_risk NA NA 7 1_Corpo
8 sanction NA NA 8 1_Corpo
9 fiscal 1890 0.339 1 2_FxTax
10 tax_rate 6 0.001 2 2_FxTax
11 foreign_exchange 1 0 3 2_FxTax
12 taxe NA NA 4 2_FxTax
13 government_tariffs NA NA 5 2_FxTax
I am repeatedly running into a complete "R session aborted" in RStudio while attempting at running the model. I have a fairly large dfm
with ~80k documents and 50k features. I am expecting this sample to grow in both N and P.
Did you experienced any issue with large matrices?
Hi I am new to keyATM, and I am learning keyATM Dynamic. My data are Chinese newspaper articles, and there are 3-5 news reports per day (see the screenshot below please). I followed instructions on your website to prepare time index, but got the following error message:
Error in check_arg_model_settings(obj, model, info) : model_settings$time_index
does not increment by 1.
How do I fix this issue? Any help is appreciated!
Running the example model provided in the user guide, I am able to replicate the model output on Mac and Linux, but not on Windows, despite using the same seed. The model output is consistent between different Windows machines (at least on the three I have tried), but consistenly produces different results from the other platforms (see top_words()
and top_docs()
results pasted below).
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Germany.1252 LC_CTYPE=English_Germany.1252
## [3] LC_MONETARY=English_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_1.5 quanteda_2.1.2 keyATM_0.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.3
## [4] tools_4.0.3 stopwords_2.0 digest_0.6.27
## [7] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.4
## [10] gtable_0.3.0 lattice_0.20-41 pkgconfig_2.0.3
## [13] rlang_0.4.8 Matrix_1.2-18 fastmatch_1.1-0
## [16] parallel_4.0.3 yaml_2.2.1 xfun_0.19
## [19] fastmap_1.0.1 stringr_1.4.0 dplyr_1.0.2
## [22] knitr_1.30 fs_1.5.0 generics_0.1.0
## [25] vctrs_0.3.4 grid_4.0.3 tidyselect_1.1.0
## [28] glue_1.4.2 data.table_1.13.2 R6_2.5.0
## [31] rmarkdown_2.5 tidyr_1.1.2 ggplot2_3.3.2
## [34] purrr_0.3.4 ISOcodes_2020.03.16 usethis_1.6.3
## [37] scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0
## [40] colorspace_2.0-0 stringi_1.5.3 RcppParallel_5.0.2
## [43] munsell_0.5.0 crayon_1.3.4
For top_words(out)
:
## 1_Government 2_Congress 3_Peace
## 1 national government world [<U+2713>]
## 2 laws [<U+2713>] people peace [<U+2713>]
## 3 law [<U+2713>] states new
## 4 office union people
## 5 secure congress [<U+2713>] freedom [<U+2713>]
## 6 order interests america
## 7 republic policy let
## 8 business made government
## 9 respect administration nation
## 10 american present life
## 4_Constitution 5_ForeignAffairs Other_1 Other_2 Other_3
## 1 constitution [<U+2713>] country power great public
## 2 rights [<U+2713>] every state nations political
## 3 now citizens powers nation executive [1]
## 4 duty united support good system
## 5 free war [<U+2713>] general men confidence
## 6 institutions spirit well justice necessary
## 7 commerce fellow right many far
## 8 trust foreign [<U+2713>] principle first duties
## 9 honor time part purpose federal
## 10 citizen years high action prosperity
## Other_4 Other_5
## 1 hope one
## 2 american make
## 3 know much
## 4 day president
## 5 strength just
## 6 need always
## 7 land better
## 8 things others
## 9 power home
## 10 earth place
For top_docs(out)
:
## 1_Government 2_Congress 3_Peace 4_Constitution 5_ForeignAffairs Other_1
## 1 31 15 47 10 7 14
## 2 36 18 52 6 2 11
## 3 34 19 53 1 13 16
## 4 26 24 50 14 44 15
## 5 21 12 46 9 5 12
## 6 27 23 45 12 1 8
## 7 35 2 56 3 9 9
## 8 24 8 40 13 20 3
## 9 37 25 44 29 33 10
## 10 28 28 51 28 3 5
## Other_2 Other_3 Other_4 Other_5
## 1 38 11 58 47
## 2 41 26 46 49
## 3 32 31 43 48
## 4 36 16 54 52
## 5 43 28 51 57
## 6 3 17 42 53
## 7 6 23 39 46
## 8 35 1 55 50
## 9 30 7 37 51
## 10 37 6 48 32
See base model output in the user guide, which I am able to replicate on other platforms. For example, the top three documents of topic 1_Government
should be 9
, 14
, and 8
.
library(keyATM)
library(quanteda)
library(magrittr)
data(data_corpus_inaugural, package = "quanteda")
data_tokens <- tokens(data_corpus_inaugural,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_url = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"),
"may", "shall", "can",
"must", "upon", "with", "without")) %>%
tokens_select(min_nchar = 3)
data_dfm <- dfm(data_tokens) %>%
dfm_trim(min_termfreq = 5, min_docfreq = 2)
keyATM_docs <- keyATM_read(texts = data_dfm)
summary(keyATM_docs)
keywords <- list(Government = c("laws", "law", "executive"),
Congress = c("congress", "party"),
Peace = c("peace", "world", "freedom"),
Constitution = c("constitution", "rights"),
ForeignAffairs = c("foreign", "war"))
out <- keyATM(docs = keyATM_docs,
no_keyword_topics = 5,
keywords = keywords,
model = "base",
options = list(seed = 250))
top_words(out)
top_docs(out)
I have a use-case of using the dynamic keyATM model. My transition matrix
However, I would need the transition matrix to allow backward state switching. That is
I am thinking of changing the sampling procedure for
Do you think it is possible? Hope you can offer some advice on where are the lines in the source code that I should be aware of.
Thank you.
fastmap
seems to be slower than hashmap
. Write make_sz_key
with C++ if needed. Keep old initialization as old
.
As in the examples, I expected the top_words
function to give me a tibble with as many columns as named elements in the list of keywords. I have four such elements, but I get 8 columns, the first four are as expected the remaining are labelled "Other_1", "Other_2", etc.
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] quanteda.textstats_0.95 keyATM_0.4.0 quanteda_3.2.1
[4] ggpubr_0.4.0 deeptime_0.2.1 zoo_1.8-9
[7] scales_1.1.1 ggtext_0.1.1 readabs_0.4.11
[10] pbmcapply_1.5.0 tictoc_1.0.1 tidytable_0.6.7
[13] lubridate_1.8.0 forcats_0.5.1 stringr_1.4.0
[16] dplyr_1.0.7 purrr_0.3.4 readr_2.1.1
[19] tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5
[22] tidyverse_1.3.1 reticulate_1.23
loaded via a namespace (and not attached):
[1] fs_1.5.2 httr_1.4.2 tools_4.0.3 backports_1.4.1
[5] utf8_1.2.2 R6_2.5.1 DBI_1.1.2 colorspace_2.0-2
[9] withr_2.4.3 tidyselect_1.1.1 gridExtra_2.3 compiler_4.0.3
[13] cli_3.1.0 rvest_1.0.2 pacman_0.5.1 xml2_1.3.3
[17] labeling_0.4.2 digest_0.6.29 rmarkdown_2.11 pkgconfig_2.0.3
[21] htmltools_0.5.2 parallelly_1.30.0 dbplyr_2.1.1 fastmap_1.1.0
[25] rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13 farver_2.1.0
[29] generics_0.1.1 jsonlite_1.7.2 car_3.0-12 magrittr_2.0.1
[33] Matrix_1.4-0 Rcpp_1.0.8 munsell_0.5.0 fansi_1.0.2
[37] ggfittext_0.9.1 abind_1.4-5 ggnewscale_0.4.5 lifecycle_1.0.1
[41] yaml_2.2.1 stringi_1.7.6 carData_3.0-5 MASS_7.3-55
[45] grid_4.0.3 listenv_0.8.0 crayon_1.4.2 lattice_0.20-45
[49] haven_2.4.3 gridtext_0.1.4 hms_1.1.1 knitr_1.37
[53] pillar_1.6.4 ggsignif_0.6.3 codetools_0.2-18 future.apply_1.8.1
[57] stopwords_2.3 fastmatch_1.1-3 reprex_2.0.1 glue_1.6.0
[61] evaluate_0.14 data.table_1.14.2 RcppParallel_5.1.5 modelr_0.1.8
[65] png_0.1-7 vctrs_0.3.8 tzdb_0.2.0 tweenr_1.0.2
[69] cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0 future_1.23.0
[73] assertthat_0.2.1 xfun_0.29 ggforce_0.3.3 broom_0.7.11
[77] rstatix_0.7.0 nsyllable_1.0.1 globals_0.14.0 ellipsis_0.3.2
> names(keywords)
[1] "left" "right" "gal" "tan"
---(following the model instructions for keyATMcovariates here)----
> names(top_words(res))
[1] "1_left" "2_right" "3_gal" "4_tan" "Other_1" "Other_2" "Other_3" "Other_4"
I'd expect only the first four items to appear or the meaning of these "Other" values to be explained somewhere as it does not throw any errors.
Can't share the data at the moment.
Hi authors!
Thank you a lot for this wonderful package, this is a great tool and it is helping me a lot with my research! I have a question/suggestion below, I am not a very experienced person on topic modelling but this is from my experience using the package so far.
Currently, I am exploring the package through a research project I have on Twitter data. However, I am slightly confused over the use of keywords and whether the keywords option support dictionary which contains regex. For instance, when I select "Terror*" (aiming for terrorism/terrorist/terror), I can see that visualize_keywords() thinks that the words are not in the corpora when it actually is. Thus, I am led to believe that KeyATM does not support keywords with regex.
Yet, this presents two problems, I think:
Maybe could there be added the option of allowing the use of regex expressions for keywords? Or if that is not possible, it may be a good idea to add it to the documentation and ways around that as well if you have any recommendations, as I was looking for that but I couldn't find it.
Hope this helps! Thanks again for the package!
Thank you so much for making this wonderful package! I have successfully replicated sample estimations shown in the package homepage. However, while I was trying to run keyATM_read() using Chinese text data, garbled characters issue raised in my environment. I have checked text encoding by as_utf8() and utf8_valid(), but it seems some issues happen during the keyATM_read().
I regularly use quanteda by using same Chinese data, and had no such trouble yet. I also found another user does not face this issue, using same text data, and running same code. So, I wonder my environment would have some issues. I guess this is primarily encoding issue and I should resolve it by myself but let me post here. For your information, my default text encoding in R is the UTF-8 (checked tool -> code -> saving in R).
Please let me know if you need more descriptions on my setting, script, and/or data.
Thank you!
sessionInfo() # please run this in R and copy&paste the output
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] showtext_0.9-5 showtextdb_3.0 sysfonts_0.8.8 utf8_1.2.2
[5] lubridate_1.8.0 jiebaR_0.11 jiebaRD_0.1 readxl_1.4.1
[9] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0
[17] tidyverse_1.3.2 keyATM_0.4.1 quanteda_3.2.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lattice_0.20-45 assertthat_0.2.1
[4] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
[7] reprex_2.0.2 httr_1.4.4 pillar_1.8.1
[10] rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14
[13] quanteda.corpora_0.9.2 Matrix_1.4-1 textshaping_0.3.6
[16] googledrive_2.0.0 munsell_0.5.0 broom_1.0.1
[19] compiler_4.1.2 modelr_0.1.9 systemfonts_1.0.4
[22] pkgconfig_2.0.3 tidyselect_1.2.0 fansi_1.0.3
[25] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1
[28] withr_2.5.0 grid_4.1.2 jsonlite_1.8.2
[31] gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3
[34] magrittr_2.0.3 scales_1.2.1 RcppParallel_5.1.5
[37] cli_3.4.1 stringi_1.7.6 fs_1.5.2
[40] xml2_1.3.3 ragg_1.2.2 ellipsis_0.3.2
[43] stopwords_2.3 generics_0.1.3 vctrs_0.5.1
[46] fastmatch_1.1-3 RColorBrewer_1.1-3 tools_4.1.2
[49] glue_1.6.2 hms_1.1.2 colorspace_2.0-3
[52] gargle_1.2.1 rvest_1.0.3 haven_2.5.1
By checking the wd_names object in keyATM_docs after running keyATM_read(), I found garbled characters. A dfm is constructed by quanteda, and no garbled characters until keyATM_read(). I have checked topfeatures(dfm, 100) and found no issue just before the keyATM_read().
# Please copy and paste the error message
> keyATM_docs <- keyATM_read(dfm,
+ encoding = "UTF-8",
+ check = TRUE,
+ keep_docnames = FALSE,
+ progress_bar = FALSE,
+ split = 0)
Using quanteda dfm.
> summary(keyATM_docs)
keyATM_docs object of: 983 documents.
Length of documents:
Avg: 542.942
Min: 48
Max: 4706
SD: 551.392
Number of unique words: 12616
keyATM_docs[["wd_names"]]
[1] "蝮壼ョ壻ク咲ァサ" "襍ー"
[3] "荳ュ蝗ス迚ケ濶イ遉セ莨壻クサ荵<89>" "豕墓イサ"
[5] "驕楢キッ" "蜈ィ髱「"
[7] "謗ィ霑<9b>" "萓晄ウ墓イサ蝗ス"
As the wd_names has trouble, visualize_keywords() does not work.
> key_viz <- visualize_keywords(docs = keyATM_docs, keywords = keywords)
Warning in check_keywords(unique(unlisted), keywords, prune) :
Keywords will be pruned because they do not appear in documents: <U+7ECF><U+6D4E>, 商<U+4E1A>, <U+519B>
Error in check_keywords(unique(unlisted), keywords, prune) :
All keywords are pruned. Please check: econ, military
I hope to have a basic result as shown in the Preparation page (https://keyatm.github.io/keyATM/articles/pkgdown_files/Preparation.html).
Thank you for your considerations!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.